src package#
Submodules#
src.classification module#
Supervised Classification and Performance Evaluation.
This module provides a robust pipeline for supervised machine learning, using the unsupervised clusters identified in the previous stage as ground truth labels. It includes functions for preparing the data, training a Random Forest classifier, evaluating its performance, and analyzing the importance of the extracted features. The module is designed to be fully configurable, with all algorithm parameters specified in the application’s configuration.
- Functions:
label_clustered_data : Merges features with cluster assignments and assigns descriptive labels.
prepare_training_data_splits : Splits the labeled dataset into training and testing sets.
train_classifier : Trains and saves a Random Forest classifier model.
evaluate_classifier : Evaluates the performance of the trained model.
get_feature_importances : Calculates and returns the feature importance scores from the model.
predict_wound_border_type : Classifies new wound images using a pre-trained model.
- Typical Use:
This module is typically used in the main application script (run_pipeline.py) after the clustering stage. The main script orchestrates a sequence of calls to these functions to train a classifier on the clustered data and report its performance.
- src.classification.label_clustered_data(features_df, cluster_map_df, label_map)#
Merges feature data with cluster assignments and assigns descriptive wound type labels.
This function combines the extracted features with the cluster labels identified by HDBSCAN. It then uses a provided label_map to replace the numerical cluster labels with more descriptive strings, preparing the final dataset for supervised classification.
- Parameters:
features_df (pd.DataFrame) – DataFrame containing extracted features. Must include an ‘image_id’ column.
cluster_map_df (pd.DataFrame) – DataFrame containing image-cluster assignments. Must include ‘image_id’ and ‘cluster_label’ columns.
label_map (Dict[int, str]) – A dictionary that maps numerical cluster labels to descriptive string labels.
- Returns:
Merged and labeled DataFrame (df_merged).
- Return type:
pd.DataFrame
- Raises:
TypeError – If inputs are not of the expected DataFrame types.
ValueError – If required columns (‘image_id’, ‘cluster_label’) are missing or if the label map is empty.
KeyError – If a required column is missing from one of the DataFrames.
- Output:
- Console/Log:
Informational messages about the merging process and outlier removal.
- Return Value:
A Pandas DataFrame ready for supervised learning.
Examples
>>> import pandas as pd >>> from src.classification import label_clustered_data >>> # Dummy data >>> df_features = pd.DataFrame({'image_id': ['A', 'B'], 'feat1': [1, 2]}) >>> df_clusters = pd.DataFrame({'image_id': ['A', 'B'], 'cluster_label': [0, -1]}) >>> label_map = {0: 'Type A', -1: 'Outlier'} >>> labeled_data = label_clustered_data(df_features, df_clusters, label_map)
- Relationships:
- Dependencies:
Relies on pandas for DataFrame manipulation.
- Used by:
The classification pipeline to prepare the input for model training.
- src.classification.prepare_training_data_splits(df, classification_params)#
Prepares and splits the labeled dataset into training and testing sets.
This function takes the final labeled DataFrame, separates the feature columns (X) from the target variable (y), and then uses a stratified split to ensure that the distribution of wound types is preserved in both the training and test sets.
- Parameters:
df (pd.DataFrame) – The labeled DataFrame to be split.
classification_params (Dict[str, Any]) – A dictionary containing parameters for the classification pipeline, including ‘test_size’ and ‘random_state’.
- Returns:
- A tuple containing:
X_train (np.ndarray): Training features.
X_test (np.ndarray): Testing features.
y_train (np.ndarray): Training target labels.
y_test (np.ndarray): Testing target labels.
X_train_df (pd.DataFrame): Training features as a DataFrame for feature importance.
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, pd.DataFrame]
- Raises:
TypeError – If df is not a Pandas DataFrame.
ValueError – If required columns are missing or if df is empty.
- Output:
- Console/Log:
Informational messages about the split and the resulting shapes of the datasets.
- Return Value:
Four NumPy arrays representing the split data.
Examples
>>> import pandas as pd >>> import numpy as np >>> from src.classification import prepare_training_data_splits >>> # Dummy data >>> df_mock = pd.DataFrame({ ... 'image_id': ['A', 'B', 'C', 'D'], 'feat1': [1, 2, 3, 4], ... 'wound_type': ['Type A', 'Type A', 'Type B', 'Type B'] ... }) >>> classification_params = {'test_size': 0.5, 'random_state': 42} >>> X_train, X_test, y_train, y_test = prepare_training_data_splits(df_mock, classification_params=classification_params) >>> print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
- Relationships:
- Dependencies:
Relies on pandas and sklearn.model_selection.train_test_split.
- Used by:
The classification pipeline to prepare data for train_classifier.
- src.classification.train_classifier(X_train, y_train, classification_params, save_path)#
Trains a Random Forest classifier model.
This function initializes a RandomForestClassifier with a fixed set of hyperparameters and trains it on the provided training data.
- Parameters:
X_train (np.ndarray) – Training features.
y_train (np.ndarray) – Training target labels.
classification_params (Dict[str, Any]) – A dictionary containing parameters for the classification pipeline, including ‘random_state’ and ‘random_forest_n_estimators’.
save_path (Path) – The full path, including filename, to save the trained model.
- Returns:
The trained classifier model.
- Return type:
RandomForestClassifier
- Raises:
TypeError – If inputs are not NumPy arrays.
ValueError – If input arrays are empty or have inconsistent shapes.
RuntimeError – If model training fails.
- Output:
- Console/Log:
Informational messages about model training and performance.
- Return Value:
The trained classifier model.
Examples
>>> import numpy as np >>> from src.classification import train_classifier >>> # Dummy data >>> X_train = np.random.rand(10, 5) >>> y_train = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]) >>> classification_params = {'random_state': 42, 'random_forest_n_estimators': 100} >>> model = train_classifier(X_train, y_train, classification_params)
- Relationships:
- Dependencies:
Relies on numpy and sklearn.ensemble.RandomForestClassifier and ‘joblib.
- Used by:
The classification pipeline to train the model.
- src.classification.evaluate_classifier(model, X_test, y_test)#
Evaluates the performance of the trained classifier model on a test set.
This function calculates several standard classification metrics, including overall accuracy and a detailed classification report, providing a comprehensive view of the model’s performance.
- Parameters:
model (RandomForestClassifier) – The trained classifier model.
X_test (np.ndarray) – Testing features.
y_test (np.ndarray) – Testing target labels.
- Returns:
A dictionary containing the accuracy score, the full classification report, and the model itself.
- Return type:
Dict[str, Any]
- Raises:
TypeError – If inputs are not of expected types.
ValueError – If input arrays are empty or have inconsistent shapes.
RuntimeError – If model prediction or evaluation fails.
- Output:
- Console/Log:
Informational messages about model performance.
- Return Value:
A dictionary of evaluation metrics.
Examples
>>> import numpy as np >>> from sklearn.ensemble import RandomForestClassifier >>> from src.classification import evaluate_classifier >>> # Dummy data >>> model = RandomForestClassifier(random_state=42).fit(np.random.rand(10, 5), np.array([0, 1]*5)) >>> X_test = np.random.rand(10, 5) >>> y_test = np.array([0, 1]*5) >>> results = evaluate_classifier(model, X_test, y_test)
- Relationships:
- Dependencies:
Relies on numpy and sklearn.metrics for evaluation.
- Used by:
The classification pipeline to report model performance.
- src.classification.get_feature_importances(model, X_train_df)#
Calculates and returns feature importances from a trained model.
This function extracts the feature importance scores from a trained RandomForestClassifier model, and returns them as a DataFrame. This is useful for understanding which features are most influential for the model’s predictions.
- Parameters:
model (RandomForestClassifier) – The trained classifier model.
X_train_df (pd.DataFrame) – Training features.
- Returns:
A DataFrame with ‘feature’ and ‘importance’ columns, sorted by importance in descending order, or None if the model does not support feature importances.
- Return type:
Optional[pd.DataFrame]
- Raises:
TypeError – If inputs are not of expected types.
ValueError – If input DataFrame is empty or missing required columns.
- Output:
- Console/Log:
Informational messages about the top features and any warnings if the model does not support feature importances.
- Return Value:
A DataFrame of feature importances.
Examples
>>> import pandas as pd >>> from sklearn.ensemble import RandomForestClassifier >>> from src.classification import get_feature_importances >>> # Dummy data >>> X_train = pd.DataFrame(np.random.rand(10, 5), columns=[f'feat{i}' for i in range(5)]) >>> y_train = np.array([0, 1]*5) >>> model = RandomForestClassifier(random_state=42).fit(X_train, y_train) >>> importances_df = get_feature_importances(model, X_train)
- Relationships:
- Dependencies:
pandas: For DataFrame manipulation.
numpy: For array operations.
sklearn.ensemble.RandomForestClassifier: The type of model expected.
logging: For outputting messages.
- Used by:
The classification pipeline, to report feature importance.
- src.classification.predict_wound_border_type(model_path, new_features_df)#
Classifies new wound images using a pre-trained model and returns the prediction.
This function loads a pre-trained RandomForestClassifier model from a file, loads and standardizes the new features from a CSV file, and then uses the model to predict the wound border type. The result is returned as a DataFrame containing the original image IDs and the predicted labels.
- Parameters:
model_path (Path) – The full path to the saved model file (e.g., ‘.joblib’).
new_features_path (Path) – The full path to the CSV file containing the new feature vectors. The CSV file must have an ‘image_id’ column.
- Returns:
A DataFrame with ‘image_id’, ‘predicted_label’, and ‘probability’. Returns None if the prediction fails.
- Return type:
Optional[pd.DataFrame]
- Raises:
FileNotFoundError – If the model or feature file does not exist.
IOError – If there is an issue loading the model or feature file.
TypeError – If inputs are not of expected types.
- Output:
- Console/Log:
Informational messages about the prediction and the result.
- Return Value:
A DataFrame of prediction results.
Examples
>>> import pandas as pd >>> from pathlib import Path >>> from src.classification import predict_wound_border_type >>> # Dummy data setup >>> dummy_model_path = Path('./dummy_model.joblib') >>> dummy_features_path = Path('./dummy_features.csv') >>> label_map = {0: 'Type A', 1: 'Type B'} >>> prediction_df = predict_wound_border_type(dummy_model_path, dummy_features_path, label_map)
- Relationships:
- Dependencies:
joblib: For loading the pre-trained model.
pandas: For DataFrame manipulation.
numpy: For array operations.
sklearn.ensemble.RandomForestClassifier: The type of model expected.
logging: For outputting messages.
- Used by:
The classification pipeline, to classify new, unseen images.
src.clustering module#
This module provides functions for dimensionality reduction and clustering of image features.
It leverages PaCMAP (Pairwise Controlled Manifold Approximation) for transforming high-dimensional feature vectors into a low-dimensional space, followed by HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to automatically identify clusters and outliers.
- Functions:
apply_pacmap: Performs dimensionality reduction on a feature set using PaCMAP.
- perform_hdbscan_clustering: Applies the HDBSCAN algorithm to find clusters
and noise points in a low-dimensional embedding.
- Typical use:
This module is used in the machine learning pipeline after feature extraction to prepare the data for visualization and to group similar images based on their extracted features. The outputs are then used for subsequent analysis and classification.
- src.clustering.apply_pacmap(df, clustering_params)#
Applies PaCMAP for dimensionality reduction on the feature set.
This function transforms the high-dimensional feature vectors of the dataset into a low-dimensional space (typically 2D) for visualization and to improve the performance of subsequent clustering algorithms. It first standardizes the features using StandardScaler to ensure a consistent scale before applying PaCMAP.
- Parameters:
df (pd.DataFrame) – DataFrame containing the feature vectors, with ‘image_id’ as the index. The DataFrame is expected to have been cleaned of NaNs.
clustering_params (Dict[str, Any]) – Dictionary containing configurable parameters for PaCMAP. Expected keys: ‘pacmap_n_components’, ‘pacmap_mn_ratio’, ‘pacmap_fp_ratio’, and a ‘random_state’.
- Returns:
A NumPy array representing the 2D embedding of the feature data. Returns None if an error occurs during the process.
- Return type:
Optional[np.ndarray]
- Raises:
TypeError – If the input df is not a Pandas DataFrame.
ValueError – If required keys are missing from clustering_params or if df is empty.
RuntimeError – If the PaCMAP algorithm fails to run.
- Output:
- Console/Log:
Informational messages about the dimensions of the input and output data. Error messages for invalid input or algorithm failures.
- Return Value:
A NumPy array representing the 2D embedding.
Examples
>>> import pandas as pd >>> import numpy as np >>> from src.clustering import apply_pacmap >>> # Dummy feature data (cleaned, without image_id column) >>> dummy_features = pd.DataFrame(np.random.rand(100, 28)) >>> clustering_params = {'pacmap_n_components': 2, 'pacmap_mn_ratio': 2, ... 'pacmap_fp_ratio': 8, 'random_state': 42} >>> embedding = apply_pacmap(dummy_features, clustering_params)
- Relationships:
- Dependencies:
pandas: For DataFrame handling.
pacmap: For the PaCMAP algorithm.
sklearn.preprocessing.StandardScaler: For feature standardization.
logging: For outputting messages.
- Used by:
The main clustering pipeline in the file ‘run_pipeline.py` to prepare data for HDBSCAN.
- src.clustering.perform_hdbscan_clustering(embedding, df, clustering_params)#
Performs HDBSCAN clustering on the reduced data embedding.
This function applies the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm to identify distinct clusters in the input data. It is particularly effective at finding clusters of varying shapes and densities and automatically handles outliers (assigning them a label of -1).
- Parameters:
embedding (np.ndarray) – The 2D embedding of the feature data, typically from PaCMAP. Shape: (N, 2).
df (pd.DataFrame) – The original DataFrame used to create the embedding. This is used to add the cluster labels back to the original data.
clustering_params (Dict[str, Any]) – Dictionary containing configurable parameters for HDBSCAN. Expected keys: ‘hdbscan_min_cluster_size’, ‘hdbscan_min_samples’, ‘hdbscan_epsilon’.
- Returns:
- A tuple containing:
df (pd.DataFrame): The original DataFrame with an added ‘cluster_label’ column.
num_clusters (int): The number of clusters identified.
num_noise (int): The number of points classified as noise.
cluster_labels (np.ndarray): The raw cluster labels returned by HDBSCAN.
- Return type:
Tuple[pd.DataFrame, int, int, np.ndarray]
- Raises:
TypeError – If embedding is not a NumPy array or df is not a Pandas DataFrame.
ValueError – If required keys are missing from clustering_params or if inputs are empty.
RuntimeError – If the HDBSCAN algorithm fails to run.
- Output:
- Console/Log:
Informational messages about the number of clusters and noise points. Error messages for invalid input or algorithm failures.
- Return Value:
The updated DataFrame and clustering summary.
Examples
>>> import pandas as pd >>> import numpy as np >>> from src.clustering import perform_hdbscan_clustering >>> # Dummy data: A 100x2 embedding and corresponding DataFrame >>> dummy_embedding = np.random.rand(100, 2) >>> dummy_df = pd.DataFrame(np.random.rand(100, 28)) >>> clustering_params = {'hdbscan_min_cluster_size': 10, 'hdbscan_min_samples': 5, ... 'hdbscan_epsilon': 0.5} >>> df_with_labels, num_clusters, num_noise, labels = perform_hdbscan_clustering( ... dummy_embedding, dummy_df, clustering_params)
- Relationships:
- Dependencies:
pandas: For DataFrame handling.
hdbscan: For the HDBSCAN algorithm.
logging: For outputting messages.
- Used by:
The main clustering pipeline in the ‘run_pipeline’ file, to assign cluster labels.
src.config_manager module#
This module provides a robust configuration management system for the project.
It defines the Config class, which handles loading, accessing, and organizing application settings from a JSON configuration file. This centralizes all configuration parameters, including file paths, filtering thresholds, and model parameters, making the application more flexible and easier to manage.
- Functions:
Config: The main class providing methods to load, access, and manage application configurations.
- Typical use:
This module is typically used at the application’s startup to load all necessary configuration parameters and file paths, providing a single source of truth for settings throughout the pipeline.
- class src.config_manager.Config(config_filepath='config.json')#
Bases:
object
Manages configuration settings loaded from a JSON file.
This class provides a structured way to load, access, and manage application configuration parameters from a JSON file. It centralizes configuration logic, making it easy to retrieve parameters by section and option, and to resolve file paths.
- Parameters:
config_filepath (str) – The relative or absolute path to the JSON configuration file. Defaults to ‘config.json’ in the current working directory.
- config_filepath#
The resolved Path object pointing to the configuration file.
- Type:
Path
- data#
A dictionary holding the parsed content of the JSON file.
- Type:
Dict[str, Any]
- Raises:
FileNotFoundError – If the specified configuration file does not exist.
json.JSONDecodeError – If the configuration file is found but is not valid JSON.
ValueError – If a required configuration option is missing when accessed without a fallback.
- - `_load_config()`
Internal method to read and parse the JSON file.
- - `get(section, option, fallback)`
Retrieves a specific configuration value.
- - `get_paths()`
Constructs and returns resolved pathlib.Path objects for all file and directory paths defined in the configuration.
- - `get_filtering_params()`, `get_feature_extraction_params()`, etc.
Convenience methods to return specific sections of the configuration as dictionaries.
- Output:
- Log:
Informational messages about successful configuration loading, warnings if optional configuration options are missing (and a fallback is used), and errors for critical issues (e.g., file not found, invalid format).
Example
# Example : Basic configuration loading and access >>> # Assuming a ‘config.json’ exists in the root with: >>> # {“DataPaths”: {“data_path”: “./data”}, “Filtering”: {“threshold”: 100}} >>> from pathlib import Path >>> from src.config_manager import Config >>> # Assume logging is set up: from src.logging_setup import setup_logging; setup_logging(log_level=”INFO”)
>>> config = Config("config.json") >>> data_path = config.get("DataPaths", "data_path") >>> threshold = config.get("Filtering", "threshold") >>> print(f"Loaded data path: {data_path}, threshold: {threshold}") Expected output: Loaded data path: ./data, threshold: 100
- Relationships:
- Dependencies:
Relies on Python’s built-in json module for parsing, pathlib.Path for path management, and logging for output.
- Used by:
The main application entry point (run_pipeline.py) to load and manage all application parameters and file paths.
- get(section, option, fallback=None)#
Retrieves a configuration value from a specific section and option.
- Parameters:
section (str) – The name of the section (e.g., ‘DataPaths’, ‘Filtering’).
option (str) – The name of the option within the section (e.g., ‘data_path’, ‘threshold’).
fallback (Any, optional) – A default value to return if the option is not found. If None and the option is not found, a ValueError is raised.
- Returns:
The value of the specified configuration option.
- Return type:
Any
- Raises:
ValueError – If the option is not found and no fallback value is provided.
- - Attempts to access the value using dictionary lookup.
- - If `KeyError` occurs and `fallback` is provided, returns `fallback`.
- - If `KeyError` occurs and no `fallback` is provided, raises `ValueError`.
- Output:
Log: A warning message is logged if an option is not found but a fallback is used.
- get_paths()#
Constructs and returns a dictionary of all resolved Path objects needed by the pipeline. This centralizes all path creation logic and ensures OS-independent path handling.
- Returns:
A dictionary where keys are descriptive path names (e.g., ‘base_data_dir’, ‘filtered_manifest_path’) and values are resolved pathlib.Path objects.
- Return type:
Dict[str, Path]
- - Retrieves base directory paths (e.g., `data_path`, `metadata_path`) and
subdirectory names from the configuration.
- - Uses `pathlib.Path` and its `/` operator to construct full, absolute paths
for all relevant files and directories. resolve() is used to get absolute paths.
Examples
>>> from src.config_manager import Config >>> from pathlib import Path >>> # Assume config.json in project root with "DataPaths":{"data_path": "./data"} >>> config = Config("config.json") >>> paths = config.get_paths() >>> print(paths['base_data_dir']) Expected output: /absolute/path/to/your/project/data
- Relationships:
- Used by:
The main application entry point (run_pipeline.py) to retrieve all necessary file system references.
- get_filtering_params()#
Returns the filtering parameters section from the configuration.
- Returns:
A dictionary containing parameters related to image filtering (e.g., area thresholds).
- Return type:
Dict[str, Any]
- - Directly retrieves the 'Filtering' section from the loaded configuration data.
Examples
>>> from src.config_manager import Config >>> # Assume config.json has {"Filtering": {"threshold": 100}} >>> config = Config("config.json") >>> params = config.get_filtering_params() >>> print(params) Expected output: {'threshold': 100}
- Relationships:
- Used by:
The data filtering stage (e.g., filter_masks_by_area_and_component_count) to apply specific quality criteria.
- get_feature_extraction_params()#
Returns the feature extraction parameters section from the configuration.
- Returns:
A dictionary containing parameters for feature extraction (e.g., unroll iterations).
- Return type:
Dict[str, Any]
- - Directly retrieves the 'FeatureExtraction' section.
- get_clustering_params()#
Returns the clustering parameters section from the configuration.
- Returns:
A dictionary containing parameters for clustering algorithms (e.g., HDBSCAN parameters).
- Return type:
Dict[str, Any]
- - Directly retrieves the 'Clustering' section.
- get_classification_params()#
Returns the classification parameters section from the configuration.
- Returns:
A dictionary containing parameters for classification models (e.g., Random Forest settings).
- Return type:
Dict[str, Any]
- - Directly retrieves the 'Classification' section.
- get_subdirs_params()#
Returns the subdirectory names and mappings from the configuration.
- Returns:
A dictionary containing mappings for subdirectory names.
- Return type:
Dict[str, str]
- - Directly retrieves the 'subdirs' section.
- get_descriptive_labels()#
Returns the descriptive labels section from the configuration.
- Return type:
Dict
[int
,str
]
src.data_loader module#
This module provides robust functions for loading various data types critical to the wound analysis pipeline, including images, masks, depth maps, feature vectors, and cluster assignments.
It centralizes data loading operations, ensuring consistency in file handling, path resolution, and initial data validation (e.g., checking for missing files, data integrity, and applying masks).
- Functions:
data_loader: Loads all image, mask, and depth map files for a single ImageID.
load_and_clean_features: Loads a CSV of feature vectors, handles missing data, and prepares it for ML tasks.
load_cluster_groups: Loads image-to-cluster assignment data from a CSV and groups images by their assigned cluster.
- Typical use:
This module is primarily used by the main application entry point (run_pipeline.py) and other pipeline stages to retrieve and prepare necessary data for subsequent processing, such as feature extraction, clustering, and classification.
- src.data_loader.data_loader(ImageID, data_root_path, subdirs_config)#
Loads various image and mask files associated with a given ImageID from specified paths.
This function is responsible for retrieving all necessary image, mask, and depth map files for a single wound. It supports loading optional marker images and applies initial masking operations to the depth map. The function enforces strict checks for critical files (image, wound mask, depth map, body mask) and raises an error if any are missing or unreadable. Warnings are logged for optional files.
- Parameters:
ImageID (str) – The unique identifier for the image set (e.g., filename without extension). All corresponding files (body mask, wound mask, depths) are expected to share this name within their respective subdirectories.
data_root_path (Path) – The base directory where the ‘images’, ‘wound_masks’, etc., subdirectories are located. This should be a resolved Path object.
subdirs_config (Dict[str, str]) – A dictionary containing subdirectory names (e.g.,’images_subdir’, ‘wound_masks_subdir’, ‘body_mask_subdir’, ‘depth_maps_subdir’, ‘marker_mask_subdir’).
- Returns:
A dictionary containing the loaded image and mask data: - ‘image’ (np.ndarray): The loaded main RGB image. Its shape is (height, width, 3). - ‘wound’ (np.ndarray): The loaded grayscale wound mask. Its shape is (height, width). - ‘body’ (np.ndarray): The loaded grayscale body mask. Its shape is (height, width). - ‘depth’ (np.ndarray): The loaded grayscale depth map (e.g., 16-bit). Its shape is (height, width), and it will have been masked by the body mask and marker mask (if applied). Returns None if ImageID is invalid or a critical file is missing/unreadable.
- Return type:
Optional[Dict[str, Any]]
- Raises:
FileNotFoundError – If a critical file (image, wound, depth, body mask) is not found.
IOError – If a critical file is found but cannot be read (e.g., corrupted, permission issues).
ValueError – If subdirs_config is missing expected keys.
- Output:
- Log:
Informational messages on successful loads. Warnings for optional file issues or shape mismatches. Errors for critical file loading failures.
- Return Value:
A dictionary of NumPy arrays for image, wound, body, depth.
Example
>>> import numpy as np >>> import cv2 >>> from pathlib import Path >>> # Assume logging is set up >>> # Assume data root path, subdirectory configuration and Image ID are as follows: >>> temp_data_root = Path("./temp_data_loader_example") >>> subdirs = { ... "images_subdir": "images", "wound_masks_subdir": "wound_masks", ... "body_mask_subdir": "body_mask", "depth_maps_subdir": "depth_maps", ... "marker_mask_subdir": "marker_mask" ... } >>> img_id = "sample_001" >>> loaded_data = data_loader(img_id, temp_data_root, subdirs)
- Relationships:
- Dependencies:
cv2: For image I/O (cv2.imread) and masking operations (cv2.bitwise_and).
numpy: For array operations (np.zeros, np.any).
pathlib: For robust file path handling (Path).
logging: For outputting informational messages, warnings, and errors.
- Used by:
The main application entry point (e.g.,`run_pipeline.py`) to load raw image data for individual image processing.
- src.data_loader.load_and_clean_features(file_path)#
Loads a CSV file containing feature vectors, cleans the data by dropping rows with NaN values, and separates image IDs from feature vectors.
This function acts as a robust loader for the comprehensive feature set, ensuring data integrity before further processing.
- Parameters:
file_path (Path) – The Path to the comprehensive_features.csv file.
- Returns:
- A tuple containing:
df_clean (pd.DataFrame): Cleaned DataFrame without NaNs (includes ‘image_id’).
image_ids (np.ndarray): A NumPy array of image IDs from the cleaned DataFrame.
features (np.ndarray): A NumPy array of feature vectors (without ‘image_id’). If the CSV is empty, returns an empty DataFrame and empty NumPy arrays.
- Return type:
Tuple[pd.DataFrame, np.ndarray, np.ndarray]
- Raises:
FileNotFoundError – If the CSV file specified by file_path is not found.
IOError – If there’s an issue reading the CSV file (e.g., permissions, corrupted, or other unexpected errors).
KeyError – If the ‘image_id’ column is missing after loading.
- Output:
- Log:
Informational messages about loading progress and NaN rows. Error messages for critical failures like file not found or read errors.
- Return Value:
A tuple containing the cleaned DataFrame, image IDs array, and features array.
Example
>>> import pandas as pd >>> import numpy as np >>> from pathlib import Path >>> # Assume logging is set up >>> # Create a dummy CSV for the example >>> temp_csv_path = Path("./temp_features_data.csv") >>> df_clean, ids, feats = load_and_clean_features(temp_csv_path)
- Relationships:
- Dependencies:
pandas: For DataFrame operations (pd.read_csv, pd.DataFrame, .dropna(), column selection).
numpy: For array manipulation (np.ndarray, np.array).
pathlib: For robust file path handling (Path).
logging: For outputting informational messages and errors.
- Used by:
The clustering and classification pipeline sections (e.g., in run_pipeline.py) to load the prepared feature set.
- src.data_loader.load_cluster_groups(file_path)#
Loads a CSV containing image IDs and cluster labels, and groups the image IDs by cluster.
This function reads the manifest that maps image identifiers to their assigned cluster labels, providing a convenient structure for accessing cluster-specific lists of images.
- Parameters:
file_path (Path) – Path to the image_cluster_map.csv file.
- Returns:
- A tuple containing:
pd.Series: A Series mapping each cluster label to a list of image IDs, sorted by cluster label.
- pd.DataFrame: The original DataFrame loaded from the CSV (includes ‘image_id’ and ‘cluster_label’).
If the CSV is empty, returns an empty Series and an empty DataFrame with expected columns.
- Return type:
Tuple[pd.Series, pd.DataFrame]
- Raises:
FileNotFoundError – If the CSV file specified by file_path is not found.
IOError – If there’s an issue reading the CSV file (e.g., permissions, corrupted, or other unexpected errors).
KeyError – If required columns (‘image_id’, ‘cluster_label’) are missing.
- Output:
- Log:
Informational messages about loading. Error messages for critical failures.
- Return Value:
A tuple containing a Series of grouped image IDs and the original DataFrame.
Examples
>>> import pandas as pd >>> from pathlib import Path >>> # Assume logging is set up >>> temp_csv_path = Path("./temp_cluster_map.csv") >>> cluster_groups_series, cluster_map_df = load_cluster_groups(temp_csv_path)
- Relationships:
- Dependencies:
pandas: For DataFrame operations (pd.read_csv, pd.DataFrame, pd.Series, .groupby(), .apply()).
pathlib: For robust file path handling (Path).
logging: For outputting informational messages and errors.
- Used by:
The clustering and classification pipelines (e.g., in run_pipeline.py) to retrieve cluster assignment information.
src.feature_extraction module#
This module provides a set of functions for extracting quantitative features from rectified wound images’ depth profiles.
It includes utilities for calculating mean and standard deviation profiles, performing curve fitting with linear and sigmoid models, computing statistical and spectral features, and applying a Butterworth low-pass filter to smooth data. The core functionality is encapsulated in a single function that orchestrates these steps to produce a full feature vector for a given image’s profile.
- Functions:
calculate_depth_profiles: Calculates mean and standard deviation profiles from a 2D depth strip.
r_squared: Computes the R-squared value for a given curve fit.
linear_func: A helper function defining a linear model for curve fitting.
sigmoid_func: A helper function defining a sigmoid model for curve fitting.
get_spectral_features: Extracts spectral centroid and entropy from a profile segment.
get_statistical_features: Computes statistical moments (mean, std, skew, kurtosis) for a profile segment.
butter_lowpass_filter: Applies a Butterworth low-pass filter to smooth a signal.
extract_features_from_profile: The main function that orchestrates all feature extraction steps.
- Typical use:
This module is a core part of the feature extraction pipeline. It is typically used after preprocessing and unrolling a depth map to generate a single, comprehensive feature vector for each image, which is then used for downstream clustering and classification.
- src.feature_extraction.calculate_depth_profiles(rect_depth)#
Calculates the mean and standard deviation profiles from a rectified depth map.
This function processes a 2D rectified depth strip by computing the mean and standard deviation for each column (the cross-sectional profile). It is designed to be a straightforward utility for initial profile generation.
- Parameters:
rect_depth (np.ndarray) – The 2D NumPy array representing the rectified depth strip. This array is expected to have NaNs for masked regions. Shape: (Strip_Height, Strip_Width).
- Returns:
- A tuple containing:
mean_profile (np.ndarray): The 1D mean depth profile.
std_profile (np.ndarray): The 1D standard deviation profile.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
TypeError – If rect_depth is not a NumPy array.
ValueError – If rect_depth is empty or has an unexpected number of dimensions.
- Output:
- Console/Log:
Informational messages about profile dimensions. Errors for invalid inputs.
- Return Value:
Two NumPy arrays representing the mean and standard deviation profiles.
Examples
>>> import numpy as np >>> from src.feature_extraction import calculate_depth_profiles >>> # Assume a 10x100 rectified depth map >>> dummy_rect_depth = np.random.rand(10, 100) >>> mean_profile, std_profile = calculate_depth_profiles(dummy_rect_depth)
- Relationships:
- Dependencies:
numpy: For array operations (np.ndarray, np.nanmean, np.nanstd).
logging: For outputting messages.
- Used by:
The run_pipeline.py script to generate the 1D profile from which all features are extracted.
- src.feature_extraction.r_squared(y_true, y_pred)#
Calculates the R-squared (coefficient of determination) value for a curve fit.
The R-squared value measures how well the fitted model explains the variation in the actual data. A value closer to 1 indicates a better fit.
- Parameters:
y_true (np.ndarray) – The actual data values.
y_pred (np.ndarray) – The values predicted by the fitted model.
- Returns:
The R-squared value. Returns 0 if calculation is not possible.
- Return type:
float
- Raises:
TypeError – If inputs are not NumPy arrays.
ValueError – If inputs have different shapes or are empty.
- Output:
- Console/Log:
A warning message is logged if the variance is zero.
- Return Value:
A floating-point number representing the R-squared value.
Examples
>>> import numpy as np >>> from src.feature_extraction import r_squared >>> y_actual = np.array([1, 2, 3, 4, 5]) >>> y_fitted = np.array([1.1, 2.1, 3.2, 4.0, 5.1]) >>> r2 = r_squared(y_actual, y_fitted)
- Relationships:
- Dependencies:
numpy: For array operations.
logging: For outputting messages.
- Used by:
extract_features_from_profile to quantify the goodness of fit for the linear and sigmoid curve fits.
- src.feature_extraction.linear_func(x, m, c)#
A helper function that defines a linear model for curve fitting.
This function represents the equation y = m * x + c and is specifically designed to be compatible with scipy.optimize.curve_fit. It takes an array of x-coordinates and calculates the corresponding y-coordinates based on the provided slope and intercept.
- Parameters:
x (np.ndarray) – The input independent variable (e.g., array of x-coordinates).
m (float) – The slope of the line.
c (float) – The y-intercept of the line.
- Returns:
The calculated dependent variable (y) values.
- Return type:
np.ndarray
- Raises:
TypeError – If the input x is not a NumPy array.
- Output:
- Console/Log:
Debug messages confirming the calculation.
- Return Value:
A NumPy array containing the computed y-values.
Examples
>>> import numpy as np >>> from src.feature_extraction import linear_func >>> x_vals = np.array([0, 1, 2]) >>> y_vals = linear_func(x_vals, m=2.0, c=1.0)
- Relationships:
- Dependencies:
Relies on numpy for array operations.
- Used by:
extract_features_from_profile for fitting the wound bed and periwound skin regions.
- src.feature_extraction.sigmoid_func(x, L, k, x0, offset)#
A helper function that defines a generalized logistic (sigmoid) model for curve fitting.
This function represents the equation y = L / (1 + exp(-k * (x - x0))) + offset. It is useful for modeling the S-shaped transition of the wound edge. It’s designed to be compatible with scipy.optimize.curve_fit.
- Parameters:
x (np.ndarray) – The input independent variable (e.g., array of x-coordinates).
L (float) – The curve’s maximum value.
k (float) – The steepness or growth rate of the curve.
x0 (float) – The x-value of the curve’s midpoint.
offset (float) – The vertical offset of the curve.
- Returns:
The calculated dependent variable (y) values.
- Return type:
np.ndarray
- Raises:
TypeError – If the input x is not a NumPy array.
- Output:
- Console/Log:
Debug messages confirming the calculation.
- Return Value:
A NumPy array containing the computed y-values.
Examples
>>> import numpy as np >>> from src.feature_extraction import sigmoid_func >>> x_vals = np.linspace(-10, 10, 100) >>> y_vals = sigmoid_func(x_vals, L=10.0, k=0.5, x0=0.0, offset=0.0)
- Relationships:
- Dependencies:
numpy: For array operations.
- Used by:
extract_features_from_profile for fitting the wound edge transition region.
- src.feature_extraction.get_spectral_features(profile_segment)#
Computes spectral features (centroid, entropy) from a profile’s power spectrum.
This function is used to characterize the textural properties of the wound surface. Spectral centroid measures the “center of mass” of the spectrum, while spectral entropy measures the “peakiness” or randomness of the signal. The function is designed to be robust against zero-variance inputs and uses a refined method for power spectrum calculation.
- Parameters:
profile_segment (np.ndarray) – A 1D NumPy array representing the full profile.
- Returns:
A dictionary containing the spectral centroid and entropy. Returns NaN values for both if the input profile is too short or lacks variance.
- Return type:
Dict[str, float]
- Raises:
TypeError – If profile_segment is not a NumPy array.
ValueError – If profile_segment is an empty array.
- Output:
- Console/Log:
A warning message is logged if the profile is too short or if the power spectrum sum is zero. Debug messages are logged for successful calculation.
- Return Value:
A dictionary of spectral features.
Examples
>>> import numpy as np >>> from src.feature_extraction import get_spectral_features >>> dummy_profile = np.array([1, 2, 3, 4, 5]) >>> spectral_feats = get_spectral_features(dummy_profile)
- Relationships:
- Dependencies:
numpy: For array operations and FFT (np.fft.fft, np.abs, np.sum, etc.).
logging: For outputting messages.
- Used by:
extract_features_from_profile for extracting global textural features from the entire profile.
- src.feature_extraction.get_statistical_features(profile)#
Computes statistical features (mean, std, skewness, kurtosis) for a given 1D profile.
This function quantifies the distribution and shape of the data within a specific region of the depth profile. The four moments of the distribution (mean, standard deviation, skewness, and kurtosis) are calculated and returned as a dictionary.
- Parameters:
profile (np.ndarray) – A 1D NumPy array representing a segmented profile region.
- Returns:
A dictionary containing the calculated statistical features.
- Return type:
Dict[str, float]
- Raises:
TypeError – If profile is not a NumPy array.
ValueError – If profile is an empty array.
- Output:
- Console/Log:
A debug message is logged upon successful calculation. A warning is logged for an empty input.
- Return Value:
A dictionary of statistical features.
Examples
>>> import numpy as np >>> from src.feature_extraction import get_statistical_features >>> dummy_profile = np.array([1, 2, 3, 4, 5]) >>> stats = get_statistical_features(dummy_profile)
- Relationships:
- Dependencies:
numpy: For array operations.
scipy.stats: For skewness and kurtosis calculations.
logging: For outputting messages.
- Used by:
extract_features_from_profile for each of the three profile regions (bed, edge, skin).
- src.feature_extraction.butter_lowpass_filter(data, cutoff, fs, order=4)#
Applies a Butterworth low-pass filter to a 1D signal.
This function designs and applies a Butterworth digital low-pass filter, which is a signal processing technique used to smooth out high-frequency noise from a signal. The filtfilt function is used to apply the filter, which ensures zero phase shift in the output.
- Parameters:
data (np.ndarray) – The 1D input signal (e.g., a depth profile).
cutoff (float) – The cutoff frequency of the filter.
fs (float) – The sampling frequency of the signal.
order (int) – The order of the filter. A higher order results in a sharper cutoff. Defaults to 4.
- Returns:
The filtered and smoothed 1D signal.
- Return type:
np.ndarray
- Raises:
TypeError – If data is not a NumPy array or cutoff/fs are not numeric types.
ValueError – If data is empty, cutoff or fs are non-positive, or order is non-positive.
RuntimeError – If the filter design or application fails.
- Output:
- Console/Log:
A debug message is logged upon successful filter application. Warnings are logged for invalid input values.
- Return Value:
A NumPy array containing the filtered signal.
Examples
>>> import numpy as np >>> from src.feature_extraction import butter_lowpass_filter >>> # Assume a signal with some high-frequency noise >>> t = np.linspace(0, 1, 500, endpoint=False) >>> sig = np.sin(2 * np.pi * 10 * t) + np.sin(2 * np.pi * 50 * t) >>> fs = 500 >>> cutoff_freq = 20 >>> filtered_sig = butter_lowpass_filter(sig, cutoff=cutoff_freq, fs=fs)
- Relationships:
- Dependencies:
numpy: For array operations.
scipy.signal: For filter design and application (butter, filtfilt).
logging: For outputting messages.
- Used by:
This filter function can be used to smooth the mean depth profile (e.g., within calculate_depth_profiles).
- src.feature_extraction.extract_features_from_profile(mean_profile, d1, feature_params)#
Extracts a comprehensive set of quantitative features from a depth profile.
This is the main function for feature extraction. It segments the profile into three regions (wound bed, edge, skin) and applies statistical, spectral, and curve-fitting methods to quantify their characteristics. A comprehensive feature dictionary is returned.
- Parameters:
mean_profile (np.ndarray) – The 1D raw mean depth profile. Shape: (Profile_Length,).
d1 (int) – The index of the wound edge (the baseline contour).
feature_params (Dict[str, Any]) – A dictionary containing parameters for feature extraction, including ‘transition_width’.
- Returns:
- A tuple containing:
- features (Dict[str, Any]):
A dictionary of extracted features. Keys are feature names, values are floats or booleans.
- smoothed_profile (np.ndarray):
The low-pass filtered version of the mean_profile
- success (bool):
True if feature extraction was successful, False otherwise.
- Return type:
Tuple[Dict[str, Any], bool]
- Raises:
TypeError – If mean_profile is not a NumPy array.
ValueError – If mean_profile is empty or inputs have inconsistent shapes.
RuntimeError – If profile segmentation fails unexpectedly.
- Output:
- Console/Log:
Informational messages about each step and warnings for unsuccessful curve fitting. Errors for critical input issues.
- Return Value:
A dictionary of features and a success flag.
Examples
>>> import numpy as np >>> from src.feature_extraction import extract_features_from_profile >>> # Assume dummy profiles are generated by calculate_depth_profiles >>> dummy_mean = np.linspace(10, 0, 200) # Simple linear transition >>> dummy_std = np.ones(200) * 0.5 >>> # Assume wound edge is at pixel 100 >>> feature_params = {'transition_width': 50, 'cutoff_freq': 0.1, 'butter_order': 4} >>> features_dict, success_flag = extract_features_from_profile(dummy_mean, d1=100, feature_params=feature_params)
- Relationships:
- Dependencies:
numpy: For array operations.
scipy.optimize.curve_fit: For fitting curves.
scipy.stats: For statistical calculations.
scipy.signal: For filtering.
logging: For outputting messages.
r_squared(), linear_func(), sigmoid_func(), get_spectral_features(), get_statistical_features(), butter_lowpass_filter(): All functions within this module.
- Used by:
The main application entry point (run_pipeline.py) to generate the final feature vector for a single image.
src.logging_setup module#
This module provides a centralized and configurable logging setup for the project.
It defines the setup_logging function, which allows for easy configuration of console and file-based logging, ensuring consistent log formatting and levels across the application.
- Functions:
setup_logging: Configures the global root logger for console and optional file output.
- Typical use:
This module is intended to be called once at the application’s startup to establish the primary logging configuration for the entire project.
- src.logging_setup.setup_logging(log_level='INFO', log_file_path=None)#
Sets up the global logging configuration for the project.
This function configures the root logger to output messages to both the console and, optionally, to a specified file. It ensures that logs include timestamps, log levels, and the origin module name, facilitating debugging and monitoring. It also clears any existing handlers to prevent duplicate log messages if called multiple times.
- Args:
- log_level (str, optional):
The minimum level of messages to log for the root logger (e.g., ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’, ‘CRITICAL’). Messages below this level will be ignored. Case-insensitive. Defaults to ‘INFO’.
- log_file_path (Optional[Path], optional):
If a Path object is provided, logs will be written to this file, appending to it if it exists. The necessary parent directories will be created if they don’t exist. If None (default), logs will only go to the console.
- Returns:
The function does not return any value; it configures the global logging system as a side effect.
- Return type:
None
- Raises:
ValueError – If log_level is not a valid logging level string. # Note: Specific IOError on file handling is caught internally and logged, but not re-raised.
Examples
>>> import logging >>> from pathlib import Path >>> >>> # Example 1: Set up console logging only at INFO level >>> setup_logging(log_level="INFO") >>> logger = logging.getLogger(__name__) >>> logger.info("This is an info message to console.") >>> logger.debug("This debug message will not appear (level is INFO).") >>> >>> # Example 2: Set up logging to a file and console at DEBUG level >>> log_file = Path("/tmp/my_application.log") >>> setup_logging(log_level="DEBUG", log_file_path=log_file) >>> logger_file = logging.getLogger(__name__) >>> logger_file.debug("This debug message will go to console and file.")
- Relationships:
- Used by:
Expected to be called once by the main entry point of the application at startup (e.g., run_pipeline.py).
- Affects:
All logging.getLogger(__name__) instances throughout the application will inherit and adhere to this configuration.
Notes
This function modifies the global root logger.
Existing log handlers are cleared upon each call to prevent duplicate log output if setup_logging is invoked multiple times during application runtime.
If file logging setup fails (e.g., due to permission errors), an error is logged, and the function falls back to console-only logging.
src.plotting module#
Data Visualization Utilities for Wound Border Characterization.
This module contains a collection of functions for visualizing the various stages of the wound border characterization pipeline. It includes utilities for displaying raw image data, unrolled depth and RGB strips, dimensionality reduction embeddings, cluster distributions, and classification performance metrics. All plots are designed to be saved to file, and can optionally be displayed to the user.
- Functions:
plot_initial_data : Displays the raw RGB, wound mask, body mask, and depth map.
show_unrolled_strip : Visualizes the rectified depth profile and corresponding RGB image.
plot_depth_profile : Creates a line plot of the mean depth profile with its standard deviation.
plot_profiles_and_fits : Visualizes the mean depth profile, smoothed profile, and curve fits.
plot_embedding : Creates a scatter plot of the 2D PaCMAP embedding.
plot_clusters : Creates a scatter plot of the HDBSCAN clusters on the embedding.
plot_feature_distributions_by_cluster : Generates box plots for key features across clusters.
plot_cluster_image_grid : Displays a grid of sample images for each identified cluster.
plot_confusion_matrix : Visualizes the confusion matrix for the supervised classifier.
plot_feature_importances : Creates a bar chart of feature importance scores.
- Typical Use:
These functions are called by the main application script (run_pipeline.py) or utility scripts (ML_Pipeline.py) to provide visual feedback and to save key results for reporting and analysis. Plots are configurable to be saved automatically without requiring manual interaction.
- src.plotting.plot_initial_data(image, wound_mask, body_mask, depth_map, save_path=None)#
Creates and saves a grid plot of the raw input data.
This utility visualizes the initial data for a single sample, including the RGB image, wound mask, body mask, and depth map.
- Parameters:
image (np.ndarray) – The loaded RGB image.
wound_mask (np.ndarray) – The loaded wound mask.
body_mask (np.ndarray) – The loaded body mask.
depth_map (np.ndarray) – The loaded depth map.
save_path (Optional[Path], optional) – The full path to save the plot. If None, the plot is not saved. Defaults to None.
- Returns:
None
- Raises:
TypeError – If any of the inputs are not NumPy arrays.
ValueError – If the shapes of the inputs are inconsistent.
- Output:
- Log:
A debug message is logged upon successful plotting.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from src.plotting import plot_initial_data >>> # Create dummy data >>> dummy_img = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8) >>> dummy_mask = np.zeros((100, 100), dtype=np.uint8) >>> dummy_depth = np.random.rand(100, 100) >>> plot_initial_data(dummy_img, dummy_mask, dummy_mask, dummy_depth)
- src.plotting.show_unrolled_strip(rect_depth, unrolled_image, d1, p1, iterations, save_path=None)#
Creates and saves a plot of the rectified depth and RGB strips.
This function visualizes the output of the periwound unrolling process, displaying the depth and RGB data in a standardized rectangular format.
- Parameters:
rect_depth (np.ndarray) – The rectified depth profile.
unrolled_image (np.ndarray) – The rectified RGB image.
d1 (int) – The position of the wound border in the depth profile.
p1 (int) – The position of the wound border in the RGB image.
iterations (int) – The number of erosion/dilation steps.
save_path (Optional[Path], optional) – The full path to save the plot. If None, the plot is not saved. Defaults to None.
- Returns:
None
- Raises:
TypeError – If any of the inputs are not NumPy arrays or integers.
- Output:
- Log:
A debug message is logged upon successful plotting.
- File:
A png file save in the given save_path.
Examples
>>> import numpy as np >>> from src.plotting import show_unrolled_strip >>> dummy_depth_strip = np.random.rand(100, 20) >>> dummy_rgb_strip = np.random.randint(0, 255, (100, 20, 3), dtype=np.uint8) >>> show_unrolled_strip(dummy_depth_strip, dummy_rgb_strip, d1=50, p1=50, iterations=10)
- src.plotting.plot_depth_profile(mean_profile, std_profile, d1, save_path=None)#
Creates and saves a line plot of the raw mean depth profile.
This function creates a line plot of the mean depth profile, and uses a shaded area to represent the standard deviation across the strip. It also adds a vertical line at the estimated wound border position (d1).
- Parameters:
mean_profile (np.ndarray) – A 1D array of the mean depth profile.
std_profile (np.ndarray) – A 1D array of the standard deviation profile.
d1 (int) – The index of the wound edge (the baseline contour).
save_path (Optional[Path], optional) – The full path to save the plot. If None, the plot is not saved. Defaults to None.
- Returns:
None
- Raises:
TypeError – If inputs are not of expected types.
ValueError – If inputs have inconsistent shapes or are empty.
- Output:
- Log:
A debug message is logged upon successful plotting.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from src.plotting import plot_depth_profile >>> dummy_mean = np.linspace(10, 0, 100) + np.random.rand(100) >>> dummy_std = np.ones(100) * 0.5 >>> plot_depth_profile(dummy_mean, dummy_std, d1=50)
- src.plotting.plot_profiles_and_fits(mean_profile, std_profile, smoothed_profile, features, d1, p1, transition_width, save_path=None)#
Creates and saves a plot of the depth profile and its piecewise curve fits.
This function visualizes the core output of the feature extraction module, showing how the linear and sigmoid functions fit the smoothed depth profile across the bed, edge, and skin regions.
- Parameters:
mean_profile (np.ndarray) – The raw mean depth profile.
std_profile (np.ndarray) – The standard deviation profile.
smoothed_profile (np.ndarray) – The smoothed mean depth profile used for fitting.
features (Dict[str, Any]) – A dictionary of extracted features from the profile.
d1 (int) – The index of the wound edge (baseline contour).
p1 (int) – A redundant parameter (likely from original code), but kept for consistency.
transition_width (int) – The width of the edge transition region.
save_path (Optional[Path], optional) – The full path to save the plot. If None, the plot is not saved. Defaults to None.
- Returns:
None
- Raises:
TypeError – If inputs are not of expected types.
ValueError – If inputs have inconsistent shapes or are empty.
- Output:
- Log:
A debug message is logged upon successful plotting.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from src.plotting import plot_profiles_and_fits >>> dummy_mean = np.linspace(10, 0, 200) + np.random.rand(200) >>> dummy_std = np.ones(200) * 0.5 >>> dummy_smoothed = np.linspace(10, 0, 200) >>> dummy_features = {'bed_fit_success': 1, 'bed_slope': -0.1, 'bed_intercept': 10, ... 'edge_fit_success': 1, 'edge_amplitude': 10, 'edge_steepness': 0.5, ... 'edge_midpoint': 100, 'edge_offset': 0, ... 'skin_fit_success': 1, 'skin_slope': 0, 'skin_intercept': 0} >>> plot_profiles_and_fits(dummy_mean, dummy_std, dummy_smoothed, dummy_features, d1=100, p1=100, transition_width=50)
- src.plotting.plot_embedding(embedding, save_path)#
Creates and saves a scatter plot of the 2D PaCMAP embedding.
This function visualizes the feature dataset after dimensionality reduction, providing an initial view of the data’s inherent structure and potential clusters.
- Parameters:
embedding (np.ndarray) – The 2D NumPy array of the PaCMAP embedding. Expected shape: (N, 2).
save_path (Path) – The full path, including filename, to save the plot.
- Returns:
None
- Raises:
TypeError – If embedding is not a NumPy array or save_path is not a Path object.
ValueError – If embedding does not have 2 dimensions.
IOError – If there’s an issue saving the file.
- Output:
- Log:
Informational messages about successful saving. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from src.plotting import plot_embedding >>> from pathlib import Path >>> # Create a dummy embedding and temporary path >>> dummy_embedding = np.random.rand(100, 2) >>> dummy_path = Path('./dummy_embedding.png') >>> plot_embedding(dummy_embedding, dummy_path)
- src.plotting.plot_clusters(embedding, cluster_labels, save_path)#
Creates and saves a scatter plot of the HDBSCAN clusters on the embedding.
This function visualizes the output of the HDBSCAN algorithm, with each identified cluster represented by a different color. Noise points (labeled -1) are shown in a distinct color to provide a clear view of the clustering results.
- Parameters:
embedding (np.ndarray) – The 2D NumPy array of the PaCMAP embedding.
cluster_labels (np.ndarray) – A 1D NumPy array of cluster labels from HDBSCAN.
save_path (Path) – The full path, including filename, to save the plot.
- Returns:
None
- Raises:
TypeError – If inputs are not NumPy arrays or save_path is not a Path object.
ValueError – If input shapes are inconsistent.
IOError – If there’s an issue saving the file.
- Output:
- Log:
Informational messages about successful saving. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from src.plotting import plot_clusters >>> from pathlib import Path >>> dummy_embedding = np.random.rand(100, 2) >>> dummy_labels = np.random.randint(-1, 3, 100) # -1, 0, 1, 2 >>> dummy_path = Path('./dummy_clusters.png') >>> plot_clusters(dummy_embedding, dummy_labels, dummy_path)
- src.plotting.plot_feature_distributions_by_cluster(df, features_to_plot, save_path)#
Generates and saves a grid of box plots for key features across identified clusters.
This function provides a visual, statistical overview of the differences between the identified wound border types. Each box plot shows the distribution of a feature within a cluster, highlighting the unique characteristics of each group.
- Parameters:
df (pd.DataFrame) – The DataFrame containing feature values and ‘cluster_label’.
features_to_plot (List[str]) – A list of feature names to include in the plots.
save_path (Path) – The full path, including filename, to save the plot.
- Returns:
None
- Raises:
TypeError – If df is not a Pandas DataFrame or features_to_plot is not a list.
ValueError – If required columns are missing from df.
IOError – If there’s an issue saving the file.
- Output:
- Log:
Informational messages about successful saving. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import pandas as pd >>> from src.plotting import plot_feature_distributions_by_cluster >>> from pathlib import Path >>> dummy_df = pd.DataFrame({ ... 'feat_A': np.random.rand(100), ... 'feat_B': np.random.rand(100), ... 'cluster_label': np.random.randint(0, 3, 100) ... }) >>> dummy_path = Path('./dummy_distributions.png') >>> plot_feature_distributions_by_cluster(dummy_df, ['feat_A', 'feat_B'], dummy_path)
- src.plotting.plot_cluster_image_grid(cluster_groups, image_dir, save_path, num_samples=3)#
Displays a grid of representative sample images for each identified cluster.
This function provides a critical qualitative validation step, allowing a user to visually inspect whether the computationally identified clusters correspond to distinct and meaningful morphological patterns in the wound images.
- Parameters:
cluster_groups (pd.Series) – A Pandas Series mapping each cluster label to a list of image IDs. This is typically the output of save_cluster_assignments.
image_dir (Path) – The Path to the directory containing the original RGB image files.
save_path (Path) – The full path, including filename, to save the plot.
num_samples (int) – The number of random images to display for each cluster. Defaults to 3.
- Returns:
None
- Raises:
TypeError – If cluster_groups is not a Pandas Series or image_dir is not a Path object.
IOError – If there’s an issue loading image files or saving the plot.
- Output:
- Log:
Informational messages about successful plotting. Warnings for images that fail to load. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import pandas as pd >>> import numpy as np >>> from pathlib import Path >>> import cv2 >>> from src.plotting import plot_cluster_image_grid >>> # Create a dummy image directory and files >>> temp_dir = Path('./temp_images_for_grid'); temp_dir.mkdir(exist_ok=True) >>> cv2.imwrite(str(temp_dir / 'imgA.png'), np.zeros((10,10,3), dtype=np.uint8)) >>> cv2.imwrite(str(temp_dir / 'imgB.png'), np.ones((10,10,3), dtype=np.uint8) * 255) >>> # Create a dummy cluster_groups Series >>> groups = pd.Series([['imgA'], ['imgB']], index=[0, 1]) >>> dummy_path = Path('./dummy_image_grid.png') >>> plot_cluster_image_grid(groups, temp_dir, dummy_path)
- src.plotting.plot_confusion_matrix(model, X_test, y_test, save_path)#
Visualizes and saves the confusion matrix for the supervised classifier.
This function provides a visual overview of the classifier’s performance, showing which classes are correctly identified and which are frequently misclassified.
- Parameters:
model (RandomForestClassifier) – The trained classifier model.
X_test (np.ndarray) – Testing features.
y_test (np.ndarray) – Testing target labels.
save_path (Path) – The full path, including filename, to save the plot.
- Returns:
None
- Raises:
TypeError – If inputs are not of expected types.
ValueError – If inputs are empty or have inconsistent shapes.
IOError – If there’s an issue saving the file.
- Output:
- Console/Log:
Informational messages about successful saving. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import numpy as np >>> from sklearn.ensemble import RandomForestClassifier >>> from src.classification import plot_confusion_matrix >>> from pathlib import Path >>> # Dummy data >>> model = RandomForestClassifier(random_state=42).fit(np.random.rand(10, 5), np.array([0, 1]*5)) >>> X_test = np.random.rand(10, 5) >>> y_test = np.array([0, 1]*5) >>> save_path = Path('./dummy_confusion_matrix.png') >>> plot_confusion_matrix(model, X_test, y_test, save_path)
- Relationships:
- Dependencies:
Relies on numpy, sklearn.metrics, and matplotlib.
- Used by:
The classification pipeline to report model performance.
- src.plotting.plot_feature_importances(feature_importance_df, save_path)#
Creates and saves a bar chart of feature importance scores from a trained model.
This function provides a visual ranking of the most influential features for the classification task, offering insights into which quantitative metrics are most critical for distinguishing between wound border types.
- Parameters:
feature_importance_df (pd.DataFrame) – A DataFrame with ‘feature’ and ‘importance’ columns.
save_path (Path) – The full path, including filename, to save the plot.
- Returns:
None
- Raises:
TypeError – If feature_importance_df is not a Pandas DataFrame or save_path is not a Path object.
ValueError – If required columns (‘feature’, ‘importance’) are missing.
IOError – If there’s an issue saving the file.
- Output:
- Log:
Informational messages about successful saving. Errors if saving fails.
- File:
A PNG file of the plot at save_path.
Examples
>>> import pandas as pd >>> from src.plotting import plot_feature_importances >>> from pathlib import Path >>> dummy_df = pd.DataFrame({ ... 'feature': ['feat_A', 'feat_B', 'feat_C'], ... 'importance': [0.5, 0.3, 0.2] ... }) >>> dummy_path = Path('./dummy_feature_importances.png') >>> plot_feature_importances(dummy_df, dummy_path)
src.preprocessing module#
This module provides a suite of preprocessing functions essential for wound image analysis.
It encompasses functionalities for filtering wound masks based on quality criteria, cleaning depth map data using Z-score filtering, correcting depth maps for body curvature, and unrolling peri-wound regions into standardized strips for feature extraction.
- Functions:
validate_wound_masks: Filters wound mask images based on area and component count.
zscore_filter: Applies a Z-score filter to depth maps within body masks to remove outliers.
quad_surface: A helper function defining the quadratic surface model for curve fitting.
depth_corrction_for_body_curvature: Corrects depth maps for body’s natural curvature using surface fitting.
sample_pixels_from_contour: Samples image pixels along a mask’s longest contour.
unroll_periwound_to_image: Transforms irregular peri-wound regions into rectangular strips via morphological operations.
- Typical use:
This module is typically used in the early stages of the image processing pipeline, after initial data loading, to clean, normalize, and transform raw image and depth data into a suitable format for subsequent feature extraction, clustering, and classification.
- src.preprocessing.validate_wound_masks(mask_directory, filtering_params)#
Filters wound mask images based on specified quality criteria.
This function processes image files within a given directory, applying checks for the number of connected components and the area of the primary wound region. Only image IDs that satisfy all criteria are considered valid and returned as a Pandas DataFrame.
- Parameters:
mask_directory (Path) – The path to the folder containing the wound mask images. This should be a resolved Path object.
filtering_params (Dict[str, Any]) –
- Dictionary containing filtering parameters:
’pixel_area_threshold’ (int): Minimum pixel area required for the largest wound component to be considered valid.
’max_wound_components’ (int): The expected (and required) number of connected wound components in the mask. Typically 1.
- Returns:
A Pandas DataFrame with a single column ‘image_id’ containing the IDs of images that meet the filtering criteria. Returns an empty DataFrame if no images pass the filter or if the mask directory is empty/invalid.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified mask_directory does not exist.
IOError – If there’s an issue reading files from the mask_directory.
TypeError – If input parameters (mask_directory, filtering_params, output_filepath) are not of the expected types.
ValueError – If filtering_params is improperly structured (e.g., missing keys).
Examples
>>> from pathlib import Path >>> import pandas as pd >>> from src.preprocessing import validate_wound_masks >>> from src.config_manager import Config >>> Assume config class is defined and has a method to get paths and filtering parameters >>> paths = config.get_paths() # load paths from config_manager >>> filtering_params = config.get_filtering_params() # load filtering parameters from config_manager >>> filter_masks_by_area_and_component_count(paths['wound_masks_dir'],filtering_params)
- Relationships:
- Used by:
The main application entry point (e.g., run_pipeline.py) might call this function as part of the initial data preparation phase.
- src.preprocessing.zscore_filter(body, depth)#
Applies a Z-score filter to the depth map within the body mask to remove outliers.
This function helps in removing noise (e.g., specular reflections) from the depth data by identifying and masking out pixels whose depth values are statistically too far from the mean. The body mask is simultaneously updated to reflect the removed outlier regions.
- Parameters:
body (np.ndarray) – A 2D NumPy array representing the binary body mask (e.g., 0 for background, >0 for body). Expected dtype: uint8. Shape: (H, W).
depth (np.ndarray) – A 2D NumPy array representing the depth map values. Expected dtype: float32 or uint16 (converted internally). Shape: (H, W).
- Returns:
- A tuple containing:
- body_clensed (np.ndarray): The updated body mask with outlier regions zeroed out.
dtype: uint8. Shape: (H, W).
- depth_clensed (np.ndarray): The depth map with outlier pixels removed (set to 0).
dtype: Same as input depth. Shape: (H, W).
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
TypeError – If input body or depth are not NumPy arrays.
ValueError – If input image shapes are inconsistent or empty.
RuntimeError – If standard deviation is zero causing division by zero in Z-score calculation.
- Output:
- Console/Log:
Debug messages on filter application. Error messages for invalid inputs.
- Return Value:
Two NumPy arrays representing the cleaned body mask and depth map.
Examples
>>> import numpy as np >>> from src.preprocessing import zscore_filter >>> # Dummy data: A 5x5 depth map with one outlier and a corresponding body mask >>> depth_map_in = np.array([[10,10,10,10,10],[10,10,1000,10,10],[10,10,10,10,10],[10,10,10,10,10],[10,10,10,10,10]], dtype=np.float32) >>> body_mask_in = np.ones((5,5), dtype=np.uint8) * 255 # All body initially >>> cleaned_body, cleaned_depth = zscore_filter(body_mask_in.copy(), depth_map_in.copy())
- Relationships:
- Dependencies:
Relies on numpy for array operations (np.ndarray, .mean(), .std()) and cv2 for bitwise operations (cv2.bitwise_and).
- Used by:
depth_corrction_for_body_curvature in this module as a preprocessing step.
- src.preprocessing.quad_surface(xy, a, b, c, d, e, f)#
Defines a quadratic surface equation for fitting body curvature.
This helper function represents the mathematical model $z = ax^2 + by^2 + cxy + dx + ey + f$ used to approximate the natural curvature of the human body from depth data. It’s specifically designed to be compatible with scipy.optimize.curve_fit.
- Parameters:
xy (Tuple[np.ndarray, np.ndarray]) – A tuple containing two 2D NumPy arrays: (x_coordinates_grid, y_coordinates_grid) for the grid points. Note: curve_fit expects x and y to be the first two args, so (x_coords, y_coords) corresponds to (y_coords_grid, x_coords_grid) when called from depth_corrction_for_body_curvature due to its setup.
a (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
b (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
c (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
d (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
e (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
f (float) – Coefficients of the quadratic surface equation. These are the parameters that curve_fit will determine.
- Returns:
A 2D NumPy array representing the calculated height (z) values for the given coordinates. The shape will be that of x_coordinates_grid (or y_coordinates_grid).
- Return type:
np.ndarray
- Output:
- Return Value:
A NumPy array of calculated z-values.
Examples
>>> import numpy as np >>> from src.preprocessing import quad_surface >>> # Create a simple 2x2 grid for x and y coordinates >>> y_coords, x_coords = np.mgrid[0:2, 0:2] # y_coords is rows, x_coords is columns >>> >>> surface = quad_surface((x_coords, y_coords), a=1, b=2, c=3, d=4, e=5, f=6)
- Relationships:
- Used by:
depth_corrction_for_body_curvature as the model function for scipy.optimize.curve_fit.
- src.preprocessing.depth_corrction_for_body_curvature(wound, body, depth, kernel_size=(20, 20), dilation_iterations=15)#
Corrects the depth map for the natural curvature of the human body.
This function isolates the intrinsic topography of the wound by subtracting a fitted quadratic surface from the depth map, which represents the body’s curvature. It applies Z-score filtering first to clean initial depth data outliers.
- Parameters:
wound (np.ndarray) – A 2D NumPy array representing the binary wound mask (255=wound, 0=background). Shape: (H, W).
body (np.ndarray) – A 2D NumPy array representing the binary body mask (255=body, 0=background). Shape: (H, W).
depth (np.ndarray) – A 2D NumPy array representing the raw depth map. Shape: (H, W).
kernel_size (Tuple[int, int]) – Size of the elliptical kernel for morphological operations. Defaults to (20, 20).
dilation_iterations (int) – Number of iterations for wound mask dilation to define the peri-wound ROI. Defaults to 15.
- Returns:
- A tuple containing:
body_clensed (np.ndarray): The body mask after Z-score filtering. dtype: uint8.
depth_clensed (np.ndarray): The depth map after Z-score filtering (before curvature correction).
- depth_corrected (np.ndarray): The final depth map corrected for body curvature,
with background and outliers masked out.
- Return type:
Tuple[np.ndarray, np.ndarray, np.ndarray]
- Raises:
ValueError – If input image shapes are inconsistent or scipy.optimize.curve_fit fails to converge or fit.
TypeError – If input arrays are not NumPy arrays.
RuntimeError – If image dimensions change unexpectedly during processing.
- Output:
- Console/Log:
Informational messages about filtering and surface fitting steps. Warnings if fitting data is insufficient or if curve_fit fails. Errors for critical input issues.
- Return Value:
Three NumPy arrays representing the cleaned body, cleaned depth, and curvature-corrected depth.
Examples
>>> import numpy as np >>> import cv2 >>> from src.preprocessing import depth_corrction_for_body_curvature >>> # Assume logging is set up
>>> # Dummy data: A simple wound, body, and depth map (100x100) >>> dummy_wound = np.zeros((100,100), dtype=np.uint8); cv2.circle(dummy_wound, (50,50), 10, 255, -1) >>> dummy_body = np.ones((100,100), dtype=np.uint8) * 255 >>> dummy_depth = np.linspace(0, 100, 10000).reshape(100,100).astype(np.float32) # Simulated gradient depth >>> dummy_depth[50,50] = 500 # A 'wound' dip to make it interesting
>>> cleaned_body, cleaned_depth, corrected_depth = depth_corrction_for_body_curvature( ... dummy_wound, dummy_body, dummy_depth)
- Relationships:
- Dependencies:
Calls zscore_filter() and quad_surface(). Uses cv2 for morphology (cv2.getStructuringElement, cv2.dilate, cv2.bitwise_and), numpy for array operations, and scipy.optimize.curve_fit for surface fitting.
- Used by:
The main application entry point (run_pipeline.py) during the preprocessing stage, immediately after data loading.
- src.preprocessing.sample_pixels_from_contour(img, mask)#
Samples pixels from an image along its longest contour within a given binary mask.
This utility function extracts pixel values that lie directly on the primary contour identified within the mask. It’s typically used to get a baseline pixel strip along a wound border or other defined boundary for subsequent analysis.
- Parameters:
img (np.ndarray) – The input image (RGB or grayscale) from which to sample pixels. Shape: (H, W) or (H, W, C).
mask (np.ndarray) – A binary mask (uint8, 255=foreground, 0=background) defining the region of interest from which contours are extracted. Shape: (H, W).
- Returns:
A NumPy array containing the sampled pixels. For color images, shape will be (N, 1, C); for grayscale, (N, 1). N is the contour length.
- Return type:
np.ndarray
- Raises:
TypeError – If input img or mask are not NumPy arrays.
ValueError – If the mask is empty, has inconsistent shape with img, or no valid contours are found, or if a contour is degenerate.
- Output:
- Console/Log:
Debug messages on pixel sampling. Error messages for invalid masks/contours.
- Return Value:
A NumPy array of sampled pixel values.
Example
>>> import numpy as np >>> import cv2 >>> from src.preprocessing import sample_pixels_from_contour >>> # Dummy image and mask: a white square on a black background >>> dummy_img = np.zeros((50,50,3), dtype=np.uint8); dummy_img[10:40, 10:40] = 200 # Grey square >>> dummy_mask = np.zeros((50,50), dtype=np.uint8); dummy_mask[10:40, 10:40] = 255 # White square mask >>> pixels = sample_pixels_from_contour(dummy_img, dummy_mask)
- Relationships:
- Dependencies:
Relies on cv2 for contour finding (cv2.findContours) and numpy for array manipulation.
- Used by:
unroll_periwound_to_image to get the initial baseline pixel strip and subsequent pixel rings during erosion/dilation.
- src.preprocessing.unroll_periwound_to_image(img, mask, iterations=100, kernel_size=(3, 3))#
“Unrolls” the peri-wound region from an image (RGB or depth map) into a standardized rectangular strip.
This technique transforms the irregular, ring-like area around a wound into a fixed-geometry strip, enabling consistent feature extraction. It achieves this by iteratively dilating (outwards from wound) and eroding (inwards from wound) the wound mask and sampling pixels at each step. This process helps to flatten the wound border profile for quantitative analysis.
- Parameters:
img (np.ndarray) – The input image (RGB or depth map) to unroll. Shape: (H, W) or (H, W, C).
mask (np.ndarray) – The binary wound mask (uint8, 255=wound, 0=background). Shape: (H, W).
iterations (int) – The maximum number of erosion/dilation steps. This determines the potential total width of the unrolled strip. Defaults to 100.
kernel_size (Tuple[int, int]) – Size of the elliptical kernel for morphological operations. Defaults to (3, 3).
- Returns:
- A tuple containing:
- unrolled_strip (np.ndarray): The 2D (for grayscale/depth) or 3D (for RGB) unrolled rectangular strip.
Shape: (Contour_Length, Total_Strip_Width, [Channels]).
- erosion_dilation_counts (Tuple[int, int]): A tuple (num_eroded_strips, num_dilated_strips)
representing the effective width of the inner (wound bed) and outer (periwound skin) regions.
- Return type:
Tuple[np.ndarray, Tuple[int, int]]
- Raises:
TypeError – If input img or mask are not NumPy arrays.
ValueError – If input mask or img is empty or has inconsistent shape, or if sample_pixels_from_contour fails internally.
RuntimeError – If image dimensions change unexpectedly during processing, or if unrolled image has an unexpected number of dimensions.
- Output:
- Console/Log:
Informational messages about unrolling progress, and warnings for early stopping due to mask issues. Errors for critical input or processing failures.
- Return Value:
The unrolled image strip and a tuple of erosion/dilation counts.
Examples
>>> import numpy as np >>> import cv2 >>> from pathlib import Path >>> from src.preprocessing import unroll_periwound_to_image >>> # Assume logging is set up >>> >>> # Dummy data: A simple 100x100 image and wound mask (circle in center) >>> dummy_img = np.zeros((100,100,3), dtype=np.uint8); dummy_img[40:60, 40:60] = 200 # Grey square >>> dummy_wound_mask = np.zeros((100,100), dtype=np.uint8); cv2.circle(dummy_wound_mask, (50,50), 10, 255, -1) >>> >>> unrolled_strip, counts = unroll_periwound_to_image(dummy_img, dummy_wound_mask, iterations=10)
- Relationships:
- Dependencies:
Calls sample_pixels_from_contour(). Uses cv2 for morphological operations (cv2.getStructuringElement, cv2.dilate, cv2.erode, cv2.resize) and numpy for array manipulation.
- Used by:
The main application entry point (run_pipeline.py) during the preprocessing stage, typically after depth correction.
src.utils module#
This module provides a collection of utility functions for common data manipulation, analysis, and file I/O operations within the project’s data processing pipeline.
It centralizes functions for saving Pandas DataFrames to CSV, handling feature saving incrementally, and generating various statistical summaries and profiles from clustered data.
- Functions:
- save_dataframe_to_csv: A general-purpose function for saving DataFrames to CSV,
with flexible options for appending and header control.
save_features_to_csv: Appends individual image features to a CSV file.
save_cluster_assignments: Calculates and saves image-to-cluster assignments.
generate_cluster_summary: Generates and saves comprehensive statistics for each cluster.
generate_cluster_profiles: Calculates and saves mean feature values (profiles) for each cluster.
- Typical use:
This module is primarily used by various stages of the data processing and machine learning pipeline to perform standardized saving operations, data aggregation, and reporting of results.
- src.utils.save_dataframe_to_csv(df, output_filepath, append_mode=False, include_header=None, index=False)#
Saves a Pandas DataFrame to a CSV file, handling directory creation and header logic.
This function provides flexible options for saving DataFrames, including appending to existing files and intelligent control over header writing to avoid duplicate headers when appending. It ensures the output directory exists before attempting to write.
- Parameters:
df (pd.DataFrame) – The DataFrame to save. It is expected to contain the data intended for the CSV.
output_filepath (Path) – The full path, including filename, where the CSV will be saved. This should be a resolved Path object.
append_mode (bool, optional) – If True, the DataFrame will be appended to the file if it exists. If False, the file will be overwritten. Defaults to False.
include_header (Optional[bool], optional) – Controls when the header is written: - If True, the header will always be written. - If False, the header will never be written. - If None (default), the header is written only if append_mode is True and the file does not already exist. When append_mode is False (overwrite mode), pd.to_csv generally writes a header by default, unless include_header is explicitly set to False.
index (bool, optional) – Whether to write the DataFrame index as a column in the CSV. Defaults to False.
- Returns:
The function does not return any value. It performs a file-saving operation.
- Return type:
None
- Raises:
IOError – If there’s an issue creating the directory or writing the DataFrame to the file. This could be due to permission issues, disk full, or an invalid path.
TypeError – If df is not a Pandas DataFrame or output_filepath is not a Path object.
Example
>>> from pathlib import Path >>> import pandas as pd >>> # Assume 'config' provides paths like paths['filtered_manifest_csv'] >>> valid_ids_df = pd.DataFrame({'image_id': ['img_001', 'img_002']}) # Assume this is the DataFrame >>> temp_csv_path = Path("./temp_data.csv") >>> save_dataframe_to_csv(valid_ids_df, temp_csv_path, append_mode=False)
- Relationships:
- Dependencies:
pandas: For DataFrame operations (pd.DataFrame, .to_csv()).
pathlib: For path manipulation (Path, .parent, .mkdir(), .is_file()).
logging: For logging informational and error messages.
typing.Optional: For type hinting.
- Used by:
This function is typically called by various parts of the main application (run_pipeline.py or other utility functions) to persist data to CSV files.
Notes
The function uses pandas.DataFrame.to_csv() internally.
Logging messages indicate the success or failure of the file saving operation.
- src.utils.save_features_to_csv(ImageID, features_dict, output_filepath)#
Appends a dictionary of features for a single image to a CSV file, ensuring the ‘image_id’ column is first. This function is designed for incremental saving during a batch process.
If the CSV file does not exist, it will be created with headers. If it exists, a new row will be appended without headers.
- Parameters:
ImageID (str) – The unique ID of the image for which features are being saved.
features_dict (Dict[str, Any]) – A dictionary where keys are feature names and values are their corresponding data.
output_filepath (Path) – The full path, including filename, where the features CSV will be saved.
- Returns:
The function does not return any value. It performs a file-saving operation.
- Return type:
None
- Raises:
ValueError – If the feature dictionary is empty or contains non-scalar values (e.g., lists).
TypeError – If ImageID is not a string, features_dict is not a dictionary, or output_filepath is not a Path object.
IOError – If there’s an issue writing to the file via save_dataframe_to_csv.
- Output:
- Log:
Warning if features_dict is empty. Error if saving fails.
- CSV File:
A new row appended to the CSV file at output_filepath.
Examples
>>> from pathlib import Path >>> import pandas as pd >>> # Assume logging is set up
>>> temp_output_file = Path("./temp_features.csv") >>> features_1 = {'feature_A': 10.5, 'feature_B': 20.1} >>> save_features_to_csv('image_001', features_1, temp_output_file)
- Relationships:
- Dependencies:
Relies on pandas for DataFrame conversion and save_dataframe_to_csv for I/O. Uses Python’s built-in logging module for output.
- Used by:
Primarily used by the feature extraction pipeline (run_pipeline.py) to save extracted features for individual images incrementally.
- src.utils.save_cluster_assignments(df, output_filepath)#
Calculates and displays the number of samples in each cluster, then saves the image IDs with their cluster assignments to a CSV file.
This function provides a summary of the clustering results by showing the distribution of samples across different identified clusters. It also generates a manifest mapping each image to its assigned cluster.
- Parameters:
df (pd.DataFrame) – DataFrame containing at least ‘image_id’ and ‘cluster_label’ columns.
output_filepath (Path) – The full path, including filename, where the cluster assignment CSV will be saved. The filename ‘image_cluster_map.csv’ is typically derived from configuration but this function will save to the provided full path.
- Returns:
A dictionary mapping each cluster label to its corresponding count of samples.
- Return type:
Dict[Any, np.intp]
- Raises:
ValueError – If the input DataFrame does not contain ‘image_id’ or ‘cluster_label’ columns.
TypeError – If df is not a Pandas DataFrame or output_filepath is not a Path object.
IOError – If there’s an issue writing the CSV file via save_dataframe_to_csv.
- Output:
- Log:
Informational messages about cluster sample counts. Errors if saving fails.
- CSV File:
A CSV file created at output_filepath containing image IDs and their cluster labels.
Examples
>>> import pandas as pd >>> from pathlib import Path
>>> # Create a dummy DataFrame with cluster assignments >>> df_mock = pd.DataFrame({ ... 'image_id': ['imgA', 'imgB', 'imgC', 'imgD', 'imgE'], ... 'cluster_label': [0, 1, 0, 2, 1] ... }) >>> temp_output_file = Path("./temp_image_cluster_map.csv") >>> >>> counts = save_cluster_assignments(df_mock, temp_output_file)
- Relationships:
- Dependencies:
Relies on pandas for DataFrame operations, numpy for unique counts, and save_dataframe_to_csv for I/O. Uses logging for output.
- Used by:
The clustering pipeline (run_pipeline.py) to record and summarize cluster assignments.
- src.utils.generate_cluster_summary(df, output_filepath)#
Generates and saves a summary of numeric statistics (mean, std, count) for each cluster.
This function calculates descriptive statistics for all numeric features, grouped by their assigned cluster label, providing insights into the characteristics of each cluster. It’s particularly useful for understanding the quantitative differences between discovered wound border types.
- Parameters:
df (pd.DataFrame) – DataFrame that includes numeric feature columns and ‘cluster_label’ column.
output_filepath (Path) – The full path, including filename, where the summary CSV will be saved. The filename ‘cluster_summary_stats.csv’ is typically derived from configuration but this function will save to the provided full path.
- Returns:
The summary statistics DataFrame (mean, std, count) for each cluster.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ‘cluster_label’ column is missing from the DataFrame.
TypeError – If df is not a Pandas DataFrame or output_filepath is not a Path object.
IOError – If there’s an issue writing the CSV file via save_dataframe_to_csv.
- Output:
- Log:
Informational messages about summary generation. Errors if saving fails.
- CSV File:
A CSV file created at output_filepath containing the aggregated statistics.
Examples
>>> import pandas as pd >>> import numpy as np >>> from pathlib import Path >>> df_mock = pd.DataFrame({ ... 'image_id': ['id1', 'id2', 'id3', 'id4', 'id5'], ... 'feature_A': [10, 12, 11, 20, 22], ... 'feature_B': [1, 2, 1, 5, 6], ... 'cluster_label': [0, 0, 0, 1, 1] ... }) >>> temp_output_file = Path("./temp_cluster_summary_stats.csv") >>> summary_df = generate_cluster_summary(df_mock, temp_output_file)
- Relationships:
- Dependencies:
Relies on pandas for DataFrame operations, numpy for numeric types, and save_dataframe_to_csv for I/O. Uses logging for output.
- Used by:
The clustering pipeline (run_pipeline.py) to provide a statistical overview of the identified clusters.
- src.utils.generate_cluster_profiles(df, output_filepath)#
Calculates the mean feature values for each cluster and saves the result as a profile table.
This function provides a “profile” for each identified cluster by computing the average value for every feature within that cluster. This helps to quantitatively characterize and differentiate distinct wound types discovered during clustering.
- Parameters:
df (pd.DataFrame) – DataFrame that includes numeric feature columns and ‘cluster_label’ column.
output_filepath (Path) – The full path, including filename, where the cluster profiles CSV will be saved. The filename ‘cluster_profiles.csv’ is typically derived from configuration but this function will save to the provided full path.
- Returns:
DataFrame containing the mean feature profile of each cluster.
- Return type:
pd.DataFrame
- Raises:
ValueError – If ‘cluster_label’ column is missing from the DataFrame.
TypeError – If df is not a Pandas DataFrame or output_filepath is not a Path object.
IOError – If there’s an issue writing the CSV file via save_dataframe_to_csv.
- Output:
- Log:
Informational messages about profile generation. Errors if saving fails.
- CSV File:
A CSV file created at output_filepath containing the mean feature values per cluster.
Examples
>>> import pandas as pd >>> from pathlib import Path
>>> df_mock = pd.DataFrame({ ... 'image_id': ['id1', 'id2', 'id3', 'id4', 'id5'], ... 'feature_X': [10.0, 11.0, 10.5, 20.0, 21.0], ... 'feature_Y': [1.0, 1.2, 1.1, 2.0, 2.3], ... 'cluster_label': [0, 0, 0, 1, 1] ... }) >>> temp_output_file = Path("./temp_cluster_profiles.csv") >>> >>> profiles_df = generate_cluster_profiles(df_mock, temp_output_file)
- Relationships:
- Dependencies:
Relies on pandas for DataFrame operations, numpy for numeric types, and save_dataframe_to_csv for I/O. Uses logging for output.
- Used by:
The clustering pipeline (run_pipeline.py) to store mean feature values for each cluster.