Load and Distribution of Data ========================== .. _cids.data.load_dist: This document explains the code and workflow in the provided Jupyter notebook for distributing non-IID data to clients in a Federated Learning scenario. 1. Function: `load_data(client_id)` ----------------------------------- **Purpose**: Load a unique subset of data for a specific client in a Federated Learning setup, simulating non-IID data distribution. **Code**: .. code-block:: python def load_data(client_id): # Set seed for reproducibility based on client_id np.random.seed(client_id) indices = np.arange(len(X_df_scl)) np.random.shuffle(indices) # Define fraction of data allocated to the client fraction = 0.02 client_data_size = int(fraction * len(X_df_scl)) client_indices = indices[:client_data_size] # Extract client-specific data X_client = X_df_scl[client_indices] y_client = y_df.iloc[client_indices] return X_client, y_client **Details**: - **Seed Initialization**: - ``np.random.seed(client_id)`` ensures reproducibility and uniqueness for each client. - The seed is tied to the ``client_id``, guaranteeing different shuffling patterns across clients. - **Data Shuffling**: - ``indices = np.arange(len(X_df_scl))`` generates an array of indices corresponding to the scaled feature dataset. - ``np.random.shuffle(indices)`` randomizes the order of indices. - Each client receives a unique shuffled order due to the client-specific seed. - **Subset Selection**: - ``fraction = 0.02`` specifies that 2% of the total data is allocated to each client. Adjust this value to change the client's data portion. - ``client_data_size`` calculates the number of samples per client. - ``client_indices`` selects the first ``client_data_size`` indices from the shuffled array, ensuring non-overlapping subsets across clients. - **Data Extraction**: - ``X_client`` and ``y_client`` extract features and labels using the client-specific indices. - Assumes ``X_df_scl`` (scaled features) and ``y_df`` (labels) are predefined from prior preprocessing steps. 2. Key Notes ------------ - **Non-IID Simulation**: - The client-specific shuffling ensures each client's data distribution is unique and non-identically distributed. - Suitable for scenarios requiring heterogeneous data partitions (e.g., edge devices with varying data sources). - **Dependencies**: - Requires ``numpy`` for index manipulation and assumes prior execution of code defining ``X_df_scl`` and ``y_df`` (scaling and splitting steps). - **Class Distribution**: - The function does not explicitly balance classes. Inherits any class imbalance present in the original dataset. - To address imbalance, additional preprocessing (e.g., stratified sampling) may be required. - **Adjustability**: - Modify ``fraction`` to control the proportion of data allocated per client. - Example: ``fraction = 0.05`` allocates 5% of the dataset to each client.