Load and Distribution of Data
This document explains the code and workflow in the provided Jupyter notebook for distributing non-IID data to clients in a Federated Learning scenario.
1. Function: load_data(client_id)
Purpose: Load a unique subset of data for a specific client in a Federated Learning setup, simulating non-IID data distribution.
Code:
def load_data(client_id):
# Set seed for reproducibility based on client_id
np.random.seed(client_id)
indices = np.arange(len(X_df_scl))
np.random.shuffle(indices)
# Define fraction of data allocated to the client
fraction = 0.02
client_data_size = int(fraction * len(X_df_scl))
client_indices = indices[:client_data_size]
# Extract client-specific data
X_client = X_df_scl[client_indices]
y_client = y_df.iloc[client_indices]
return X_client, y_client
Details:
Seed Initialization:
np.random.seed(client_id)ensures reproducibility and uniqueness for each client.The seed is tied to the
client_id, guaranteeing different shuffling patterns across clients.
Data Shuffling:
indices = np.arange(len(X_df_scl))generates an array of indices corresponding to the scaled feature dataset.np.random.shuffle(indices)randomizes the order of indices.Each client receives a unique shuffled order due to the client-specific seed.
Subset Selection:
fraction = 0.02specifies that 2% of the total data is allocated to each client. Adjust this value to change the client’s data portion.client_data_sizecalculates the number of samples per client.client_indicesselects the firstclient_data_sizeindices from the shuffled array, ensuring non-overlapping subsets across clients.
Data Extraction:
X_clientandy_clientextract features and labels using the client-specific indices.Assumes
X_df_scl(scaled features) andy_df(labels) are predefined from prior preprocessing steps.
2. Key Notes
Non-IID Simulation:
The client-specific shuffling ensures each client’s data distribution is unique and non-identically distributed.
Suitable for scenarios requiring heterogeneous data partitions (e.g., edge devices with varying data sources).
Dependencies:
Requires
numpyfor index manipulation and assumes prior execution of code definingX_df_sclandy_df(scaling and splitting steps).
Class Distribution:
The function does not explicitly balance classes. Inherits any class imbalance present in the original dataset.
To address imbalance, additional preprocessing (e.g., stratified sampling) may be required.
Adjustability:
Modify
fractionto control the proportion of data allocated per client.Example:
fraction = 0.05allocates 5% of the dataset to each client.