Load and Distribution of Data

This document explains the code and workflow in the provided Jupyter notebook for distributing non-IID data to clients in a Federated Learning scenario.

1. Function: load_data(client_id)

Purpose: Load a unique subset of data for a specific client in a Federated Learning setup, simulating non-IID data distribution.

Code:

def load_data(client_id):
    # Set seed for reproducibility based on client_id
    np.random.seed(client_id)
    indices = np.arange(len(X_df_scl))
    np.random.shuffle(indices)

    # Define fraction of data allocated to the client
    fraction = 0.02
    client_data_size = int(fraction * len(X_df_scl))
    client_indices = indices[:client_data_size]

    # Extract client-specific data
    X_client = X_df_scl[client_indices]
    y_client = y_df.iloc[client_indices]

    return X_client, y_client

Details:

  • Seed Initialization:

    • np.random.seed(client_id) ensures reproducibility and uniqueness for each client.

    • The seed is tied to the client_id, guaranteeing different shuffling patterns across clients.

  • Data Shuffling:

    • indices = np.arange(len(X_df_scl)) generates an array of indices corresponding to the scaled feature dataset.

    • np.random.shuffle(indices) randomizes the order of indices.

    • Each client receives a unique shuffled order due to the client-specific seed.

  • Subset Selection:

    • fraction = 0.02 specifies that 2% of the total data is allocated to each client. Adjust this value to change the client’s data portion.

    • client_data_size calculates the number of samples per client.

    • client_indices selects the first client_data_size indices from the shuffled array, ensuring non-overlapping subsets across clients.

  • Data Extraction:

    • X_client and y_client extract features and labels using the client-specific indices.

    • Assumes X_df_scl (scaled features) and y_df (labels) are predefined from prior preprocessing steps.

2. Key Notes

  • Non-IID Simulation:

    • The client-specific shuffling ensures each client’s data distribution is unique and non-identically distributed.

    • Suitable for scenarios requiring heterogeneous data partitions (e.g., edge devices with varying data sources).

  • Dependencies:

    • Requires numpy for index manipulation and assumes prior execution of code defining X_df_scl and y_df (scaling and splitting steps).

  • Class Distribution:

    • The function does not explicitly balance classes. Inherits any class imbalance present in the original dataset.

    • To address imbalance, additional preprocessing (e.g., stratified sampling) may be required.

  • Adjustability:

    • Modify fraction to control the proportion of data allocated per client.

    • Example: fraction = 0.05 allocates 5% of the dataset to each client.