Data Preprocessing ========================== .. _cids.data.preprocessing: This document explains the code and workflow in the simulator for data preprocessing part. 1. Loading the Dataset ---------------------- **Purpose**: Load the dataset stored in a `.parquet` file for efficient data reading. **Code**: .. code-block:: python file_path = os.path.join('..', '..', 'dataset', 'CoAt_NF-UQ-NIDS-V2.parquet') df = pd.read_parquet(file_path, engine='pyarrow') **Details**: - The dataset ``CoAt_NF-UQ-NIDS-V2.parquet`` contains network flow (NF) features tailored for Non-IID scenarios. - The ``pyarrow`` engine is used to read the Parquet file for better performance. 2. Viewing Dataset Information ------------------------------ **Purpose**: Inspect the structure and metadata of the loaded DataFrame. **Code**: .. code-block:: python df.info() **Details**: - Displays column names, data types, and non-null counts. - Helps verify the dataset's integrity (e.g., missing values, memory usage). 3. Preparing Binary Labels -------------------------- **Purpose**: Convert the problem to binary classification by removing the multi-class label. **Code**: .. code-block:: python df = df.drop(columns=['Label']) df.info() **Details**: - The original dataset includes a multi-class column ``Label``, which is dropped. - The remaining column ``Attack`` is used as the binary label (0 = normal, 1 = anomaly). 4. Analyzing Class Distribution ------------------------------- **Purpose**: Check the balance between normal and anomaly traffic samples. **Code**: .. code-block:: python df['Attack'].value_counts() **Details**: - Outputs the count of samples labeled ``0`` (normal) and ``1`` (anomaly). - Critical for assessing potential class imbalance issues. 5. Splitting Features and Labels -------------------------------- **Purpose**: Separate input features (``X``) from target labels (``y``). **Code**: .. code-block:: python X_df = df.drop(columns=['Attack']) y_df = df['Attack'] **Details**: - ``X_df`` contains all columns except ``Attack`` (input features). - ``y_df`` contains only the ``Attack`` column (target variable). 6. Feature Scaling ------------------ **Purpose**: Normalize feature values to ensure uniformity in scale. **Code**: .. code-block:: python scaler = QuantileTransformer(output_distribution='normal') X_df_scl = scaler.fit_transform(X_df) **Details**: - ``QuantileTransformer`` maps features to a normal distribution, reducing the impact of outliers. - Suitable for scenarios where features have varying ranges or skewed distributions. Notes ----- - **Dependencies**: Requires ``pandas``, ``pyarrow``, and ``scikit-learn`` (for ``QuantileTransformer``). - **Dataset Assumptions**: The ``Attack`` column is assumed to exist and contain binary labels. - **Non-IID Context**: The preprocessing steps are tailored for Non-IID data, where sample independence and identical distribution assumptions do not hold.