Data Preprocessing

This document explains the code and workflow in the simulator for data preprocessing part.

1. Loading the Dataset

Purpose: Load the dataset stored in a .parquet file for efficient data reading.

Code:

file_path = os.path.join('..', '..', 'dataset', 'CoAt_NF-UQ-NIDS-V2.parquet')
df = pd.read_parquet(file_path, engine='pyarrow')

Details:

The dataset CoAt_NF-UQ-NIDS-V2.parquet contains network flow (NF) features tailored for Non-IID scenarios.
The pyarrow engine is used to read the Parquet file for better performance.

Purpose: Inspect the structure and metadata of the loaded DataFrame.

Code:

df.info()

Details:

Purpose: Convert the problem to binary classification by removing the multi-class label.

Code:

df = df.drop(columns=['Label'])
df.info()

Details:

The original dataset includes a multi-class column Label, which is dropped.
The remaining column Attack is used as the binary label (0 = normal, 1 = anomaly).

Purpose: Check the balance between normal and anomaly traffic samples.

Code:

df['Attack'].value_counts()

Details:

Purpose: Separate input features (X) from target labels (y).

Code:

X_df = df.drop(columns=['Attack'])
y_df = df['Attack']

Details:

Purpose: Normalize feature values to ensure uniformity in scale.

Code:

scaler = QuantileTransformer(output_distribution='normal')
X_df_scl = scaler.fit_transform(X_df)

Details:

QuantileTransformer maps features to a normal distribution, reducing the impact of outliers.
Suitable for scenarios where features have varying ranges or skewed distributions.

Dependencies: Requires pandas, pyarrow, and scikit-learn (for QuantileTransformer).
Dataset Assumptions: The Attack column is assumed to exist and contain binary labels.
Non-IID Context: The preprocessing steps are tailored for Non-IID data, where sample independence and identical distribution assumptions do not hold.