Data Preprocessing
This document explains the code and workflow in the simulator for data preprocessing part.
1. Loading the Dataset
Purpose: Load the dataset stored in a .parquet file for efficient data reading.
Code:
file_path = os.path.join('..', '..', 'dataset', 'CoAt_NF-UQ-NIDS-V2.parquet')
df = pd.read_parquet(file_path, engine='pyarrow')
Details:
The dataset
CoAt_NF-UQ-NIDS-V2.parquetcontains network flow (NF) features tailored for Non-IID scenarios.The
pyarrowengine is used to read the Parquet file for better performance.
2. Viewing Dataset Information
Purpose: Inspect the structure and metadata of the loaded DataFrame.
Code:
df.info()
Details:
Displays column names, data types, and non-null counts.
Helps verify the dataset’s integrity (e.g., missing values, memory usage).
3. Preparing Binary Labels
Purpose: Convert the problem to binary classification by removing the multi-class label.
Code:
df = df.drop(columns=['Label'])
df.info()
Details:
The original dataset includes a multi-class column
Label, which is dropped.The remaining column
Attackis used as the binary label (0 = normal, 1 = anomaly).
4. Analyzing Class Distribution
Purpose: Check the balance between normal and anomaly traffic samples.
Code:
df['Attack'].value_counts()
Details:
Outputs the count of samples labeled
0(normal) and1(anomaly).Critical for assessing potential class imbalance issues.
5. Splitting Features and Labels
Purpose: Separate input features (X) from target labels (y).
Code:
X_df = df.drop(columns=['Attack'])
y_df = df['Attack']
Details:
X_dfcontains all columns exceptAttack(input features).y_dfcontains only theAttackcolumn (target variable).
6. Feature Scaling
Purpose: Normalize feature values to ensure uniformity in scale.
Code:
scaler = QuantileTransformer(output_distribution='normal')
X_df_scl = scaler.fit_transform(X_df)
Details:
QuantileTransformermaps features to a normal distribution, reducing the impact of outliers.Suitable for scenarios where features have varying ranges or skewed distributions.
Notes
Dependencies: Requires
pandas,pyarrow, andscikit-learn(forQuantileTransformer).Dataset Assumptions: The
Attackcolumn is assumed to exist and contain binary labels.Non-IID Context: The preprocessing steps are tailored for Non-IID data, where sample independence and identical distribution assumptions do not hold.