Data Preprocessing

This document explains the code and workflow in the simulator for data preprocessing part.

1. Loading the Dataset

Purpose: Load the dataset stored in a .parquet file for efficient data reading.

Code:

file_path = os.path.join('..', '..', 'dataset', 'CoAt_NF-UQ-NIDS-V2.parquet')
df = pd.read_parquet(file_path, engine='pyarrow')

Details:

  • The dataset CoAt_NF-UQ-NIDS-V2.parquet contains network flow (NF) features tailored for Non-IID scenarios.

  • The pyarrow engine is used to read the Parquet file for better performance.

2. Viewing Dataset Information

Purpose: Inspect the structure and metadata of the loaded DataFrame.

Code:

df.info()

Details:

  • Displays column names, data types, and non-null counts.

  • Helps verify the dataset’s integrity (e.g., missing values, memory usage).

3. Preparing Binary Labels

Purpose: Convert the problem to binary classification by removing the multi-class label.

Code:

df = df.drop(columns=['Label'])
df.info()

Details:

  • The original dataset includes a multi-class column Label, which is dropped.

  • The remaining column Attack is used as the binary label (0 = normal, 1 = anomaly).

4. Analyzing Class Distribution

Purpose: Check the balance between normal and anomaly traffic samples.

Code:

df['Attack'].value_counts()

Details:

  • Outputs the count of samples labeled 0 (normal) and 1 (anomaly).

  • Critical for assessing potential class imbalance issues.

5. Splitting Features and Labels

Purpose: Separate input features (X) from target labels (y).

Code:

X_df = df.drop(columns=['Attack'])
y_df = df['Attack']

Details:

  • X_df contains all columns except Attack (input features).

  • y_df contains only the Attack column (target variable).

6. Feature Scaling

Purpose: Normalize feature values to ensure uniformity in scale.

Code:

scaler = QuantileTransformer(output_distribution='normal')
X_df_scl = scaler.fit_transform(X_df)

Details:

  • QuantileTransformer maps features to a normal distribution, reducing the impact of outliers.

  • Suitable for scenarios where features have varying ranges or skewed distributions.

Notes

  • Dependencies: Requires pandas, pyarrow, and scikit-learn (for QuantileTransformer).

  • Dataset Assumptions: The Attack column is assumed to exist and contain binary labels.

  • Non-IID Context: The preprocessing steps are tailored for Non-IID data, where sample independence and identical distribution assumptions do not hold.