1

I have a a networking dataset and it seems the data is coming from different channels. So the same timestamp is repeated for different rows which makes it meaningless to do timeseries analysis.

My goal is to see of there is a way to extract only one specific flow/channel data and do the timeseries analysis on that data only.

There is no flow/channel ID so among 80 features, I want to check if there is a column that can act like an id for a flow.

So I need to groupby timestamp, check if there is column that is not repeated in a group, like feature1, and if this feature1 has the same property in all other groups, meaning it is not repeated within a timestamp group. I have written this code but it give me wrong results.

def find_unique_features(df, timestamp_col='Timestamp', exclude_features=None):
# Set default excluded features if none provided
if exclude_features is None:
    exclude_features = ['Label','Flow Duration']  # flow duration and label is known to be non-unique

Initialize a list to hold the names of unique features

unique_features = []

Loop through each column in the DataFrame, excluding timestamp and specified features

for col in df.columns: if col == timestamp_col or col in exclude_features: continue

# Group by timestamp and count unique values in the column for each timestamp group
unique_counts = df.groupby(timestamp_col)[col].nunique()
print(f"In column '{col}', there are {unique_counts} unique values for the timestamp group.")        # Check if all groups have unique values (unique count equals group size)
if all(unique_counts == df.groupby(timestamp_col).size()):
    unique_features.append(col)

return unique_features

Here is the link to a piece of this dataset: https://www.kaggle.com/datasets/rihanna1/cic2018-02-14-2018

    Active Min  Idle Max    Fwd IAT Mean    Bwd Pkts/s  Fwd IAT Min     Bwd IAT Max     Fwd IAT Max     Pkt Size Avg    Fwd IAT Std     Fwd Header Len  ...     Idle Mean   Bwd IAT Mean    Fwd Seg Size Min    Flow IAT Max    Dst Port    Pkt Len Mean    Bwd Header Len  Active Max  Rounded_Timestamp   Time
0   0   56320958    5.632086e+07    0.000000    56320761    0   56320958    0.00    1.393000e+02    0   ...     56320859.5  0.000000e+00    0   56320958    0   0.000000    0   0   2018-02-14 08:31:01     08:31:01
1   0   56320814    5.632073e+07    0.000000    56320652    0   56320814    0.00    1.145513e+02    0   ...     56320733.0  0.000000e+00    0   56320814    0   0.000000    0   0   2018-02-14 08:33:50     08:33:50
2   0   56319525    5.631931e+07    0.000000    56319098    0   56319525    0.00    3.019346e+02    0   ...     56319311.5  0.000000e+00    0   56319525    0   0.000000    0   0   2018-02-14 08:36:39     08:36:39
Hanna
  • 111
  • 2

0 Answers0