I have a a networking dataset and it seems the data is coming from different channels. So the same timestamp is repeated for different rows which makes it meaningless to do timeseries analysis.
My goal is to see of there is a way to extract only one specific flow/channel data and do the timeseries analysis on that data only.
There is no flow/channel ID so among 80 features, I want to check if there is a column that can act like an id for a flow.
So I need to groupby timestamp, check if there is column that is not repeated in a group, like feature1, and if this feature1 has the same property in all other groups, meaning it is not repeated within a timestamp group. I have written this code but it give me wrong results.
def find_unique_features(df, timestamp_col='Timestamp', exclude_features=None):
# Set default excluded features if none provided
if exclude_features is None:
exclude_features = ['Label','Flow Duration'] # flow duration and label is known to be non-unique
Initialize a list to hold the names of unique features
unique_features = []
Loop through each column in the DataFrame, excluding timestamp and specified features
for col in df.columns:
if col == timestamp_col or col in exclude_features:
continue
# Group by timestamp and count unique values in the column for each timestamp group
unique_counts = df.groupby(timestamp_col)[col].nunique()
print(f"In column '{col}', there are {unique_counts} unique values for the timestamp group.") # Check if all groups have unique values (unique count equals group size)
if all(unique_counts == df.groupby(timestamp_col).size()):
unique_features.append(col)
return unique_features
Here is the link to a piece of this dataset: https://www.kaggle.com/datasets/rihanna1/cic2018-02-14-2018
Active Min Idle Max Fwd IAT Mean Bwd Pkts/s Fwd IAT Min Bwd IAT Max Fwd IAT Max Pkt Size Avg Fwd IAT Std Fwd Header Len ... Idle Mean Bwd IAT Mean Fwd Seg Size Min Flow IAT Max Dst Port Pkt Len Mean Bwd Header Len Active Max Rounded_Timestamp Time
0 0 56320958 5.632086e+07 0.000000 56320761 0 56320958 0.00 1.393000e+02 0 ... 56320859.5 0.000000e+00 0 56320958 0 0.000000 0 0 2018-02-14 08:31:01 08:31:01
1 0 56320814 5.632073e+07 0.000000 56320652 0 56320814 0.00 1.145513e+02 0 ... 56320733.0 0.000000e+00 0 56320814 0 0.000000 0 0 2018-02-14 08:33:50 08:33:50
2 0 56319525 5.631931e+07 0.000000 56319098 0 56319525 0.00 3.019346e+02 0 ... 56319311.5 0.000000e+00 0 56319525 0 0.000000 0 0 2018-02-14 08:36:39 08:36:39