2

I am seeking to find a dataset with log files that have labeled cybersecurity issues. As I am trying to build a cybersecurity log analysis model there is no preference on the type of the log, but there is a preference on existence of known cybersecurity issues in the data.

Currently all I was able to find log datasets(HDFS, BGL) that had anomalies which were not cybersecurity issues but rather execution flow errors. Also I have found numerous amounts of network data such as in https://vizsec.org/data/, but they contain network traffic instead of logs. Also, I have found log datasets that actually had cybersecurity issues but the quantity of them were too little to train a model on.

It would also be helpful to know, how is it possible to generate such a dataset in large quantities.

jsbc
  • 21
  • 2

3 Answers3

0

Finding up-to-date log-based public datasets including labels for new attacks, is hard to find. but there are some old-fashioned log-based datasets for some known attacks (i.e., iSQL, XSS injection) within weblogs or HTTP requests for the context of Web-server Log Anomaly Detection (WLAD) if fits you.

Please see Table II in this paper:

Majd, Mehryar, et al. "A Comprehensive Review of Anomaly Detection in Web Logs." 2022 IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT). IEEE, 2022.

Context: Web-server Log Anomaly Detection (WLAD)

Here author collected recent workarounds including the used datasets of weblogs or HTTPS requests in the cybersecurity domain that the author addressed recently reviewed works of literature. As you see in this table, one of the most recent papers from Amazon used: HTTP CSIC 2010 and ISCX IDS 2012 which are old public datasets as I mentioned in his approach.

I also would like to share that a long time ago I saw a conversation in RG you might look at:

there are also old posts at https://security.stackexchange.com/ :

some related Repo GH:

recent survey:

Mario
  • 571
  • 1
  • 6
  • 24
0

In reference with your little found data either augment it or apply cross validation on top of it.

else Look for your expected data in https://datasetsearch.research.google.com/

Durga K
  • 31
  • 2
0

See if this can help - Publicly Available Datasets

Also you can use SMOTE technique if you have insufficient data.

Madhur Yadav
  • 158
  • 1
  • 14