2

I'm looking for a corpus of toy tabular datasets that can be used to test data profiling, machine learning, data manipulation, etc. software. Some example attributes:

  • Strange column names (empty string, long names, duplicate names, names with spaces, periods, syntax, escaped delimiters and tokens)
  • Non-rectangular
  • Mixed scientific notation in floats, inf literals
  • Row-empty or column-empty
  • Mixed file encodings
  • Numeric and string values designed to overflow memory buffers/cause truncation/rounding to int
  • Ambiguous and invalid dates
  • Diacritics, emojis

I was going to build a corpus myself, but surely there is some prior work here?

Shoeboxam
  • 21
  • 2

0 Answers0