2

I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and obtain some conclusions about data using statistics and machine learning techniques; however, the amount of data that I will have to deal soon is considerably larger than the volume that I've processed until now.

I'm not sure if I can consider this a Big Data problem, I've done some research, and for example, according to this post, my problem is closer to a Medium Data situation.

I'm looking for some insights from the community in terms of tools, procedures, techniques, and other elements that I should consider as a learning stage previous to facing the new tasks that I'll mention later. Something as a career path that I have not found clearly, since most of the documentation that I find focuses more on the analysis itself (theoretically) than the difficulties that one should face when trying to perform this analysis

Current situation

I have been dealing with different volumes of data, in general, I work with data captured during a period of time (generally a week), and the amount is close to 5 GB per day. I've faced some difficulties, which I think that mostly associated with hardware limitations, but I've managed to reduce the information based on some criteria to be able to operate it.

The tools that I used are mostly:

  • Python
  • pandas (perform operations over data, joins, filters, etc.)
  • scikit-learn (mostly clustering, binning, etc.)
  • numpy, matplotlib, etc.
  • Elasticsearch (store information, some queries).

Some issues that I've found are: pandas is not completely suitable when dealing with amounts of data comparable to RAM size; pandas documentation even suggests using other tools to deal with high volumes of data; I'm using an 8 GB of RAM laptop and it almost freezes when dealing with a 4 GB CSV file. Elasticsearch Python API faces some issues when recovering a high number of documents, I've mostly used pagination to deal with it. scikit-learn freezes when performing some clustering. scikit-learn has definitely issues when trying to cluster; my jupyter kernels usually die.

Future situation

In the first stage, I'll have to perform computations (sorting, filtering, aggregations) over static data. The volume will be considerably higher than the one that I've managed until now, with about 10GB per day for a larger time period.

In the second stage, I'll have to deal with some data ingestion, with high transactions per second.

I'd appreciate it if you could give me some insight into the tools, techniques, hardware considerations, and concepts that I should consider to face this new situation.

Best regards

tms
  • 31
  • 2

2 Answers2

2

The fastest data structures in Python are sets. They are low-level structures, and you can manage your data quickly. To avoid using too much RAM, you can build them progressively with a CSV with a loop. Once all your data is on sets, you should see much faster results.

See also: https://copyprogramming.com/howto/what-makes-sets-faster-than-lists

Otherwise, you can use Cython + Numpy arrays, a C-based library for Python: https://towardsdatascience.com/numpy-array-processing-with-cython-1250x-faster-a80f8b3caa52

Keep in mind that multi-threading could improve the speed greatly. https://towardsdatascience.com/multithreading-multiprocessing-python-180d0975ab29

However, if your 10GB data is complex, creating a MySQL database could be the best option to retrieve useful data quickly.

Nicolas Martin
  • 5,014
  • 1
  • 8
  • 15
1

As it was mentioned, you can use your memory several times more efficiently using other tools (numpy, Redis, C++, R / data.table, etc), but after that you will still hit your laptop RAM limit and then you would need to use or implement some solution that stores your data to the disk.

The problems you face were solved long ago. For large data analysis and production code some database engine is usually used.

Both free solutions (like Postgres, Apache Spark, etc) and paid ones (Google Cloud BigQuery, Microsoft SQL Server, Vertica, kdb+, etc) often go with an associated universe of additional tools that you can use for data processing.

If you are familiar with python and pandas, a professional solution is to use Apache Beam to process your data in parallel (also available as part of some commercial solutions).

Valentas
  • 1,412
  • 1
  • 10
  • 22