I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and obtain some conclusions about data using statistics and machine learning techniques; however, the amount of data that I will have to deal soon is considerably larger than the volume that I've processed until now.
I'm not sure if I can consider this a Big Data problem, I've done some research, and for example, according to this post, my problem is closer to a Medium Data situation.
I'm looking for some insights from the community in terms of tools, procedures, techniques, and other elements that I should consider as a learning stage previous to facing the new tasks that I'll mention later. Something as a career path that I have not found clearly, since most of the documentation that I find focuses more on the analysis itself (theoretically) than the difficulties that one should face when trying to perform this analysis
Current situation
I have been dealing with different volumes of data, in general, I work with data captured during a period of time (generally a week), and the amount is close to 5 GB per day. I've faced some difficulties, which I think that mostly associated with hardware limitations, but I've managed to reduce the information based on some criteria to be able to operate it.
The tools that I used are mostly:
- Python
- pandas (perform operations over data, joins, filters, etc.)
- scikit-learn (mostly clustering, binning, etc.)
- numpy, matplotlib, etc.
- Elasticsearch (store information, some queries).
Some issues that I've found are: pandas is not completely suitable when dealing with amounts of data comparable to RAM size; pandas documentation even suggests using other tools to deal with high volumes of data; I'm using an 8 GB of RAM laptop and it almost freezes when dealing with a 4 GB CSV file. Elasticsearch Python API faces some issues when recovering a high number of documents, I've mostly used pagination to deal with it. scikit-learn freezes when performing some clustering. scikit-learn has definitely issues when trying to cluster; my jupyter kernels usually die.
Future situation
In the first stage, I'll have to perform computations (sorting, filtering, aggregations) over static data. The volume will be considerably higher than the one that I've managed until now, with about 10GB per day for a larger time period.
In the second stage, I'll have to deal with some data ingestion, with high transactions per second.
I'd appreciate it if you could give me some insight into the tools, techniques, hardware considerations, and concepts that I should consider to face this new situation.
Best regards