I'm new to the data analytics world, but perhaps my question could help others, we all start somewhere. We have a process that extracts data from an SQL database and stores it in temp files as CSV format. A second process reads the CSV data and saves to Parquet format (to an s3 bucket).
The CSV to Parquet step is a very simple Python script using pyarrow, for example:
from pyarrow import csv
from pyarrow import parquet as pq
table = csv.read_csv('myfile.csv')
... do a bit of processing, example add a date column etc
pq.write_to_dataset(table)
We then have a set of scripts that read from Parquet and save the data into a data warehouse (using pandas)
It feels like we are missing something here. I know that Parquet format is a "columnar" store but for us Parquet is just a convenient way of placing data onto s3 in order to get to it later, without having to manage multiple CSV files or other data formats.
Does the above process ignore the advantages that Parquet (columnar storage) offers, or is this all handled transparently within pyarrow? Presently we export data from the databases daily, so the day's CSV dataframe can (still) fit into RAM, but this will not be the case in a year's time. Is there something we can do now to avoid a rewrite in the future? Or is this all an oversimplification of how it should work?