Suitability of column store formats like Parquet for table data

Question

I'm new to the data analytics world, but perhaps my question could help others, we all start somewhere. We have a process that extracts data from an SQL database and stores it in temp files as CSV format. A second process reads the CSV data and saves to Parquet format (to an s3 bucket).

The CSV to Parquet step is a very simple Python script using pyarrow, for example:

from pyarrow import csv
from pyarrow import parquet as pq
table = csv.read_csv('myfile.csv')
... do a bit of processing, example add a date column etc
pq.write_to_dataset(table)

We then have a set of scripts that read from Parquet and save the data into a data warehouse (using pandas)

It feels like we are missing something here. I know that Parquet format is a "columnar" store but for us Parquet is just a convenient way of placing data onto s3 in order to get to it later, without having to manage multiple CSV files or other data formats.

Does the above process ignore the advantages that Parquet (columnar storage) offers, or is this all handled transparently within pyarrow? Presently we export data from the databases daily, so the day's CSV dataframe can (still) fit into RAM, but this will not be the case in a year's time. Is there something we can do now to avoid a rewrite in the future? Or is this all an oversimplification of how it should work?

J_H · Accepted Answer · 2023-12-13T15:52:32.570

Your question raises issues of both low-level storage format and high-level retention policy.

tl;dr: Yes, keep using parquet.

storage format

Parquet offers several attractive features:

binary
compressed
columnar

You are currently taking advantage of the first two, and you should keep doing so. They shrink the size of your S3 bucket.

For example if you're storing (lat, long)'s as pair of binary 32-bit floats, that consumes significantly less space on disk than a typical decimal .CSV file would. Running the default compressor on top of that yields a bigger win. Less storage, less I/O, faster response.

You don't appear to be leveraging the columnar aspect, given that you move all the bits every time. Suppose we had a bunch of columns and each row is much bigger than a kilobyte. Then reading a projection of the columns, say, SELECT b, z FROM ..., would be significantly cheaper in a column store. We would read all the B's and all the Z's, while most A, C .. Y data remains on-disk, ignored. It is perfectly fine that you're not using a feature which you don't currently need.

data retention

we export data from the databases daily, so the day's CSV dataframe can (still) fit into RAM

To satisfy your business needs, it sounds like a daily full dump of everything won't be a good match, at least not in a year or two. Prefer an incremental algorithm.

Use ISO (year_num, week_num) tuples. Make daily dumps of everything from week's start up till today. That means that every 7th day you have a weekly dump. Retain the weeklies forever. You can rebuild your entire database from the weeklies if disaster strikes.

Now memory use is limited to amount of data generated in a week, rather than growing without bound. (Your installed RAM will not grow without bound, even if you allocate lots of budget for hardware acquisition.)

If the early stages of your pipeline really do need to know about events that happened in previous weeks, write {monthly, quarterly, annual} summaries of what happened (such as event counts) and rely on those. Notice that settling on a weekly interval gives a constant 7-day term to each file being archived in S3, unlike a 28 .. 31 day cadence.

Suitability of column store formats like Parquet for table data

... do a bit of processing, example add a date column etc

1 Answers1

storage format

data retention