I'm new to datascience, but I have the following problem:
1) Read a file of 1GB, which each line is a json object 2) There are two more files, much smaller, which I need to JOIN some data
In this case, which kind of tools is the best?
I saw some examples in pyspark, but I have no idea what happen under the hood.