I have a huge amount of JSON files that I need to transform into Parquet. They look something like this:
{
"foo": "bar",
"props": {
"prop1": "val1",
"prop2": "val2"
}
}
And I need to transform them into a Parquet file whose structure is this (nested properties are made top-level and get _ as a suffix):
foo=bar
_prop1=val1
_prop2=val2
Now here's the catch: not all of the JSON documents have the same properties. So, if doc1 has prop1 and prop2, but doc2 has prop3, the final Parquet file must have the three properties (some of them will be null for some of the records).
I understand that Parquet needs a schema up front, so my current plan is:
- Traverse all the
JSONfiles - Infer a
schemaper document (using Kite, like this) - Merge all the
schemas - Start writing the
Parquet
This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark.