I am looking more for concepts or pointed in the right direction for best practices for a project I am working on. I am currently playing around with Luigi package for the ETL pipeline.
I followed the tutorial: Python: Create an ETL with Luigi, Pandas and SQLAlchemy
This was easy enough to get up and running. Where I am struggling conceptually is if I run this pipeline nightly.
What is the best practice for handling data new or changing data from the source data?
Question 1) Should you be gathering a full dataset from Source, and then overwriting the data at the Target? Or is it better to try and only insert new records? Ie. The difference between Source and Target.
Question 2) If you are only inserting new records, what is the best practice for handling updated records? Is it better to capture the whole record with a date on it? Or am I updating the record that sits in the Target?