1

I am looking more for concepts or pointed in the right direction for best practices for a project I am working on. I am currently playing around with Luigi package for the ETL pipeline.

I followed the tutorial: Python: Create an ETL with Luigi, Pandas and SQLAlchemy

This was easy enough to get up and running. Where I am struggling conceptually is if I run this pipeline nightly.

What is the best practice for handling data new or changing data from the source data?

  • Question 1) Should you be gathering a full dataset from Source, and then overwriting the data at the Target? Or is it better to try and only insert new records? Ie. The difference between Source and Target.

  • Question 2) If you are only inserting new records, what is the best practice for handling updated records? Is it better to capture the whole record with a date on it? Or am I updating the record that sits in the Target?

Mario
  • 571
  • 1
  • 6
  • 24
ghawes
  • 111
  • 1

0 Answers0