There is a wide variety of "pipelines" that exists in today's Data Science world:
- data ("lift & shift," curation, reconciliation?)
- inference
- modeling
- machine learning (as distinct from 2 and 3?)
- continuous integration and deployment (CICD of content, but these days, IaC too)
I don't necessarily mean the different companies in each of these spaces, but more the function of each section: where do they start and stop? When should my Airflow DAGs put raw data down as curated in Azure Storage or S3 buckets and say to Synapse or Sagemaker, "okay this is yours now"? Why not run basic modeling processes in Airflow, or why not do my data prep with MLStudio pipelines?
#5 might seem like it doesn't have to do with Data Science, except: when do the upstream applications that consume that data output from #2-4 have to notify those users that the infrastructure or software running on it has changed? Why not let those downstream consumers change the EventHub schemas, or let upstream DS producers approve infra and content updates and redeploys?