Data Engineering is the process of extracting data, (cleaning and) transforming it into a proper format, and loading it into a location where other users can access it, which is commonly abbreviated as ETL. This data is later on used by Data Scientists for their analysis.
Questions tagged [data-engineering]
41 questions
6
votes
3 answers
ETL and Data Engineering - is it purely the knowledge of tools or is there theory behind it?
I would like to better understand what a good Data Englineer must know or what he does. Job descriptions mostly list tools that are required, such as Python.
If it is possible to separate Data Engineering from Data Science, on what principles is…
MindYB
- 61
- 3
5
votes
1 answer
Data engineering good and bad practice?
I'm a Data Analyst in a pretty big company and I'm having a really bad time with the data I'm being given. I spend about 70% of my time thinking about where to find the data and how to pull it instead of analyzing it. I have to pull from tables that…
Marc
- 222
- 1
- 7
4
votes
2 answers
Data Engineering Stack - collect, transform and visualize geospatial data
I'm making a side project, where I collect geospatial data by web scrapping and from OSM API. I've started with simple Java application, however, I would like to make it as a data flow, purely for learning purposes.
Unfortunately, my knowledge about…
Forin
- 141
- 1
3
votes
2 answers
Storing Large dataset for processing and analysis of data
I am new to data engineering and wanted to know , what is the best way to store more than 3000 GB of data for further processing and analysis ? I am specifically looking for open source resources . I have explored many data formats for storage . The…
user14519285
- 41
- 3
3
votes
0 answers
Where do the different pipelines start and end?
There is a wide variety of "pipelines" that exists in today's Data Science world:
data ("lift & shift," curation, reconciliation?)
inference
modeling
machine learning (as distinct from 2 and 3?)
continuous integration and deployment (CICD of…
d8aninja
- 151
- 5
3
votes
1 answer
Which software engineering design patterns are most commonly applicable in building pipelines and other DE/DS/ML workflows?
In software engineering, a design pattern is a general, reusable solution to a common problem in software design. It is not a finished piece of code but rather a template or best practice that can be applied to specific challenges in different…
Robert Long
- 3,518
- 12
- 30
2
votes
1 answer
Alternative to EC2 for running ML batch training jobs on AWS
We are building an ML pipeline on AWS, which will obviously require some heavy-compute components including preprocessing and batch training.
Most the the pipeline is on Lambda, but Lambda is known to have time limits on how long a job can be run…
Cybernetic
- 800
- 1
- 4
- 11
2
votes
3 answers
Is there a cost associated with converting Koalas dataframe to Spark dataframe?
I know that pandas works "under the hood" with numpy arrays stored in dictionaries. In contrast, Koalas works with the underlying Spark framework. Does that mean that there is no extra cost associated with switching back and forth between Koalas and…
DataBach
- 165
- 1
- 9
2
votes
2 answers
Loading models from external source
I have a 500MB model which I am commiting to Git. That is a really bad practice since for newer model versions the repository will be huge. As well, It will slow down all builds for deployments.
I thought of using another repository that contains…
room13
- 133
- 5
2
votes
1 answer
Can a fact table have a 1:1 relationship with a dimension table?
I am trying to build a small healthcare fact table with the following information
[patientid], [organid], [value]
Each [patientid] is unique to that patient, but there are only 10 available [organid] in the system (Heart, Left Lung, Right Lung,…
A. Romain
- 21
- 1
2
votes
1 answer
How to build this data pipeline?
I don't have much experience in data engineering, so I'm here to ask for advice. I am working on a project which consists of building a dashboard for the IT department of a bank. the dashboard should present information from log data. Log data…
Wissem Boujlida
- 21
- 2
2
votes
2 answers
How to deal with high data volumes? (Tools, techniques, concepts, etc.)
I have some doubts about how to deal with high volumes of data. I'm currently working in the data analysis/data science field, so I've had the chance to perform calculations, manipulate data, and obtain some conclusions about data using statistics…
tms
- 31
- 2
2
votes
1 answer
How to partition data effectively?
I have a pipeline which outputs model scores to s3. I need to partition the data by model_type and date. Which is the most efficient way to partition the data from the…
CyberPunk
- 141
- 4
1
vote
1 answer
Best Technologies opening Large Sets of Sensor Time-Series Data to Analytics
My team is exploring options to create a robust "analytics" capability that is well-suited for our large quantities of sensor test data. I'd appreciate any suggestions for technologies that would perform well for my use case.
About my data:
For…
CrashLandon
- 11
- 2
1
vote
3 answers
Advice on where to continue in the field of data engineering and machine learning
I finished a 28 hours Machine learning with python (Basic course) on Udemy, and it was very beneficial.
My aim, is to be able to understand what is ML and how to use its concepts while working with data.
I am confused about where to continue. My…
alim1990
- 173
- 1
- 8