What exactly do data engineers do

110 views

Im in college and data engineers seemingly make a shit ton of money
I googled what they do and it told me they build some kind of pipelines but i still dont understand what it means can someone help me understand it in simple terms please thanks

In: 6

Data engineering is a broad field that deals with making computers handle data, usually large amounts of data, automatically.

For example, suppose you have a hospital. You have a bunch of data on each patient, entered in by doctors on tablets carried around to the various rooms. That data matters to a lot of people, but different parts of the data matter to different people. Some data is confidential and needs access restrictions. Other data can be wrong and should be flagged as such. A data engineer sets up a system that collects the data from tablets (“ingestion”), makes sure it makes sense (“validation”), possibly restructures it, and then makes it available in a transformed and restricted form to users (the T and L in “ETL”).

To make it more concrete, the data architecture of a hospital might consist of:

– A server that collects incoming data from the tablets (“data lake”).
– A batch job that runs every ten minutes that takes the collected data and processes it, looking for irregularities (e.g. “oh this patient’s blood pressure is probably not 1200/80”), sorts it into a collection of appropriate databases or other storage systems (a “data warehouse”, distinguished from the original lake by being structured), and stores it there. (This whole process is a “data pipeline”, though not the only one in this example)
– A set of separate systems (“data marts”) that make parts of this data available. For example, the hospital’s billing department gets one set of data in one format, while the doctors get a different set of data in another. Each data mart is fueled by a separate data pipeline drawing data from the main data warehouse. Data marts sometimes also contain secondary data – for example, rather than showing the satisfaction of each patient on a survey, it might show a computed average happiness that wasn’t directly in the original data set.

This takes expertise because you want to guarantee certain properties:

– What happens if one of these steps crashes? Can you afford to lose data? (In the hospital’s case usually no for things like patient health data, but maybe yes for things like a satisfaction survey.)
– How secure does your data need to be? (In a hospital’s case, most data needs to be secure.)
– How quickly can you access large amounts of data? (In a hospital’s case you might not care that much because you’re usually accessing small amounts of data at a time, but a few things like analytics are probably higher scale.)
– What connections can you draw among your data? (For example, do certain doctors have higher satisfaction? If none of your data marts make both the feedback and the doctor involved available, no one can draw this connection.)
– Do you need to be able to track where data comes from (“provenance”)? (A hospital probably does, because of potential malpractice by its doctors.)
– Can you afford for your data to be unavailable for a little while? (For your billing system yes, probably; for your patient data system no, probably not.)

And so on. The set of properties you want, and the cost you can afford, determine how you want to set all these pieces up.