Deadly sins of Data Engineering

Hasan Özdemir
5 min readDec 18, 2022

Neither documentation nor data model is existed.

Assume that you are working on e-commerce project, the marketing team requested from DE team a DWH to store customer metadata and comments for the sold products. You must manage this request in short period of time.If you don’t create data model and documentation how will you answer following questions ?

  1. Which validation methods you should implement for each field ?
  2. Which optimization strategies you can follow up based on fields ?
  3. How do you plan to measure storage requirements for next quarters, years etc ?
  4. What challanges possibly you may face based on data model ?

There is no foreign key constraints in database

The FOREIGN KEY constraint is crucial to relational database design. It lets us link the data according to our needs. As it creates some dependencies between the columns of primary and foreign tables, it also lets us decide what to do ON UPDATE and ON DELETE actions performed on the rows of the primary table.

  • FK constraint is that the database constantly maintains referential integrity, means that the database monitors the data inserted into the parent and child tables.
  • One of the intangibles of the FK constraint is how much easier it is for DBA’s and developers to determine how the database is designed.
  • Performance, by including information on how tables are joined, most of server can easily determine how it’s going to retrieve data when using those joins.

If data is loaded, it’s must be fine, no need for testing (TBD)

Data quality is very important to build trust in any relationship, especially the relationship that organizations have with their customers. Assume that you are a software company and you build a ML model that measures impact of different marketing campaigns. Validity, veracity, consistency, completeness, timeliness, uniqueness, If you don’t have tests for these metrics you may possible faces following problems

  1. Hard to build long-term relationship with end-customer
  2. If the product that you built is crucial you may lost more than a money

Misunderstanding of multiple environments

It’s always good to split your environments at least to two as DEV and PRODUCTION but do really people understand this correctly ?If one of the data consumer is running some operation on DEV which means not.

Let’s assume that you are preparing your monthly executive summary for accountant team about sales and you have prepared from DEV environment. Do you think what can happen worst ?

  • Can you confirm all data you analyzed is passing all data tests ?
  • Can you make sure all data is included or excluded based on date or other filters ?
  • Is there any chance you caused a delay to data engineers delay because you were also working on DEV environment ?

Doing everything on Apache Airflow

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Apache Airflow for most of the use cases are very useful but writing large data transformations purely in Apache Airflow very bad practice and you should avoid doing that. Performance is the key metric in your pipeline. Avoid from things that make your performance less.

Writing all ETL pipeline in Pandas

Pandas is mainly used for data analysis and associated manipulation of tabular data in DataFrames. Pandas still is the one of the leading Python package for the data related jobs but to enforce DE to write all ETL pipeline with pandas causes a lot of headache. Python and SQL limits are dependent on your dreams but for pandas is not the same. Following lack of features make things harder for Pandas

  • Pandas can only perform equality-based joins, conditional joins are not easy to implement as SQL. You should do it indirectly and make code unreadable. Let’s check how to use Pandas in the code below.
# performing a join
df_merged = pd.merge(t1, t2, left_on = "c1", right_on = "c3")
# Filter the df to incorporate the other join condition
final_df = df_merged[(df_merged.col2 >= df_merged.col4 - 2) & (df_merged.col2 >= df_merged.col4 - 2)]
# drop intermediate columns
final_df.drop(columns = ["c3"], inplace = True)
print(final_df.head())
  • Lack of performance for huge amount of data. Pandas always stick to a single core utilization and it leading to increased run time.
  • Pandas does not support multi-threading and it will decrease the performance for independent ETL tasks such as data transformation during the pipeline.

We have VCS and it’s enough. No need for DVC !!

Data version control tracks and manages changes in a collection of related entities. It records changes and modifications over time, so you can recall, revert, compare, reference, and restore anything you want. Let’s assume you have DWH and you have facing with following problems

  • DWH server(s) failed catastrophically
  • Discovered leakage or defect in production DWH

Do you know how long would this problems take to revert previous version while you resolve these problems ? What will you do when this errors occurred?

No need for Future Proofing

Future-proofing is the process of anticipating the future and developing methods of minimizing the effects of shocks and stresses of future events. This are the most common issues that you may face one day as data team;

  • Outdated systems: Outdated db models can be worst enemy and will rapidly limit digital transformation efforts as they’re unable to keep up with the speed of modern business growth
  • Poor data quality: Not enough words to explain problems that poor data can cause. It hurts organization’s overall agility.
  • Shortage of skills
  • Lack of clearly defined data governance policies

CI/CD is for software engineering

If you ever searched about success metrics of Data Engineers / Data Scientist you may seen metrics will be related with the data. If you write not clean code or do not automate things people rarely will mad about you. Time to search about principles of DataOps or MlOps. CI/CD is for every engineer who wants things being automated and easy to manage.

Unclear DWH Access Management

A DWH is usually accessed by the company’s employees through business intelligence tools, internal applications, dashboards, and sometimes directly via SQL queries.

There is two crucial point for accessing DWH.

  1. User Access Management: Direct Queries, Intermediate tables
  2. Data Access Methods: Specifying for each data consumer roles and access levels

Unfortunately because of lack of knowledge, pressure of data consumers and so many other reasons DE setting up top level access to users and let them to run direct querries in DWH which cause a lot of headache after a while.

Reach me out for any question https://bio.link/hasanozdemir

Thank you so much for your interest and attention.

--

--