Data Engineering

182 readers

1 users here now

Discussion on Data Engineering topics. Data pipelines, tools and technologies, databases and DBMS, best practices:

Rules:

Limited to data engineering, no general CS/programming posts.
No technical questions. Example: how to fix this bug in my code.
No marketing
No resumes, jobs
No PII

founded 1 year ago

MODERATORS

[email protected]

How do I convince my data engineer to not modify data before including it in our db? (lemm.ee)

submitted 1 year ago by [email protected] to c/[email protected]

7 comments fedilink hide all child comments

Our data engineer insists in lowercasing everything and removing some other formatting like new lines on free text fields.

They say it's "better for elastic search".

To me that makes no sense and loses information that can't be added back. But I couldn't really convince them otherwise. So far no real problem has come out of it but it makes for a worse experience for the user. Like company names that are acronyms show up as all lowercase. (ibm, llc, etc.) or free text fields that we miss when the user wrote in caps or added paragraphs.

What are your thoughts on this?

Disclaimer, I'm not a data engineer. Just a PM from a data related product.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 7 points 1 year ago (1 children)

Where are you getting the data from, and do you maintain access to the originals after ingestion?

Is the database used for anything other than Elasticsearch?

If you do not have access to it after ingestion, you should keep a perfect copy of the data because, as you noted, you lose information otherwise. This can be especially important to address bugs in normalization logic, or requirement changes. For example, if your normalization logic replaces "-" with "_", and at some point in the future you need to distinguish between "this-phrase" and "this_phrase", if you've lost the original data you've also lost the ability to fix your normalized data and indexes.

Similarly, while the existing normalization logic might be better for Elasticsearch, you may not be using Elasticsearch forever, and you don't know the requirements of the next system.

That all said, I'm also skeptical that there is any real Elasticsearch benefit to modifying your data as described, in particular converting to lowercase. You might want to ask your data engineer to tell you explicitly what the purported benefits are. If they tell you it's for performance, ask for metrics, and weigh performance gains/costs against the usability gains/costs. If they can't give you metrics, ask for the documentation supporting their claims. If they can't give you metrics or docs, find a new data engineer.

[–] [email protected] 2 points 1 year ago

We build market analytics/reports out of the data from elastic search.

Thank you for your suggestion. I'll address this with them to see if I can get a better understanding of the reasoning behind it.

We don't have access to all the past data, most yes. But a lot no.