this post was submitted on 23 Aug 2023
5 points (85.7% liked)
Data Engineering
182 readers
1 users here now
Discussion on Data Engineering topics. Data pipelines, tools and technologies, databases and DBMS, best practices:
Rules:
- Limited to data engineering, no general CS/programming posts.
- No technical questions. Example: how to fix this bug in my code.
- No marketing
- No resumes, jobs
- No PII
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Where are you getting the data from, and do you maintain access to the originals after ingestion?
Is the database used for anything other than Elasticsearch?
If you do not have access to it after ingestion, you should keep a perfect copy of the data because, as you noted, you lose information otherwise. This can be especially important to address bugs in normalization logic, or requirement changes. For example, if your normalization logic replaces "-" with "_", and at some point in the future you need to distinguish between "this-phrase" and "this_phrase", if you've lost the original data you've also lost the ability to fix your normalized data and indexes.
Similarly, while the existing normalization logic might be better for Elasticsearch, you may not be using Elasticsearch forever, and you don't know the requirements of the next system.
That all said, I'm also skeptical that there is any real Elasticsearch benefit to modifying your data as described, in particular converting to lowercase. You might want to ask your data engineer to tell you explicitly what the purported benefits are. If they tell you it's for performance, ask for metrics, and weigh performance gains/costs against the usability gains/costs. If they can't give you metrics, ask for the documentation supporting their claims. If they can't give you metrics or docs, find a new data engineer.
We build market analytics/reports out of the data from elastic search.
Thank you for your suggestion. I'll address this with them to see if I can get a better understanding of the reasoning behind it.
We don't have access to all the past data, most yes. But a lot no.