this post was submitted on 25 Jun 2024
23 points (89.7% liked)

Python

6234 readers
3 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events

October 2023

November 2023

PastJuly 2023

August 2023

September 2023

🐍 Python project:
💓 Python Community:
✨ Python Ecosystem:
🌌 Fediverse
Communities
Projects
Feeds

founded 1 year ago
MODERATORS
 

Hi, I want to know what is the best way to keep the databases I use in different projects? I use a lot of CSVs that I need to prepare every time I'm working with them (I just copy paste the code from other projects) but would like to make some module that I can import and it have all the processes of the databases for example for this database I usually do columns = [(configuration of, my columns)], names = [names], dates = [list of columns dates], dtypes ={column: type},

then database_1 = pd.read_fwf(**kwargs), database_2 = pd.read_fwf(**kwargs), database_3 = pd.read_fwf(**kwargs)...

Then database = pd.concat([database_1...])

But I would like to have a module that I could import and have all my databases and configuration of ETL in it so I could just do something like 'database = my_module.dabase' to import the database, without all that process everytime.

Thanks for any help.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 3 months ago (9 children)

Thanks, I could solve it creating a file with a def get_database(name):

if name == 'database':

all the process to create the database

return database

And then df = get_database('database') execute all the processes and return it.

[–] [email protected] 5 points 3 months ago (8 children)

I am a little curious about the conditional. I have a suspicion that this is a bit of over engineering.

The problem you seem to be trying to solve is “I need to access the same data in multiple ways, places, or projects.” That’s what a database is really great for. However, if you just need to combine the same csv files you have on disk over and over, why not combine them and dump the output to a csv? Next time you need it, just load the combined csv. FWIW this is loosely what SQLite is doing.

If you are defining a method or function that performs these ETL operations over and over, and the underlying data is not changing, I think updating your local files to be the desired content and format is actually what you want.

If instead you’re trying to explore modules, imports, abstraction, writing DRY code, or other software development fundamentals- great! Play around, it’s a great way to learn how might also recommend picking up some books! Usually your local library has some books on Python development, object oriented programming, and data engineering basics that you might find fascinating (as I have)

[–] [email protected] 3 points 3 months ago (6 children)

There's some data that comes in CSV, other are database files, in the SQL server, excel or web apis. From some of them I need to combine multiple sources with different formags even.

I guess I could have a database with everything more tidy, easier to use, secure and with less failure ratio. I'm still going to prepare the databases (I'm thinking on dataframe objects on a pickle, but I want to experiment with parquetd) so they don't have to be processed every time, but I wanted something I could just write the name of the database and get the update version.

[–] [email protected] 3 points 3 months ago (1 children)

This sounds kind of like a data warehouse. Depending on the size of the data and number of connections I’d say script or database or module, this is a much bigger problem. Look into dbt (data build tool) and airflow

[–] [email protected] 2 points 3 months ago (1 children)

I have a Datawerehouse some of the dabases I got come from there, but can only be accessed in the virtual machine.

[–] [email protected] 2 points 3 months ago

I would say consider having a script that combines all these sources into a single data mart for your monthly reports. Could also be useful for the ad hoc studies, but idk how much of the same fields you're using for these studies.

load more comments (4 replies)
load more comments (5 replies)
load more comments (5 replies)