this post was submitted on 18 Oct 2023
6 points (100.0% liked)

WebDev

1162 readers
2 users here now

Community for all things Web Development related.

founded 2 years ago
MODERATORS
 

To preface, I’m currently rewriting a personal webapp to use MySQL instead of storing everything in hundreds of JSON files. I’m currently in the testing phase of generating tables with the data from the JSON files, destroying the tables, adding more columns and data, repeat, all to make sure everything is working as intended.

My issue is that occasionally I’ll create too many columns and then I get an error saying something about the row being too large? I’ve also noticed that if I change the parameters of what data is allowed to go in the column, I can generate more columns. I know there is some relationship between number of columns, the data that can go in a column, data size, and row size but I don’t know what’s going on. I’d appreciate it if someone could broadly go over how row length(?) can affect number of columns.

Thank you

you are viewing a single comment's thread
view the rest of the comments
[–] a_fancy_kiwi 1 points 1 year ago* (last edited 1 year ago) (2 children)

For instance, this is why on a ‘varchar’ column you have to specify how many characters it can hold.

That's likely what I'm doing wrong. I have roughly 220 columns and 95% of them were set to varchar(255) not understanding the size restrictions. A few were set to text(1000) as well (but these don't count?). I definitely don't need 255 characters worth of data in all fields

This excludes column types ‘text’ and ‘blob,’ which your storage engine will store separately.

are there any storage limits when using text or blob? I assume that the lookups are slower?

So what are you doing that requires creating more than 1017 or greater than 64k of data per row?

64 kilobytes, correct?

And I suppose it's a glorified project book generator. 220ish variables per project and once it's all entered into the webapp, I can generate a few hundred page PDF. Each variable of info isn't very big, most are one word variables, a few are like 3 sentences; each row worth of info fits in a max 15KB JSON file.

At the moment, I was just putting all the data of each JSON file in one table. As I progress, variables like contacts will be put in a separate contact table and I'll reference them (once I figure out how to do that lol). Based on the way the work is split up to collect the data on each project, I could logically split the table into 3 but I was hoping to avoid that just to reduce complexity and have everything, minus contacts, easily viewable on one table

[–] ShunkW 3 points 1 year ago (1 children)

220 columns in one table? Yeah, you definitely need to do some research and learn how to design a schema that isn't based on incredibly wide tables.

[–] a_fancy_kiwi 1 points 1 year ago (1 children)

If you had to ball park it, what’s the max number of columns you would use per table?

Right now I’m considering splitting the table into 4. 3 tables per person/job and 1 for contacts.

Or, within each job there are 5ish main topics. So I could have 2 tables. One table for contacts and one table 15 columns wide but I’d store JSON in the cells. The data between each row isn’t related to another row except by contacts.

Is one option more correct than the other?

[–] ShunkW 3 points 1 year ago (1 children)

In a perfect world, I'd say 20 columns per table max, but shoot under that if possible. This isn't always feasible, and I've definitely had some fat tables in some legacy apps I've worked on. But 220 is just unmanageable, especially if you're doing a select * against that table ever in your app.

[–] a_fancy_kiwi 1 points 1 year ago

Thank you for the info, I appreciate it

[–] dual_sport_dork 1 points 1 year ago

If your variables are inconsistent in number but relatively consistent in format, and can be uniquely identified, it is probably a better idea to have a table that's all variables -- one per row. Your table structure would be something like id as INT or LONGINT, variable_name as VARCHAR(x), and variable_value as TEXT. When you look up a record, you SELECT * FROM variables WHERE id=whatever and parse the results. Note that in this case, the table will not have a primary key. You'll be able to have more than one row with the id of whatever, which matches the ID of your document. You can keep whatever metadata about the document in another table, which will hopefully be short.

Having everything stored in text fields will not necessarily make lookups slow, but it may make retrieval of the data in them slow if they contain a lot of data and there are an awful lot of them. Especially if you retrieve it all the time when maybe you don't have to. It will also make your app temporarily contain a lot of data in memory while it's holding the result of the SQL call.

In SQL, finding a row (the seek or lookup) is a very different procedure to returning the data within it once found. The amount of time and CPU cycles it takes to find a given row can be quite high, especially if your tables are not efficiently designed and do not have suitable indexes, or you have to use a complex query to narrow it down. Once found, returning the data should take the same amount of time no matter what, dependent only on how much of it there is. Also, if your returned result is huge that will make it super slow if your data has to be piped over an external connection. If your app and the database live on the same machine the data transfer from database to app can be pretty fast even if the result is huge. If they're on separate machines and that data has to be squeezed through a network connection, though, that's going to be painful.

Consider that:

SELECT * FROM table WHERE id=1

and

SELECT * FROM table WHERE id REGEXP('#^1$#')

Will result in very different lookup times despite superficially accomplishing the same thing. The second one is going to take longer. If your table is long, it will take a lot longer. You could even cause it to exceed the query time limit of your connection if you're not careful. The first is just finding an ID by value, on a column that should (hopefully) be indexed. The second is using a regular expression to match the digit "1" as a string, which must be compared against every single value in column id one at a time in a full scan of every row in the table. Full table scans are slow and expensive, and you should avoid them whenever possible.