Organizing SQL tables and named queries efficiently

zacharyw.larson · November 17, 2021, 9:45pm

If you have a bunch of machines, and multiple events to track, is it best to have a transaction group for each event type, and for each machine?

I read one guy say if you had a bunch of machines, you wanted to organize all of them into one table.

I thought the limitations on SQL were that as a table gets bigger, the queries slow down. So I thought that if my tables are split up, then my tables will be smaller with the same total growth.

bkarabinchak.psi · November 18, 2021, 1:49pm

I thought the limitations on SQL were that as a table gets bigger, the queries slow down.

Not necessarily true and actually the splitting of the tables could make it slower if you have proper indexes. Indexes (unless otherwise specified) I believe make a binary tree (BTREE) of the values which helps with the quick lookups. Binary trees grow at a rate of log_2, for 1 record you have 1 level of nodes, for 2 to <4 records, you have 2 levels of nodes, for 4 to <8 records you have 3 levels of nodes.

I will give you a quick example - Say you had a table of 12 records and you index on some integer column. It might look like

        2
       / \
      /   \
     /     \
    1       3
   / \     / \
  0   2   9   1
 /   / \     / \
2   1   0   8   8

(Note: I just copied and pasted this binary tree from somewhere, usually you would see some ordering like to the left is less than the current node and to the right is greater than the current node).

So any look up where you have a WHERE myIndexColumn = ? would take at most 4 comparisons for the database to find (since there are only 4 levels of the binary tree to go through).

Now say you split them up and are still generous enough to make both have an index on that same column. Well now each tables binary tree to go through is 3 levels each. So worst case scenario that you're looking for something that happens to be at the bottom of the binary tree would be 3 + 3 comparisons the database has to do - this is worse than if you kept all the data in a single table.

I've worked with a database that did this. The worst part is, all the queries required UNION of the "current" and "old" records which eliminates the ability to indexes as well, which then tends to then go to full table scans instead, which are about the worst case scenario you can have for a where clause - instead of "knowing" about the values in your table as in the binary tree and being able to get rid of at most half of the remaining records per level you go down in the binary tree, now SQL has to look through every single record in both tables, so you go from 3 or 6 comparisons to a full 15 comparisons.

The second part about what I said about losing indexes when you UNION an table and a _log or _history (or whatever you would call the table with the older records), is from my experience with MySQL, I do not know if the same is true for say SQL Server.

However, the first part I said would still be true - if SQL server still utilized both indexes, you'd still be at 6 comparisons over 4. If SQL Server was really smart and was able to combine these index binary trees into a single binary tree, you'd have the same situation as if you just kept all the records in the same table to begin with.

All this to say, I would recommend not splitting up the table. At least, not until you have millions of records and the performance is notably slow. Then you might want to get move records older than a previous date (such that they're not required by whatever queries you might run for reports etc) into a myTable_history type of table. Though at that point instead of doing that I think you're actually better off setting up Date Partitioning to help with performance.

The last drawback is that when you have the records in two tables, your queries will get unnecessarily convoluted. And again, in at least the case of MySQL, you will be killing your indexes, and making the performance much worse (since you lose all indexes if you UNION both tables) than if you just kept everything in a single table.

Please someone correct me if I misspoke about anything here.

KathyApplebaum · November 18, 2021, 3:39pm

Just a pedantic point -- B-trees and B+ trees (the default for many database indexes) don't have to be binary.Their fanout (how fast the nodes increase at the next level) is often far greater than 2. How much larger depends on the size of the data type being indexed, the size of block pointers, and the block size of the disk the index is on. If your fanout is 300, by just the third level you have 27,000,000 index entries.

Wikipedia article on B+ trees: B+ tree - Wikipedia

bkarabinchak.psi · November 18, 2021, 3:52pm

Right, I just used an actual binary tree for the example as that was hard enough to get a good ASCII version of.

The major broad stroke I wanted to convey is that by splitting it into two tables, your combined B+ Trees are larger than a single B+ Tree of the same data would be due to the logarithmic growth, that a single B+ tree has less steps to find the right record than two B+ trees of the same data split between them (in the worst case). Also that you lose out on the power of indexes in the event you need to UNION the current database table with the history table, which is really the major reason to not split it (at least for MySQL).

I worked with a database that did this, had a Record and a Record_History, which was thought to speed up queries, but then they had to UNION these two tables for their purposes, leading to two full table scans and making it much slower than if they just left it in one table with an index. That’s the major point I want to make sure @zacharyw.larson avoids.

I would think your best bet for making sure the table stays performant despite the number of rows, if not an index, would be to do partitioning based on the date Database table partitioning in SQL Server

Edit:
TLDR: Do not split the table into two tables.

This got the math proof juices for me flowing a bit and I just wanted to make sure I wasn’t saying anything incorrect so here’s a little proof:

let X = the number of records in the database table
Let N be the fanout of the B+ tree
Let’s assume that in the scenario where records are split between two tables are done so right down the middle, x/2 and x/2 records in each.

Then, for one table to be more performant than two tables of split records, the following must be true

log_n(x) <= log_n(x/2) + log_n(x/2)

ie the number of max steps for the a btree of x records with fanout n must be less than the max of the two tables with half the records. using log rules -

log_n(x) <= log_n(x) - log_n(2) + log_n(x) - log_n(2)
log_n(x) <= log_n(x^2) - 2log_n(2) 
log_n(x) <= log_n(x^2) - log_n(4)
log_n(x) <= log_n(x^2/4)
x <= x^2/4
0 <= x^2/4 -x
0 <= x(x^2/4 -1)

and since x is the number of records x >=0, so this becomes true at x>4 records no matter the fanout of the B+ tree

i wasn’t sure of this so I plotted it in desmos and it looks like I am right the inquelaity line stays at x=4 nomatter what I make the fanout -

Should be noted this is really measuring worst case scenarios of number of steps, and assuming that the split tables can both utilize their indexes.

If I am wrong please let me know, been a while since proofs class.

zacharyw.larson · November 18, 2021, 4:43pm

Comment removed.
I quoted wrong.
I had put in another x.
I had tried to use Wolfram Alpha to assist in the math to help myself understand.
Then I made an error right off the bat in the equation.

WillMT10 · November 18, 2021, 5:11pm

I completely agree with @bkarabinchak.psi keeping it to a single table (one caveat at the end). But I just want to emphasize that if this is going to be a very large table, you need to really understand the common queries you are going to be running against this table. If you define indexes without thinking about your use case, they may not cover your needs and you end up with an extremely slow query because it cannot utilize the defined indexes for your query and require a full table scan. Adding indexes to a large table isn’t fun…can lock it up for quite some time while it builds the index. So it’s worthwhile putting in some effort in the beginning to really understand the queries you expect to run against it. And on the flip side, you do not want to create more indexes than you need. Indexes have a performance hit on table writes (every index needs to be updated each time a new record is added to the table). For tables with way too many indexes, this can also cause problems by confusing the query optimizer when it’s trying to pick an index to use for a given query (at least that has been my experience with MySQL).

For very large tables dealing with time series data, partitions can definitely help. BUT, and this is a big one, you need to be careful and know what you are doing. Partitioning is not a silver bullet that you can just throw at a table without any thought and expect it to help (it is often used incorrectly, causing performance problems rather than helping). And partitioning has drawbacks that may prevent it from being a viable option (restrictions around the partitioning key and unique constraints/indexes, etc). If configured correctly and for the right use case, they can help speed up queries and make deleting old data almost instant via the DROP command. Trying to DELETE from a very large table without this can lock the table for a long time.

Typically the only time I split into two tables is if I know that data after X amount of time can be archived due to rarely being needed. Most of our use cases we only care about the raw data from the last year (or less) for 99% of our queries. For those situations I will have one table for data from the last year and an archive table to hold everything older than a year (sometimes the archive has a copy of everything from the last year too). Data is then moved/copied to the archive table on a regular interval (this is where the instant DROP with partitioning is helpful). I only bother with this for tables where I expect it to grow to more than ~100 million rows (just seems to be a threshold where it has made sense for us with MySQL). Sometimes we’ll even place the archive table in a different database in order to speed up backups/restores for the data that really matters, and can also allow us to use cheaper AWS RDS instances for storing that archive data.

I’m sure I have other thoughts on this…but that’s enough of my ramblings for now…

bkarabinchak.psi · November 18, 2021, 5:31pm

Yes, I've run into this before, make your indexes as soon as possible.

zacharyw.larson · November 18, 2021, 6:30pm

when you say indexes, what do you mean?
As in total entries?

In my case, I am using transaction groups for this.
I have an MS SQL / SQL Bridge setup.

Right now I have a test group. It has a t_stamp and it has a year, minute, seconds, day, and month of each event. Then it has even code and qty of the event which I am not sure about.

For some events, this is a quantity. For the stop events, it is a duration. So I might have two different columns.

zacht · November 18, 2021, 6:37pm

Possibly useful:

zacharyw.larson · November 18, 2021, 7:55pm

Thanks

I understand a bit better about indexes now.

I guess I make a bunch of different transaction groups, but then point them all at the same table is the way I do this for the different machine events and different machines.

if I point a transaction group at a table with additional column, the transaction group will add that column yes?

zacharyw.larson · December 1, 2021, 9:43pm

I am still learning how to analyze the structure of my transaction groups, tables, and named queries.
I was convinced the other day that one big table is better.
Today, though, I guess I don’t understand again.

Option 1
One huge table, all events, all machines

Columns:
ndx, machine, event code, quantity or duration, t_stamp

Pro - one big table, all the machines, all the codes make it easy to parametize
Con - table will be huge

Option 2
one table of events per machine

Columns:
ndx, event code, quantity or duration, t_stamp

pro - smaller tables, faster queries?
con - a different named query for every machine

Option 3
A table for uptime events
A table for rejection events
Each of these for each machine

columns
ndx, downtime code, duration, t_stamp
ndx, rejection code, quantity, t_stamp

pro - smallest tables? fastest queries?
con - named queries for each machine

ryanjmclaughlin · December 1, 2021, 10:02pm

Option 1 would be the only way I would ever go.

Define “huge”… if you have a proper size DB setup and the table is indexed properly according to the way you will be querying the data until your into billions of rows its not very big.

The better questions are what’s your transaction rate? How may rows a day/hour/second do you generate? What’s your retention period? Will you purge date > x days? These answers will impact your decision. If one machine is generating 1 billion rows a month and you want 5 year retention, then yeah, not gonna work… but I doubt that’s the case,

pturmel · December 1, 2021, 10:08pm

Event logs are the poster child for “tall” tables. So, option 1, unquestionably.

For efficiency, I would use a numeric foreign key for machine, where a separate, relatively static table of machines has that as a primary key. I would also omit duration, as it is easily generated from lead(t_stamp) over (partition by machine order by t_stamp).

nminchin · December 2, 2021, 8:15am

Option A hands down. In addition to everyone else’s comments, consider also the ability to query and summarise your data as well. For multiple tables this would be an absolute nightmare in addition to being a lot slower

Tim · December 2, 2021, 4:15pm

I would do a 4th option. A table for uptime events for all machines and a 2nd table for rejection events for all machines.

zacharyw.larson · December 13, 2021, 4:03pm

Many of my charts require uptime and rejects therefore I am trying to add my reject transaction groups to the table that has uptime already.

I am planning to run alter table add column in the database query browser to add the quantity column.

ALTER TABLE table_name
ADD column_name datatype;

I have a column in the transaction groups that has datatype string for machine/line name.
Usually, I see people using varchar(xxx) though.
How bad is this and is there an easy fix?

Tim · December 13, 2021, 4:19pm

For your charts and reports you would join the two tables in your query.

zacharyw.larson · December 13, 2021, 4:52pm

Thanks, you’re right, I could join them.
They are used a ton though, and these guys said it is faster to have stops and rejects together in a table.

I’d like some help with the varchar(xxx) information regarding one of the data points being a string in the transaction group.
I don’t see where I would make the datatype a varchar(xxx).
Also, I already had string in there, so what is the best way to keep my data and make the appropriate change?

Tim · December 13, 2021, 5:12pm

It would be slower to insert the data with just one table. You would need to test in your application to determine which would be faster to run your report queries.

For the transaction group datatype, hopefully someone else can can comment since I don’t use transaction groups. But I would guess that the two are compatible and that you can leave it as-is.

zacharyw.larson · December 13, 2021, 7:13pm

int2 in ignition is a smallint in sql?
How easy is changing datatypes later?