Organizing SQL tables and named queries efficiently

bkarabinchak.psi · November 18, 2021, 3:52pm

Right, I just used an actual binary tree for the example as that was hard enough to get a good ASCII version of.

The major broad stroke I wanted to convey is that by splitting it into two tables, your combined B+ Trees are larger than a single B+ Tree of the same data would be due to the logarithmic growth, that a single B+ tree has less steps to find the right record than two B+ trees of the same data split between them (in the worst case). Also that you lose out on the power of indexes in the event you need to UNION the current database table with the history table, which is really the major reason to not split it (at least for MySQL).

I worked with a database that did this, had a Record and a Record_History, which was thought to speed up queries, but then they had to UNION these two tables for their purposes, leading to two full table scans and making it much slower than if they just left it in one table with an index. That’s the major point I want to make sure @zacharyw.larson avoids.

I would think your best bet for making sure the table stays performant despite the number of rows, if not an index, would be to do partitioning based on the date Database table partitioning in SQL Server

Edit:
TLDR: Do not split the table into two tables.

This got the math proof juices for me flowing a bit and I just wanted to make sure I wasn’t saying anything incorrect so here’s a little proof:

let X = the number of records in the database table
Let N be the fanout of the B+ tree
Let’s assume that in the scenario where records are split between two tables are done so right down the middle, x/2 and x/2 records in each.

Then, for one table to be more performant than two tables of split records, the following must be true

log_n(x) <= log_n(x/2) + log_n(x/2)

ie the number of max steps for the a btree of x records with fanout n must be less than the max of the two tables with half the records. using log rules -

log_n(x) <= log_n(x) - log_n(2) + log_n(x) - log_n(2)
log_n(x) <= log_n(x^2) - 2log_n(2) 
log_n(x) <= log_n(x^2) - log_n(4)
log_n(x) <= log_n(x^2/4)
x <= x^2/4
0 <= x^2/4 -x
0 <= x(x^2/4 -1)

and since x is the number of records x >=0, so this becomes true at x>4 records no matter the fanout of the B+ tree

i wasn’t sure of this so I plotted it in desmos and it looks like I am right the inquelaity line stays at x=4 nomatter what I make the fanout -

Should be noted this is really measuring worst case scenarios of number of steps, and assuming that the split tables can both utilize their indexes.

If I am wrong please let me know, been a while since proofs class.