Indexing sql database guidance

zacharyw.larson · February 8, 2022, 9:38pm

I am trying to learn about how to use indexes for my sql database.

I have a basic understanding mostly from youtube and some dry reading:

I have transaction groups adding fault, reject, and a few other kinds of events for many machines to a very large table.

I wonder how I should setup my indexes, how to update them, when to update them, and how it works when I need to do a query, but a lot of data has been added since the index was added.

The video said the syntax was like:

Create index nameofindex_idx
On tablename (column_x,column);

-My slowest queries sort events into either the days or the months they have occurred.
-New data is always being entered

as I understand it, I can drop the index later if needed to make a new index later

lrose · February 8, 2022, 9:52pm

I won't comment fully on indexes at this time, however, if there is an index of any kind on the table then odds are the performance is more about the size of the table then the execution plan of the query.

There is an argument to be made here for partitioning this table into many tables, particularly depending on the rate at which new events can be entered.

What flavor of DB are you using?
How many rows are already in the table?
What rate are new rows being entered?
How often are you querying the data?

Do you have access to then tools needed to view the execution plan of these queries?

zacharyw.larson · February 8, 2022, 10:03pm

What flavor of DB are you using?
ms sql

How many rows are already in the table?
1 million estimating

What rate are new rows being entered?
10,000 rows a day

How often are you querying the data?
several times per hour

Do you have access to then tools needed to view the execution plan of these queries?
not at the moment

WillMT10 · February 8, 2022, 10:12pm

Depending on the DB, one potential issue with this type of grouping query is that even if you have an index on your datetime/timestamp column, you likely will not gain the benefit of your index if you group by something like:

GROUP BY YEAR(tstamp_col), MONTH(tstamp_col)

Often I will create separate summary tables to periodically record these aggregated results so that I don't have to run this slow query against the entire table, I can just run it for "new/updated" records.

lrose · February 9, 2022, 12:56pm

I would recommend partitioning the table.

Generally you would do that by creating a new table at the start of whatever interval, however, you have to do some processing to determine the appropriate table.
You could also choose to move old data to an "archive" table after some timespan. This is useful if the data you are querying is generally recent and older data doesn't need queried as often.
You can blend the two and keep small archive tables.

#1 is good if you have the ability to set up the process at the onset. This can be more difficult if there is already programming in place which would break should the table name change. One of the drawbacks of this method is processing requests that span the partitions. This is how Ignition's Historian partitions tables when configured to do so.

#2 is good if you need to set up the process after the table and automation around it has already been created. This is because you keep the same table name so none of your queries brake. However, if you need to query data in the archive you still have the performance issue where you're querying against millions of rows. This is sometimes okay if you rarely need to query data in that table. The processing of queries across the partition is also less complicated as there is only ever two tables. The need to dynamically determine and potentially span multiple tables isn't there. This makes the union the data much easier to accomplish.

#3 comes with all the good and bad of #1 except it keeps the original table name the same, making the process easier to implement after the fact.

Querying data against 300K rows will pretty much always be faster than against 3M no matter how good your indexing is.

Also, remember that while Indexes help performance on select and where statements, they hurt performance on update queries. So, by trying to improve the performance of the ~100's of queries per day, you'll hurt the performance on the 10,000 inserts per day. This is because the DB will need to modify all indexes on the table with every insert. There is a balance to be made here.

Generally speaking best practice would be to put indexes on columns which are frequently used to sort on or join tables together. If the data is rarely used for those things then an Index may not be worth the cost.

zacharyw.larson · February 9, 2022, 3:35pm

The idea to split into two tables is very appealing.
-Having a table that receives information fast
-Having another table that gets old information fast

are these ideas more or less appealing?
-keep one table
-for data which doesn't require the current day - run the query once a day, store in a dataset, use the dataset
-Some method to trim the query if it is close to real-time- top x, index by date, sub query when after x date

If the second set are appealing, how do I know that my query/dataset is good?

I could have a button that has a time limit on how often it can be used, and have it let a user call a refresh on the dataset.

How long would adding data be from transaction groups if I had the t_stamp indexed, is it going to corrupt my data or not really an issue?
collecting infeed and outfeed big ints, maybe 3 small ints and a 50 varchar description
Is it best to index on t_stamp, or should I index on t_stamp and the event code? Or are these both effectively just indexing by the t_stamp because two events don't happen at the same time?

testing

select * from mybigtable
where (t_stamp>DATEADD(day, -2,GETDATE()))
order by t_stamp desc

time: 0.204s
sample size limit: 200,000
rows fetched: 27,200

select top 1000 * from mybigtable
where (t_stamp>DATEADD(day, -2,GETDATE()))
order by t_stamp desc

time: 0.036s
sample size limit: 200,000
rows fetched: 1000

zacht · February 9, 2022, 9:40pm

As a side note, functions in the WHERE clause can limit the effectiveness of indexes. You’d have to look at the query plan to know for sure, but, without checking, I wouldn’t be confident that an index on t_stamp would actually help those queries.