SQL Design

Let’s say I am connecting to a bunch of machines that are quite different. For instance, let’s say I wanted to monitor and record the motor speed of each machine, but the number of motors varied between 1 and 8 (but someday there might be a machine with 12, or whatever).

What would be the best way to design a table? Should I have eight columns (Motor_1, Motor_2, etc) even though most of the entries will be empty, and add additional columns later if necessary? Or should I use some sort of varbinary type and tease the bytes apart as needed? Since I will be using SQL 2008, which compresses unused ints down to nothing, I’m not worried about disk space (I’ll have a very large number of records), but it just sort of bugs me that something like the number of instances of anything is basically hardcoded in the table design. Then again, maybe this is just how it’s done.

I also thought of creating a second table as I normally do and reference the identity key and just write a new record for each update for the motors that actually exist. But since I will eventually have a very large number of records (tens of millions?), I didn’t know if this would be very efficient to do a join on every query.

With this very limited information, do you have any ideas to throw out?

Well, this is actually a tricky issue. The rule of thumb is that you almost never want to design tables with columns like “Something_1, Something_2…”. The reason is that it makes SQL queries that aggregate information really hard to write.

I would go for the two-table related aproach that you shied away from. For the uninitiated, you would have two tables:

MachineId (PK)

MachineId (FK->Machines.MachineId)
MotorId (Indexed)
Timestamp (Indexed)

The keys and indices should make the join performance a non-issue.

Now the flip side to this is that if your only aim is to throw this in an EasyChart, the simple one-table approach will lead to less queries executed against the DB, which will be faster. This is because the easy chart will be able to run a single query for lots of different motor history pens.

Not to tease you with something you can’t use yet, but in case this is a project early in the planning stages, the upcoming SQLTags history feature will make this choice irrelevant.

Ok, let’s say I went with two tables. How do I get the indentity key of the machine update record so I can use it in the motor table. For instance, if I had two updates like this:

Update 1 at 12:03:35:

Machine_ID = 203
Motor_1_Speed = 2300
Motor_2_Speed = 2400

Update 2 at 12:03:37:

Machine_ID = 203
Motor_1_Speed = 2100
Motor_2_Speed = 2900

In my machine table, I would have two records that were unique based on the indentity column (probably called something like Machine_ndx). In my motor table, I would have four records and would want to use this Machine_ndx value to join on, right? So would I write to the machine table, and then read back the new _ndx, and then write to the motor table? Or should I create a different unique identifier another way?

I many not fully understand what you’re trying to accomplish so please excuse me if my following thoughts are irrelevent.

One approach is to have a table for each machine. Each table would have the standard index and time stamp along with a column for each motor. The tables would have different numbers of columns. For example:

Table Machine203 (for Machine 203 with 3 motors)
Machine203_ndx, motor_1_speed, motor_2_speed, motor_3_speed, t_stamp

Table Machine204 (for Machine 204 with 4 motors)
Machine204_ndx, motor_1_speed, motor_2_speed, motor_3_speed, motor_4_speed, t_stamp

Another approach is to have one table and columns like this:

Table Machine (1)
Machine_ndx, Machine203_motor_1_speed, Machine203_motor_2_speed, Machine203_motor_3_speed, Machine204_motor_1_speed, Machine204_motor_2_speed, Machine204_motor_3_speed, Machine204_motor_4_speed, t_stamp

Yet a third approach is to have one table and columns like this:

Table Machine (2)
Machine_ndx, MachineID, MotorID, Speed, t_stamp

Each approach has its advantages and disadvantages. The best approach depends on how the data is generated and in what form, and what you intend to do with the data.

The first approach is straight forward. It may be harder to generate boiler plate windows as the table to query would differ with the machine. It also means that charts with data from multiple machines will be a little slower to update because multiple tables will have to be hit. It’s also going to be more of challenge to join data from multiple tables based on time for reports or exporting since there is not guarantee that time stamps will match exactly. If additional machines are added, you just add additional tables. If additional motors are added to a machine, you just add new columns to the appropriate table. Null values will be written for existing rows.

The second approach may be best for reporting data for multiple machines over time in reports and charts. It means you only have to hit one table to get the data and t_stamps are guaranteed to be consistent amongst different machines/motors. Additional columns must be added if additional machines and/or additional motors are added. You have to make sure that the database can handle the required number of columns. If the column count approaches the column limit, additional tables may have to be created, in effect, making this approach the same as the first. In this scenario though, you have a less than obvious idea where a particular machines data resides. This approach doesn’t lend itself well to boiler plates unless you use sophisticated expressions to generate dynamic SQL queries. When you add more motors, the database engine will have to add null values for any existing rows.

The third approach works best for boiler plate windows but is more difficult to use with trending. The data must also be available in a manner which is not likely to occur naturally or easily generated. Again, joining data from multiple machines, and motors, based on time would be a challenge. In terms of columns not having null values, this approach is the most efficient. But in terms of the number of rows, it’s the least efficient.

Depending on which approach you take, some of the short comings of the design can possibly be overcome by automatic post processing within the database. Or you could use a combination of approaches (i.e., have two tables of different design for the same machine). A dual approach may not be the most space efficient but it may greatly speed up queries or simplify your FPMI application.

Just some food for thought…

Thanks MickeyBob and Carl. I’ve played around with a few things, and am committed to the two table approach (my first gut instinct). Mickey, the number of machines can be large and there really isn’t heck of a lot of commonality between them, so any machine properties that are not common will get lumped into their own tables and referenced through a join.

I ended up using “scope_identity” and got something to work. Here is some pseudo-code for anyone who is interested:


DECLARE @RefID bigint

SELECT @RefID = scope_identity()


Basically, the first record is written to T1. Then, the current identity is copied to @RefID, and passed into the next two insert statements in the other table. If you saw my machine layouts, you would understand why this is going to help a lot. I’ve got a lot of testing to do to see where this falls apart though.

Another general design question.

In playing around with some charts and tables that Travis made for me for a demo (Travis, they loved it!), I pushed the CPU resources to 100% due to the multiple sql queries.

What I did was add multiple charts and tables so they could see different ways the data could be displayed, so I actualy had about 4 indentical queries running at once. I probably should know this, but if I have a query that is used on two or more components, should I instead use a single query to populate a dataset and then bind the visible tables and charts to this dataset?

Absolutely. wherever you can, reduce the # of queries going to the DB. If they’re an indentical query on the same window: don’t run it twice.