Dropped data points

I’ve had some database crashes(in dire need of upgrade, sadly I have no control there) recently and in all cases the Store and Forward system did not recover the data. I read somewhere in the doc that the Gateway should be able to store up to 10 million rows until it starts dropping data. I thought that was a default internal setting. But maybe I'm misunderstanding the system. Now I’m wondering if I need increase the disk cache value to 10 million to have the best chance of recovering at least a few hours worth of data in the event of a database crash. Or maybe the Quarantine only writes to RAM and I’m simply running out of memory. If you could help me understand the system better that would be great. Let me know if there is any other info would be helpful in diagnosing the problem.

Running Ignition 8.1.43 with PostgreSQL 16.10
Writing around 500rec/sec

That figure is for Ignition Edge, not Standard S&F.

The default S&F cache is 25K records:

These settings can be tweaked here:

Ok, Would it be correct to say that in the event of a main gateway crash the edges can hold 10 mil rows of data until they start dropping data. And in the case of the database crashing then this main gateway disk cache is the limiting factor?

Do you know if there is some upper limit on this disk cache setting or am I just limited by RAM or Disk storage?

They can hold more than 10M if you are using MQTT, in that case it is 35 day’s worth.

As to your other question, note this from the docs:

Note: The disk cache is intended to be a short term storage system. It is not recommended to increase the Max Records beyond 50,000.

Ok 50000 is very low, that’s barely 2 minutes of data for me. That seams like it can’t be correct. I think we must be missing something. Maybe the quarantine is a separate buffer that is bigger and is just not working for some reason.

You are using Postgres so therefore also a normal Ignition Gateway and not the edge version. Based on the documentation you should not increase the disc cached S&F to more than 50.000 records - although a record can be a batch of values it is still not much - depending on the amount of data you want to log.I feel your pain :slight_smile:

If. the DB is not available you will pretty fast drop data. Maybe you can use the internal DB as a a short term buffer with a historian splitter in case the Postgres connection is not a available or setup a HA DB cluster.

Regarding Edge: I‘m not sure if there is a S&F System for the edge. It is using the internal DB which is on the same system with a tag row limit of 10Mio or 35days (what ever comes first, afterwards the data will be pruned). I heard that the new historian used WAL, this can maybe be seen as a S&F function, But I‘m not sure if this would help in case the internal DB crashes.

You should set up a PostgreSQL high-availability cluster. The standard tool to do this with PostgreSQL is their Replication Manager. You will probably want to use either a master-tracking proxy in front of the cluster or an extra, movable, IP address for the active master.

1 Like

Yeah, If I had control over the server side of things. I could solve the root problem which is the database server hardware and software/network setup. But unfortunately I can’t do anything about it. I was hoping the central gateway could be configured to handle at least a few hours of database outage.

Sometimes you just have to leave the problem in the hands of those with the authority to actually fix it.

I'd like to offer some clarification on the 50,000 record size recommendation. That figure appears to have been determined by Support, and I plan on doing some digging to find out how they arrived at it. From my perspective as the Product Lead in this area who overhauled the new system in 8.3, the prior system should absolutely be able to cache more data than 50k records within the HSQL cache. The confusion often stems from the definition of a "record" in the older architecture, which was not a single tag value but rather a batch of data like what @Andre_Hennecke mentioned. For instance, if you have a tag group on an onChange sample mode (which actually runs at a 1-second rate), all tag values that changed during this execution are bundled together and stored as a single record.


Store & Forward Engine Metrics

Now, let's address the 500 records/second metric, as its meaning depends entirely on where you observed it. On the Store & Forward status page, there's a critical difference in the units used:

  • The store throughput and forwarding throughput metrics are measured in data points/second (i.e., individual rows).
  • The pending and quarantined counts are shown in records (i.e., batches).

If the 500 value you're seeing is from the forwarding throughput, it means your system is only writing an average of 500 rows to the database per second, which isn't a very high rate. Could you clarify where you saw this "500 records/second" value? Knowing the source is key to understanding what's truly happening.


Cache Data Flow and Performance

The actual performance limitation within Store & Forward often comes down to the catch-up rate, which is governed by a specific data flow. All incoming data is first written to an in-memory buffer. This data is then moved to the on-disk HSQL cache when one of two triggers is met: either the batch reaches the store size limit (e.g., 25 records) or the store time limit is exceeded (e.g., 5 seconds). This leads to two levels of what can be considered a "healthy state." While the system is technically healthy any time it forwards data to the final database as fast as it arrives, the optimal state is when data is sent directly from the in-memory buffer. This is significantly faster because it bypasses the overhead of querying the local on-disk cache and deserializing the data.

However, once data exists in the HSQL cache, the forwarding mechanism prioritizes clearing this backlog. It queries data from the cache from oldest to newest to ensure the historical order is maintained. Each of these queries is limited by the forward write size setting, which dictates the number of rows pulled from the HSQL database to be bundled for writing to your PostgreSQL database. It's also important to clarify that a setting like the max cache size (e.g., 25,000) refers to the total number of data rows in the HSQL database table.

A crucial detail is that this on-disk cache is a shared resource. Its total capacity must accommodate both normal pending data and any data that has been quarantined due to a critical failure when forwarding. The limit applies to the combined total of both. For example, if you have 25,000 rows of data in quarantine, the cache is effectively full, and no new pending data can be stored.


PostgreSQL Clustering

I also want to agree with the original recommendation for using a PostgreSQL cluster. While the local cache is a tool for handling temporary disconnections, it's not a substitute for true high availability. If your system needs to tolerate a database being down for an extended period, then a cluster is absolutely the superior architecture. Having another node that can automatically take over and accept writes is far better than relying on the local cache to buffer a large volume of data over a long time.


8.3 Overhaul

It's also crucial to understand that the "record" concept is now legacy. The Store & Forward system underwent a complete architectural shift in 8.3. In this version, the idea of a "record" as a batch was eliminated, and the system now processes data on a one-to-one basis. Given the numerous other performance improvements, I'd highly recommend upgrading when you are ready. The benefits are significant.


Ignition Edge

Finally, it's important not to confuse this behavior with Ignition Edge. On Edge, there is no local HSQL cache for Store & Forward, as data is written directly to the local Edge Historian, which uses a SQLite database. Its 10 million point limit is not a limit of "records" in the batch sense but a hard limit of 10 million individual rows of data.

3 Likes