Ignition Observability

Randy_Rausch · September 16, 2025, 6:32am

Overview

Introduction / Motivation

Ignition doesn’t run in a vacuum—it sits at the center of controllers, networks, databases, brokers, and other infrastructure and applications that all depend on each other. In IT operations, the way to keep that whole picture healthy is observability: a single place to see correlated traces, metrics, and logs across every layer so you can spot causes and effects quickly. At 4IR, we already monitor the rest of our ecosystem this way, but Ignition was the missing piece. The built-in diagnostics are useful, yet we needed those signals in the same observability suite as everything else. This post shares how we solved that challenge—and how you can too.

Approach

To bring Ignition into our observability stack, we focused on two key areas: metrics and logs.

Metrics: Prometheus is an open-source monitoring system that collects performance data (metrics) and makes it easy to graph and alert on. Using the OpenTelemetry Java Agent, we exposed Ignition’s internal metrics (like JVM health, database pool usage, Perspective sessions, and script performance) in a Prometheus-friendly format.
Logs: We reconfigured Ignition’s logging to output in structured JSON format. This allowed our log aggregation system to index, search, and correlate Ignition events alongside everything else—making it easy to connect the “what happened” in logs with the “why it happened” in metrics.

Once metrics and logs were available locally in standard formats, we used Grafana Alloy to scrape the Prometheus endpoint, collect the JSON logs, and forward everything to our central observability suite. With all of this data in one place, we could see what “normal” looked like, and set proactive alerts when the health of the ecosystem drifted outside those bounds.

Note: This approach exposes a vast number of metrics—some more useful than others. To stay cost-efficient and avoid noisy dashboards, we filtered out lower-value signals and focused on the ones that truly help us detect issues early.

Dashboard Screen Shots

Implementation Details / Tutorial

To get metrics out of Ignition, we first had to tap in. That meant figuring out the best way to expose what Ignition already knows about itself—JVM health, database pools, Perspective sessions, scripts, and more—so our observability stack could scrape and store it.

Metrics

We briefly considered writing a custom Ignition module, but chose not to—mainly to avoid long-term maintenance overhead. Since we run Ignition in containers and deploy with infrastructure-as-code, it made more sense to use existing, well-supported tooling like the Prometheus JMX Exporter or OpenTelemetry Java Instrumentation.

Prometheus JMX Exporter – Works well for JVM/runtime telemetry (Garbage Collection, memory, threads, etc.), but it only covers Java internals. To get Ignition-specific metrics, we would have had to build and maintain custom MBeans—too much ongoing work.
OpenTelemetry Java Instrumentation – OpenTelemetry (OTel) is an open source framework for Observability. It has a java instrumentation agent that provides both the JVM/runtime metrics and the Ignition application metrics we cared about. It integrates cleanly with collectors via OTLP, so it plugged right into the dashboards and alerts we already had.

Given the broader coverage, almost no code changes, and room to expand into traces/logs later, we decided OpenTelemetry Java Instrumentation was the clear win for us.

OpenTelemetry Integration:

Once we chose the OTel approach, the next step was integrating it with Ignition. The agent runs as a lightweight Java agent inside the container, exposing a Prometheus endpoint with all the metrics.

Step 1: Add the Java Agent

Extend your custom Ignition image to include the OTel Java agent. In your Dockerfile, download the agent and copy it into the container:

RUN apt-get update && apt-get install -y wget \\ 
    && wget -O opentelemetry-javaagent.jar https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.14.0/opentelemetry-javaagent.jar

COPY opentelemetry-javaagent.jar /mnt/shared-files/lib/

Step 2: Configure Environment Variables

Enable Prometheus metrics export by setting environment variables in the Dockerfile (or in your Docker Compose / Kubernetes manifests):

ENV OTEL_TRACES_EXPORTER=none 
ENV OTEL_METRICS_EXPORTER=prometheus 
ENV OTEL_LOGS_EXPORTER=none 
ENV OTEL_EXPORTER_PROMETHEUS_PORT=9000 
ENV OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0 
ENV OTEL_EXPORTER_PROMETHEUS_PATH=/metrics

Step 3: Start Ignition with the Agent

The last step is telling Ignition’s JVM to actually start with the Java agent attached. In a docker-compose.yml, this is done by extending the command: section so it includes the -javaagent flag (and the -Dotel.* flags).

      "-javaagent:/mnt/shared-files/lib/opentelemetry-javaagent.jar", 
      "-Dotel.instrumentation.dropwizard-metrics.enabled=true", 
      "-Dotel.instrumentation.jdbc-datasource.enabled=true",

If you’re not running Ignition in Docker or Kubernetes, you can still attach the OpenTelemetry Java agent by updating the wrapper.conf file. Look for the wrapper.java.additional entries and add the following lines (adjust the numbers to be the next available).

wrapper.java.additional.100=-javaagent:/path/to/opentelemetry-javaagent.jar 
wrapper.java.additional.101=-Dotel.instrumentation.dropwizard-metrics.enabled=true 
wrapper.java.additional.102=-Dotel.instrumentation.jdbc-datasource.enabled=true

Linked below is a multi-stage Dockerfile example with the java agent and custom logging config and a docker-compose file to run ignition with the metrics enabled.
Dockerfile (1.1 KB)
docker-compose.yaml (832 Bytes)

Scraping

With the above complete, Ignition exposes a Prometheus-compatible /metrics endpoint on port 9000. From here, a collector, like Grafana Alloy, can scrape the metrics and forward them to your observability suite for dashboards and alerts.

In our case, we run on Kubernetes and already deploy Alloy via the k8s-monitoring Helm chart. To have Alloy pick up Ignition’s metrics automatically through a ServiceMonitor, we simply enabled the Prometheus Operator objects in our Helm values:

           prometheusOperatorObjects: { 
             enabled: true, 
          },

With this setting, Alloy creates the ServiceMonitor resources needed to discover and scrape the /metrics endpoint, then ships the data upstream.

An important configuration to take in consideration is the scrapeInterval parameter. This parameter will be the responsible to affect the total number of series for the metrics. We recommend to use 15 seconds for this parameter. It means that each minute we will have 4 data points collected. Reducing the scrape interval for lower values can lead to excessive data metrics collected and stored also can lead to possible performance impacts.

Filtering

By default, the configuration above exports all Ignition and JVM metrics. Some of these are very high-cardinality (lots of unique label values) — for example, metrics with request paths or a separate series for every Perspective session. These are great when you’re deep-diving into debugging, but they quickly become overkill (read: expensive) for day-to-day monitoring.

We found it more effective to filter down to the most useful metrics for operations. If you think other signals should be added (or dropped), please share your experience — this is definitely a place for the community to collaborate.

In Kubernetes, filtering can be handled directly in the ServiceMonitor using metricRelabelings. Rather than filtering out “noisy” metrics, we chose to be explicit about which metrics to keep.

Here's an example of service monitor with metricRelabelings to keep the useful metrics:
service-monitor-ignition.yaml (2.4 KB)

If you aren’t using Kubernetes, you can configure the filtering in your config.alloy file.

Ignition Logging

To improve our observability on Ignition, we decided to change the log output to JSON format. This will enable our monitoring systems to index and process the log information more effectively.

To achieve this, follow these steps:

Generate a new logback.xml.
logback.xml (2.7 KB)

DBAsync is the default configuration used by Ignition for internal logs.
The JSONConsoleAppender configures the output of Ignition logs in JSON format.

Add the new logback.xml to the Ignition data directory: $IGNITION_HOME/data

To enable our logging system's parser, we configured Ignition to output only the log message. We achieved this by adding the following wrapper configurations to the Ignition start arguments:

      // logging configuration 
      "wrapper.console.format=M", 
      "wrapper.logfile.format=M", 
      "wrapper.console.loglevel=INFO",

Ignition will output JSON logs as shown in the example below:

Alloy tails container stdout and log files, parses with JSON and regex, enriches with Kubernetes and metadata labels, filters noise, buffers and batches, sends to Grafana Cloud Logs via Loki over TLS using the tenant and API key.

Integrated Ignition Ecosystem Dashboard

With the above information gathered, we built a unified dashboard designed to bring together all the signals needed to identify issues and understand system behavior. Instead of bouncing between tools, we can see infrastructure, broker, databases, and Ignition health in one place.

Here are the sections of the dashboard and some details about each one:
Key metrics Overview:

CPU%, Memory%, Disk usage, Perspective sessions, DB ops/s, Uptime, SSL cert expiry.
Answers “is the gateway healthy right now?” with color thresholds and 1-hour sparklines.

System Overview:

Threads (total/running/blocked) to spot contention.
JVM heap vs max + G1 pools (Eden/Old/Survivor) and GC pauses for memory pressure.
Container I/O/CPU/RSS and network RX/TX to separate JVM issues from cgroup or node constraints.

Database Metrics:

Throughput (ops/s) & connections per DB.
Optional latency/p95 if exposed; correlates with disk and network to explain slowdowns.

Perspective Metrics:

Sessions & pages (server load) and client views/components (front-end activity).
Spikes here should align with CPU and script execution.

Scripts Metrics:

Execution rate by script/project, Top slow projects, and p95 duration from histograms.
Surfaces heavy or regressing scripts without exploding cardinality.

Gateway network Metrics:

Message rate (incoming/outgoing) beside container network bytes/sec to detect external bottlenecks.
Connection churn/restarts for stability signals.

Logs:

JSON logs via Alloy with quick links from panels; error/warn rate and recent exceptions.
Used for drill-downs when a metric crosses a threshold.

RED metrics(Rate - Errors - Duration):

Rate: HTTP requests/sec (RPS) handled by the Gateway/Ingress.
Errors: 5xx error rate and ratio (5xx ÷ total). Track 4xx separately for client issues.
Duration: p95/p99 server request latency from request-duration histograms.

MQTT metrics:

Queue depth/in-flight, publish/receive rate, dropped messages, connected clients.
Plotted next to Gateway message rate to confirm whether issues are upstream (MQTT) or inside Ignition.

Wrapping up:

With metrics, logs, and scraping in place, Ignition becomes a first-class citizen in our observability stack. That means:

Unified dashboards – We can see Ignition health right alongside databases, brokers, Kubernetes, and network infrastructure. Golden signals (latency, errors, throughput, saturation) sit next to Ignition-specific metrics like Perspective sessions, database pool usage, and script performance.
Actionable alerts – Symptom-based alerts proactivly flag issues and link directly to dashboards.
Scalability – Whether you’re running a single gateway or hundreds worldwide, the same pattern applies.
Faster troubleshooting – The unification of telemetry makes it easy to connect “what happened” with “why” without logging into each gateway.

We’ve shared our setup and example configs to help others get started, but this is just one path. Ignition exposes a lot of signals, and every project is different.

We’d love to hear how others in the community are handling observability for Ignition.

Which metrics do you find most valuable?
What proactive alerts are most useful to catch issues before they cause downtime?

The more we share, the stronger the playbook becomes for everyone running Ignition in production.

Acknowledgments

Special thanks to The Kevin Collins, whose brilliant ideas, wise counsel, and seemingly endless depth of knowledge helped shape this project

Example Files

Here are some example files to help you get started on this approach.

- This Dockerfile builds a custom Ignition image that bundles the OpenTelemetry Java agent and logging configs, enabling Prometheus metrics export on port 9000.
Dockerfile-Ignition-Metrics (1.1 KB)

- An example, simple Docker Compose file running an Ignition gateway with OpenTelemetry agent enabled to expose metrics on port 9000.
docker-compose-custom-otel-2.yaml (832 Bytes)

- We use this file to define the format of the logs Ignition exports
logback.xml (2.7 KB)

- This is the list of metrics we find most useful.
all-metrics-list.txt (5.5 KB)

- This file configures the Kubernetes service monitor, keeping only the metrics of interest
service-monitor-ignition.yaml (2.4 KB)

- This importable JSON files defines how the dashboard in Grafana looks.
dashboard-grafana.json (107.4 KB)

References:

pturmel · September 16, 2025, 2:03pm

Well done! Bookmarked.

Jackson_Lancaster · September 18, 2025, 3:27pm

Cant wait to try this out!

Just got done implementing custom projects metrics via WebDev.

Your method will be much cleaner.

diamond · September 30, 2025, 6:38am

Very interested in that “top slowest projects“ chart.

Can this be implemented without Docker?

paul.longtine · October 8, 2025, 8:57pm

This can absolutely run outside of docker, the post mentions it:

If you’re not running Ignition in Docker or Kubernetes, you can still attach the OpenTelemetry Java agent by updating the wrapper.conf file. Look for the wrapper.java.additional entries and add the following lines (adjust the numbers to be the next available).
wrapper.java.additional.100=-javaagent:/path/to/opentelemetry-javaagent.jar 
wrapper.java.additional.101=-Dotel.instrumentation.dropwizard-metrics.enabled=true 
wrapper.java.additional.102=-Dotel.instrumentation.jdbc-datasource.enabled=true

At this point you would be able to configure your own Prometheus or Grafana Alloy instance to scrape the /metrics endpoint.

Just be sure that you have your OTEL_ environment variables configured (mentioned above) when you start the gateway

paul.longtine · October 8, 2025, 9:10pm

Thank you for sharing this! I have some questions,

Around the histogram bucketing: Will there ever be support for configuring the buckets for certain histogram metrics? In some environments it helps to widen the span or include more buckets to get a finer resolution.
I appreciate being able to see metrics that target specific perspective sessions, but I am curious if the perspective session id could be exposed as a label instead of a metric name to simplify queries.

I'm very happy to see how painless instrumenting a gateway is with the otel javaagent, this is a great project that has made a very positive impact in observing the activity of a running gateway, great work!

paul.longtine · October 23, 2025, 9:10pm

Replying to myself here, to offer a solution for #2 for those who are interested:

If you're using Prometheus for metric collection, it is possible to modify your configuration to rename metrics with session IDs into a new metric name with a label containing the session ID using metric_relabel_configs:

scrape_configs:
  - job_name: 'rename'
    ...
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'perspective_session_(.+)_scripts_milliseconds_(bucket|count|sum)'
        target_label: 'sessionid'
        replacement: '$1'
      - source_labels: [__name__]
        regex: 'perspective_session_(.+)_scripts_milliseconds_(bucket|count|sum)'
        target_label: __name__
        replacement: 'perspective_sessions_scripts_milliseconds_$2'

This translates metrics with names like:

perspective_session_01fc38bf_006c_4056_8314_769190f8f351_scripts_milliseconds_bucket
perspective_session_01fc38bf_006c_4056_8314_769190f8f351_scripts_milliseconds_count
perspective_session_01fc38bf_006c_4056_8314_769190f8f351_scripts_milliseconds_sum

Into a common metric name, with a label for a given session ID:

perspective_sessions_scripts_milliseconds_bucket{sessionid="01fc38bf_006c_4056_8314_769190f8f351"}
perspective_sessions_scripts_milliseconds_count{sessionid="01fc38bf_006c_4056_8314_769190f8f351"}
perspective_sessions_scripts_milliseconds_sum{sessionid="01fc38bf_006c_4056_8314_769190f8f351"}

Having applications label metrics internally is a more robust solution compared to this, as you would have to maintain your configuration file to hand-pick which metric names will be renamed+relabeled.

matheus.4IR · October 28, 2025, 7:36pm

Hi, Paul - apologies for the slow reply.

On #2 — Prometheus stores every unique combination of metric name + label set as a distinct time series. The tools allow you to relabel as desired. This will be quite helpful for debugging or where detailed session-level insight is needed. Your solution should work well here, we do something similar on other metrics with grafana cloud leveraging the data processing, with your approach the data processing happens on the origin. However, we advise caution against doing it this way in production. For large systems, this can explode the number of active series, increasing storage, CPU, memory pressure, query latency, and cost.

Regarding #1

The configuration of the exposed histogram metrics is defined in the Ignition code; what we can do today is filter by the required bucket with prometheus on the origin, so we can keep only the desired buckets.
For example, to monitor that critical DB queries stay ≤ 100 ms, compare the le="100.0" bucket (counts ≤ 100 ms) with the le="+Inf" bucket (total count):

sum by (cluster,pod) (
  rate(databases_queries_milliseconds_bucket{
   le="100.0"
  }[5m])
)
/
sum by (cluster,pod) (
  rate(databases_queries_milliseconds_bucket{
   le="+Inf"
  }[5m])
)

paul.longtine · October 29, 2025, 3:04pm

Excellent points to make here. You are absolutely right in that what I'm asking for results in a larger demand for resources. My questions are aimed at a use case during evaluation of certain projects, during development and testing. The configurations for which metrics are kept, which are dropped, and what retention sizes should be used, will vary depending on the environment and scale.

Regarding histogram buckets, it appears the internal gateway metrics are using a default exponential bucket, where the underlying data may be misrepresented by these buckets. Operationally, you are 100% correct in that, you can drop many of the histogram bucket labels that aren't entirely insightful for alerting or triage.

I've been attempting to see what it might be like to report metrics from within a certain project, and stumbled upon this type of pattern:

WARNING: I have no idea if this is stable or recommended by IA.

from com.inductiveautomation.ignition.gateway import IgnitionGateway
def observe_histogram_metric(value):
	gateway = IgnitionGateway.get()
	metrics = gateway.getMetricRegistry()
	custom_metric = metrics.histogram("my_custom_histogram")
	custom_metric.update(value)

The gateway has a getter for the MetricRegistry, which allows scripts to report their own metrics through calling methods that return or create new metrics.

I'm still exploring exactly how the javaagent <-> dropwizard metrics interaction happens internally, to better understand how histogram buckets are set for a given metric.

matheus.4IR · October 31, 2025, 10:58am

Great catch — thanks for digging into this. The gateway’s internal histograms do appear to use default exponential buckets.
That IgnitionGateway.get().getMetricRegistry() hook is a nice find. I’ll validate this approach in a test gateway and share results in the comments. Also, if we can add these additional histogram metrics during the development phase, it would really help surface bottlenecks in projects and scripts early.