Observability
As you build and operate applications in OpenShift, observability becomes one of your most valuable tools.
You’ll use metrics to scale applications with HPAs, inspect logs to debug crashes, and rely on probes and events to understand how deployments are behaving. Observability isn’t just a feature — it’s how you’ll understand what’s happening inside your workloads, at every stage of their lifecycle.
This module takes a closer look at the tools OpenShift provides to help you monitor and troubleshoot your own applications, not just the platform.
We’ll focus on two key systems:
-
User Workload Monitoring — for gathering custom application metrics and using them to power dashboards, alerts, and autoscalers
-
User Workload Logging — for collecting and querying logs from your workloads at scale
By the end of this module, you’ll know how to instrument your apps, access metrics and logs from the web console or CLI, and build a more observable system from day one.
User Workload Monitoring
OpenShift includes a built-in monitoring stack powered by Prometheus, Alertmanager, and Grafana. This stack collects platform-level metrics — like node health, etcd performance, and API server activity — and stores them in a highly available fashion.
By default, this stack is focused on monitoring OpenShift itself. But what about your own apps?
That’s where User Workload Monitoring comes in.
What Is It?
User Workload Monitoring is an OpenShift feature that lets you scrape, store, and query Prometheus-style metrics from applications running in your own projects.
It gives you access to:
-
A dedicated Prometheus instance for user-defined workloads
-
A ServiceMonitor or PodMonitor interface to tell Prometheus what to scrape
-
The ability to create alerts and visualize metrics using Grafana or the OpenShift web console
How to Enable It
User workload monitoring is disabled by default. You can enable it by modifying the cluster-monitoring-config
ConfigMap:
oc edit configmap cluster-monitoring-config -n openshift-monitoring
Add or update the following:
data:
config.yaml: |
enableUserWorkload: true
This enables a second Prometheus stack in the openshift-user-workload-monitoring
namespace.
This step usually requires cluster-admin privileges. You’ll also need to wait a few minutes for the new Prometheus instance to come online. |
How to Expose Your Metrics
To expose metrics from your app, you’ll need:
-
An HTTP
/metrics
endpoint (in Prometheus format) -
A
Service
that selects your pods -
A
ServiceMonitor
orPodMonitor
resource that tells Prometheus where to scrape
Example ServiceMonitor
:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- my-project
endpoints:
- port: metrics
interval: 30s
This tells Prometheus to scrape metrics from all pods labeled app=my-app
in the my-project
namespace, using the metrics
port every 30 seconds.
Where to View Metrics
Once metrics are flowing, you can query them using:
-
The OpenShift web console → Observe → Metrics
-
The Prometheus UI exposed via
oc -n openshift-user-workload-monitoring port-forward svc/prometheus-user-workload 9090
-
A connected Grafana instance (if installed)
You can also use these metrics to power:
-
Horizontal Pod Autoscalers (HPAs)
-
Custom alerts with Alertmanager
-
Dashboards for internal or external users
User workload metrics are stored separately from platform metrics — and you are responsible for maintaining their signal quality. Avoid high cardinality, unbounded labels, or overly frequent scrapes. |
User Workload Logging
Metrics give you the big picture — but logs are how you zoom in on specific events, requests, or failures. In OpenShift, you can enable a logging stack that collects, stores, and makes searchable the logs from your application containers.
This is especially useful when:
-
You want to debug applications without exec’ing into a pod
-
You want to correlate logs across multiple pods or services
-
You need to store logs longer than a container’s lifetime
-
Your developers or support teams need log access without cluster-admin rights
What Is User Workload Logging?
OpenShift’s modern logging architecture is built around the Vector log collector, which gathers logs from all nodes and routes them to one or more backends, such as:
-
Loki — a horizontally-scalable log store designed for Kubernetes
-
Elasticsearch — a traditional full-text log indexing engine
-
(Optional) External log stores — S3, Splunk, Kafka, etc.
These logs can be accessed directly through:
-
The OpenShift Console → Observe → Logs
-
The Grafana Loki UI, if exposed
Enabling Logging for Workloads
To collect logs from application namespaces, follow these steps:
-
Install the OpenShift Logging Operator via Operators → OperatorHub
-
Install the Loki Operator (required if using Loki as your log store)
Then, deploy a LokiStack
instance:
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
name: logging-loki
namespace: openshift-logging
spec:
size: 1x.extra-small
storage:
schemas:
- version: v12
effectiveDate: "2022-06-01"
secret:
name: logging-loki-s3
type: s3
tenants:
mode: openshift-logging
Finally, configure log forwarding using a ClusterLogForwarder
resource:
apiVersion: logging.openshift.io/v1
kind: ClusterLogForwarder
metadata:
name: instance
namespace: openshift-logging
spec:
inputs:
- name: application-logs
application:
namespaces: ["my-project"]
outputs:
- name: default
type: loki
url: http://logging-loki-gateway.openshift-logging.svc:8080
pipelines:
- inputRefs: [application-logs]
outputRefs: [default]
This setup tells Vector to collect logs from the my-project
namespace and forward them to your LokiStack instance.
Log collection is namespace-aware. You must explicitly include namespaces you want to monitor in the |
Accessing Logs
You can access logs directly from the OpenShift Console:
-
Go to Observe → Logs
-
Filter by project, pod, container, or log level
-
Search or tail logs in real time without shell access
If desired, you can also expose and use the Grafana Loki UI for advanced log querying.
Best Practices
-
Use structured logs (JSON is strongly preferred)
-
Limit high-volume debug output in production
-
Avoid logging sensitive information (secrets, tokens)
-
Always log to
stdout
andstderr
— not to local files
Enabling user workload logging gives you durable, searchable insight into your applications — making it easier to diagnose issues, investigate incidents, and support multi-team environments. |
Optional: Network Observability Operator
If you want to go beyond metrics for CPU, memory, and custom application data, OpenShift also provides network-level observability using the Network Observability Operator.
This operator enables flow-based network monitoring — giving you visibility into traffic between namespaces, pods, services, external endpoints, and even dropped packets.
It is especially useful for:
-
Troubleshooting network latency or traffic anomalies
-
Understanding who is talking to whom inside your cluster
-
Identifying unexpected or unauthorized external traffic
-
Visualizing data flow between workloads
Once installed, it collects network flows using the eBPF-based Flow Collector and surfaces them in the OpenShift Console under Observe → Network Traffic.
You can:
-
Filter flows by namespace, pod, protocol, or direction
-
View conversations (source/destination pairs) and traffic rates
-
Export flow logs to a remote system for long-term analysis
To install the operator:
-
Go to Operators → OperatorHub in the OpenShift Console
-
Search for Network Observability Operator
-
Install it into the
netobserv
namespace -
Accept the default configuration (or customize as needed)
Once installed, navigate to Observe → Network Traffic to explore live and historical traffic flows.
The Network Observability Operator is powerful for debugging network issues, enforcing policies, and gaining visibility into production traffic patterns. It complements Prometheus-based monitoring by showing how data moves, not just how applications behave. |
Optional: Distributed Tracing
For workloads that span multiple services, logs and metrics may not be enough. That’s where distributed tracing comes in.
Tracing lets you see how a single request flows across your entire system — including timing information for each hop. This is invaluable for debugging performance issues, latency bottlenecks, or failed transactions.
In OpenShift, you can deploy Grafana Tempo, a lightweight, scalable tracing backend designed for cloud-native environments.
Deploying Tempo via the Tempo Operator
If you’re already using Grafana and Loki, Tempo integrates seamlessly and allows you to correlate logs and traces using shared trace IDs.
To get started:
-
Install the Tempo Operator from OperatorHub
-
Deploy a
TempoStack
resource -
Configure your applications to send traces using OpenTelemetry-compatible SDKs or exporters
Example:
apiVersion: tempo.grafana.com/v1alpha1
kind: TempoStack
metadata:
name: tempo
namespace: openshift-monitoring
spec:
storage:
type: memory
replicas: 1
This creates a minimal, in-cluster Tempo setup suitable for development and testing.
Instrumenting Applications
Your applications need to be instrumented to generate and export trace data.
This is typically done using:
-
OpenTelemetry SDKs for languages like Go, Java, Python, JavaScript, etc.
-
Libraries or frameworks that support tracing headers and propagation
-
Sidecars or ingress layers that forward trace headers (in service mesh scenarios)
Instrumentation includes:
-
Creating spans for each operation
-
Tagging spans with metadata (e.g., endpoint, status, duration)
-
Forwarding trace data to Tempo via OTLP
Distributed tracing adds critical visibility into the flow of requests across services — but only works if trace data is generated and exported by your workloads. Start with a single service and expand as you gain confidence. |
Knowledge Check
-
What is the difference between platform monitoring and user workload monitoring in OpenShift?
-
How do you expose Prometheus metrics from your application to the OpenShift monitoring stack?
-
What are
ServiceMonitor
andPodMonitor
, and when would you use each? -
How can you view user workload metrics from the OpenShift web console?
-
What is the role of Vector in OpenShift’s logging stack?
-
How do you configure which application logs are collected by OpenShift?
-
Why is it a bad practice to write logs to files inside the container?
-
What kinds of issues is the Network Observability Operator designed to help identify?
-
How does distributed tracing differ from traditional logging?
-
What are some ways to instrument an application to produce trace data?