MyObservability

Distributed Tracing

Distributed tracing is a technique used to monitor and observe requests as they flow through different services or components in a distributed system, such as microservices architectures. It helps to track the journey of a single request as it propagates through multiple interconnected services, offering insights into the performance, latency, and potential bottlenecks.

Examples:

- Dynatrace: Purepath
- AppDynamics: Business Transactions
- DataDog: APM Traces
- Splunk Observability Cloud: ASPM Traces

Context Propagation

Context Propagation: To maintain visibility into the flow of requests, distributed tracing uses context propagation to pass tracing information (like trace ID and span ID) between services. This allows subsequent services to continue the trace.

Examples:

- Dynatrace: Context Propagation
- AppDynamics: Correlation Context
- Datadog: Trace Context
- Splunk Observability Cloud: Trace Context Propagation
- New Relic: Distributed Trace Context

Service Dependency Mapping or Topology Mapping

Service Dependency Mapping or Topology Mapping visual overview of the relationships between different services and components in a distributed system. They help with understanding dependencies, identifying bottlenecks, and diagnosing performance issues across the architecture.

Examples:

- Dynatrace: Smartscape
- AppDynamics: Flow Map
- Datadog: Service Map
- Splunk Observability Cloud: Service Dependency Map or Infrastructure Navigator
- New Relic: Service Map or Entity Explorer

Trace Snapshot or Trace Detail

Trace Snapshot or Trace Detail: captures detailed information about a specific transaction or request at a granular level, including all the interactions and spans within a distributed trace. It essentially provides a "snapshot" of all the steps involved in processing a request for diagnostic purposes.

Examples:

- AppDynamics: Snapshot
- Dynatrace: PurePath
- Datadog: Trace View or Trace Details
- Splunk Observability Cloud: Trace Detail or Span Analytics
- New Relic: Transaction Trace

Error or Exception Tracking

Error or Exception Tracking refers to the process of detecting, collecting, and analyzing errors that occur in an application or system. This includes capturing exceptions, failures, or unexpected behaviors in real-time, helping developers and operations teams understand the root causes of issues. Effective error tracking tools provide detailed insights such as stack traces, error rates, and service dependencies, enabling faster debugging and resolution of application problems.

Examples:

- Dynatrace: Problem Detection & Analysis
- AppDynamics: Error Analytics
- Datadog: Error Tracking
- Splunk Observability Cloud: Error Analytics
- New Relic: Error Analytics

Alerts or Threshold-Based Monitoring

Alerts or Threshold-Based Monitoring is the capability of observability tools to monitor predefined metrics or conditions and automatically trigger notifications when thresholds are breached or anomalies are detected. This allows operations teams to respond quickly to critical issues (e.g., high error rates, performance degradation) before they impact users. Real-time alerts can be sent via various channels, such as email, SMS, or incident management systems, and are crucial for proactive system management and minimizing downtime.

Examples:

- Dynatrace: Davis AI Alerts
- AppDynamics: Health Rules and Alerting
- Datadog: Monitors and Alerts
- Splunk Observability Cloud: Alerts
- New Relic: Alerts & Applied Intelligence

SLOs and SLIs (Service-Level Objectives and Indicators)

SLOs and SLIs (Service-Level Objectives and Indicators) are specific, measurable goals set for the performance and reliability of a service, typically defined in terms of availability, latency, or error rates. Service-Level Indicators (SLIs) are the metrics used to measure whether a service is meeting those objectives. Together, SLOs and SLIs help organizations ensure that they are delivering the expected level of service to users and can be used to monitor compliance with Service-Level Agreements (SLAs). These metrics provide clear targets for maintaining system reliability and user satisfaction.

Examples:

- Dynatrace: SLOs (Service-Level Objectives) Monitoring
- AppDynamics: Service-Level Management
- Datadog: SLO Monitoring
- Splunk Observability Cloud: SLO Monitoring
- New Relic: SLO/Reliability Monitoring

Anomaly Detection

Anomaly Detection is the process of identifying unusual patterns or behaviors in system performance metrics or logs that deviate from the expected norm. By using machine learning or statistical models, observability tools can detect performance issues, failures, or security breaches in real-time, even if predefined thresholds aren’t breached. This helps organizations catch issues early, often before they escalate into larger problems affecting users.

Log Management

Log Management involves the aggregation, storage, and analysis of log data generated by applications, services, and infrastructure. Logs capture detailed, timestamped records of events occurring within a system, such as error messages, database queries, and transaction failures. Effective log management allows operators to search, analyze, and correlate log data in real-time to diagnose issues, monitor security, and track application behavior. Logs are critical for root cause analysis and troubleshooting.

Examples:

- Dynatrace: Log Monitoring
- AppDynamics: Log Analytics
- Datadog: Log Management
- Splunk Observability Cloud: Log Observer
- New Relic: Log Management

Root Cause Analysis (RCA)

Root Cause Analysis (RCA) is the process of identifying the fundamental cause of a problem or performance issue in a system. By correlating metrics, logs, traces, and events, observability tools help engineers pinpoint the exact component or service responsible for an issue, enabling faster resolution. RCA is essential for reducing the time to identify and fix critical issues, ensuring system reliability and preventing recurring problems.

Examples:

- Dynatrace: AI-Powered Root Cause Analysis
- AppDynamics: Root Cause Diagnostics
- Datadog: Root Cause Detection
- Splunk Observability Cloud: Root Cause Explorer
- New Relic: Root Cause Analysis

Sampling

Collecting every metric or collecting each trace for every transaction is resource intensive

you create a new span, record its start time, end time, attributes, context propagation, and maybe even events/exceptions.

This generates in-memory data.
This data needs to be batched.
This data needs to be exported to backends (e.g., Jaeger, Tempo, Datadog).

It can: Consume CPU (building spans), Consume memory (storing spans), Consume network (exporting spans), Slow down your app if not carefully managed.

Instead of recording every single request, you can: Record only some requests. Drop the rest to save resources. You can use any of the below Sampling Types in OpenTelemetry

| Type | Meaning | |------|---------| | AlwaysOnSampler | Sample (record) every trace (default for development). | | AlwaysOffSampler | Never sample traces (only manually start them).| | TraceIdRatioBasedSampler | Sample a percentage (e.g., 10%, 1%) based on Trace ID.| | ParentBasedSampler | Follow the parent span’s sampling decision. (If parent is sampled, child is too.)|

Because of sampling sometimes critical traces (failed transactions) get missed so use "recording sampled spans on demand". you can programmatically override and force a span to be recorded when an error is detected.

Introduction Main Page

MyObservability

Observability Concepts

Distributed Tracing

Context Propagation

Service Dependency Mapping or Topology Mapping

Trace Snapshot or Trace Detail

Error or Exception Tracking

Alerts or Threshold-Based Monitoring

SLOs and SLIs (Service-Level Objectives and Indicators)

Anomaly Detection

Log Management

Root Cause Analysis (RCA)

Sampling