MyObservability

Designing AIOPs

Machine learning

Core feature of AIOps is Machine learning

Supervised learning: Predicting a category based on past data.
Unsupervised learning: Finding inherent groupings in data without known labels. like Grouping related alerts, potential correlation
- Anomaly Detection: Identifying unusual data points that deviate from established norms.
Reinforcement Learning: Involves an agent learning optimal actions through trial and error in a given environment.

Challenges for tradition root cause analysis:

Soiled data
Manual & Time consuming for correlating events, pattern and dependencies
Alert overload

How Machine Learning (ML) tackles root cause analysis (RCA) within AIOps.

Log & Event Correlation using ML algorithms & Pattern recognitions.
Anomaly Detection: ML models learn the normal baseline behavior of system metrics. Deviations outside established norms pinpoint where and when something unusual occurred, triggering deeper investigations.
Knowledge Graphs : Interdependencies between IT infrastructure components
Predictive Models: Based on historical data and patterns.

Ex:

A customer reported about a web application about inaccessible.
AIOps platform collects the data: alerts from web server, middileware, database server, Infrastructure.
Groups alerts based on timing and component overlap.
Anomaly detection models simultaneously show unusual traffic patterns associated with that network device.
IT teams focus on that network device to investigate, rather than troubleshooting everything downstream of the outage.

ML algorithms & Pattern recognitions

ML algorithms analyze flood of raw data/alerts to identify events that typically co-occur or exhibit distinct signatures indicating a shared problem.
Clustering: Groups similar events and logs together, based on features like timestamps, error codes, affected components, and text patterns within the event message.
Predefined Patterns: Rules configured to seek specific, known combinations of events suggesting an incident is forming.
Dynamic Pattern Learning: Advanced AIOps systems incorporate ML to automatically learn new patterns as they observe how events typically cluster over time

Types of Anomalies:

Consider what you want to detect:

Point anomalies: Spikes or dips at a specific time.
Contextual anomalies: Unusual within a period (e.g., low CPU usage on weekends).
Collective anomalies: A group of data points acting unusually together.

Knowledge Graphs:

A knowledge graph is a network of interconnected data points representing IT components, dependencies, historical incidents, alerts, and other relevant operational information.

Knowledge Graph is Built using

Discovery tools for infrastructure details
Monitoring and logging systems for events and alerts
CI information from the CMDB

Predictive Model

Predictive models use machine learning algorithms to analyze historical data (logs, metrics, past incidents) and identify patterns that anticipate potential issues, breakdowns, or performance degradations.

AIOps Main page My AIOPs

AIOps Overview Page: AIOPs Overview

Next Page: AIOps Design