Designing AIOPs
Machine learning
Core feature of AIOps is Machine learning
- Supervised learning: Predicting a category based on past data.
- Unsupervised learning: Finding inherent groupings in data without known labels. like Grouping related alerts, potential correlation
- Anomaly Detection: Identifying unusual data points that deviate from established norms.
- Reinforcement Learning: Involves an agent learning optimal actions through trial and error in a given environment.
Challenges for tradition root cause analysis:
- Soiled data
- Manual & Time consuming for correlating events, pattern and dependencies
- Alert overload
How Machine Learning (ML) tackles root cause analysis (RCA) within AIOps.
- Log & Event Correlation using ML algorithms & Pattern recognitions.
- Anomaly Detection: ML models learn the normal baseline behavior of system metrics. Deviations outside established norms pinpoint where and when something unusual occurred, triggering deeper investigations.
- Knowledge Graphs : Interdependencies between IT infrastructure components
- Predictive Models: Based on historical data and patterns.
Ex:
- A customer reported about a web application about inaccessible.
- AIOps platform collects the data: alerts from web server, middileware, database server, Infrastructure.
- Groups alerts based on timing and component overlap.
- Anomaly detection models simultaneously show unusual traffic patterns associated with that network device.
- IT teams focus on that network device to investigate, rather than troubleshooting everything downstream of the outage.
ML algorithms & Pattern recognitions
- ML algorithms analyze flood of raw data/alerts to identify events that typically co-occur or exhibit distinct signatures indicating a shared problem.
- Clustering: Groups similar events and logs together, based on features like timestamps, error codes, affected components, and text patterns within the event message.
- Predefined Patterns: Rules configured to seek specific, known combinations of events suggesting an incident is forming.
- Dynamic Pattern Learning: Advanced AIOps systems incorporate ML to automatically learn new patterns as they observe how events typically cluster over time
Types of Anomalies:
Consider what you want to detect:
- Point anomalies: Spikes or dips at a specific time.
- Contextual anomalies: Unusual within a period (e.g., low CPU usage on weekends).
- Collective anomalies: A group of data points acting unusually together.
Knowledge Graphs:
A knowledge graph is a network of interconnected data points representing IT components, dependencies, historical incidents, alerts, and other relevant operational information.
Knowledge Graph is Built using
- Discovery tools for infrastructure details
- Monitoring and logging systems for events and alerts
- CI information from the CMDB
Predictive Model
Predictive models use machine learning algorithms to analyze historical data (logs, metrics, past incidents) and identify patterns that anticipate potential issues, breakdowns, or performance degradations.
AIOps Main page My AIOPs
AIOps Overview Page: AIOPs Overview
Next Page: AIOps Design