A KPI (Key Performance Indicator) is a recurring saved search that returns the value of an IT performance metric.
Recommended number of KPIs per service
- To effectively monitor and troubleshoot a service with 50 or more KPIs, spend time crafting and fostering the KPIs you care about and want to measure, which saves time troubleshooting later.
- It’s best to have 20 or fewer KPIs per individual service.
Path: Login to ITSI -> Configuration -> Services -> Select Service -> KPIs -> New.
Select one of the below option:
- Generic KPI
- Select a KPI template to populate the KPI with a preconfigured source search based on an ITSI module.
Steps:
- Step 1 - (Reqired) Define a KPI Source search
- Step 2 - (Optional) Split and filter by entitites
- Step 3 - (Reqired) Configure KPI monitoring calculations
- Step 4 - (Optional) Define KPI unit and monitoring lag
- Step 5 - (Optional) Enable backfill
- Step 6 - (Reqired) Configure KPI thresholds
Step 1 - (Reqired) Define a KPI Source search
Consider the performance implications for your particular deployment.
- Data models are suitable for smaller test environments
- Base searches generally provide best performance in larger production
Search types:
- Data model
- Metric Search
- Ad hoc Search
- Base Search
Note:
- When you create a KPI search from a data model, the data model object field becomes the threshold field.
- When you create a KPI search from an ad hoc search, you must manually enter the threshold field.
- The use of transforming commands, the mstats command, the gettime macro, or time modifiers in your KPI search is not recommended as this may cause issues with KPI backfill,
Step 2 - (Optional) Split and filter by entitites
Split a KPI by entities in IT Service Intelligence (ITSI) to monitor each individual entity against which the KPI search runs.
Note:
- ITSI doesn’t limit the number of matching entities for a service. Be mindful of the performance implication when you have a lot of entities matched for a single service.
- Entity filtering lets you specify the service entities against which a KPI search runs.
Step 3 - (Reqired) Configure KPI monitoring calculations
KPI monitoring calculations determine how and when ITSI performs statistical calculations on the KPI. They also determine how ITSI displays gaps in your data.
Options:
- KPI Search Schedule: How often to check the value of the KPI.
- Entity Calculation: Method of aggregating the KPI for the entity level.
- Service/Aggregate Calculation: Method of aggregating the KPI for the Service level.
- Calculation window: Time period over which the search applies.
- Fill Data Gaps with: Null value / last available value / Custom value.
- Threshold level for Null values.
Note:
- Each time the saved search runs for a KPI with Fill Data Gaps with set to Last available value, ITSI caches the alert value for the KPI in the itsi_kpi_summary_cache KV store collection.
- ITSI fills data gaps with the last reported value for at most 30 to 45 minutes, in accordance with the default modular input interval and retention time (15 minutes + 30 minutes). If data gaps for a KPI continue to occur for more than 45 minutes, the data gaps appear as N/A values.
Step 4 - (Optional) Define KPI unit and monitoring lag
Configure the monitoring lag to offset indexing lag and improve performance.
Unit is measurement to display for the KPI like %, Secs, MBps etc.
The monitoring lag time, in seconds, is used to offset the indexing lag. Monitoring lag is an estimate of the number of seconds it takes for new events to move from the source to the index. When indexing large quantities of data, an indexing lag can occur, which can cause performance issues. Delay the search time window to ensure that events are actually in the index before running the search.
- As a best practice, don’t set the monitoring lag to less than 30 seconds.
- If you’re working with a new data source, click Determine Recommended Lag to sample a 60-minute time period and find out what the minimum, maximum, and recommended monitoring lag setting for your data source is
Step 5 - (Optional) Enable backfill
Enable backfill for a KPI in IT Service Intelligence (ITSI) to fill the summary index with historical raw KPI data. In other words, even though the summary index only started collecting data at the start of this week when the KPI was created, if necessary you can use the backfill option to fill the summary index with data from the past month.
Prerequisite:
- Disable KPI alerting.If KPI alerting is enabled when you backfill a KPI, ITSI can generate duplicate alerts.
- Indexed raw data requirements. The backfill option requires you to have adequate indexed raw data for the backfill period you select.
Note: Backfill is a one-time operation. Once started, it cannot be redone or undone. For example, if you backfill 60 days of data and then later decide that you want 120 days, you cannot go back and change the backfill period. Think carefully about how many days of data you want to backfill before saving the service.
- Choose a backfill period - When you enable backfill, you must indicate how many days of data to backfill. You can choose a predefined time range like last 7 days, or select a custom date prior to the current date.
How backfill fills data gaps
- If you backfill a KPI that uses Last available value to fill data gaps, the gaps are backfilled with filled-in alert values, using the last reported value for the KPI instead of N/A alert values
Status
ITSI supports a maximum of 60 days of data in the summary index. Therefore, after you configure backfill, you see one of the following messages:
- Backfill is not available - More than 60 days of summary index data already exists.
- Backfill has been configured for last <#> days of data - The backfill job is configured but hasn’t run yet. It might not have run because you haven’t saved the service yet.
- Backfill completed for last <#> days - Backfill has completed successfully. This message only shows up until a total of 60 days of data is in the summary index, then it changes to Backfill is not available.
Step 6 - (Reqired) Configure KPI thresholds
Severity-level thresholds determine the current status of your KPI in IT Service Intelligence (ITSI). When KPI values meet or exceed threshold conditions, the KPI status changes.
Threshold Types:
- Set aggregate thresholds: Aggregate thresholds are useful for monitoring the status of aggregated KPI valuesn (based on a calculation that uses the stats count function).
- Set per-entity thresholds: Per-entity thresholds are useful for monitoring multiple separate entities against which a single KPI is running.
- Advanced Thresholds:
- Time-based thresholds - user-defined threshold values to be used at different times of the day or week to account for changing KPI workloads.
- Adaptive thresholds - thresholds calculated by machine learning algorithms that dynamically adapt and change based on the KPI’s observed behavior.
Set KPI Importance values in ITSI
After you create a KPI in IT Service Intelligence (ITSI), assign the KPI an importance value. ITSI uses KPI importance values, along with the KPI severity levels, to calculate the overall service health score. A service’s health score is a weighted average of the severity levels of a service’s KPIs and dependencies.
Importance values range from 0 to 11. KPI importance values from 1-11 are included in the health score calculation, with 1 being the least important and 11 being the most important. KPIs with an importance value of 0 aren’t included in the health score calculation. The greater the KPI importance value, the greater the impact that KPI has on the service health score.
ITSI considers KPIs that have an importance value of 11 as a special case that represents a “minimum health indicator” for the service. When a KPI with an importance value of 11 reaches the critical state, the overall health score for the service turns critical, regardless of the status of other KPIs in the service
.
How service health scores are calculated
- A decline in a service’s health can be the first sign of an issue that might lead to an outage.
- The health score calculation is based on the current severity level of service KPIs (Critical, High, Medium, Low, and Normal) and the weighted average of the importance values of all KPIs in a service.
Note: The Info severity level isn’t included in the service health score calculation.
- ITSI doesn’t directly use KPIs or health scores of dependent services to calculate a service’s health score.
- Service health scores are calculated based on the score_contribution value for each severity level. Score contribution values are defined in threshold_labels.conf. Don’t modify these values.
For example, a service contains 2 KPIs. One KPI is Critical, so the score_contribution value is 0. The other KPI is Normal, so the score_contribution value is 100. Assuming both KPIs have the same importance values, the service health score will be 50.
The following formula is used to calculate service health scores:
Where:
- N = count of KPIs
- G = importance value of one KPI
- K = the score contribution of the KPI (Normal=100, Low=70, Medium=50, High=30, Critical=0)
For example, if you set KPI importance values as follows:
The service health score is calculated as follows:
Service health score = (100 ∗ 10/22) + (70 ∗ 7/22) + (30 ∗ 5/22) = 45.45 + 22.27 + 6.81 = 74.53
Impact of per-entity thresholds on service health scores
- When a KPI is split by entity, if any entity has a severity level that’s worse than the service aggregate severity, the service health score is impacted.
if the KPI is split by entity, the worst entity is taken as the score contribution.
Therefore, while the aggregate KPI score might be 100 (Normal), one of the entities within that KPI might be 30 (High), so the overall score contribution of that KPI will be 30.
In some cases, entity severity contributions can cause the overall service health score to change significantly, while the aggregate KPI severity level changes only marginally or not at all. For example, if you have a CPU % utilization KPI that is running against three entities, and two of those entities show normal severity, while the third shows critical, the overall service health score might show critical, while the aggregate KPI severity level remains normal.
Create KPI base searches in ITSI
KPI base searches let you share a search definition across multiple KPIs in IT Service Intelligence (ITSI). Create base searches to consolidate multiple similar KPIs, reduce search load, and improve search performance.
ITSI module base searches
ITSI includes several pre-configured KPI base searches based on ITSI modules that you can use with your services.
- The titles of these base searches begin with “DA-ITSI”.
- KPI base searches that come with ITSI modules are read-only and cannot be modified or deleted.
- To customize, clone the base search, then perform your edits on the clone.
Path: Configuration -> KPI Base Searches -> Create KPI Base Search
Saved searches saved in savedsearches.conf as “Indicator - shared - <-name-> Search”
Service templates and base searches
- Service templates use base searches for their KPIs. When a service template is created from a service, all of the KPIs in the service are imported into the template.
- Any service KPIs that use ad hoc searches, data model searches, or metrics searches are converted into base searches.
Overview of Service Templete
Path: Configuration -> Service/Service Templet -> Select Service/Service Templet -> KPIs -> New -> Select KPI
Delete a KPI base search
- Only users with write permissions to the Global team can delete a KPI base search. The itoa_admin and itoa_team_admin roles have this capability by default.
- When you delete a base search, any service KPIs that use the base search are converted to ad hoc searches.
- You can’t delete a base search that is being used by a KPI in a service template. You must modify those service templates to remove the dependency before you can delete the base search.
- Additionally, you can’t delete a metric that is being used by a base search in a service template.
Wildcards in KPI base searches
- As a best practice, don’t use wildcards for entities in a KPI base search.
KPI base search performance considerations
The performance of KPI base searches is dependent on the following factors:
- The number of KPIs that use the base search.
- The number of services that contain KPIs that use the base search.
- The number of entities matching service entity rules.
Note:
- In general, a KPI base search can support fewer KPIs with many entities or many KPIs with fewer entities.
- It’s not advised to use a single KPI base search for both a high number of KPIs and a high number of entities.
- As the number of services or matching entities increases, the search runtime also increases.
Fix truncated or incorrect KPI values
Search results are processed, created, and written to the itsi_summary index via an alert action. The default limit on the number of rows that can be written is 50,000.
Calculate the number of the result rows generated by a shared base search using the following formula:
< number of services> x < number of KPIs in each service> x < number of entities per service entity rule> + < number of services> x 2 (one for the service aggregation result, one for the service maximum result)
For example, for 500 services with 10 KPIs in each service and 15 matching entities, the expected number of result rows is 500 x 10 x 15 + 500 x 2 = 76,000 rows.
If the number of result rows expected is more than 50,000, ITSI truncates the results and displays incorrect KPI values.
Increase value in $SPLUNK_HOME/etc/system/default/limits.conf
Increase the KV store bulk get limit
The KPI base search tries to get all the relevant services from the KV store internally for thresholding related operations. When a KPI base search is attached to a lot of services, the bulk get might reach the KV store bulk get size limit. The default limit is 500MB.
Synchronize KPI searches in ITSI
By default, ITSI staggers the search scheduling of KPIs in order to reduce search load. For example, if you have five KPIs that are scheduled to run every 5 minutes, the search to update the value of each KPI from the summary index is staggered over the 5 minute interval (the first KPI at minute 1, the second KPI at minute 2, and so on).
You can synchronize KPI searches so they update at the same time during the scheduled interval.
Next Chapter: Advanced Thresholding