Introducing the new plugin for Grafana - Statusmap panel

Grafana has the option show the status of , Grafana has the ability to display data in time . However, paradoxically, Grafana has not had a convenient way to show to the present day. status in time !!
We present our plug-in - Statusmap panel . It allows you to visually display the status of a set of objects for a selected period of time. As an example, demonstrating the work of the plug-in, imagine a lot of locations in which someone prepares coffee:
Introducing the new plugin for Grafana - Statusmap panel  
You can see how Nikki saves electricity, Gerry quickly replenishes water supplies, Valera's coffee machine often messes up, and on Bifrost Wi-Fi is clearly better than at the lunar station, which seems to be very tight with water.
Looks interesting? But let's start with how we came to this in general. Status Panel and Status Dot . These plugins allow you to display the current state of a set of objects, for example, hosts or pods or coffee machines in different parts of the world.
Everything went well, until we wanted to see the statuses of these objects in time. The first, the simplest solution was to add an ordinary graph with a tick stacked .
According to the idea of ​​Status Panel + stacked Graph would allow to see the state of objects "for now" and the development of the situation in time. However, the stacked Graph is not very obvious:
color indicates different timeseries, not values ​​that are displayed in color for the Status Dot or Status Panel. That is, the colors for the two graphs are not the same and this is confusing;
if appears among the values. null , then the graphs fail.  

We tried to adjust the standard Heatmap - it did not work: the plug-in works with the Y-axis only at the level of values ​​and does not know how to output labels there. Then we tried the following plugins for Grafana:
  • Carpet plot - group values ​​by day and by selected part of the day;  
  • Discrete Panel - A good plugin, but we need to discretely show the status in time;  
  • Status By Group Panel - a good improvement to the Status panel, allowing to display a lot of statuses, but still without the capabilities we need.  

Based on the results of all the studies conducted, we formulated the following requirements for the plug-in:
  • a dedicated clear line of the graph for each object;  
  • The object name is displayed along the Y axis and is specified in the legend field;  
  • one object can have several statuses - in such cases, the most significant will be displayed in color, and the rest will be displayed in the tooltip;  
  • buckets display a width of at least the specified (5 px), because in one-pixel ones it is inconvenient to aim with a mouse;  
  • Manual color management - the ability to set color to each numerical value from a discrete set.  

Let me now make a small digression about Heatmap charts, Prometheus and discrete statuses

A bit of theory

The classical heatmap is a 3-dimensional graph:
  • The time is plotted along the X axis,  
  • along the Y axis, possible values ​​of a certain value,  
  • on the Z-axis - the number of observed values ​​at a given time.  

The standard Heatmap plug-in displays the Z axis in color - for example, from white to red or through a gradient of green-yellow-red. This works very well for continuous values: response time, queue length, number of requests to the server In the case of discrete statuses for a set of objects, you need the following: on the Y axis display the names of objects that we monitor, and along the Z axis show for each object the observed at this point in time status But stand! What does the set of object statuses mean at a time? I'll try to describe.
Those who use Prometheus with Grafana know about step or interval - setting on the tab. Query . If there indicate 1m , and you collect the data with an interval of 3-333336. 5s , then when performing a simple metric query coffee_maker_status Prometheus will return every 12th value, and 11 values ​​on the chart can not be seen. How to improve the situation?
The first thing that comes to mind is to use aggregation functions - for example, * _over_time (coffee_maker_status[1m]) . Which function exactly? Time to figure out how the status appears in the Prometheus metrics. In most cases, the status is indicated by a certain set of values. For example, for coffee_maker_status can have such status values:
  • 0 - ok,  
  • 1-off,  
  • 2 - no beans,  
  • 3 - no water,  
  • 4 - fail.  

Further it would seem everything is simple: take the number of zeros, ones, twos, etc. for one minute and we have excellent data to display on the chart! But Prometheus has its own view of this: coffee_maker_status[1m] Is a range vector, and therefore expressions like max_over_time (coffee_maker_status[1m]== 2) or count_values_over_time (coffee_maker_status[1m], 3) , which are very suitable, are impossible.
Everything works fine if the metric has two values: 0 (status was not observed) and 1 (the status was observed), and the status itself is kept in the label. Then you can make such requests: (max_over_time (coffee_maker_status {status = "3"}[1m]) == 1) * 3
What to do with a metric that has several meanings? Note " Composing range vector functions in PromQL "Gave the idea of ​​turning a metric with discrete values ​​into metrics with labels. This can be done using a recording rule:
    - record: coffee_maker_status: discrete
expr: |
count_values ​​("status", coffee_maker_status)

This rule transforms the metric of coffee_maker_status so: if the value has come. 3 , then Prometheus creates a metric of coffee_maker_status: discrete {status = "3"} with a value of 1. And so - for each observed value.
Usually, statuses are defined in advance, so you can compose a set of queries to avoid missing the required values. The legend for all queries must match, so that you can group the values:
Now if within 30 minutes the machine was switched off for 30 seconds (off status - 3r3r???r3r3337.), And the rest of the time worked (ok status - .0 ), Then we will have information about shutdown, because the plugin will get two values ​​with one legend for one moment in time: 0 from query A and 1 from query B.
Okay: we figured out how to aggregate data about discrete statuses and still not lose information. It remains to figure out how to combine data based on the legend and draw them on the panel.

Plugin Statusmap

Of course, we did not come to the one described above, but when all this was put together, it became clear that, in fact, there was not enough mechanism for rendering. Now there is such a mechanism - Statusmap panel plugin , which knows the following:
  • the values ​​at each point in time are grouped into baskets according to the coincidence of the text of the legends specified in Query ;  
  • Each legend text has its own line on the graph and the text is displayed as a label on the Y axis, and empty values ​​are indicated by a blank or as 0 :
  • for any value you can set the exact color of the basket:
  • if several values ​​are added to the basket, the color will be taken for the value that is defined above on the tab. Colors , and when you hover over the basket, all the values ​​that are included in it are displayed:
  • the plug can create interval to request Prometheus, so that the baskets do not turn into pixel lines.
    As a result, a very convenient representation of is obtained. status of several objects . And you can see both the current status (these are the rightmost baskets) and the status of the object in time.

    Where to get?

    Source code Grafana Statusmap plugin is distributed under the free license MIT (by analogy with other plugins for Grafana) . At the moment it is available in our GitHub . And we sincerely hope that in the near future he will will get and in repository of Grafana plugins .
    And at last - an illustration, how Statusmap helps to visualize the data with statuses of pods from production-cluster Kubernetes:
+ +1 -

Add comment