The Children Of Zabbix: A Ripple In Time
I encountered an intriguing challenge while initially learning to configuring Zabbix, and I wanted to share this experience as a follow-up to my previous article, "The Zabbix Investment."
The zabbix systemctl documentation instructs you to create an item that gathers systemctl data for a specific service, along with a dependentb child item that extracts the active state from the parent item. At this stage, a trigger can be set up to compare the active state of the service against the desired state of the service(running). If the service is not running, an alert will be sent.
While following the guide to test this alert , I ran into an issue where the trigger expressions I had previously implemented on other items were not functioning as expected. My frustration stemmed from the fact that while others offered suggestions to get this alert functioning, there was a lack of low-level explanations for why it wasn't working. I will clarify this here.
When evaluating data points for an item in Zabbix, you can use a trigger expression to assess whether the last data point gathered is not equal to the desired value. If this condition is met, the trigger will activate, which in practice can lead to a flood of notifications every time a service, in this case, is restarted. It is better to alert if a service is down for x amount of time.
To avoid this undesired notafication scenario, you can add a second condition using an "and" operator. This will ensure that both the last data point and the first data point over a 3-minute period must not equal the desired state for an alert to be triggered. Therefore, if a service goes down after a routine restart, the alert will not trigger immediately since the first recorded data point 3-minuts ago was running. However, if the service remains down three minutes later, the alert will then be triggered.
After implementing this configuration for the child item that extracts the active state of a systemctl service from the parent, you might anticipate that the trigger would activate only if the systemctl status remains not running for three minutes. However, it actually triggers immediately. So, what’s the difference between when i previously applied this tecnique to other items sucessfully and the current configuration? Lets dig a little bit deeper.
I have a parent item with a check interval every minute. When I go into Monitoring and check the latest data, I can see that the last check for the parent item never exceeds 1 minute, while the last check for the dependent child item is significantly higher.
So, how do I set the check interval for the dependent child item to 1 minute or synchronize it with the parent check? As it turns out, this is inherited from the parent item. However, if that's the case, why isn't the first expression value preventing an instant alert in this situation?
After stopping the systemctl service, I observed that the last check value was updating for the child process but continued to increment past 1 minute. This led me to realize that dependent items inherit their check intervals from the parent item but only record data points when there are changes in value. This means that if the service was started yesterday and has been running since, the last data point and the first data point in the past three minutes will be the same (there is only one). The change from running state to stopped state.
To fix this I use the last data expression combined with the "no data" expression (using a time period = 3 minutes). This way, if the service is not running and three minutes elapse, the alert will trigger. However, if the service recovers before reaching the three-minute threshold, the last condition (running) will be met, and no alert will be sent.
When implementing Zabbix, you may encounter a variety of challenges that require thoughtful planning and problem-solving. These challenges often stem from the platform's highly customizable nature, which, while being one of its greatest strengths, also introduces a level of complexity. Designing an effective monitoring solution with Zabbix demands meticulous attention to detail, as even small oversights during the planning and configuration stages can lead to issues down the line. From setting up the architecture to fine-tuning triggers, templates, and alerts, every aspect of the implementation process requires a thorough understanding of both the system and the specific needs of your environment. Proper preparation and a methodical approach are essential to fully leverage Zabbix's capabilities while ensuring a reliable and efficient monitoring solution.