Design of event correlation analysis for fault location in monitoring system

introduce
An introduction to the author
The author of this article is Wucheng contact mode: autohomeops@autohome.com.cn, which is mainly responsible for the development and technical management of cloud home platform. Personal Blog http://jackywu.github.io/
Team Introduction
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
Contact information
You can communicate with us via email or message from official technology blog.
First, preface
After the first milestone in the auto home monitoring system] (the first milestone / the http://autohomeops.corpautohome.com/articles/ car home monitoring system), we realized the following small Feature
URL monitoring
Log monitoring, and bring error log fragments in alarm information.
Then, we hope to achieve the so-called “automatic fault location” to improve the efficiency of problem diagnosis.
Two, ideas
We think that the occurrence of an abnormal problem must be caused by one or more reasons. We use “Event” to describe this anomaly. The QPS enlargement of the web site is an event, the backend interface response time becomes larger and more than expected is an event, the CPU-Load increase of the server is an event, a configuration change to the MySQL server is an event, and a business code distribution is also an event to identify these events. The relationship between them is the key to the fault location.
We understand two methods of fault location
Event correlation analysis
Automatic reasoning of decision tree
We are now practicing the first method.
Three. The scheme
Classification of monitoring indexes
In order to facilitate the analysis and location, we classify all collected monitoring indicators.
explain
Business level: the index reflects the quality of Service, such as the order success rate of an order system.
Application layer: this layer index reflects the running state of the application software, such as the number of Nginx connections.
System level: this layer index reflects the operating state of the operating system, such as average load.
Hardware level: this layer index reflects the running state of hardware devices, such as CPU temperature.
By stratifying, we classify problems into areas that we know very well.
Building a service tree model
The focus is: “service” and “module”
Module: the combination of servers that provide some function belongs to the same server function in the same module. For example, “caching module”, “DB module”.
Service: it is organized by several modules to provide a Service. “Service” is a higher level abstraction of “function”. For example, “order split service”.
The definition of these two concepts lay the foundation for the following module call relationship.
Building block call relationships
Capital letters represent “services”, such as A services; lowercase letters represent modules, such as a modules; arrows represent calls or results return relationships.
Building a “unified event library”
We believe that there are several ways to determine the relationship between events.
The module call relationship above is a human defined deterministic relationship.
Time correlation, which is a non deterministic strategy, represents a relative possibility.
Fact correlation, through the analysis and calculation of mass historical data, find the correlation between events in fact.
First of all, we need a unified “event library” to collect all events.
We think there are so many sources of the event
We believe that the factors that influence an object’s anomaly come from these aspects.
Self abnormality, such as a hard disk damage
Dependency side exceptions such as A.b invoke A.c’s service, but A.c service exceptions.
Changes from external sources, such as developers’ code upgrades to A.b services, fail to switch to the server, and cut off the network in the local computer room.
For the first point, as long as we monitor the integrity of our 4 level indicators, we can achieve the degree of controllability. For the second point, we need to fully determine the calling relationship of each module. For the third point, we are particularly worried because it is difficult to collect all of the external events, and we use the following methods to achieve this goal.
A “notice center” is set up to let the known deterministic changes be sent in this way and recorded in the event library. For example, a code upgrade for A.b services.
A common event bus is established through the Pub/Sub model of the message queue, so that each system in the platform can release important changes produced by it through this loosely coupled way. Other systems interested in the change of the system can be picked up selectively, for example, the monitoring system can grab the write event library. .
For event bus, this is how our systems work.
The AutoBank (resource management system) will change the state of the device (server, switch, and so on), attach the work list, record it to its own log as an event, and publish it to the event bus. I mentioned in the sharing of data banks of 2016 opsworld – operation and maintenance that change of state means that there is a change process, and that there is a risk of change if there is any change.
The AutoGear (service management system) will directly change the software environment on the server, so what software is installed on which server, or how the parameters of the software are modified to be encapsulated into an event in its own log, and published to the event bus.
The AutoPush (code publishing system) will force the code to force the line person to publish “online bulletin” through the “notice center”. The announcement will be sent to the relevant people through nails, mail or SMS, and the event library will be recorded. Then AutoPush will try to read the announcement and read it. Windows, publishing personnel can publish code on the page. After the publication window period expires, automatic locking AutoPush is prohibited.
Fault location strategy
Our current strategy
The more the end of the calling chain, the more likely it is the root cause of the failure.
The lower level of monitoring index, the more likely it is the root cause of the failure.
According to the concept of “module calling relationship” and “index stratification”, a recursive method is used to traverse all events, and a diagnosis report of the cause of a suspected fault is obtained.
Examples are as follows
The interface response time of the “advertisement service” is abnormal (3S), and the suspected reason is “index module” server xxx.autohome.cc disk util exception (98%).
Reservations on the scene
The business layer index is a direct response to the quality of service and is the most representative of the user experience, so when the business layer is abnormal, we execute the snapshot script on the server through the SaltStack remote execution channel to save the current server’s running state snapshot. After the question, a standardized package of routine commands on the landing server is done.
Save the uptime instruction output
Save CPU for TOP10
Save the free instruction output
Save memory occupancy TOP10
Save the DF instruction output
Save the IP addr show instruction output
Save the routing table
Save Local DNS
Save the Dig autohome.com.cn instruction output
Save Ping -c 3 autohome.com.cn output
Save TCPDUMP packet for 10 seconds
Wait
At the same time, in the “latest problem” page of the monitoring system, click “live snapshot”, the information above will be presented directly on the page, and click “historical data”. The page will show the historical data curve of 30 minutes before and after the time of the problem, including CPU, memory, hard disk, IO, network traffic, and so on. The problem of rapid positioning of operation and maintenance.
Log tracking
Building log analysis system through ELK can meet the needs of fault location in the following two aspects.
The B application log of a search server A in a certain period of time can be directly skipping through the above issue page.
Search the whole process of single user request through TraceID.
Four, summary
The construction of the monitoring system is a very heavy task. The last part of the paper is only part of the content of the event correlation analysis. The next step is to develop the “decision reasoning” in order to improve the accuracy of the positioning and lay the foundation for the subsequent fault self healing. There is anything wrong with this article. Welcome to point out.

Leave a Reply

Your email address will not be published. Required fields are marked *