An introduction to the author
The author of this article is Wucheng contact mode: email@example.com, which is mainly responsible for the development and management of the home cloud platform. Personal Blog http://jackywu.github.io/
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
You can communicate with us via email or message from official technology blog.
Each company has a monitoring system, big or small, or simple or complex. The car home’s monitoring system first used Zabbix, from cold and hot, to Proxy to build a distributed model, to achieve a cross – room disaster, the same way as everyone imagined.
So why do you have to be a new monitoring system later?
Zabbix has several short boards
Heavy reliance on databases does not support monitoring of very large clusters.
Zabbix Server end modularity is low, and the two development is not flexible.
We hope to have a distributed and more extensible system architecture.
3.2 our design ideas
Our design goals for the system are
Fault self healing
Product positioning of the system
Operation and maintenance provide monitoring system, responsible for the stability of the system.
Business operation and maintenance can help users to use the system, focusing on the configuration of basic monitoring and alarm reception.
The business department can use the system on its own, focusing on the configuration and alarm reception of business monitoring.
According to our research on monitoring system, we think we need such an architecture.
Agent: responsible for collecting data.
Transfer: responsible for forwarding the collected data to different backends. The backend includes: parser, storage, and so on.
Storage: historical data storage.
Dashboard: responsible for monitoring policy configuration and monitoring data viewing.
Detector: responsible for detecting whether the data reported by Agent represent abnormal occurrence.
Analyzer: responsible for analyzing the abnormal event behavior generated by Detector, making some logic and sending an alarm.
Sender: receive messages from Analyzer and send alarm notices.
Processor: responsible for automatic handling of abnormal events, that is, “self healing”.
3.3 our implementation strategy
According to this framework, we are going to investigate. We can use very basic building blocks to build our own systems, such as collectd, statsd, MySQL, and HBase, or similar other open source components. Later, we found that at almost the same time millet was open source of its own monitoring system Open-Falcon, our design ideas are very close, so we chose to do the two development on the basis of it.
3.4 product design
Here is a description of the difference between Open-Falcon and products.
A. service tree
We constructed our own service tree according to the organizational structure of the company and the organizational relationship of the business system. The organization architecture and the server are automatically synchronized with the CMDB.
According to our company’s business organization form and past usage habits, we built our own Dashboard.
We abstract the concept of “business” and “functional modules” to represent “systems” and “subsystems”, or the meaning of “large groups” and “two level groups”. Open-Falcon only supports first – level groupings.
The function module “associated” host group, “alarm policy”, “alarm template”.
In the “alarm strategy”, we implemented the “alarm upgrade”, “delayed alarm”, and combined the alarm of the host according to the “function module”.
We provide a self help subscription function for the host alarm in the “function module” on the page.
We will share this content separately.
C. alarm judgment expression
We modified Judge and added yesterday’s contrast function
Daydiff (#3) > 1: the difference between yesterday and today is 1.
Daypdiff (#3) > 1: the difference between yesterday and today is 1%.
D. warning notification strategy
In order to solve the problem of alarm bombing, we added the “alarm and upgrade mechanism”: the alarm was sent to the “group 1” immediately after the alarm occurred, and the X minutes were still continuing to send the notifications to the “group 2”, and so on. The crew and interval can be customized.
In order to further optimize the problem of alarm flooding in a period of time, we have added this mechanism: in one time period, the alarm first occurs, the notification is sent immediately, and the principle of the lengthening time is used to suppress the alarm between the swaps and the end of the time period. Send a fault or restore state. Time can be customized.
In order to further optimize a group of machines with the same function (or other angles common) at the same time, the same kind of anomaly caused the overflow of alarm, we add this mechanism: the principle of lengthening time, at the end of the time period, the machine under the same functional block, merged according to the same monitoring item. Time can be customized.
This content will also consider open source.
The car home has a large number of Windows servers. In order to achieve the unification of the system logic architecture, we developed its own Agent (author:ninjadq) based on the Windows Service service with Python.
The difference between the Agent and freedomkk-qfeng/falcon-scripts is:
Support IIS and SQLServer monitoring items collection.
Service running Windows does not configure timed tasks. The operation mode of Agent is consistent with that of Go-Agent under Linux.
For each application thread, the cost of thread switching is negligible when the application is not collected and is collected regularly.
The local HTTP proxy interface is implemented.
The difference from LeonZYang/agent is
Don’t give golang a patch.
With the support of our company, we open the code with Apache license. See “Windows Agent” https://github.com/AutohomeRadar/Windows-Agent/. We borrowed part of “freedomkk-qfeng/falcon-scripts” to collect code, thanked the author’s code.
Welcome to PR and Issue. Welcome to contact us with QQ or email.
4. future Roadmap
Event correlation analysis based on temporal dimension and semantics
Visualization of network and business and analysis of abnormal possibility
Dynamic threshold analysis
Monitoring item behavior analysis + dynamic monitoring strategy matching
Judge/Graph/ and other components of the multi machine room data synchronization and cold standby plan
Monitoring system construction is a long and arduous process. As long as demand is constant, there may never be a day to stop. We will maintain close communication with the community, learn some experience and contribute to it. We welcome all of us to communicate with us in this way.