Presto in the point of my use

Reasons for use:
Point me to big data developers, BI colleagues need to use hive to query various kinds of data every day, more and more reports business is used to hive. Although the CDH cluster has deployed impala, most of our hive tables use ORC format, and impala is unfriendly to ORC support. Before using presto, I would like to use big data to use hive and go to MapReduce to query related tables. The efficiency of query is low.
In the early days of presto, operation and maintenance set up a 3 node experience environment, queried hive through presto, and experienced simple SQL queries, and found that the efficiency was very good.
Our existing solutions are unable to query both historical and real-time data at the same time. Once in demand, BI puts forward the problem that we need our large data to solve hive and MySQL interactive queries. We used spark, Presto and other tools in the big data team to test, and found that Presto is relatively easiest to use. It is also suitable for BI’s colleagues. After deliberating internally, he decided to vigorously promote presto.
We built a Presto cluster on the Hadoop node and configured 1coordinator, 3worker. Later, with the increasing use of Presto business, now it has been expanded to 7worker. The memory of worker node also increased from original 8G to 24G.
Presto introduction:
Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports massive data. It is mainly to solve the interactive analysis of commercial data warehouse and to deal with low speed. It supports standard ANSI SQL, including complex queries, aggregation (aggregation), connection (join) and window function (window functions).
Presto supports online data query, including Hive, Cassandra, relational database, and proprietary data storage. A Presto query can merge data from multiple data sources and analyze them across the entire organization.
Working principleļ¼š
The running model of Presto is essentially different from that of Hive or MapReduce. Hive translates queries into multistage MapReduce tasks and runs one after another. Each task reads the input data from the disk and outputs the intermediate results to the disk. However, the Presto engine does not use MapReduce. It uses a custom query and execution engine and response operators to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is carried out in memory. Different processing terminals constitute the pipeline processed through the network. This will avoid unnecessary disk read and write and additional delay. This pipelined execution model runs multiple data processing segments at the same time, and once the data is available, the data will be passed from one processing segment to the next processing segment. Such a way will greatly reduce the end to end response time of various queries.
Use the scene:
1, commonly used hive queries: more and more colleagues have been querying hive through presto. Compared with the MR hive query, the efficiency of Presto has been greatly improved.
2, data platform inquiries hive through Presto to do business display.
3, union of Cobar. Cobarc has 5 segments. There are 5 pieces of data in the data. If you need to query the cobarc, you need to query 5 pieces of data separately and then hand it together. The efficiency of inquiry is low. Cobarb has 8 libraries, and the data exists in 8 libraries. There is the same problem with cobarc. Through presto, the union of 5 slices can be combined to enhance query efficiency and simplify query.
The interaction between 4, hive and mysql. In the past, hive’s historical data and cobarc’s real-time data could not be join together. You can only query the data of the previous day or the data of the day. Using Presto can bring hive and cobarc’s table join together.
5, Kafka’s topic is mapped into tables; in existing Kafka, some topic data is structured JSON data. Through presto, the JSON is mapped into a table. You can use the SQL query directly. But it has not been applied to the scene for the time being;
6, at present, hue, Zeppelin and other web tools are used to operate Presto and support the export of result sets.

Using Sidecar to introduce Node.js into Spring Cloud

theory
brief introduction
Spring Cloud is a popular micro service solution at present. It combines the convenient development of Spring Boot with the rich solution of Netflix OSS. As we all know, Spring Cloud is different from Dubbo and uses Rest services based on HTTP (s) to build the whole service system.
Is it possible to develop some Rest services using some non JVM languages, such as Node.js, which we are familiar with? Yes, of course. However, if only Rest services are available, it is not possible to access the Spring Cloud system. We also want to use the Eureka provided by Spring Cloud for service discovery, use Config Server to do configuration management, and use Ribbon to do client load balancing. At this point, Spring sidecar will be able to show its talents.
Sidecar originated from Netflix Prana. He provides a HTTP API that allows access to all instances of established services, such as host, ports, etc. You can also use an embedded Zuul proxy service to get the relevant routing nodes from Eureka. Spring Cloud Config Server can be accessed directly through the host or through proxy Zuul.
What you need to be aware of is the Node.js application you have developed, and you have to implement a health check interface to allow Sidecar to report the health of this service instance to Eureka.
In order to use Sidecar, you can create a Spring Boot program with @EnableSidecar annotation. Let’s look at what this annotation has done.
@EnableCircuitBreaker
@EnableDiscoveryClient
@EnableZuulProxy
@Target (ElementType.TYPE)
@Retention (RetentionPolicy.RUNTIME)
@Documented
@Import (SidecarConfiguration.class)
Public @interface EnableSidecar {
}
Look, hystrix fuse, Eureka service discovery, zuul agent, all of these components have been opened.
Health examination
Next, we need to add the configuration of sidecar.port and sidecar.health-uri in application.yml. The sidecar.port attribute represents the port of the Node.js application listener. This is to enable sidecar to register in Eureka services. Sidecar.health-uri is a URI that simulates the interface of Spring Boot application health indicators. It must return the following form of JSON document: health-uri-document
{
" status": " UP"
}
The application.yml of the entire Sidecar application is as follows: application.yml
Server:
Port: 5678
Spring:
Application:
Name: sidecar
Sidecar:
Port: 8000
Health-uri: http://localhost:8000/health.json
Service access
After building this application, you can use /hosts/{serviceId} API to get the result of DiscoveryClient.getInstances (). Here is an example of returning two instances of information from different /hosts/customers from host. If sidebar runs on the 5678 port, then the Node.js application can access the API through http://localhost:5678/hosts/{serviceId}.
/hosts/customers
[
{
" host": " myhost"
" port": 9000,
" uri": " http://myhost:9000"
" serviceId": " CUSTOMERS"
" secure": false
},
{
" host": " myhost2"
" port": 9000,
" uri": " http://myhost2:9000"
" serviceId": " CUSTOMERS"
" secure": false
}
]
Zuul proxy can automatically be registered to the Eureka association to /< serviceId> services add routing, so customer services can be accessed via the /customers URI. It is also assumed that sidecar listens on the 5678 port, so that our Node.js application can access the customer service through http://localhost:5678/customers.
Config Server
If we use the Config Server service and register it to Eureka, Node.js application can access it through Zull Proxy. If ConfigServer’s serviceId is configserver and Sidecar listens on the 5678 port, then you can access Config Server via http://localhost:5678/configserver. Of course, this is also due to Eureka, Config Server provides Rest interface based on HTTP protocol.
Node.js applications can also use the capabilities of Config Server to get some configuration documents, such as YAML format. For example, an access to http://sidecar.local.spring.io:5678/configserver/default-master.yml might get the following YAML document back:
Eureka:
Client:
ServiceUrl:
DefaultZone: http://localhost:8761/eureka/
Password: password
Info:
Description: Spring Cloud Samples
Url: https://github.com/spring-cloud-samples
So the whole architecture of Node.js application accessing to Spring Cloud micro service cluster through Sidecar is roughly shown as follows:
Demo practice
Let’s suppose that there is such a very simple data. It is called User:
Class User {
Private Long ID;
Private String username;
Private Integer age;
}
It looks very classic, Kazakhstan!
Another data structure is used to represent books, Book:
Class Book {
Private Long ID;
Private Long authorId;
Private String name;
Private String publishDate;
Private String des;
Private String ISBN;
}
The authorId in Book corresponds to the ID of User. Now we need to develop Rest services for these two data.
First, User, we use spring to develop, first in the controller construction method, mock some false data users, and then a very simple Get interface based on the ID user.
@GetMapping (" /{id}")
Public User findById (@PathVariable Long ID) {}
After starting, we curl visited:
Curl localhost:8720/12
{" id"; 12; " username&quot:; " user12" " age&quot: 16}
Next, we use Node.js to develop Book related interfaces.
Because the Node.js community is very active, the optional Rest service framework is very large. The mainstream is express, KOA, hapi, and so on, very lightweight and easy to extend like connect. Here I consider the mass base and document richness, and choose to use express to develop such a Rest service that can be connected to Spring Cloud.
Const express = require (‘express’)
Const faker = require (‘faker/locale/zh_CN’)
Const logger = require (‘morgan’)
Const services = require (‘./service’)
Const app = Express ()
Let count = 100
Const books = new Array (count)
While (count > 0) {
Books[count] = {{
Id: count,
Name: faker.name.title (),
AuthorId: parseInt (Math.random () * 100) + 1,
PublishDate: faker.date.past ().ToLocaleString (),
Des: faker.lorem.paragraph (),
ISBN: `ISBN 000-0000-00-0`
}
Count —
}
App.use (logger (‘combined’))
/ / service health index interface
App.get (‘/health’, (req, RES) => {
Res.json ({
Status:’UP’
})
})
App.get (‘/book/: id’, (req, res, next) => {
Const id = parseInt (req.params.id)
If (isNaN (ID)) {
Next ()
}
Res.json (books[id])
})
/ /…
First, use faker to mock100 data, then write a simple get routing.
After startup, we visit http://localhost:3000/book/1 with browser.
Now that we have two micro services, next we launch a Sidecar instance to connect Node.js to Spring Cloud.
@SpringBootApplication
@EnableSidecar
Public class SidecarApplication {
Public static void main (String[] args) {
SpringApplication.run (SidecarApplication.class, args);
}
}
Very simple, it needs to be noted that before this, you need a eureka-server to test the ability of the sidecar agent to access Spring Config, and I also use config-server, believing that students who are familiar with spring cloud should know.
In the configuration of sidecar, bootstrap.yaml simply specifies the address of the service port and the config-server, and the node-sidecar.yaml configuration is as follows:
Eureka:
Client:
ServiceUrl:
DefaultZone: ${EUREKA_SERVICE_URL:http://localhost:8700/eureka/}
Sidecar:
Port: 3000
Home-page-uri: http://localhost:${sidecar.port}/
Health-uri: http://localhost:${sidecar.port}/health
Hystrix:
Command:
Default:
Execution:
Timeout:
Enabled: false
Here we specify the address of the node.js service directed by sidecar, and hystrix.command.default.execution.timeout.enabled: false is mainly because sidecar uses hystrix’s default timeout fuse, and the speed of domestic access to GitHub, as you know, I often timeout of config-server when I test, so I put it If you drop out of disable, you can choose to extend the overtime time.
When eureka-server, config-server, user-service, node-sidecar, node-book-service are all started, we open the http://localhost:8700/ of the main page of Eureka:
See that our services are in UP state, indicating that everything is normal. Next, look at the console of the Node.js application:
It is found that there is already traffic coming in, and the access interface is /health. Obviously, this is node-sidecar’s call to check the health of our node application.
Next is the time to witness miracles. Our curl visits the 8741 port of sidecar:
Curl localhost:8741/user-service/12
{" id"; 12; " username&quot:; " user12" " age&quot: 16}
Consistent with the results of direct access to user-service, it shows that sidecar Zuul Proxy can proxy our request to user-service services.
Well, with this agent, we hope that book services can provide the interface of author information:
Const SIDECAR = {{
Uri:’http://localhost:8741′
}
Const USER_SERVICE =’user-service’
Const getUserById = (ID) => fetch (`${SIDECAR.uri}/${USER_SERVICE}/${id}`).Then ((RESP) => resp.json ())
App.get (‘/book/: bookId/author’, (req, res, next) => {
Const bookId = parseInt (req.params.bookId)
If (isNaN (bookId)) {
Next ()
}
Const book = books[bookId]
If (Book) {
Let uid = book.authorId
Services.getUserById (uid).Then ((user) => {
If (user.id) {
Res.json (user)
}else{
Throw new Error (" user not found")
}
}).Catch ((error) => next (error))
}
})
/ / / / based on uid, filter produces all the books of authorId as UID
App.get (‘/books’, (req, res, next) => {
Const uid = req.query.uid
Res.json (books.filter ((Book) => book.authorId = = uid))
})
When we visit http://localhost:3000/book/2/author, we can see that we return the author information of bookId 2. But there is a problem here, and we can’t access the Node.js interface by accessing http://localhost:8741/node-sidecar/book/1 as the proxy to user-service, so how do you get user-service to get book-service data? Looking at the first part of theoretical knowledge, we can get access to various services by visiting /hosts/< serviceId>

Data bank automated asset data acquisition

introduce
An introduction to the author
The author of this article is Wucheng contact mode: autohomeops@autohome.com.cn, which is mainly responsible for the development and technical management of cloud home platform. Personal Blog http://jackywu.github.io/
Team Introduction
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
Contact information
You can communicate with us via email or message from official technology blog.
First, preface
At the OpsWorld conference in Shenzhen, 16, in December, we shared the operation and maintenance data bank – building CMDB method, which mentioned our internal asset data acquisition tools, and we intend to open it open source. Now we share it on Github (Assets_Report).
Two. Introduction of the principle
This tool is implemented on the basis of Facter mechanism, and the results collected are reported to the AutoBank resource management system with Puppet Report Processor.
This is a workflow diagram between Puppet’s Server and Agent.
When Agent sends Request request Catalog, it will report its own facts to Master.
We developed our own Report Processor:assets_report, and sent facts post to AutoBank’s data storage interface through HTTP protocol.
Students interested in developing custom facts can refer to fact_overview and custom facts.
Three, Feature introduction
Compared with Facter built in facts, this plug-in provides more hardware data, such as
CPU number, model
Memory capacity, serial number, manufacturer, slot position
The IP, mask, MAC, and type of the binding on the network card, and support the scene of binding more IP on a network card.
RAID card number, model, memory capacity, RAID Level
Number of disk, capacity, serial number, vendor, RAID card, slot location
Operating system type, version
Server vendor, SN
Advanced features: in order to avoid repeated reports of the same data, and reduce the AutoBank database pressure, this plug-in has a Cache function, that is, if the asset data of a server does not change, then only the not_modify tag will be reported.
The operating system supported by this plug-in has (the system must be 64 bits, because the collection tool in this plug-in is 64 bits).
CentOS-6
CentOS-7
Windows 2008 R2
The server that this plug-in supports has
HP
DELL
CISCO
Four. Installation
Put the entire code directory in the module directory of Puppet Master, assuming that your module directory is /etc/puppet/modules
CD /etc/puppet/modules
Git clone git@github.com:AutohomeOps/Assets_Report.git assets_report
Then let all the Node include assets_report modules, through the configuration of the manifests/init.pp in the module, the acquisition tool will be automatically installed on the server. The plug-in will work properly next time Puppet Agent runs.
Five. Configuration
The configuration file is lib/puppet/reports/report_setting.yaml.
parameter
Meaning
Examples
Report_url
Report interface address
Http://localhost/api/report
Auth_required
Whether the interface contains validation
True/false, the default is false, and the validation code is implemented in auth.rb.
User
Verifying the username
If auth_required is true, you need to fill in
Passwd
Authentication cipher
If auth_required is true, you need to fill in
Enable_cache
Whether cache functionality is enabled
True/false, the default is false
Six. Use
Manual trigger
Puppet agent -t
Or when puppet agent’s daemon runs automatically, the AutoBank resource management system interface will receive a HTTP call.
Seven. Data format
{
‘os_type’type operating system type
‘os_distribution’release operating system release version
‘os_release’version operating system version number
‘not_modify’do you have any changes in the data compared with last time?
‘setuptime’installation time
‘sn’sequence number
‘manufactory’server manufacturer
‘productname’server product name
‘model’server model
‘cpu_count’number of physical CPU numbers
‘cpu_core_count’logical kernel number of CPU
‘cpu_model’CPU model
‘nic_count’network card number
Detailed parameters of the’nic’adapter network card
The number of’raid_adaptor_count’controller raid card controller
‘raid_adaptor’controller raid card controller detailed parameters
‘raid_type’type raid type
‘physical_disk_driver’physical disk detailed parameters
Total memory capacity of’ram_size’
‘ram_slot’memory detailed parameters
‘certname’Puppet certname
}
Eight, summary
Each company will develop a set of collection tools when they are doing CMDB, and we need to face many different types of servers, types and versions of operating systems, and use different tools to solve these complicated problems. We have done this dirty work over and over again. We want to open it open, reduce the pain of some people, and hope to maintain a common set of tools with the power of the community to meet the requirements of the configuration of the vast majority of the equipment on the market.
In addition, there may be a few places to think about in the code, there are still some scenes that are not considered, and you are also looking at the Github in the form of issue. Welcome to PR.

Design of event correlation analysis for fault location in monitoring system

introduce
An introduction to the author
The author of this article is Wucheng contact mode: autohomeops@autohome.com.cn, which is mainly responsible for the development and technical management of cloud home platform. Personal Blog http://jackywu.github.io/
Team Introduction
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
Contact information
You can communicate with us via email or message from official technology blog.
First, preface
After the first milestone in the auto home monitoring system] (the first milestone / the http://autohomeops.corpautohome.com/articles/ car home monitoring system), we realized the following small Feature
URL monitoring
Log monitoring, and bring error log fragments in alarm information.
Then, we hope to achieve the so-called “automatic fault location” to improve the efficiency of problem diagnosis.
Two, ideas
We think that the occurrence of an abnormal problem must be caused by one or more reasons. We use “Event” to describe this anomaly. The QPS enlargement of the web site is an event, the backend interface response time becomes larger and more than expected is an event, the CPU-Load increase of the server is an event, a configuration change to the MySQL server is an event, and a business code distribution is also an event to identify these events. The relationship between them is the key to the fault location.
We understand two methods of fault location
Event correlation analysis
Automatic reasoning of decision tree
We are now practicing the first method.
Three. The scheme
Classification of monitoring indexes
In order to facilitate the analysis and location, we classify all collected monitoring indicators.
explain
Business level: the index reflects the quality of Service, such as the order success rate of an order system.
Application layer: this layer index reflects the running state of the application software, such as the number of Nginx connections.
System level: this layer index reflects the operating state of the operating system, such as average load.
Hardware level: this layer index reflects the running state of hardware devices, such as CPU temperature.
By stratifying, we classify problems into areas that we know very well.
Building a service tree model
The focus is: “service” and “module”
Module: the combination of servers that provide some function belongs to the same server function in the same module. For example, “caching module”, “DB module”.
Service: it is organized by several modules to provide a Service. “Service” is a higher level abstraction of “function”. For example, “order split service”.
The definition of these two concepts lay the foundation for the following module call relationship.
Building block call relationships
Capital letters represent “services”, such as A services; lowercase letters represent modules, such as a modules; arrows represent calls or results return relationships.
Building a “unified event library”
We believe that there are several ways to determine the relationship between events.
The module call relationship above is a human defined deterministic relationship.
Time correlation, which is a non deterministic strategy, represents a relative possibility.
Fact correlation, through the analysis and calculation of mass historical data, find the correlation between events in fact.
First of all, we need a unified “event library” to collect all events.
We think there are so many sources of the event
We believe that the factors that influence an object’s anomaly come from these aspects.
Self abnormality, such as a hard disk damage
Dependency side exceptions such as A.b invoke A.c’s service, but A.c service exceptions.
Changes from external sources, such as developers’ code upgrades to A.b services, fail to switch to the server, and cut off the network in the local computer room.
For the first point, as long as we monitor the integrity of our 4 level indicators, we can achieve the degree of controllability. For the second point, we need to fully determine the calling relationship of each module. For the third point, we are particularly worried because it is difficult to collect all of the external events, and we use the following methods to achieve this goal.
A “notice center” is set up to let the known deterministic changes be sent in this way and recorded in the event library. For example, a code upgrade for A.b services.
A common event bus is established through the Pub/Sub model of the message queue, so that each system in the platform can release important changes produced by it through this loosely coupled way. Other systems interested in the change of the system can be picked up selectively, for example, the monitoring system can grab the write event library. .
For event bus, this is how our systems work.
The AutoBank (resource management system) will change the state of the device (server, switch, and so on), attach the work list, record it to its own log as an event, and publish it to the event bus. I mentioned in the sharing of data banks of 2016 opsworld – operation and maintenance that change of state means that there is a change process, and that there is a risk of change if there is any change.
The AutoGear (service management system) will directly change the software environment on the server, so what software is installed on which server, or how the parameters of the software are modified to be encapsulated into an event in its own log, and published to the event bus.
The AutoPush (code publishing system) will force the code to force the line person to publish “online bulletin” through the “notice center”. The announcement will be sent to the relevant people through nails, mail or SMS, and the event library will be recorded. Then AutoPush will try to read the announcement and read it. Windows, publishing personnel can publish code on the page. After the publication window period expires, automatic locking AutoPush is prohibited.
Fault location strategy
Our current strategy
The more the end of the calling chain, the more likely it is the root cause of the failure.
The lower level of monitoring index, the more likely it is the root cause of the failure.
According to the concept of “module calling relationship” and “index stratification”, a recursive method is used to traverse all events, and a diagnosis report of the cause of a suspected fault is obtained.
Examples are as follows
The interface response time of the “advertisement service” is abnormal (3S), and the suspected reason is “index module” server xxx.autohome.cc disk util exception (98%).
Reservations on the scene
The business layer index is a direct response to the quality of service and is the most representative of the user experience, so when the business layer is abnormal, we execute the snapshot script on the server through the SaltStack remote execution channel to save the current server’s running state snapshot. After the question, a standardized package of routine commands on the landing server is done.
Save the uptime instruction output
Save CPU for TOP10
Save the free instruction output
Save memory occupancy TOP10
Save the DF instruction output
Save the IP addr show instruction output
Save the routing table
Save Local DNS
Save the Dig autohome.com.cn instruction output
Save Ping -c 3 autohome.com.cn output
Save TCPDUMP packet for 10 seconds
Wait
At the same time, in the “latest problem” page of the monitoring system, click “live snapshot”, the information above will be presented directly on the page, and click “historical data”. The page will show the historical data curve of 30 minutes before and after the time of the problem, including CPU, memory, hard disk, IO, network traffic, and so on. The problem of rapid positioning of operation and maintenance.
Log tracking
Building log analysis system through ELK can meet the needs of fault location in the following two aspects.
The B application log of a search server A in a certain period of time can be directly skipping through the above issue page.
Search the whole process of single user request through TraceID.
Four, summary
The construction of the monitoring system is a very heavy task. The last part of the paper is only part of the content of the event correlation analysis. The next step is to develop the “decision reasoning” in order to improve the accuracy of the positioning and lay the foundation for the subsequent fault self healing. There is anything wrong with this article. Welcome to point out.

A small pit that is enough to make the private cloud service crumble – talk about CMDB’s asset audit

1. Introduction
An introduction to the author
The author of this article is Wu Xiumin’s contact mode: autohomeops@autohome.com.cn, which is mainly responsible for the development of the asset management system and the configuration management system of the car home. Personal Blog http://pylixm.cc/
Team Introduction
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
Contact information
You can communicate with us via email or message from official technology blog.
Two. Preface
As the private cloud and the company’s automation systems deepen the CMDB data dependency, CMDB has become the basic data source for the maintenance of the company’s server. Once the CMDB data has a problem, it may result in an unexpected consequence.
The serious may cause the paralysis of the online business. At the beginning of our CMDB development, we have made the direction of “high accuracy, high availability, high automation”, and the core of this direction is “process control”.
And “process control” is a long-term construction process, at some time it may not be able to catch up with the development of the business process. This time, in order not to affect the smooth progress of the daily work, it is necessary to handle the manual process according to the prescribed standard process. If the time is long, it may cause the inaccurate data of the machine.
The accuracy of data has always been a major problem in the construction of CMDB. Next, let’s talk about our exploration of ensuring accuracy of data other than “process control”. Welcome to communicate.
Three. The problems we have encountered
With the construction of private cloud and automation systems, all kinds of data maintenance of CMDB have been processed automatically. But as mentioned above, “process control” and “business process development” are a game of interaction, and there will be opportunities for manual intervention. Among them, we encountered many problems. Several common problems were as follows:
Problem 1. the data of each asset field are inaccurate.
When a company’s private cloud platform grabs IP from CMDB, it is based on the computer room and business line to grab the pre allocated IP segment. When there is a problem in the business line and computer room, it will catch the error.
Previously, a colleague created a machine room in CMDB to divide the IP. When the machine room was created, there was no specified format, and no verification was added from the background.
The private cloud can not get the available IP when installed, and the process cannot go down. Such data is inaccurate, which is very dangerous. This is just a computer room naming. Once the private cloud is allocated the wrong IP, it is possible.
Covering the online server business is a serious problem.
Problem 2.. In an asset state, the empty field also has value, resulting in the data uniqueness error when other private cloud process flows into the library.
A colleague who has a business line applies for a cloud host. After passing the leaders’ instructions at all levels, he has not received the result email of the cloud host application. Then contact operation and maintenance view, operation and maintenance, and contact cloud host manager.
The administrator contacted the cloud host developer again, and the developer found that the machine went wrong when it entered the CMDB asset library automatically and found us. After checking through logs, we found that a server was offline.
IP did not empty, resulting in the data storage times only error. After multiple system developers’ joint investigation, the data problem was finally discovered. This problem is time consuming and unforeseeable.
Problem 3., because there is no private cloud online job list and manually modify the state, causing the time is wrong, statistics on-line assets when the data is not accurate.
When analyzing CMDB data, it depends on the time of various events. However, when people modify data, they may not modify the change time. So when we categorize the asset data, and
Private cloud job data is not correct. Further verification is needed. At this point, it’s crumble.
Four. Our audit scheme
4.1 overview
Based on the above questions, we have read a lot of data, and there is very little information about the self audit of assets.
In accordance with our own problems, we have developed a self audit plan with the core of “disk”, “trial” and “punishment” to ensure the accuracy of CMDB data.
4.2 disk – log back, external check
4.2.1 record data source
Our CMDB is built on Django. We rewrite the model of Django, and record the change log when the data changes.
Save the data before and after the change, as the data source for future calculation.
4.2.2 back calculation
With the most detailed data change logs of the asset, so long as we traverse the change log of the data, we can know the status information of the assets at any time of history.
For example, we have to calculate the number of machines applied to a business line last month. As long as we traverse the data change log of last month’s CMDB, we will accumulate the business line and the state changed records simultaneously.
It’s the data we need.
4.2.3 external disk state
According to the regression processing, some aggregated data can be obtained. We can also get some collate data from the external system such as the work order system. According to these 2 data we can judge
Is there any mistake in CMDB’s data record? Is it multiple assets or less assets?
The flow of the external disk state is as follows:
4.3 trial – periodic review, self revision
In addition to inventory, we also customize the self censorship background tasks. Every day, we will check the accuracy of CMDB assets and whether they are empty or not.
The assets are blacklisted and sent to the relevant operation and maintenance personnel in the form of mail, which reminds him that there is something wrong with this asset and needs to be corrected. First, we should discover the problem and gain the initiative by calling the system externally.
Mail style:
In addition to mail, we also developed a blacklist verification function – “blacklist” to urge operators to do data correction. The operation and maintenance of the modified assets in the “list of blacklists” to do the confirmation operation, and other second mail in the afternoon will publish the process of revision, so that everyone supervision and supervision of the role of each other.
4.4 penalty – division of responsibility and implementation to people
Execution is also a great guarantee for data accuracy. Above, we find the problem assets through automated inventory, and timely and effective correction of data is also a big problem.
Rules and regulations of 4.4.1 visualization
In order to solve the “executive force” problem, we have set up regulations and regulations, and visualize it in the form of assets in the form of CMDB, for each operation and maintenance to view learning.
The page prototype is as follows:
4.4.2 combines blacklists and regulations
After the blacklist and regulations are visualized, there are provisions to follow. When there is a problem in the asset data, the operation and maintenance time is given to self – correction, such as the data error still exists, and can carry on the small punishment in varying degrees.
The page prototype is as follows:
Flow chart of the award and Punishment Ordinance event:
Five. Summary of experience
In the whole construction process of CMDB, the accuracy of data has always been a big problem. Some of our experiences in this direction are summarized as follows:
Divide the role of responsibility into people and reduce the trouble of wrangling. What is the problem and the person who looks for the role.
In the process of private cloud job flow control, data recording time of each asset must be recorded in detail, so as to meet various statistical requirements.
Separate the super administrator from the developer, liberate the developers, avoid excessive checking of the wrong data and delay normal development.
Six. Future Roadmap
Asset locking processing based on blacklist
Self correction of partial problem assets
Seven. Reference materials
CMDB understanding
HUAWEI CMDB
Blue whale
Excellent cloud software

The first milestone of the car home monitoring system

1. introduction
An introduction to the author
The author of this article is Wucheng contact mode: autohomeops@autohome.com.cn, which is mainly responsible for the development and management of the home cloud platform. Personal Blog http://jackywu.github.io/
Team Introduction
We are the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.corpautohome.com.
Contact information
You can communicate with us via email or message from official technology blog.
2. Preface
Each company has a monitoring system, big or small, or simple or complex. The car home’s monitoring system first used Zabbix, from cold and hot, to Proxy to build a distributed model, to achieve a cross – room disaster, the same way as everyone imagined.
3. text
3.1 course
So why do you have to be a new monitoring system later?
Zabbix has several short boards
Heavy reliance on databases does not support monitoring of very large clusters.
Zabbix Server end modularity is low, and the two development is not flexible.
We hope to have a distributed and more extensible system architecture.
3.2 our design ideas
Our design goals for the system are
Accurate warning
Automatic positioning
Fault self healing
Product positioning of the system
Operation and maintenance provide monitoring system, responsible for the stability of the system.
Business operation and maintenance can help users to use the system, focusing on the configuration of basic monitoring and alarm reception.
The business department can use the system on its own, focusing on the configuration and alarm reception of business monitoring.
According to our research on monitoring system, we think we need such an architecture.
Conceptual interpretation
Agent: responsible for collecting data.
Transfer: responsible for forwarding the collected data to different backends. The backend includes: parser, storage, and so on.
Storage: historical data storage.
Dashboard: responsible for monitoring policy configuration and monitoring data viewing.
Detector: responsible for detecting whether the data reported by Agent represent abnormal occurrence.
Analyzer: responsible for analyzing the abnormal event behavior generated by Detector, making some logic and sending an alarm.
Sender: receive messages from Analyzer and send alarm notices.
Processor: responsible for automatic handling of abnormal events, that is, “self healing”.
3.3 our implementation strategy
According to this framework, we are going to investigate. We can use very basic building blocks to build our own systems, such as collectd, statsd, MySQL, and HBase, or similar other open source components. Later, we found that at almost the same time millet was open source of its own monitoring system Open-Falcon, our design ideas are very close, so we chose to do the two development on the basis of it.
3.4 product design
Here is a description of the difference between Open-Falcon and products.
A. service tree
We constructed our own service tree according to the organizational structure of the company and the organizational relationship of the business system. The organization architecture and the server are automatically synchronized with the CMDB.
B. Dashboard
According to our company’s business organization form and past usage habits, we built our own Dashboard.
We abstract the concept of “business” and “functional modules” to represent “systems” and “subsystems”, or the meaning of “large groups” and “two level groups”. Open-Falcon only supports first – level groupings.
The function module “associated” host group, “alarm policy”, “alarm template”.
In the “alarm strategy”, we implemented the “alarm upgrade”, “delayed alarm”, and combined the alarm of the host according to the “function module”.
We provide a self help subscription function for the host alarm in the “function module” on the page.
We will share this content separately.
C. alarm judgment expression
We modified Judge and added yesterday’s contrast function
Daydiff (#3) > 1: the difference between yesterday and today is 1.
Daypdiff (#3) > 1: the difference between yesterday and today is 1%.
D. warning notification strategy
In order to solve the problem of alarm bombing, we added the “alarm and upgrade mechanism”: the alarm was sent to the “group 1” immediately after the alarm occurred, and the X minutes were still continuing to send the notifications to the “group 2”, and so on. The crew and interval can be customized.
In order to further optimize the problem of alarm flooding in a period of time, we have added this mechanism: in one time period, the alarm first occurs, the notification is sent immediately, and the principle of the lengthening time is used to suppress the alarm between the swaps and the end of the time period. Send a fault or restore state. Time can be customized.
In order to further optimize a group of machines with the same function (or other angles common) at the same time, the same kind of anomaly caused the overflow of alarm, we add this mechanism: the principle of lengthening time, at the end of the time period, the machine under the same functional block, merged according to the same monitoring item. Time can be customized.
This content will also consider open source.
E. Agent
The car home has a large number of Windows servers. In order to achieve the unification of the system logic architecture, we developed its own Agent (author:ninjadq) based on the Windows Service service with Python.
The difference between the Agent and freedomkk-qfeng/falcon-scripts is:
Support IIS and SQLServer monitoring items collection.
Service running Windows does not configure timed tasks. The operation mode of Agent is consistent with that of Go-Agent under Linux.
For each application thread, the cost of thread switching is negligible when the application is not collected and is collected regularly.
The local HTTP proxy interface is implemented.
The difference from LeonZYang/agent is
Don’t give golang a patch.
With the support of our company, we open the code with Apache license. See “Windows Agent” https://github.com/AutohomeRadar/Windows-Agent/. We borrowed part of “freedomkk-qfeng/falcon-scripts” to collect code, thanked the author’s code.
Welcome to PR and Issue. Welcome to contact us with QQ or email.
4. future Roadmap
fault location
Event correlation analysis based on temporal dimension and semantics
Visualization of network and business and analysis of abnormal possibility
Dynamic threshold analysis
Monitoring item behavior analysis + dynamic monitoring strategy matching
Judge/Graph/ and other components of the multi machine room data synchronization and cold standby plan
5. summary
Monitoring system construction is a long and arduous process. As long as demand is constant, there may never be a day to stop. We will maintain close communication with the community, learn some experience and contribute to it. We welcome all of us to communicate with us in this way.

Design ideas of car home CMDB

introduce
The author introduces Du Hui duhui@autohome.com.cn and is now working in the car home system platform team, which is mainly responsible for web infrastructure and operation and maintenance, and has rich experience in the construction of operational and maintenance processes, the architecture design and implementation of automated operation and maintenance.
Team Introduction
The car home operation and maintenance team is the core team of the automotive home technology department, which is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group.
Team technology blog address is http://autohomeops.github.io/.
Write in front of
What is CMDB?
In short, it is a platform for unified management of IT data, equipment status, assets and other information for Internet companies or companies. There is a certain difference between CMDB and the real asset system, which itself contains not only the asset system, but also all service support and service delivery processes. It provides the data basis for the operation of the entire ITIL process and the value of configuration information. So a successful CMDB system is great for an enterprise.
Design purpose
What is the CMDB of the car home? What problems does it use to solve? After several unsuccessful development of CMDB and CMDB, we emptied our brain, thought painful, and carefully thought about the problems encountered in the work, summed up the following points:
With the expansion of the company’s operation scale, the number of equipment used is very large, there are many kinds of machines, the cost of auditing is multiplied, the statistics are difficult, the time consuming and the cost are consumed, so a unified platform is urgently needed to manage and collect.
With the development of the operation and maintenance automation of the company, the distribution system, the distribution system, the monitoring system, the work order system, the equipment utilization system and so on are isolated and inconsistent, and all the problems should be maintained separately. It is urgent to integrate all data and open up a bottom data platform to facilitate future expansion and maintenance.
For a variety of reasons, some work is caused by the original operation, the management is unordered, the amount of duplication is huge but the output value is not obvious. It is necessary to customize a system suitable for themselves and convenient to use to reduce the workload and improve the efficiency and output value.
As mentioned above, the purpose is to come out: a high concentration of data, a substitute for a table, a reduction in duplication of work, and an asset allocation management system for the operations and maintenance personnel.
The direction of the design, the core idea and the function
Direction: high accuracy, high availability, highly automated.
Accuracy is the basis of all data, the premise of all purposes, and all the data given by CMDB must be accurate and effective.
Availability is the fundamental, and then strong, no one is not to use, to understand the data consumption scene, to understand the user’s use of the focus and habits, to accept the user feedback and continuous optimization is no stop.
Automation ensures the high efficiency of time and manpower, and is the key to CMDB’s success from cost and revenue.
So how does CMDB make sure that the direction is correct? What is the key to solving all the problems?
Core idea: process control
The essence of the process is to solve a series of problems brought about by the change. The change is managed by an automated and repeatable process, so that there is an accurate process to execute when the change occurs. It also predicts the impact of this change on the whole system and evaluates and controls the impact. It’s the key to CMDB to solve all the problems.
When the direction and core are solved, our CMDB function design is clear.
Integration function: merging existing multiple data sources into a single view to generate a unified and complete data report.
Self mediation function: ensure that multiple data sources recorded in CMDB are not duplicated, and maintain data integrity in all fields in CMDB.
Synchronous cascade: information in CMDB can reflect updates and coordinate and coordinate data in all subsystems.
Permission classification: there are various situations, and the allocation function of personnel additions and deletions to check authority under various conditions.
Visualization function: reduce the threshold of information understanding, and intuitively understand the effects of data changes.
Platform display
1. asset management platform
Figure 1
Figure 2
Figure 1 above is shown in Figure 2. There are 35 fields on the side of the car house, which mainly satisfy daily enquiries, statistics of various kinds of data, audit records and so on.
On the degree of granularity in each field of CMDB:
The particle size is directly proportional to the company’s overall capability and manpower.
Management requirements determine the size of these granularity. Large and full data coverage should consider maintenance cost.
The output of data value is not directly proportional to the thickness of granularity, but the ability to expand at any time is guaranteed.
All fields, regardless of granularity, must be found in the process.
2. process control and control platform
The process control of the car’s home process mainly includes 4 aspects: initialization data (entry), data change (state machine), data value (report system), data audit (work single query).
Want to design a “All weather & Flawless” control process to ensure the accuracy of all data, to first understand the life cycle of the device and the link between the CMDB, such as:
According to the life cycle contact diagram of the device, we define the state machine to reflect a series of changes caused by the change of data. Figure:
By changing the state, we prescribe some necessary fields to ensure the accuracy of the data.
And through this change, we can get some information we need:
Of course, the single automation process is the first step. As the bottom data platform of the entire operation and maintenance automation, CMDB will play a connection guidance to the whole process, realize multi process collaboration, complete the end to end, and lay the foundation for the operation and maintenance automation system. The following diagram is shown as follows:
3. other functional platforms
Ip pool management: a unified induction of IP resources, in view of the current status of IP, we define three use states, and in the background for the full network IP situation to do a unified verification program, combined with three states, the rational distribution and use of IP resources, schematic:
Computer room view: in view of the data in the list of assets, it is intuitively reflected in the function of the machine room view, and provides the functions of using the cabinet, the query of the machine position, and the bright display of the machine under the specific search conditions. Figure:
There are also some customized functions, permissions, specific fields for our own business.
summary
The CMDB of car home has been on the road for 2 years. The road to success is full of thorns and frustrations, and so is the construction of CMDB. Over the past two years, we have seen a lot of information and made many attempts, some of which have been successful and some have failed. Here are some common lessons for you.
The lesson:
At the beginning of the project, there was no real person in charge of the project, and communication between development and demand was not enough.
The initial goal is too large, covering too many functions, resulting in the project’s primary and secondary indistinct and long construction period.
The goals of Yun Wei and Yun Wei are inconsistent, and the steps of discussion, testing, acceptance, and later optimization are not perfect or Yun Wei is not involved.
The core process problem is not seized or considered inconsiderate.
No collection of feedback or collection of feedback was followed.
Compromise, the famous linear curve problem.
Experience:
The leadership’s commitment to time, manpower, and resources.
Wielding all the power assigned to check data, all kinds of historical problems and difficult points must be solved.
CMDB is just a system, and configuration management is a process, CMDB is of course important, but it is more important to maintain its process, so to design a good CMDB, we must design a set of processes suitable for your own company at the point of view of the overall situation.
The establishment of standardization, platform, can not be separated from the CMDB.
In the aspect of high availability, we must strive for excellence and “find and use” to make CMDB continue.

Auto home configuration management system AutoCMS

introduce
The author introduces the author of this article is Wang Xianbao wangxianbao@autohome.com.cn, which is mainly responsible for the development of AutoCMS and the operation and maintenance of the caching platform. It is good at Python automated operation and maintenance, distributed caching and distributed file system application management.
The team introduces us as the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.github.io/.
Contact: you can communicate with us via email or message from official technology blog.
outline
1. Preface
AutoCMS is the unified configuration management tool currently being used by the auto home. This article will introduce in detail the method and architecture implementation of the system. First, let’s look back at the following stages of software family deployment and configuration file management.
Original stage (manual deployment, Wiki specification)
In the initial stage, software deployment and configuration file management depended entirely on manpower. Deploy new business, operation and maintenance according to the software version of research and development test environment, download software package to official network, compile and install on the server, then manually configure file and start service according to R & D requirements. There are many problems exposed at this stage:
The source of software package is not mandatory, which brings great security risks.
The new business is on the line or the business is expanding urgently, and operation and maintenance invest a lot of manpower to complete the heavy manual labor.
The standardization of deployment directory relies on Wiki specification maintenance. Problems often arise when manpower is deployed.
Semi automation stage (relying on manpower packing, yum deployment)
With the implementation of operation and maintenance automation, build your own Yum source and package it to standardize the source of software version and deploy directory. When you need batch deployment, you can save a lot of manpower by installing Yum install to install related software packages by the server, but you still need to log in to the server to modify the configuration file to meet the needs of the line. The backup before the configuration file changes and the recovery after failure still need manpower maintenance.
Automation phase
The management system AutoCMS is now being used to realize page management software deployment and configuration files, and realize fast batch management software.
The following is a detailed introduction of the AutoCMS we are using.
Introduction to the use of 2.AutoCMS
AutoCMS is a system of software deployment and configuration file management based on puppet implementation, which avoids the manual installation of deployment software, provides reliable software standardization deployment, configuration files and basic services management services. The main functions of this system are as follows:
Batch deployment package
Page management software configuration file changes, support backup and rollback.
Multi environment deployment, development, testing, online environment isolation
Grayscale push configuration
Remote execution of security commands
Basic statistical functions
The user’s operation process is as follows:
Creating a host group is a collection of hosts that need to deploy the same software or the same configuration, and the configuration module needs to be associated with the host group. The configuration module is a puppet module that is written in advance.
The host will need to deploy the module’s host to join the group (the host’s input source is the CMDB interface).
Select the environment to select the environment of the host according to the needs, and the configuration data of each environment is stored independently, without affecting each other.
Configuration parameters are different according to the configuration module, and configuration files are written to generate different configuration pages. As long as the business operations and dimensions select the corresponding parameters on the page, puppet will generate the corresponding configuration files according to these parameters.
When the configuration of the push configuration business operation and maintenance is completed, the host group management page is returned, the host and the push configuration are selected, and the puppet agent of the related host will be triggered, and the corresponding software and configuration will be deployed.
3. system architecture description
AutoCMS uses Django as the front-end framework, and the backend deployer is based on the puppet implementation, and implements complete deployment and configuration logic through the ENC and report functions of puppet. The overall architecture is as follows:
Brief analysis:
Front-end configuration page
Because each software configuration option is different, the front end software parameter configuration page must use the configuration file dynamic generation.
data storage
At the beginning of the design, we chose MongoDB between MySQL and MongoDB.
Schema-less, JSON style storage: since the configuration data is different from the software, its data can not be stored on the key-value, each software configuration item may have a custom structure, so the JSON format is used, and the MongoDB bson just meets the requirement and saves the frequent use of foreign keys in MySQL. . In addition, JSON format data is easy to grasp and understand, and the stored data are clear at a glance.
CRUD is convenient and fast, supports range query and regular query, and supports upsert option.
Compared with other NoSQL products, we are more familiar with the persistence and high availability of mongodb.
The deployment implementation layer is implemented through the puppet module, with the parameters referenced in the module through a custom facter or using the ENC to read the configured storage MongoDB acquisition.
Implementation of multi instance deployment:
The multi instance deployment of a host is implemented by create_resources function. For example, a host needs to configure multiple Tomcat instances, and the data will be structured as follows:
The implementation of multi environment separation: the puppet code part of this system has git as version control, and the environment separation of the puppet code level is done by the multi branch features of the GIT itself. Execute the puppet code of different branches according to the requirements.
4. configuration case
The following is an example of Tomcat automatic deployment module.
According to the deployment requirements, a storage structure with several parameters is designed.
Fill in the configuration file of the front-end layout, and automatically generate the front-end configuration page.
Write the puppet automatic deployment module.
Write the ENC conversion program
After the above preparation work is done well, the system is on-line. The operation and maintenance group can create the host group, associated with the Tomcat module, and add the host to the group and configure the parameters.
After the configuration is completed, the grayscale release is verified without any problems. The whole quantity is pushed, the state is observed, and the configuration is finished.
5. summary
With the advent of the cloud era, many servers can customize the environment for the implementation of the mirror image, but with the expansion of the business scale, the personalized configuration of the server will be more and more, and the maintenance work of the mirror will gradually increase, and the use of AutoCMS can solve this problem flexibly. Writing their configuration requirements into configuration modules and grouping servers and pushing them down in batches can save a lot of manual labor for operators. Let operation and maintenance personnel devote their energies to more valuable work.

“Transaction management” sharing

introduce
The author of this article is wucheng@autohome.com.cn, Wu Cheng, who has strong interest in system development and team management skills.
The team introduces us as the car home operation and maintenance team. It is the most core team in the automotive home technology department. It is composed of OP and dev. Our goal is to build a high performance, high scalability, low cost, stable and reliable website infrastructure platform for the auto home group. Team technology blog address is http://autohomeops.github.io/.
Contact: you can communicate with us via email or message from official technology blog.
Preface
This is the company’s internal business management lecture, aimed at sharing some time management skills to help you improve the ability to deal with a large number of transactions.
PDF Download
Click this download
Slideshare