Presto in the point of my use

Reasons for use:

Point me to big data developers, BI colleagues need to use hive to query various kinds of data every day, more and more reports business is used to hive. Although the CDH cluster has deployed impala, most of our hive tables use ORC format, and impala is unfriendly to ORC support. Before using presto, I would like to use big data to use hive and go to MapReduce to query related tables. The efficiency of query is low.

In the early days of presto, operation and maintenance set up a 3 node experience environment, queried hive through presto, and experienced simple SQL queries, and found that the efficiency was very good.

Our existing solutions are unable to query both historical and real-time data at the same time. Once in demand, BI puts forward the problem that we need our large data to solve hive and MySQL interactive queries. We used spark, Presto and other tools in the big data team to test, and found that Presto is relatively easiest to use. It is also suitable for BI’s colleagues. After deliberating internally, he decided to vigorously promote presto.

We built a Presto cluster on the Hadoop node and configured 1coordinator, 3worker. Later, with the increasing use of Presto business, now it has been expanded to 7worker. The memory of worker node also increased from original 8G to 24G.

Presto introduction:

Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports massive data. It is mainly to solve the interactive analysis of commercial data warehouse and to deal with low speed. It supports standard ANSI SQL, including complex queries, aggregation (aggregation), connection (join) and window function (window functions).

Presto supports online data query, including Hive, Cassandra, relational database, and proprietary data storage. A Presto query can merge data from multiple data sources and analyze them across the entire organization.

Working principle:

The running model of Presto is essentially different from that of Hive or MapReduce. Hive translates queries into multistage MapReduce tasks and runs one after another. Each task reads the input data from the disk and outputs the intermediate results to the disk. However, the Presto engine does not use MapReduce. It uses a custom query and execution engine and response operators to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is carried out in memory. Different processing terminals constitute the pipeline processed through the network. This will avoid unnecessary disk read and write and additional delay. This pipelined execution model runs multiple data processing segments at the same time, and once the data is available, the data will be passed from one processing segment to the next processing segment. Such a way will greatly reduce the end to end response time of various queries.

Use the scene:

1, commonly used hive queries: more and more colleagues have been querying hive through presto. Compared with the MR hive query, the efficiency of Presto has been greatly improved.

2, data platform inquiries hive through Presto to do business display.

3, union of Cobar. Cobarc has 5 segments. There are 5 pieces of data in the data. If you need to query the cobarc, you need to query 5 pieces of data separately and then hand it together. The efficiency of inquiry is low. Cobarb has 8 libraries, and the data exists in 8 libraries. There is the same problem with cobarc. Through presto, the union of 5 slices can be combined to enhance query efficiency and simplify query.

The interaction between 4, hive and mysql. In the past, hive’s historical data and cobarc’s real-time data could not be join together. You can only query the data of the previous day or the data of the day. Using Presto can bring hive and cobarc’s table join together.

5, Kafka’s topic is mapped into tables; in existing Kafka, some topic data is structured JSON data. Through presto, the JSON is mapped into a table. You can use the SQL query directly. But it has not been applied to the scene for the time being;

6, at present, hue, Zeppelin and other web tools are used to operate Presto and support the export of result sets.

Leave a Reply

Your email address will not be published. Required fields are marked *