Reasons for use:
Point me to big data developers, BI colleagues need to use hive to query various kinds of data every day, more and more reports business is used to hive. Although the CDH cluster has deployed impala, most of our hive tables use ORC format, and impala is unfriendly to ORC support. Before using presto, I would like to use big data to use hive and go to MapReduce to query related tables. The efficiency of query is low.
Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports massive data. It is mainly to solve the interactive analysis of commercial data warehouse and to deal with low speed. It supports standard ANSI SQL, including complex queries, aggregation (aggregation), connection (join) and window function (window functions).
The running model of Presto is essentially different from that of Hive or MapReduce. Hive translates queries into multistage MapReduce tasks and runs one after another. Each task reads the input data from the disk and outputs the intermediate results to the disk. However, the Presto engine does not use MapReduce. It uses a custom query and execution engine and response operators to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is carried out in memory. Different processing terminals constitute the pipeline processed through the network. This will avoid unnecessary disk read and write and additional delay. This pipelined execution model runs multiple data processing segments at the same time, and once the data is available, the data will be passed from one processing segment to the next processing segment. Such a way will greatly reduce the end to end response time of various queries.
Use the scene:
1, commonly used hive queries: more and more colleagues have been querying hive through presto. Compared with the MR hive query, the efficiency of Presto has been greatly improved.