Presto in the point of my use

Reasons for use:

Point me to big data developers, BI colleagues need to use hive to query various kinds of data every day, more and more reports business is used to hive. Although the CDH cluster has deployed impala, most of our hive tables use ORC format, and impala is unfriendly to ORC support. Before using presto, I would like to use big data to use hive and go to MapReduce to query related tables. The efficiency of query is low.

In the early days of presto, operation and maintenance set up a 3 node experience environment, queried hive through presto, and experienced simple SQL queries, and found that the efficiency was very good.

Our existing solutions are unable to query both historical and real-time data at the same time. Once in demand, BI puts forward the problem that we need our large data to solve hive and MySQL interactive queries. We used spark, Presto and other tools in the big data team to test, and found that Presto is relatively easiest to use. It is also suitable for BI’s colleagues. After deliberating internally, he decided to vigorously promote presto.

We built a Presto cluster on the Hadoop node and configured 1coordinator, 3worker. Later, with the increasing use of Presto business, now it has been expanded to 7worker. The memory of worker node also increased from original 8G to 24G.

Presto introduction:

Presto is an open source distributed SQL query engine, which is suitable for interactive analysis and query, and supports massive data. It is mainly to solve the interactive analysis of commercial data warehouse and to deal with low speed. It supports standard ANSI SQL, including complex queries, aggregation (aggregation), connection (join) and window function (window functions).

Presto supports online data query, including Hive, Cassandra, relational database, and proprietary data storage. A Presto query can merge data from multiple data sources and analyze them across the entire organization.

Working principle:

The running model of Presto is essentially different from that of Hive or MapReduce. Hive translates queries into multistage MapReduce tasks and runs one after another. Each task reads the input data from the disk and outputs the intermediate results to the disk. However, the Presto engine does not use MapReduce. It uses a custom query and execution engine and response operators to support SQL syntax. In addition to the improved scheduling algorithm, all data processing is carried out in memory. Different processing terminals constitute the pipeline processed through the network. This will avoid unnecessary disk read and write and additional delay. This pipelined execution model runs multiple data processing segments at the same time, and once the data is available, the data will be passed from one processing segment to the next processing segment. Such a way will greatly reduce the end to end response time of various queries.

Use the scene:

1, commonly used hive queries: more and more colleagues have been querying hive through presto. Compared with the MR hive query, the efficiency of Presto has been greatly improved.

2, data platform inquiries hive through Presto to do business display.

3, union of Cobar. Cobarc has 5 segments. There are 5 pieces of data in the data. If you need to query the cobarc, you need to query 5 pieces of data separately and then hand it together. The efficiency of inquiry is low. Cobarb has 8 libraries, and the data exists in 8 libraries. There is the same problem with cobarc. Through presto, the union of 5 slices can be combined to enhance query efficiency and simplify query.

The interaction between 4, hive and mysql. In the past, hive’s historical data and cobarc’s real-time data could not be join together. You can only query the data of the previous day or the data of the day. Using Presto can bring hive and cobarc’s table join together.

5, Kafka’s topic is mapped into tables; in existing Kafka, some topic data is structured JSON data. Through presto, the JSON is mapped into a table. You can use the SQL query directly. But it has not been applied to the scene for the time being;

6, at present, hue, Zeppelin and other web tools are used to operate Presto and support the export of result sets.

The Flow of Project Control Needs Dependable Optimization And Progression

Undoubtedly one of my projects this summer has become an excruciating strategy of halting a swamp from surging the cabin rental. I actually submitted due diligence 中文 story of how My spouse and i gained power over the flood. What was first every boggy marsh can be now a moist field with a stream operating through. Three moose contain possibly determined a safe safe place in a grassy region only 20 or so toes right from my personal back door. Yet , mainly because I could just manage this kind of area which has a simple spade, the water damage is still a trouble. I just can’t appear to keep up. Getting machines up the vertical, rough highway is expensive and risky. It may require a number of function just to retain a tractor via settling in the dirt. Because the whole thing is so slow by hand, I must constantly drill down trenches and expel this particular in a deeper stream crib. Even after having a week’s rest, the water starts to enlarge into your6173 the minor depressions where this can stagnate. There are a few issues My spouse and i have been carrying out to keep this swamp from increasing with the limited time I possess. I just think job managers may similarly own limited some tools upon certain assignments, and they will may apply precisely the same principles inside their control techniques.

Earliest, I must become consistent with time, functioning by a typical basis. Dropped pine sharp needles and silt quickly block up pieces of the stream, a great if My spouse and i keep the stream layer clear of debris, the drinking water puts on enough push to in a natural way hold all the material downstream. However, also small congestion may stop the water, and different areas may slow down in no time. Merely do look after this, the stream bed frame simply just goes away within a swamp, and We are back to courtyard you. A part of managing a task is to always maintain the project constraints obvious. If, for reasons uknown, a job director neglects to take some action, scope slip can come quite suddenly, and regaining control over the job constraints can be difficult.

Second, if I actually keep your stream eliminated of rubble, therefore I have an overabundance time to appropriate the move. For the purpose of example, I might cut down a dead pine that keeps giving up it is pinus radiata knitting needles in to the normal water. I may search a section better, wider, or straighter therefore debris shouldn’t collect surrounding the edges when the stream constitutes a turn. In project operations, once the job constraints will be below control, the project director can have this extra time to produce better options for avoiding returning risks and other problems.

Third, with a great optimized stream flow, I actually is competent to track down spots wherever the water is received from and improvement right from there. Though around my cabin the swamp is finished, there will be demesne of even more swamps that flow right from bigger elevations. The water grows in every directions regardless of whether the main stream is usually slow or stopped up or not really. Zero matter how hard I just work on the stream currently minimize, the water via upper swamps will get a risk. To solve this problem, My spouse and i must guide each of people sections into one flow, chopping into the high swamps. This kind of thus sets more drinking water into the primary stream and fewer water into those haphazard stagnating places. The even more normal water at this time there is, the more efficient it carries out the junk, also supporting to mill the ditch wider and deeper. In project managing, beyond keeping project restrictions visible and controlled, the manager will need to further discover the reasons for “flooding” and harness these into one particular way. On other key phrases, a good job manager recognizes in which a task should improvement.

In a nutshell, these three aspects combine to manage tasks that will be consistent, improved, and growing. The equipment and strategies used in a task should be completing this. I just think the idea of consistent optimization and advancement is particularly relevant for the corporations that use task cycle control (PCM). In ways, each time a provider does a task very similar to kinds they’ve previously done in prior times, certainly not having a great optimized task management pedal is like the stream receiving clogged again. The even more powerful the circulation, the more time there is certainly with respect to the project administrator to funnel in more means and be concerned less about risks, range creep, and everything more that may be a problem.