Introduction To Apache Tez And Its Benefits
INTRODUCTION TO APACHE TEZ:
Another tool for the optimization of big data that we used was Apache Tez. Apache Tez is also an execution framework just like the MapReduce framework, which is built on top of Hadoop YARN. Apache Tez is must faster and more optimizable than the MapReduce framework. It converts complex SQL statements into a more optimized version. We have used Apache Tez not only because it is faster than MapReduce, but because the data processing tasks that were done by multiple MapReduce jobs can be done with the help of just one Tez job. Tez also supports the running of existing MapReduce jobs on top of the Tez framework to provide an easy upgrade for existing map-reduce framework users.
Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. Tez enhances the MapReduce framework by greatly increasing its performance while preserving its capacity to scale to petabytes of data. To summarize, Apache Tez is much more optimized. It provides much better performance and it is overall faster.
HOW IT WORKS:
Let us further explain how exactly the core processes in Apache Tez take place.
·
Representing and processing via Dataflow Graphs:
o
Apache Tez represents the processing of data as
a data flow graph.
o
The vertices of this graph represent application
logic and its edges represent the movement of the data.
o
A well-defined dataflow definition allows users
to innately express complex query logic.
·
Interaction between Input, Output, and Processor:
o
Tez divides the user logic in each vertex of the
dataflow graph as a composition of Input, Output modules.
o
Input & Output determine the data format and
how and where it is read or written.
o
The Processor holds the data transformation
logic.
·
Ability to Dynamically Reconfigure graphs:
o
Distributed data processing is dynamic and it is
very difficult to determine the optimal data movements in advance.
o
Tez enables us to optimize these movements
during runtime.
o
It can do so because it includes support for
pluggable vertex management modules.
o
This collects runtime information and changes the
dataflow graph dynamically to optimize performance and resource utilization.
·
Managing Resources for Optimized Performance:
o
In a Hadoop cluster, YARN manages the resources.
o
Tez gathers these resources from YARN and reuses
every component in the pipeline such that no operation is repeated unless it is
required.
·
Directed Acyclic Graphs (DAGs):
o
Tez identifies a simple Java API and expresses a
Directed Acyclic Graph of data processing. This API is made up of three
components
↬DAG – The DAG defines the overall job. A
separate DAG object is created by the user for each data processing job.
↬ Vertex – The vertex defines the user logic and
the resources & environment needed to execute that user logic. A Vertex
object for each step in the job is created and added to the DAG.
↬ Edge – The Edge identifies the connection
between the producer and the consumer vertices. An Edge object gets created to
connect the producer vertex to the consumer vertex using it.
BENEFITS:
Tez does not impose any data format. But it is necessary that the input, output, and processors must be compatible with each other.
INTEGRATING TEZ ON HIVE:
When we integrate Hive on Tez, a SQL-based data warehouse system based on Apache Hive is provided. This greatly increases SQL query performance, security, and assessing capabilities. Tez is mostly used only by Hive.
BENEFITS:
2. Tez enables Hive in transitioning from batch mode to interactive mode.
3. At first, there was only MapReduce framework available in the hive to convert hive queries into execution jobs on Hadoop clusters. But now, the Tez execution engine framework is integrated into the hive to boost the efficiency of complex hive queries.
No comments:
Post a Comment
Thank you for submitting your comment! We appreciate your feedback and will review it as soon as possible. Please note that all comments are moderated and may take some time to appear on the site. We ask that you please keep your comments respectful and refrain from using offensive language or making personal attacks. Thank you for contributing to the conversation!