Introduction To Apache Tez And Its Benefits

INTRODUCTION TO APACHE TEZ:

Another tool for the optimization of big data that we used was Apache Tez. Apache Tez is also an execution framework just like the MapReduce framework, which is built on top of Hadoop YARN. Apache Tez is must faster and more optimizable than the MapReduce framework. It converts complex SQL statements into a more optimized version. We have used Apache Tez not only because it is faster than MapReduce, but because the data processing tasks that were done by multiple MapReduce jobs can be done with the help of just one Tez job. Tez also supports the running of existing MapReduce jobs on top of the Tez framework to provide an easy upgrade for existing map-reduce framework users.

Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf. Tez enhances the MapReduce framework by greatly increasing its performance while preserving its capacity to scale to petabytes of data. To summarize, Apache Tez is much more optimized. It provides much better performance and it is overall faster.

HOW IT WORKS:

Let us further explain how exactly the core processes in Apache Tez take place.

· Representing and processing via Dataflow Graphs:

o Apache Tez represents the processing of data as a data flow graph.

o The vertices of this graph represent application logic and its edges represent the movement of the data.

o A well-defined dataflow definition allows users to innately express complex query logic.

· Interaction between Input, Output, and Processor:

o Tez divides the user logic in each vertex of the dataflow graph as a composition of Input, Output modules.

o Input & Output determine the data format and how and where it is read or written.

o The Processor holds the data transformation logic.

· Ability to Dynamically Reconfigure graphs:

o Distributed data processing is dynamic and it is very difficult to determine the optimal data movements in advance.

o Tez enables us to optimize these movements during runtime.

o It can do so because it includes support for pluggable vertex management modules.

o This collects runtime information and changes the dataflow graph dynamically to optimize performance and resource utilization.

· Managing Resources for Optimized Performance:

o In a Hadoop cluster, YARN manages the resources.

o Tez gathers these resources from YARN and reuses every component in the pipeline such that no operation is repeated unless it is required.

· Directed Acyclic Graphs (DAGs):

o Tez identifies a simple Java API and expresses a Directed Acyclic Graph of data processing. This API is made up of three components

↬DAG – The DAG defines the overall job. A separate DAG object is created by the user for each data processing job.

↬ Vertex – The vertex defines the user logic and the resources & environment needed to execute that user logic. A Vertex object for each step in the job is created and added to the DAG.

↬ Edge – The Edge identifies the connection between the producer and the consumer vertices. An Edge object gets created to connect the producer vertex to the consumer vertex using it.

BENEFITS:

1. Tez offers a flexible execution architecture that allows us to define complex computations as data flow graphs and dynamic performance optimizations based on real-time input and processing resource information.

2. When compared to the MapReduce framework, Tez boosts processing speed from GBs to PBs of data and 10s to 1000s of nodes.

3. The Apache Tez library enables developers to design Hadoop apps that work well with Hadoop clusters and interface with YARN.

Tez does not impose any data format. But it is necessary that the input, output, and processors must be compatible with each other.

INTEGRATING TEZ ON HIVE:

When we integrate Hive on Tez, a SQL-based data warehouse system based on Apache Hive is provided. This greatly increases SQL query performance, security, and assessing capabilities. Tez is mostly used only by Hive.

BENEFITS:

1. Tez converts complex SQL queries into purpose-built data processing graphs that strike the ideal balance of performance, throughput, and scalability across a wide range of use cases and data set sizes.
2. Tez enables Hive in transitioning from batch mode to interactive mode.
3. At first, there was only MapReduce framework available in the hive to convert hive queries into execution jobs on Hadoop clusters. But now, the Tez execution engine framework is integrated into the hive to boost the efficiency of complex hive queries.

HOW IT WORKS:

Hive queries that are running using Tez instead of MapReduce framework enhance the query performance with the help of the expressions of Directed Acyclic Graphs (DAGs) and data transfer primitives. Hive on Tez uses the standard YARN shuffle service to run jobs on ephemeral containers. By default, Hive data is stored on HDFS.

Hive executes SQL queries in the following order:

• Hive compiles the query.

• Tez executes the query.

• Across the cluster, resources are assigned to applications.

• Hive returns query results after updating the data in the data source.

COMMON PROBLEMS:

• Sequence File Schema changes do not work properly when Hive is integrated on Tez.

• When the CommonMergeJoinOperator has inputs for a huge table, it just sets the big table position. The method is not invoked if the input is empty.

• Some queries fail with an error when many queries are sent in the same HiveServer2 session at the same time.