A Complete Guide To Apache Hive Join Optimizations

Overview

To achieve Database normalization, Join Query was introduced where we combine fields from multiple tables by using values common to each. A syntax for HQL Join is:

Different Types of JOINs

🔶 (INNER) JOIN: Returns matching rows from both tables having the same key

🔶 LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from the right table

🔶 RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records from the left table

🔶 FULL (OUTER) JOIN: Returns all records when there is a match in either the left or right table

Check How to setup Apache Hive using MySQL metastore

Working of JOINs

First, let's go over how Hive's join works. A MapReduce task will be created from a common join operation. A map stage and a reduce stage are common components of a join task. A mapper reads join tables and writes the join key and value pair to an intermediate file. In the shuffle stage, Hadoop sorts and merges these pairs. The reducer receives the sorted results and performs the actual join work. Because it must sort and merge, the shuffle stage is extremely expensive. The ability to save the shuffle and reduce stages improves task performance.

One thing to remember when writing the join query is that the table mentioned last is streamed through the reducers, while the rest are buffered in memory in the reducers. So, keep in mind that the large table is mentioned at the end, which reduces the memory required by the reducer, or use the STREAMTABLE hint.

Another thing to keep in mind is that when a where clause is used in conjunction with a join, the join part is executed first, and then the results are filtered using the where clauses.

While the same record filtering could have been performed when joining the tables. This can be accomplished by combining the filtering and join conditions.

This join includes a shuffle phase in which the mapper outputs are sorted and shuffled with the join keys, which incurs high costs in terms of processing, I/O, and data transfer. So, in order to reduce the aforementioned costs, many join algorithms were introduced.

🔹 Multi-way Join

Multi-way join algorithms are capable of joining more than two relations at a time hence improving performance by avoiding extra I/Os from multiple partitioning steps. In short, Reducing the number of reducers by doing all reducing tasks in just one reducer is the key to optimization.

🔹 Map Join (Broadcast Join or Broadcast-Hash Join)

This algorithm keeps all of the small tables (dimension tables) in memory in each mapper, while the big table (fact table) is streamed over it. For each of the small tables, a hash table would be created with the join key as the hash table key, and data would be matched with the mapping hash value when merging in the Mapper Function. As a result, because no reducers are required, these map joins in Hive are much faster than regular joins.

The following settings are to be achieved to perform Map join:

hive> set hive.auto.convert.join = true;
hive> set hive.auto.convert.join.noconditionaltask = true;

Or else, we can use the Map Join hint in the query:

The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.

🔹 Bucket Map Join

The buckets can be joined with each other if the tables being joined are bucketed on the join columns and the number of buckets in one table is a multiple of the number of buckets in the other table. If tables A and B each have four buckets, the following join is appropriate.

Instead of fetching B completely for each mapper of A, only the required buckets are fetched. The following configuration works on it:

hive> set hive.optimize.bucketmapjoin = true;

🔹 Sort-Merge-Bucket (SMB) Join

The SMB join algorithm is an optimized version of the Bucket Map Join algorithm. A sort-merge join can be performed if the tables being joined are sorted and bucketed on the join columns and have the same number of buckets. At the mapper, the corresponding buckets are linked together. If both A and B have four buckets, the query is the same as the bucket map join above. The following parameters must be configured:

hive> set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
hive> set hive.optimize.bucketmapjoin = true;
hive> set hive.optimize.bucketmapjoin.sortedmerge = true;

🔹 Skew Join

When there is a table with skew data (having values that are present in large numbers in the table compared to other data) affecting the join performance by disturbing the balance of resources, Skew join is used to tackle this. It would execute one query to implement the join of the skewed key and another join for the rest of the join keys, and would further merge the outputs of both queries. Hence, these rows are loaded into the memory. The major disadvantage of this algorithm is that the unskewed table will be read twice. The following parameters should be set:

hive> set hive.optimize.skewjoin=true;
hive> set hive.skewjoin.key=100000;

More Resources:

Thanks for reading!

To conclude, We have studied JOINs and their 5 optimization techniques. Joins prove to be a major advantage in normalizing databases with optimized techniques. To learn more about Hive file formats, you can check out here: Apache Hive File Formats Explained.

If any query occurs feel free to ask in the comment section.

See you next time,

@TechAE