What Is Apache Hive?

HISTORY:

When Apache Hadoop first arrived on the market, it quickly became the backbone of Big Data. To get the most of it, users had to create extensive, sophisticated Java codes, which was difficult for novices.

People understood that teaching their staff how to write sophisticated Java scripts to work efficiently on Apache Hadoop would be strenuous, thus they saw the need for another tool with a simple communication interface.

As a result, Apache Hive was established, giving a SQL-like interface since SQL was frequently used by engineers and analysts.

Joydeep Sen Sarma and Ashish Thusoo co-created Apache Hive while working at Facebook.

INTRODUCTION:

Apache Hive is an open-source warehouse system for querying and analyzing large data sets in Hadoop files using SQL.

Apache Hive is a Hadoop-based data warehouse software that allows for the reading, writing, and management of huge datasets stored in distributed storage using SQL.

Hive offers the SQL abstraction required for SQL-like queries to be merged with the underlying Java code without the need to implement the queries in the low-level Java API. It enables the projection of structure onto data that is already in storage. To connect to Hive, users can alternatively utilize a command-line tool and a JDBC driver.

Now, we're regularly discussing SQL here. It is a common misconception that Hive employs SQL. Hive essentially employs a SQL-like interface known as HQL, which stands for Hive Query Language, which is highly analogous to SQL.

Hive fixed the communication issue that people were experiencing as a result of Java on Hadoop. They used to write 10-15 lines of code in Java for basic instruction, but now, writing only a one-line statement in Hive HQL does the same work.

Apache Hive Tutorial For Beginners Using MySQL Metastore

How to easily install apache hive on your Hadoop cluster with MySQL metastore service.

Check now!

MYTHS ASSOCIATED WITH HIVE:

Hive is not a database, but rather a query engine. It lacks its own storage to keep the data. The Hadoop Distributed File System (HDFS). is constantly used to store data. It does nothing except querying and analyzing data.
There is no need to master complicated SQL in order to use Hive. It is sufficient to have a basic understanding of insert, select, and joins.
Hive is an abstraction of the MapReduce engine, not the replacement of MapReduce. It merely serves as a vehicle for the Apache Hadoop MapReduce engine. We can say that we are using MapReduce by HQL (=SQL) via Hive. The replacement of MapReduce is Apache Tez and Apache Spark.
Hive takes the data from HDFS, process it on itself by using MapReduce, and stores the data back on the HDFS.

METASTORE OF HIVE:

Because Hive uses HQL (=SQL), we have a large number of tables and their data. This table information is saved in metadata. The important element to remember is that metadata will contain all of the Hive table information, but not the actual data! This information includes the location, index, etc. of tables.

Hive now exclusively stores metadata in RDBMS (relational database) or metastore, which may be Oracle, DB2, MySQL, and so forth.

This RDBMS or metastore is of two types:

Embedded Metastore
Remote Metastore

Usually, people install RDBMS separately which can be MySQL, Oracle, etc. But if we do not install RDBMS separately, then where will our metadata go?

The answer is, that Hive by default has an embedded RDBMS (metastore), “Derby” which comes along with the installation of Apache Hive. But if we install a separate metastore, this will be known as a remote metastore.

Now, you're probably wondering why we needed a remote metastore if we had an embedded metastore.

Let’s look a little deeper into an example.

Imagine you have a 4 node Hadoop Cluster.

Now you install Hive in all 4 nodes.

Now you are using embedded metastore that comes with Hive itself.

Consider the following scenario: you create a table that is automatically formed by Node 3 and its data is saved in the Node 3 Derby Metastore. Now you insert anything into that table, and the instruction is transmitted to Node 2, and this Node denies the table's existence since it cannot access the Node 3 Metastore because it is embedded in Node 3 exclusively and not accessible to others. As a result, there will be an absolute error!

Now, imagine the same scenario.

But this time, the nodes have not an embedded metastore but a remote universal metastore.

This time, everyone has access to the metastore. Node 3 serves the first request for table formation, and the appropriate location is recorded in the remote metastore. Node 2 serves the second insertion instruction in that table, and it quickly discovers the table by following the address supplied by the remote metastore and correctly executes it.

That’s the main difference between both the metastores.

Again, the obvious question arises: what is the purpose of an embedded metastore?

The answer is simple!

It is best for single-node clusters.

Advantages:

It provides us an easy communication interface of HQL (=SQL) instead of Java on Apache Hadoop MapReduce.
It is highly scalable.
It is highly compatible with Apache Hadoop.
It supports ETL (Extract Transform and Load).
Hive allows users to access files from HDFS, Apache HBase, Amazon S3, and other storage systems.
Since we store Hive data on HDFS so fault tolerance is provided by Hadoop.
Hive can help us with data mining, predictive modeling, and document indexing.

Disadvantages:

Although Apache Hive does not allow online transaction processing (OLTP), it does offer online analytical processing (OLAP).
Hive does not support table update and delete instructions.
Subqueries are not permitted.

Conclusion:

Invented by Facebook, Hive proved to be a very productive tool in the field of data and is used worldwide by every small-to-big company.

There are two more concepts that are also associated with and supported by Hive, which are Partitioning and Bucketing, and are creating very hype nowadays, which will be discussed in our upcoming article.

Till then, stay tuned!😊

More resources:

Apache Hive Tutorial For Beginners Using MySQL Metastore

Easy Install Single-Node Hadoop on Ubuntu 20.04