Simplified Architecture Behind ORC Format

Overview

I have mostly heard that ORC is the best format for storing files on Hive but why it seems to be the ideal case for Hive, let's look into this file format and check out its amazing architecture. For an outline of other Hive File formats, you can look into this article.

What is ORC?

ORC (Optimized Row Columnar) is a data storage format that is designed to store large amounts of data efficiently. It is a column-oriented file format, which means that it stores data in columns instead of rows. This allows it to read and write data more efficiently, as it accesses specific columns without reading through all the data in the file.

ORC Format Layout

An ORC file contains stripes, which are groups of row data, as well as additional information in the file footer. A postscript file at the end of the file contains compression parameters and the size of the compressed footer.

ORC Architecture is illustrated below:

The stripe size is set to 250 MB by default. Large stripe sizes allow for large, efficient HDFS reads. The file footer comprises a list of the stripes in the file, the number of rows per stripe, and the data type of each column. It also includes column-level aggregates like count, min, max, and sum.

Stripe Structure

Each stripe in an ORC file, as illustrated in the diagram, contains index data, row data, and a stripe footer.

The stripe footer has a list of stream locations. The index data includes the minimum and maximum values for each column as well as the row positions within each column. Row index entries contain offsets that allow you to seek the correct compression block and byte within a decompressed block.

Having periodic row index entries allows for row-skipping inside a stripe for quick readings, even with enormous stripe sizes. Every 10,000 rows are skipped by default.

Serialization and Compression

What is Serialization?

Serialization is the algorithm used to write data to disk or send it someplace. Variable-width encoding, for example, optimizes data space utilization by using less space to encode less data.

There are two types of serializations as listed below:

💠 Serialization of Integer Columns

Integer columns are serialized in two streams.

present bit stream: is the value non-null?
data stream: a stream of integers

Integer columns are serialized using run-length encoding (a form of lossless data compression in which runs of data are stored as a single data value and count, rather than as the original run).

💠 Serialization of String Columns

String columns are serialized in four streams.

present bit stream: is the value non-null?
dictionary data: the bytes for the strings
dictionary length: the length of each entry
row data: the row values

String columns are serialized using the dictionary for column values, and the same run-length encoding. The dictionary is sorted to increase compression ratios and speed up predicate filtering.

Compression

Streams are compressed using a codec (Snappy, Zlib, or none). Compression is done incrementally as each block is created to maximize memory utilization. Jumping across compressed blocks is possible without first decompressing them for scanning. A block start point and an offset into the block are used to indicate positions in the stream.

ORC vs Parquet

Both are ideal for read-heavy workloads. ORC files are arranged into data stripes, which are the fundamental building blocks of data and are independent of one another. Parquet, on the other hand, stores data in pages, with each page containing header information, definition level, repetition level information, and the actual data. ORC format supports ACID transactions.

More Resources:

Conclusion

To conclude, We have studied the architecture of Hive ORC File format and studied its advantages like serialization and compression. ORC proves to be successful in large-scale deployments with efficient compression and fast reads.

Stay tuned for an upcoming article on how parquet seems to be a reliable file format.

See you next time,

@TechAE