Digging into Parquet: A Long-Awaited Deep Dive
Unpacking the Columnar Powerhouse Behind Modern Analytics
I have to confess: Apache Parquet has been sitting quietly in my mental backlog for many months—a name that pops up constantly in the world of data engineering, but one I’d only ever brushed past. I always figured, "Yeah, it’s that columnar format everyone uses... I'll dig into it someday, when I will have the time to do so."
But “someday” kept getting postponed on and on.
I've been deep in the weeds writing about sprawling topics—reasoning in large language models, the limits of formal logic, DAG execution semantics in Apache Airflow. Each of those demanded full attention. File formats? They seemed... deceptively mundane in comparison, and (sorry for saying that) even not too much important to deserve any attention in the short term.
Until recently, when I completely changed my perspective.
I finally carved out time to really study Parquet—not just skim the docs a little bit, but break it open, trace how it works, and understand why it’s become the default format for everything from Spark to Snowflake. And wow, it was worth it, even though I’ve spent at least 10 hours in doing so.
What I found was a compact yet elegant design: deeply optimized for analytic workloads, carefully structured to minimize I/O, and full of tricks like dictionary encoding, page-level statistics, and predicate pushdown—all the things that let modern engines scan billions of rows with surgical precision.
In this post, I’ll walk you through the inner workings of Parquet: how it stores data, how encodings and compression work, what makes it fast, and why it thrives in distributed data systems. You’ll also get a bit of my own running commentary as someone who came in late to the party—but left genuinely impressed, and believe me, it’s not always the case.
Why Columnar Storage (and Parquet) Matters
First off, we need to address the “obvious”:
Why people do not stick with simple row-wise files like CSV or JSON?
After all, they’re everywhere—human-readable, dead simple, and work fine for most small-scale needs. But when you start moving into the world of big data—hundreds of millions of rows, petabytes of logs, ad clicks, event streams—those humble and intuitive formats start to crack under pressure. And that’s where columnar formats like Apache Parquet shine.
Here’s the core idea: instead of storing data row by row, like a spreadsheet or a CSV, Parquet stores it column by column. That might sound subtle, but it completely flips how data is laid out on disk—and it pays huge dividends for analytical workloads.
Now, imagine you’ve got a dataset with 50 columns, but your query only needs 3 of them. With a row-based format like CSV, your engine still has to read every single row in full—even the 47 columns you don’t care about—just to extract the few bits you need. It’s pretty much like having to read the entire page of a book just to find one simple word. Believe me, it’s not worth it.
Parquet avoids this immense waste. It stores all values for each column together, meaning when your query only references those 3 columns, it can skip the other 47 entirely. That leads to massive I/O savings. As one engineer put it: “when we run a query, Parquet only pulls the specific columns we need instead of loading everything.” That’s not a minor win—it can reduce the read size by over 90%, which is huge.
And that’s just the start.
Because each column contains values of the same type (like all strings, or all integers), Parquet can apply compression and encoding that’s tailored to that data. For instance, a string column with lots of repeated values can benefit from dictionary encoding. Integer columns with small ranges can use bit-packing.
Nulls and repeated patterns are compressed aggressively. This per-column optimization isn’t just efficient—it’s almost surgical.
The result? Parquet files are not only smaller on disk, they’re also faster to scan. Tools like Apache Spark, Presto, DuckDB, and even Pandas can leap through them with near-native performance, especially when paired with features like predicate pushdown (i.e., skipping chunks of data that don’t match a filter).
It’s no accident. Parquet was designed from day one for very big analytical queries—things like OLAP, dashboards, SQL aggregates, and machine learning preprocessing. The original creators (a collaboration between Twitter and Cloudera) explicitly built it to support nested data, column-level compression, and efficient read paths. It’s not trying to be everything to everyone—it’s super laser-focused on complex analytics.
Compare that to old CSV and JSON, which are row-based, flat, and often bloated. They’re perfect for simple dumps, logs, and row-by-row transactional use cases (OLTP), but they fall flat when you need to crunch large datasets or perform column-heavy scans.
To put it as a metaphor: Parquet is like a very specialized scalpel for analytics, while CSV and JSON are more like duct tape and a Swiss Army knife. Both have their place—but when it comes to speed, scale, and efficiency, Parquet is in a league of its own.
Parquet’s Hybrid File Layout
At a high level, a Parquet file has three parts: a header, a data section, and a footer. The header is just a 4-byte magic string (“PAR1”) marking it as Parquet. Then the data section is organized in a hierarchical way to support both columnar storage and efficient reads. Concretely:
A Detailed Breakdown
At a high level, a Parquet file basically consists of three primary sections: the header, the data section, and the footer. Here's a detailed breakdown of these components and their structure:
Header
The header is a 4-byte magic string (PAR1) that signifies the beginning of a Parquet file. This magic number is used to verify the integrity of the file format and ensure that it is indeed a Parquet file when read by a system.
Data Section
The data section is where the actual tabular data is stored and is the most intricate part of the Parquet file. It is structured hierarchically to support efficient storage and retrieval for both row-based and column-based access patterns. The data section is further divided into row groups, column chunks, and pages.
Row Groups:
A row group is essentially a horizontal slice of the dataset that contains a subset of rows, typically millions of rows in one group (often ~128MB uncompressed). This segmentation allows for parallel processing of different row groups. For example, in a distributed system like Spark, multiple tasks can process different row groups simultaneously.
The row group is a container for one or more column chunks, with each column chunk storing all values for a single column over the entire row group.Each row group is designed to optimize reading and processing of data in parallel by enabling systems to process different row groups concurrently. A table with multiple row groups allows systems to distribute computational work, improving performance for large-scale data processing.
Column Chunks:
Inside each row group, data is organized by column chunks. A column chunk contains all the values for a specific column across all rows in the row group. For example, in a table withNcolumns, each row group contains exactlyNcolumn chunks, one for each column. The values of a given column are stored together in a contiguous block of memory. This columnar organization facilitates efficient compression and access, as similar values (e.g., all values of the "age" column) are stored next to each other.This separation by column also makes column pruning possible, which is useful for queries that only request a subset of columns. For example, if a query only requires the "age" and "name" columns, only the respective column chunks need to be read, minimizing I/O operations.
Pages:
A column chunk is further divided into pages, which are the smallest unit of read/write operations. Pages typically range from 1-10MB in size, but this can vary. Pages are the building blocks of encoding, compression, and I/O operations.Parquet supports different types of pages:
Data Pages: These contain the actual values for the column.
Dictionary Pages: If dictionary encoding is used, this page stores a dictionary of distinct values for a column, with subsequent pages referring to indices in this dictionary.
Index Pages: These optional pages store min/max statistics for ranges of pages in the column chunk, allowing for quick skipping of irrelevant pages when applying predicates (e.g., filtering out rows with "age < 30").
Within each page, metadata such as the number of values, page size, and min/max statistics for the column values is stored. This metadata is crucial for minimizing unnecessary I/O during querying, as it allows systems to efficiently skip pages that do not meet the query conditions (e.g., filtering out pages where the maximum age is ≤ 30 in a query that filters for ages > 30).
Footer
The footer serves as a critical piece of information that systems like Apache Spark use when processing Parquet files. Upon reading the footer, a system can discover all row groups, column chunk offsets, and other metadata necessary for efficient query execution. This enables the system to directly access the relevant parts of the file, optimizing both I/O and computation.
Parquet File Structure in Diagram Form (ASCII Representation):
[ PAR1 (magic) ]
<Row Group 1>
Column Chunk A (pages of column A)
Column Chunk B (pages of column B)
...
Column Chunk N (pages of column N)
<Row Group 2>
Column Chunk A (pages)
Column Chunk B (pages)
...
Column Chunk N (pages)
Footer with metadata, size, row group information, etc.
[ PAR1 (magic) ]Parquet’s Hybrid Storage Model
Parquet's design is often described as a hybrid storage model because it combines aspects of both row-based and column-based storage:
Row Group Structure (Row-Based): A row group encapsulates multiple rows, which is a row-based concept. This structure allows for independent processing of chunks of rows, enabling parallelism in distributed systems like Apache Spark or Hadoop.
Column Chunk Structure (Column-Based): Inside each row group, data is stored column-wise, meaning all values for a particular column are grouped together in a contiguous block. This columnar layout makes compression and column pruning more efficient, as similar data types are stored together.
Pages (Fine-Grained I/O): The finer granularity of pages allows for efficient I/O. Systems can read just the necessary pages within a column chunk, enabling optimized data access and reducing unnecessary reads.
Benefits of Parquet’s Hybrid Model
The hierarchical structure of row groups, column chunks, and pages offers several key advantages:
Parallel Processing: Since row groups are independent, they can be processed in parallel. This enables distributed systems to process different chunks of the dataset simultaneously.
Column Pruning: Systems can efficiently skip unnecessary columns by reading only the relevant column chunks.
Predicate Pushdown: The page-level statistics (min/max values) allow systems to skip entire pages of irrelevant data, based on query filters, without needing to read the data itself.
Efficient Compression: Because each column chunk contains similar data (values for a specific column), compression algorithms can be more effective, reducing file size and improving performance.
Optimized I/O: The use of pages enables efficient I/O operations by reading data in smaller, manageable chunks. Only the relevant pages (based on query filters) are read, minimizing disk I/O.
Because of this hierarchy, Parquet can do clever things: it can skip entire row groups that a query doesn’t need, and within a row group it can read only the specific column chunks requested. It can even skip irrelevant pages using stored stats.
This essentially minimizes unnecessary reads and maximizes parallel processing.
In practice, an engine like Spark will first read the footer to discover all row groups and column offsets, then schedule tasks per row group (each task reading only needed columns).
Metadata Everywhere
Parquet packs a lot of metadata into the footer and into headers at each level. At the very end of the file, as we’ve briefly introduced before, you’ll find the File Footer, which starts with a length and ends with “PAR1”. The footer contains:
The schema (column names, types, nullability, and any logical types like STRING or DECIMAL).
A list of Row Group metadata blocks. Each row group’s metadata has the number of rows in it and, for each column chunk, the byte offset & size in the file, the compression codec, the encoding used, and page metadata.
Within the column-chunk metadata, the page metadata lists each page’s type, offset, compressed size, and statistics (min, max, null count for that page).
Crucially, none of these stats are in the row group data itself – all the stats (min/max per column, etc.) live in the footer. This design means the reader can scan the footer once, learn exactly where to find each column’s pages, and see min/max for each page.
For example, a query with WHERE price>30 can look at each page’s max, skip pages where max ≤ 30, and only decode the rest. Also, column-level stats (min/max per entire column chunk) and optional bloom filters or indexes can speed queries by enabling predicate pushdown.
At the page level, each page begins with a small page header. This header (in serialized Thrift form) includes:
the page type (Data or Dictionary, etc.),
the number of values in the page,
the uncompressed and compressed byte size,
and for Data pages, the min and max values of that page plus null counts.
Having this header lets the reader skip pages without loading them. In short, Parquet’s metadata is deep: file footer has global schema and all row-group offsets, row-group metadata has each column’s offsets and stats, and page headers have fine-grained stats.
This is what makes Parquet such an optimized format: the engine rarely has to read something it doesn’t need.
Encoding and Compression Schemes
Parquet doesn’t just lay out data nicely, as many including myself believed; it also compresses and encodes it in smart ways. Each column chunk can use its own compression codec (commonly Snappy, Gzip, or LZO) to squash data.
More importantly, Parquet supports several encoding schemes to make data smaller before compression:
Dictionary Encoding: For columns with many repeated values (e.g. country codes, product IDs), Parquet will build a dictionary. One page (the dictionary page) contains all the unique values, and subsequent Data pages store just integer indexes into that dictionary. This can massively shrink storage for strings or low-cardinality data. (If the dictionary would get too big, Parquet simply falls back to plain encoding.)
Run-Length Encoding (RLE): Sequences of repeated values (especially for Booleans or enums) are compressed using RLE, meaning “write value X 1000 times” instead of X,X,X,…. Parquet actually uses a hybrid of bit-packing and RLE to efficiently encode the index stream from dictionary encoding or definition levels. This is great for things like null maps or sorted integer columns.
Delta Encoding: For sorted integers or timestamps, Parquet can store the difference (delta) between consecutive values. If those deltas are small, they compress extremely well. Parquet has various delta schemes (binary packed integers, delta bit-packed arrays, delta-encoded byte arrays).
Plain Encoding: The fallback. Plain encoding writes values in binary or textual form with minimal overhead, making it fast and efficient for both writing and reading. It's especially useful for small datasets or when compatibility and performance take priority over storage efficiency or complex transformations.
Typically a column chunk will pick among these per page. For example, dictionary encoding might be used first: a dictionary page with values A, B, C is written, then a data page stores [A,B,C,B,A,…] as [0,1,2,1,0,…] using RLE. In fact, Parquet calls the scheme “RLE_DICTIONARY” for data pages.
If you inspect a Parquet file, you might see a column chunk with “encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')” in its metadata.
Because data in a column tends to have patterns (e.g. repeated or similar values), these encodings plus compression yield very high compression ratios on Parquet. In practice, Parquet files are often 5-10× smaller than the same data in CSV.
Writing Data: From Rows to Parquet
Understanding how Parquet handles writing reveals a powerful design choice: although Parquet is a columnar format on disk, it behaves like a row-wise system in memory.
This is why appending data feels natural—you hand it rows—but the final layout is highly optimized for columnar analytics.
Let’s break down the internals of how that illusion works, especially how rows become columnar data behind the scenes.
Internally, a Parquet writer (e.g., in the Hadoop implementation) accepts data one row at a time through something like InternalParquetRecordWriter.write(row). But—and here’s the trick—it immediately shreds that row into column-oriented buffers.
Each value in the row gets routed to its corresponding column buffer in memory, and that’s where the transformation begins.
How Rows Turn into Columnar Parquet
1. Row Ingestion and Immediate Shredding
As soon as a new row comes in, the writer decomposes it field-by-field. Say you’re working with a record like:
{
"user_id": 42,
"username": "jchen",
"signup_date": "2023-11-14",
"premium_member": true
}Rather than store this row as a unit, the writer dispatches 42 to the user_id buffer, "jchen" to the username buffer, and so on. Think of it like a transposition happening in real time: rows go in, columns start growing separately in memory.
This is handled by the schema-aware Parquet write path, which knows how to walk the structure of a row (flat or nested) and map each leaf field to the right column store. Every column store is just a contiguous in-memory buffer for values of the same field.
2. Statistics and Definition/Repetition Levels
Every time a value is added to a column’s buffer, Parquet updates column-level statistics—like min, max, and null count—as well as definition and repetition levels, which are crucial for supporting nested or optional fields.
For example, if signup_date was sometimes null, the writer would track which rows had values and which didn’t, using a "definition level" bitmap.
Why do this during buffering? Because when we eventually flush a page, we want every page to carry its own summary stats. This enables downstream optimizations like predicate pushdown, without ever scanning the full page content.
3. In-Memory Column Buffers
Each column in the schema has its own in-memory buffer, often implemented as a dynamic array or run-length encoding structure. These buffers keep accumulating values until we reach the threshold to form a "page"—a subunit of a column chunk.
At this point, as you might have understood, nothing has been written to disk. All data lives in memory, column-aligned, ready to be chunked and compressed.
4. Page Flush: Encoding, Compression, and Header Creation
Parquet aims to flush pages when a certain threshold is reached—typically around 1MB uncompressed, though this is configurable. Suppose we’re collecting username strings, and after buffering 2,000 users, the memory footprint crosses the 1MB mark.
Now the writer:
Encodes the data using the chosen encoding (e.g., dictionary, delta, or plain).
Compresses it using the configured codec (e.g., Snappy, Zstd).
Writes a page header, which includes metadata like:
Page type (data, dictionary, etc.)
Number of values
Compressed and uncompressed sizes
Optional stats like min/max values and null count
This encoded and compressed blob becomes a page in the column chunk. It’s appended to the on-disk representation of the column, but we’re still not touching the file yet—just assembling parts.
5. Row Group Flush: From Memory to File
Eventually, the collective uncompressed size of all column buffers in a row group (often set to 128MB or 512MB) will exceed the row_group_size threshold. This is the moment Parquet performs a row group flush—a big write to the file or stream.
Steps:
Finalize and flush any partial pages.
Write all the pages of each column to disk in column-major order. That means all pages for
user_idfirst, thenusername, and so on.Each column chunk is written contiguously so it can be read efficiently in isolation.
Memory buffers are cleared to make room for the next row group.
At this point, Parquet has physically written data to disk, and it’s all cleanly columnar: grouped by field, page-aligned, and optionally compressed.
6. Footer Write: Schema + Metadata Index
Once all rows have been written and all row groups flushed, the Parquet writer appends the file footer, which is the glue that ties everything together.
The footer includes:
Full schema definition
For each row group:
Number of rows
Column chunk offsets and sizes
Encoding/compression metadata
Column-level stats
A final “magic number” marker (
PAR1) to signal the end of the file
Critically, this design avoids any need to go back and rewrite earlier parts of the file. The writer just appends the footer once at the end—making streaming, single-pass writes possible, even to remote storage like S3.
Why This Design Matters
The beauty of Parquet’s write path is in how it combines row-wise ingestion with columnar output. Writers operate naturally with rows (which is how most data is generated), but readers get the full benefit of fast, column-oriented analytics.
Because each column is flushed in chunks with its own encoding, compression, and fine-grained stats, the engine can do selective reads, skip irrelevant pages, and push filters deep into the file—all while having written the data just once, row-by-row.
It’s like having a magic machine that takes in messy rows and quietly reorganizes them into tightly-packed, indexed, columnar blocks behind the curtain. That’s Parquet’s genius.
An interesting note: because statistics are written only in metadata, the actual row-group data on disk has no headers or stats. All stats live in the footer.
Reading Data: Lazy, Selective Scans
The read workflow is essentially the reverse (and heavily optimized). A Parquet reader usually does those things:
Read the Footer: The reader seeks to the last bytes of the file, checks the “PAR1” magic and footer length, and loads the entire footer metadata. This gives the schema, the list of row groups, and for each row group the offsets and sizes of every column chunk.
Apply Projections/Filters: If you only need some columns, the reader now knows exactly which column chunks to read (and can skip the others entirely). If there are row-group-level filters (e.g. based on partitioning), it can skip whole row groups.
Read Column Chunks: For each needed row group and column: seek to the start of that column chunk and read the pages sequentially. Column chunks are just a stream of pages, so the reader can loop: read page header, check it (maybe skip based on stats), then read the data bytes of the page. Dictionary pages, if any, appear first in the chunk; after reading the dictionary page, the reader has the lookup table needed to decode the data pages.
Decode Pages: Each page’s data is decoded according to its encoding. For a dictionary page, it simply reads all unique values into an array. For a data page using RLE_DICTIONARY, it uses RLE and bit-packing to extract the dictionary indexes, then looks up each index. This reconstructs the column values for that page.
Repeat: Continue reading pages until you hit the end of the column chunk (the metadata told you how many bytes it spans). Then move on to the next column or row group as needed.
Critically, nothing in the file forces you to read extra data: readers “first read the file metadata to find all the column chunks they are interested in,” then read exactly those columns.
Within each column chunk, they can skip pages if they don’t match a predicate (thanks to min/max stats). This is how Parquet enables predicate pushdown: filters like WHERE value < 1000 can skip pages whose minimum is already ≥1000, saving tons of I/O.
From a user perspective, writing and reading Parquet often happens in libraries (Spark, Pandas, etc.), but under the hood all these steps happen. The neat thing is that because Parquet is so self-descriptive, tools can pick up a Parquet file and know exactly how to get the data without extra configuration.
Performance Optimizations: Parallelism and Compression
What makes Parquet really shine in big data pipelines is how well it parallelizes and compresses. A few key points:
Parallel read/write via row groups: Each row group can be handled independently. In a cluster, one task can write or read row group 1, another task row group 2, etc. In Hadoop parlance, the unit of parallelism for map tasks is the row group (or file). This means you can process TBs of data in parallel easily. Inside a row group, reading different columns can also be done in parallel threads (I/O on column chunks).
Selective column reading: By storing columns separately, a query touching 3 out of 20 columns only reads 15% of the bytes. This is column pruning. We’ve already noted this advantage. It’s especially great for analytics where wide tables are common but queries often cover only a few columns.
Predicate pushdown: As mentioned, the metadata lets engines skip irrelevant data early. Parquet’s column statistics (min/max per chunk or page) are used by engines like Spark to avoid scanning row groups that don’t match a filter. This can cut runtime dramatically for selective queries.
Tuning row group and page sizes: There is a trade-off between I/O overhead and parallelism. Larger row groups (say 512MB) mean fewer row groups and less footer metadata, which is good for sequential scans. Smaller row groups (say 16MB) mean more parallel chunks but more overhead. Similarly, page size tuning matters: smaller pages allow more fine-grained skipping (fewer irrelevant values per page) but incur more header overhead. Parquet defaults (128MB RG, ~1MB pages) work well for many use cases.
Efficient compression: Since each column is contiguous, compressions like Snappy or Zstd exploit patterns better. A column of integers or strings often compresses much more than mixed-type rows. Also encodings like dictionary and RLE, as noted, vastly improve compressibility. For example, if a column has only a few thousand unique customer IDs over millions of rows, dictionary encoding + Snappy can shrink it 10× or more. This is particularly effective on repetitive or slowly-changing data.
We’ve alredy said it before, but here’s a reminder: Parquet’s design is tailor-made for analytical workloads (OLAP): it gives you parallelism (via row groups), columnar I/O (skip data you don’t need), and compression efficiency (same-type data + encodings). No wonder it’s the default file format for Spark, Hive, Dremio, Drill, BigQuery export, AWS Athena, and many more analytics engines.
Final Thoughts and Next Steps
Diving into Parquet was both fun and humbling. It’s an impressively engineered format – much more than “just columnar CSV.” Piecing together all the pieces (row groups, column chunks, pages, Thrift metadata, various encodings) felt like solving a puzzle.
At first it’s easy to get lost in the jargon, but stepping through a writer’s and reader’s workflow made it click for me: everything has its place.
On the personal side, it’s great to finally check this off my learning list. I feel more confident now when an engineer mentions “predicate pushdown” or when I tweak Parquet write options in Spark.
There’s still more to explore – for example, Parquet 2.0 introduced even more encodings, and real systems add things like page indexes and Bloom filters. But for now I’ll savor the clarity I have and maybe tackle schema evolution or performance tuning next.
I hope this walkthrough was useful (and not too dry!). If you’re a data engineer or developer working with big data, I highly encourage you to look inside your Parquet files sometime. Use tools like parquet-tools or PyArrow metadata inspectors to peek at the statistics and encodings.
And if anything here sparks questions or feedback, drop a comment or reach out – I love a good chat about data format internals. Follow along for future deep dives into related tech. Until next time, happy data scheming!
Sources: Official Apache Parquet documentation and community tutorials were instrumental in piecing together how Parquet works under the hood. (Plus a lot of late-night reading 😉.)
—Lorenzo



