02 Apr 2025
The parquet file format uses the thrift compact protocol to define its binary representation. This blog post takes a closer look at the binary data of the metadata of parquet files.
Parquet can be used in Hadoop to distribute workload across different nodes in a cluster. It's a database-like format that handles data in a columnar way with some row chunks. This way, the workload can be better distributed among the worker nodes and the compression rate of the data is higher than in row-based formats. There are many standalone implementations that make it easy to create sample files to work with. For this post I chose fastparquet (Python) and read data from a csv file into panda dataframes. This is a small portion of the sample data:
===================================================================================================================================================================================================================
| Index | Customer Id | First Name | Last Name | Company | City | Country | Phone 1 | Phone 2 | Email | Subscription Date | Website |
| 1 | DD37Cf93aecA6Dc | Sheryl | Baxter | Rasmussen Group | East Leonard | Chile | 229.077.5154 | 397.884.0519x718 | zunigavanessa@smith.info | 2020-08-24 | http://www.stephenson.com/ |
| 2 | 1Ef7b82A4CAAD10 | Preston | Lozano | Vega-Gentry | East Jimmychester | Djibouti | 5153435776 | 686-620-1820x944 | vmata@colon.com | 2021-04-23 | http://www.hobbs.com/ |
===================================================================================================================================================================================================================
The creation of the parquet file with fastparquet was done with the following commands, without any available optimizations:
import pandas as pd
import fastparquet as fp
df = pd.read_csv("~/downloads/customers-100.csv")
fp.write("./pq/file.parquet", df, row_group_offsets=[0], compression="UNCOMPRESSED", file_scheme="simple")
The resulting file has the format as defined here. To give a short explanation: There is a magic header and trailer PAR1. Then some (not necessarily all) rows are included with corresponding columns. Inside the columns section there is metadata describing the values and their repetition within the specific column. More rows with columns follow until the footer is reached. Here, the schema is located in a trailing metadata block. The length of the block is added at the end of the file, just before the magic trailer. We will now take a closer look at the schema part, as this is the most interesting part of the metadata.
The parquet definition of the metadata is given here. The raw code of the definition can be found here. The ThriftCompact protocol uses a special encoding for different data types (LEB128, Varint). Using the first entry of version as an example, this would be the protocol definition and binary representation:
# from parquet.thrift
struct FileMetaData {
/** Version of this file **/
1: required i32 version
/* ... */
}
# payload
1502
======
1 -> initial field tag / selection of struct element in definition
5 -> type identifier (5: int32)
0 \
2 -> variable length encoded value: 2
Another good textual overview and source of the ThriftCompact elements and types can be found here. In the example above, the 4 byte (int32) value of the version can be encoded into 2 bytes. In each ThriftCompact struct element, a number is assigned to each value at the beginning. This is used as an index, and in case of missing optional values, the first 4 bits are used to indicate that. A special case is the beginning of a struct, where the first 4 bits declare a starting element inside the struct. The first one is used as the starting element, which is the version. The next 4 bits are used to denote the type of the value (int32) and the rest of the data is the value (2). After the version follows a list of schema elements, which are the used data types (inside the columns):
# from parquet.thrift
struct SchemaElement {
1: optional Type type;
2: optional i32 type_length;
3: optional FieldRepetitionType repetition_type;
4: required string name;
5: optional i32 num_children;
6: optional ConvertedType converted_type;
7: optional i32 scale
8: optional i32 precision
9: optional i32 field_id;
10: optional LogicalType logicalType
}
# payload
19dc 4806 736368656d61 1518 00
1504 158001 1502 1805 496e646578 00
150c 2502 180b 437573746f6d6572204964 2500 00
================================================
# schema header
19 -> initial field tag 1 (type), field type list
dc -> d != f (short list, size: 13), list type struct
# schema element 1 (root)
48 -> initial field tag 4 (name), field type binary / string
06 -> field / string length 6
7363 6865 6d61 -> "schema"
15 -> field tag +1 (num_children)
18 -> 24
00 -> struct stop / next element
# schema element 2 (leaf)
15 -> initial field tag 1 (type)
04 -> element type i16
15 -> field tag +1 (type_length)
80 01 -> 01 80 (128)
15 -> field tag +1 (repetition_type)
02 -> 2 (repeated)
18 -> field tag +1 (name), field type binary / string
05 -> field / string length 5
496e 6465 78 "Index"
00 -> struct stop / next element
# schema element 3 (leaf)
15 -> initial field tag 1 (type)
0c -> element type value 12
25 -> field tag +1 (repetition_type)
02 -> 2 (repeated)
18 -> field tag +1 (name), field type binary / string
0b -> field / string length 11
4375 7374 6f6d 6572 2049 64 "Customer Id"
25 -> field tag +2 (converted_type)
00 -> 0 (UTF8)
00 -> struct stop
# ...
First, the full specification of a schema element is shown in the example code above. There, the only field marked as required is the name of the element. Fields such as converted_type and logicalType provide options for specifying advanced parquet types. The repetition_type is an indicator used to compress the values. Other than that, the other fields should be self-explanatory.
Looking at the payload, the first entry is the header of the struct list. The first field tag is 4, which can be found by looking at the given specification. The next field (num__children) has a value of 24. Since the manual is a bit sparse around this field, I really can't wrap my head around it. It has something to do with nesting additional values, but only for non-primitive types, which makes sense, because the following schema elements will all use a converted_type. The zero is a delimiter used within structs in the thrift protocol. The next element after the root is called a leaf node. It comes with a primitive type (i16) and a maximum bit length of 128 (128 bit / 8 = 16 byte; ok). Element 3 and each subsequent element is a little different. It uses a converted_type (UTF8) and returns the value 12 as the primitive type. This is another byte I can't really make sense of, since the specification says 0xc is a struct type (which it obviously isn't). The other elements all look similar to element number 3.
Looking back, when I was trying to understand the general format of parquet bytes, I struggled to find much information about it. A good resource that provided hints on field tagging was this blog post. The official specs repo is also a good place to start. At some point I'd like to delve further into the unknown and hopefully better documented realms of parquet. But not today.