Introduction & Context

At JOBOO we work in a highly data-driven manner, with all of our application platforms continuously streaming so-called event data records to our data warehouse.

Each of these records encapsulates a relevant event from the areas

Business (“a new user signed up”, “a user has taken out a subscription”)
Application (“an email was sent to a user”, “an error occurred”)
Web request (“Page X was requested with these and those parameters”)
and Conversion tracking (“user has reached a campaign goal”).

This relatively simple form of data collection with the semantics of “what happened when?” allows us to carry out many meaningful evaluations in a very flexible manner.

The data warehouse is based on a AWS-native tech stack with an architecture that integrates multiple managed services such as Data Firehose, S3, and Athena.

The events flow into the data warehouse in the form of individual JSON-structured objects via an HTTP endpoint provided by Data Firehose, where they are converted to Apache Parquet files and then stored in S3.

From there, these data sets can be queried via SQL using Athena, with all the aggregation, association, filtering and transformation options that SQL offers.

This pipeline provides the basis for all further analyses and evaluations.

The problem

Due to the large number of recorded events in our constantly growing application landscape, large quantities of data records are generated, resulting in a correspondingly large data volume.

Without optimization, Athena scans all files in the data set on every query, which is inefficient and costly.

That’s because without any further measures, from Athena’s perspective, all Parquet files generated by Data Firehose are essentially viewed as one large, unsorted “blob of data” during execution of an SQL query.

Although Athena’s query engine can handle these inefficient structures, it doesn’t scale well cost-wise, as Athena’s “pay-per-use” cost model always uses the amount of data to be scanned for an SQL query as the basis for calculating the fee.

The query time also increases linearly with each new data set, even if the answer to a query is ultimately only contained in a few kilobytes of data.

Queries that search the event data are often structured in a manner useful for the business perspective, and limited by time, for example in the sense of “Calculate the number of all new registrations in the past 7 days”.

However, this form of logical structuring in the query does not necessarily result in an optimization of query efficiency if the physical structure of the stored event data does not support this.

You can imagine it as if you were entering a room with a completely unsorted collection of books from the last 200 years.

For example, even if one only wanted to determine the titles of all crime novels from 1998, one would still be forced to pick up each individual book and examine it for the properties “is from 1998” and “is a crime novel” — limiting the query doesn’t reduce the effort required to scan the entire data set.

The solution approach

However, if the books in the room were physically partitioned according to a logical criterion — for example: all books from a year are always on a separate shelf for that year of publication — then determining the desired titles would be much more efficient, since you now only have to examine all books on the “1998” shelf for the property “is a crime novel”.

A fundamental improvement in the situation can therefore be achieved by partitioning the physical data according to logical aspects.

This is natively supported by all components involved — Data Firehose, S3, and Athena.

The solution here is to identify a logical partitioning that makes sense for typical query use cases — for example, splitting the data per calendar day — and then physically mapping this partitioning, for example by using an S3 folder hierarchy according to the schema year/month/day/.

An example:

The Parquet files are stored in S3 by applying a partitioning schema with this structure:

business-events/
    year=2024/
        month=09/
            day=01/
                file1.parquet
                file2.parquet
                file3.parquet
            day=02/
                file4.parquet
                file5.parquet

Event data is therefore available in a total of 5 files:

business-events/year=2024/month=09/day=01/file1.parquet
business-events/year=2024/month=09/day=01/file2.parquet
business-events/year=2024/month=09/day=01/file3.parquet
business-events/year=2024/month=09/day=02/file4.parquet
business-events/year=2024/month=09/day=02/file5.parquet

Partitioning during query execution (Athena view)

The optimization at query time is now successful because Athena can capture this 3-level folder structure and query it specifically; the key=value structure of the directory names and some other measures (which are explained in detail later) make exactly that possible.

A query of the following type:

SELECT COUNT(*) FROM business_events

WHERE year  = '2024'
AND   month = '09'
AND   day   = '02'

not only results in a certain filtering of the result set at the SQL level, but also affects the access pattern to the physical files in S3: thanks to the partitioning, Athena understands that only the two files

business-events/year=2024/month=09/day=02/file4.parquet
business-events/year=2024/month=09/day=02/file5.parquet

must be physically read in order to answer the query correctly and completely. All other files can be safely ignored — that is the efficiency gain.

The cost savings can be significant, as the following graph shows the trend of daily Athena costs for the affected AWS account (the switch to partitioned data was implemented gradually over several days):

The partitioning process is based on the fact that the Athena table has three columns — year, month, day — which contain the actual date information. Additionally, these columns are “mapped” to the subfolders in S3.

To achieve this, an Athena table must be created with special partitioning instructions.

Example:

CREATE EXTERNAL TABLE business_events
(
    `eventid` string,
    `userid` string,
    `somedata` boolean,
    `someotherdata` float
)

PARTITIONED BY (year string, month string, day string)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://my-data-warehouse/business-events';

As you can see, these three special fields are only listed in the PARTITIONED BY clause, but not in the actual field list of the table. Nevertheless, they are then available in SQL queries just like any of the “normal” fields.

Partitioning in data storage (Firehose view)

In order to ensure that the data sets are logically partitioned correctly in a setup where data is continuously streamed to S3 via Amazon Data Firehose, some measures must be taken.

A short aside:

Without any further action, Data Firehose already will store the Parquet files in a time-segmented subfolder structure, like this:

business-events/
    2024/
        09/
            02/
                00/
                    file1.parquet
                    file2.parquet
                01/
                    file3.parquet
                    file4.parquet

As you can see, it is a four-level hierarchy with the segments year/month/day/hour.

But be careful: This is in no way a logical or content-based partitioning! On the contrary, this is a purely technical division - Data Firehose buffers incoming data records for a (configurable) period of time, and the directory structure in which a Parquet file ends up is based solely on the timestamp of the moment when Data Firehose saves it in S3.

The content of the data records to be written is not taken into account in any way.

The record of an event that occurred in an application platform on September 1st can therefore also be stored in the S3 folder structure for September 2nd if, for some reason, there is a longer delay between the occurrence of the event and its data warehouse processing.

To solve this issue, we use “Dynamic Partitioning” in combination with “Inline parsing for JSON”, which ensures that records are stored based on their content, not when they were processed.

This allows the Firehose Stream to access the individual JSON fields of an event record during processing, extract the values of the relevant fields, and then reuse these field values when assembling the S3 folder path.

In the following, we assume this JSON structure when receiving records in Data Firehose:

{
    "eventid": "9148766a-dc1e-473f-a198-11b5191f4cb0",
    "userid": "01615373-bb20-480e-921b-b7610bacc828",
    "somedata": true,
    "someotherdata": 1.94387,
    "year": "2024",
    "month": "09",
    "day": "01"
}

Step by step: Configuring Glue & Data Firehose

The procedure is now as follows:

First, a new AWS Glue table must be created in which the schema of the received records is specified.
The Firehose stream needs this schema to correctly interpret the JSON records received from the application.
The name of the table is business_events_as_json and it belongs to the Glue Database business_events.
The table format is “Standard AWS Glue table”.
The data format is “JSON”

For the Glue Schema, all fields of the supplied JSON, including the three date fields year, month, and day, must now be defined. The easiest way to do this is via JSON:

[
    {
        "Name": "eventid",
        "Type": "string",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "userid",
        "Type": "string",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "somedata",
        "Type": "boolean",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "someotherdata",
        "Type": "double",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "year",
        "Type": "string",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "month",
        "Type": "string",
        "Comment": "",
        "Parameters": {}
    },
    {
        "Name": "day",
        "Type": "string",
        "Comment": "",
        "Parameters": {}
    }
]

Note that AWS Glue already allows the definition of partition fields here too — but contrary to intuition, this is irrelevant at this point, which is why the three date fields are not declared as such. Partitioning only takes place later through mechanisms that are defined in the configuration of the Firehose stream.

Now a new Firehose stream can be created that stores the JSON data as Parquet files in the target structure on S3, taking the partitioning information into account:

The source is (at least in the case of JOBOO) “Direct PUT”, so that the stream offers an HTTP endpoint through which JSON records can be delivered from outside.
The destination is “Amazon S3”.
We use the name business-events.
We activate “Enable record format conversion”, and select “Apache Parquet” as the output format.
As “Schema for source records” we now define the previously created Glue Table by specifying the correct Glue Region, Database, and Table business_events_as_json.
In the Destination Settings we specify the desired S3 bucket as the target bucket (in the example above my-data-warehouse, and activate “Dynamic Partitioning”.
Now comes the relevant part: We activate “Inline Parsing for JSON” and add three entries under “Dynamic Partitioning Keys”:
- Key name: year, JQ expression: .year
- Key name: month, JQ expression: .month
- Key name: day, JQ expression: .day
This extracts the values of the three relevant date fields from the JSON using Data Firehose’s native JQ support and makes them available under the Key Names.
These key names now come into play again in the “S3 Bucket Prefix” field, because there we specify:

business-events/
    year=!{partitionKeyFromQuery:year}/
    month=!{partitionKeyFromQuery:month}/
    day=!{partitionKeyFromQuery:day}/

(Important: this bucket prefix is specified in the configuration without spaces and on a single line; here, the text has only been wrapped for better readability).

This is exactly what ensures that the Parquet files are stored in the desired folder structure — based on the values for year, month, and day within each individual JSON Event object, and regardless of when they are processed.

Whether you actually “match” the right JSON fields for extracting the relevant date fields can be easily verified locally on a command line terminal, provided jq is installed on your system:

echo '{
        "eventid": "9148766a-dc1e-473f-a198-11b5191f4cb0",
        "userid": "01615373-bb20-480e-921b-b7610bacc828",
        "somedata": true,
        "someotherdata": 1.94387,
        "year": "2024",
        "month": "09",
        "day": "01"
      }' | jq ".year"

Events can now be delivered to this stream via HTTP, and these must then appear in the correct subfolders in S3 in the form of Parquet files after an appropriate delay (the buffer interval).

In order to query this data in a structured manner, the table with partition support can now be created in Athena using the CREATE EXTERNAL TABLE business_events... statement listed above.

Attention: Queries on this table will only return empty result sets; in order for Athena to actually “see” the data in the partition structure, the table’s metadata must be updated — this is done with the MSCK REPAIR TABLE business_events statement.

Please note that this must be done regularly, namely whenever a new partition folder is created — in our example, at the start of each day.

Conclusion

By introducing data partitioning in Amazon Athena, we’ve significantly improved both the cost-efficiency and performance of our queries at JOBOO. Partitioning data based on logical criteria, like dates, means Athena only scans the files that are relevant to the query, which reduces query time and the volume of data processed. This directly translates to noticeable cost savings and faster response times, even as our data sets continue to grow.

With the right configuration of AWS services such as Data Firehose, S3, and Athena, we’ve built a streamlined pipeline that not only organizes our event data effectively but also keeps query costs in check. This approach ensures we can scale without running into performance issues or escalating costs.

In short, partitioning allows us to take full advantage of Athena’s pay-per-use model, making it easier to handle large amounts of data efficiently. If you’re managing growing datasets, this is a simple yet powerful way to keep your infrastructure scalable and cost-effective.