Athena Partition Projection

27 Jun 2021

We can make sure Athena only reads as much data as it needs for a particular query by partitioning our data. We do this by storing the data files in a Hive folder structure that represents the patitions we’ll use in our queries.

s3://mybucket/data/year=2021/month=06/day=27/file1.json
s3://mybucket/data/year=2021/month=06/day=27/file2.json
s3://mybucket/data/year=2021/month=06/day=28/file1.json

We can then create a table partitioned by the keys used in the folder structure.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'

We then need to tell Athena about the partitions. We can either do this with ALTER TABLE example ADD PARTITION (year='2021, month=06, day=27);, or by running MSCK REPAIR TABLE example;, which will crawl the folder structure and add any partitions it finds. Once the partitions are loaded we can query the data, restricting the query to just the required partitions:

SELECT * FROM example
WHERE year=2021 AND month=6 AND day=27

The problem with this is that we either need to know every about every partition before we can query the data, or repair the table to make sure our partitions are up to date - a process that will take longer and longer to run as our table grows.

There is a better way! By using partition projection we can tell Athena where to look for partitions. At query time, if the partition doesn’t exist, the query will just return no rows for that partition. Queries should also be faster when there are a lot of partitions, since Athena doesn’t need to query the metadata store to find them.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (year int, month int, day int)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TABLEPROPERTIES (
    'projection.enabled' = 'true',
    'projection.year.type' = 'integer',
    'projection.year.range' = '2020,2021',
    'projection.month.type' = 'integer',
    'projection.month.range' = '1-12',
    'projection.day.type' = 'integer',
    'projection.day.range' = '1-31',
    'storage.location.template' = 's3://mybucket/data/${year}/${month}/${day}/'
)

We can query this table immediately, without needing to run ADD PARTITION or REPAIR TABLE, since Athena now knows what partitions can exist. Since we need to provide Athena with the range of expected values for each key, the year partition range will eventually need to be updated to keep up with new data.

Another option is to project an actual date partition. This time we treat the date path in S3 (yyyy/MM/dd) as a single partition key, which Athena will read and convert to a date field. We call this partition date_created as date is a reserved keyword.

CREATE EXTERNAL TABLE example (
    foo string,
    bar string,
    baz string
)
PARTITIONED BY (date_created string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/data/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.date_created.type' = 'date',
    'projection.date_created.format' = 'yyyy/MM/dd',
    'projection.date_created.interval' = '1',
    'projection.date_created.interval.unit' = 'DAYS',
    'projection.date_created.range' = '2021/01/01,NOW',
    'storage.location.template' = 's3://mybucket/data/${date_created}/'
)

With a date partition we no longer need to update the partition ranges. Using NOW for the upper boundary allows new data to automatically become queryable at the appropriate UTC time. We can also now use the date() function in queries and Athena will still find the required partitions to limit the amount of data read.

SELECT * FROM example
WHERE date_created >= date('2021-06-27')

Tom Henderson

Athena Partition Projection

Related Posts

Notes on Migrating from Serverless Framework to AWS SAM 07 Mar 2025

Notes on Packaging Python Lambdas 25 Sep 2024

Tunnelling SSH over AWS SSM Session Manager 13 Aug 2024

Adding a Capacity Provider to an ECS Cluster with Terraform 18 Jul 2023

ECS Anywhere 01 May 2021