We can make sure Athena only reads as much data as it needs for a particular query by partitioning our data. We do this by storing the data files in a Hive folder structure that represents the patitions we’ll use in our queries.
We can then create a table partitioned by the keys used in the folder structure.
We then need to tell Athena about the partitions. We can either do this with
ALTER TABLE example ADD PARTITION (year='2021, month=06, day=27);, or by running
MSCK REPAIR TABLE example;, which will crawl the folder structure and add any partitions it finds. Once the partitions are loaded we can query the data, restricting the query to just the required partitions:
The problem with this is that we either need to know every about every partition before we can query the data, or repair the table to make sure our partitions are up to date - a process that will take longer and longer to run as our table grows.
There is a better way! By using partition projection we can tell Athena where to look for partitions. At query time, if the partition doesn’t exist, the query will just return no rows for that partition. Queries should also be faster when there are a lot of partitions, since Athena doesn’t need to query the metadata store to find them.
We can query this table immediately, without needing to run
ADD PARTITION or
REPAIR TABLE, since Athena now knows what partitions can exist. Since we need to provide Athena with the range of expected values for each key, the year partition range will eventually need to be updated to keep up with new data.
Another option is to project an actual
date partition. This time we treat the date path in S3 (
yyyy/MM/dd) as a single partition key, which Athena will read and convert to a date field. We call this partition
date is a reserved keyword.
With a date partition we no longer need to update the partition ranges. Using
NOW for the upper boundary allows new data to automatically become queryable at the appropriate UTC time. We can also now use the
date() function in queries and Athena will still find the required partitions to limit the amount of data read.