Forget about Glue Crawlers… all hail Projected Partitions!

If you’re working with large datasets in data lakes on AWS, you’ve likely come across the AWS Glue service. Glue is a serverless ETL tool that helps you extract, transform, and load data for analytics. A common use case is using Glue Crawlers to infer schema and partition structure from your data stored in S3.

However, there’s a more efficient way of dealing with partitioned data – partition projections. In this post, we’ll explore what partition projections are and why you should use them over crawlers.

What are Partition Projections? Partition projections allow you to define the partitioning structure of your data directly in the Glue Data Catalog, without having to crawl the data first. You specify the partition keys and their value paths from the source data.

Benefits of Using Partition Projections

  1. Faster Metadata Discovery With partition projections, Glue doesn’t have to scan all your data to infer the partitioning scheme. This makes metadata discovery incredibly fast, regardless of data size.
  2. Lower Costs Crawlers can be expensive for frequently updated or large datasets due to the compute resources required for scanning. Projections are much cheaper.
  3. Separation of Compute and Storage You define the partitioning scheme in the catalog, decoupled from the actual data storage. This separation provides more flexibility.
  4. Query Performance Queries on partitioned data are more efficient as Athena/Spark can prune partitions during query planning.
  5. Schema Evolution With crawlers, renaming or restructuring data requires a re-crawl. Projections allow updating the schema directly.

When to Use Crawlers While projections are great, crawlers still have their use cases:

  • When the partitioning scheme is complex or nested
  • For initial schema inference on new datasets
  • If you lack the expertise to define partition projections

Getting Started with Partition Projections You can define partition projections using the Glue console, CLI, or infrastructure as code. The key steps are:

  1. Create an unpartitioned table in the Data Catalog
  2. Add partition projections to the table specifying keys and paths
  3. Run a crawler to add the partitions to the metadata store

By moving to partition projections, you can save costs, get faster metadata operations, and improve query performance on your partitioned data in AWS data lakes. Give them a try on your next Glue ETL workflow!

Leave a Reply

Your email address will not be published. Required fields are marked *