Amazon Fills a Big Data Hole with Athena
Less than a year ago, Amazon launched a new big data database named Athena. Amazon Web Services CEO Andy Jassy framed the announcement around the theme of giving enterprises “superpowers.” Yet as we approach the soon-to-be one-year anniversary of the database, a couple of questions arise: What’s all the fuss about, and where should you consider using Athena?
At its core, Athena is a fusion of Hadoop Hive for the Data Description Language (DDL) and Facebook’s Presto for SQL. Athena can ingest data directly from Amazon S3 storage and uses Amazon’s Lambda serverless programming framework to allocate resources on demand with a very attractive pricing model.
Let’s unpack this description to best explain Athena’s popularity.
- Hive for DDL provides several benefits:
- Hive has been around for over a decade, so big data engineers are comfortable with it.
- It implements schema on read, which provides flexibility in how a table is read, regardless of how it was written.
- Users can define partitions, which organizes the way data is stored on disk for faster queries on the partition key.
- Hive includes HCatalogue, a highly persistent and highly visible data dictionary that enables users to define and share data schemas.
- Presto for SQL: Facebook created Presto as a near real-time big data SQL engine. The advantage of Presto is that it provides a full SQL implementation with no annoying missing features like nested queries. The problem with Facebook’s version of Presto was that it achieved performance at the expense of stability. To speed up queries, it made some simplifying assumptions (e.g., that it could fit needed subsets of data into memory), and it occasionally ran out of memory. Amazon seems to have tamed this problem by implementing Presto in a serverless architecture that allocates whatever resources it needs on demand.
- Data ingestion from Amazon S3 storage: The traditional Extract-Transform-Load (ETL) cycle has been eliminated. Instead, Athena reads data directly from a single S3 bucket, in a variety of data formats. The catch (and it is a big one) is the limitation to one bucket. All the data to be processed must be gathered into a single bucket in the current version of Athena.
- Based on Lambda serverless architecture: You no longer have to pay for a server cluster waiting idly for work to do. The servers are deployed on demand, to whatever scale is needed, and decommissioned when the request has completed. The downside is latency – it can take several hundred milliseconds for the servers to deploy.
- A very attractive pricing model: You pay only for the data your query reads, not for the servers required to process it. If you organize your data properly, e.g., by using a columnar format such as ORC or data compression that saves space, the cost can be an order of magnitude less than other options.
As you can see, Athena has a lot of things going for it. In particular, its features make it an ideal solution for ad hoc queries on a S3-based data lake because:
- You can define whatever schema you need at read time for the ad hoc query.
- You do not need to do any ETL – the data can be read directly from S3 storage.
- You get full SQL support.
- You get near real time performance.
- Your ad hoc queries will not affect the performance of other queries being performed on the data lake (e.g., ETL to data marts, reports, transaction processing) because Athena runs on its own dynamically-deployed servers.
- You can leverage all this at a very attractive price, assuming you have minimized disk usage.
Before we get carried away with enthusiasm, however, let’s identify some use cases where Athena would not be the best choice:
- Analytic queries that need advanced analytic SQL extensions, e.g., sliding windows. Redshift would be a better choice in this instance.
- Reports that require extensive preprocessing of data, e.g., for cleansing or deduplication. Amazon’s Elastic Map Reduce (EMR) is probably better suited here.
- Queries that require near real-time responses. Spark or an in-memory database would work as a better option.
- Unstructured or semi-structured data. A NOSQL database such as Cassandra or DynamoDB would be a better choice to consider.
Overall, Athena may not be a panacea for all big data use cases, and Google Cloud users will recognize many of Athena’s features from Google’s BigQuery, which has been around since 2012. However, Athena certainly fills a hole in the AWS big data ecosystem: ad hoc queries on a data lake. That is why Athena should be considered a major step forward for Amazon, which should now look towards filling the next Big Data hole.
About the author: Moshe Kranc is Chief Technology Officer at Ness Digital Engineering, a provider of digital transformation and software engineering services. Moshe has worked in the high tech industry for over 30 years in the United States and Israel, and has extensive experience in leading adoption of bleeding edge technologies. He previously headed the Big Data Centre of Excellence at Barclays’ Israel Development Centre (IDEC), and was part of the Emmy award-winning team that designed the scrambling system for DIRECTV. Moshe holds six patents in areas related to pay television, computer security and text mining. He has led R&D teams at companies such as Zoomix (purchased by Microsoft) and NDS (purchased by Cisco). He is a graduate of Brandeis University and earned graduate degrees from both the University of California at Berkeley and Boston University.