AWS To Build You a Data Lake in ‘A Few Clicks’
AWS yesterday announced Lake Formation, a new service that it says will let users build their own data lake on S3 — complete with the requisite provisions for security, access control, data transformation, and cataloging — with just “a few clicks.” Together with Control Tower and Security Hub, the trio of new services are designed to radically simplify the onboarding and management of large amounts of data in the cloud.
“Everybody wants a data lake,” AWS CEO Andy Jassy said during his keynote address yesterday. “We have over 10,000 data lakes built on S3. But if you try to build a data lake, it’s hard.”
First, you have to prepare your storage and configure your S3 buckets, Jassy said. “Then you have to move your data from all the disparate places, and in the process you have to crawl the data to extract the schema. And you have to add metadata tabs to the data so you can find it so you can put it in a catalog.”
Customers must also take steps to figure out how they’re going to store the data, including partitioning and indexing the data in such a way that analysts and data scientists can access it when they need it. It also needs to be cleaned and transformed to make it useful, Jassy says.
“And then the hardest part, as if that’s not enough, which is actually setting up the right security policies,” he said. “This is some of the most sensitive data in your enterprise, so you have to create data access rules at the table and column and row levels. And you have to figure out how to encrypt that data and you have to have the right access control, for each of the analytic and machine learning services that you want.
“It’s just a lot of work,” he continued. “This is a lot of work and for most customer it takes them several months to set up the data lake, which is frustrating.”
AWS has responded to the market demand for lakes with the launch of Lake Formation, which will automate many of those tasks mentioned above, and guide the user through the data lake design decisions for others.
Customers get started with AWS Lake Formation by pointing the service at their existing S3 buckets, including any data stored in AWS relational databases or NoSQL databases (external data can also be loaded via JDBC jobs managed using AWS’ ETL service, Glue). Then they can select which data access and security policies they want to apply to the data as they’re loading it into the lake.
As the data is moved into the AWS lake, Lake Formation extracts technical metadata to for the purpose of creating a data catalog, which makes the data easier to access and discover down the line. It also applies automated partitioning rules to make storage efficient, with the option to transform the data into Apache Parquet and ORC for faster analysis in downstream services, such as RedShift, Athena, or Elastic MapReduce (EMR) (for Spark, AWS says).
The Lake Formation service also applies machine learning to de-duplicate the data, and lets users build their own ML Transforms using Glue to customize the transformation. The service also applies encryption to protect data stored in Lake Formation, and uses AWS Key Management Service to store the keys. It also logs all access to the data for compliance purposes via CloudTrail.
“[L]ake formation solves a lot of the problems and challenges [in setting up a data lake] it lets you do it from a dashboard with just a few clicks,” Jassy said. “This is a step level change in how easy it’s going to be for all of you to set up data lakes.”
AWS sees Lake Formation being used alongside two other new services that it just announced, including Security Hub and Control Tower.
Security Hub is a new AWS service that provides a centralized GUI for managing all the security services a customer has running in their AWS environment. The service integrates with AWS security services, such as GuardDuty and Macie (which uses machine learning to detect anomalies in data patterns), as well as third-party security software, such as McAfee and Qualys.
“It will take all that data and aggregate for you that data, normalize that data, and it make it easy and coherent for you to see and take action on in a single GUI in a security hub,” Jassy said.
Control Tower, meanwhile, acts as a centralized place to set up and manage multi-account AWS environments in a secure and compliant manner. The service provides “best practices” blueprints, for setting up and managing identities of users who have access to AWS accounts. It also provides pre-built blueprints for setting up virtual private clouds (VPCs) in AWS, as well as “guiderails” for implementing rules that enforce the level of security, compliance, and operational control that the customer demands.
“We know our customers love the breadth of capability available in AWS, but they also tell us they want us to package our services in ways that make it easier for them to build an architecture quickly,” Charlie Bell, senior vice president of AWS, says in a press release. “One of the central benefits of the cloud is that it removes the vast operational complexities of managing physical infrastructure. AWS’s new services abstract away additional complexity, speeding and simplifying the process of deploying and managing cloud workloads, so customers can