The Cure for Kubernetes Storage Headaches: Break Your Data Free
If you’re using Kubernetes, there’s likely a simple reason why: Because it makes your life easier. That is, after all, the whole premise behind container-based orchestration. Infrastructure becomes disposable. Spin it up when you need it, throw it away when you’re done, and let Kubernetes worry about the underlying infrastructure, so you don’t have to think too much about it.
At least, that’s how things are supposed to work. As you know if you’ve actually set up workloads that depend on persistent data, there’s one big asterisk – storage.
As great as Kubernetes is at abstracting away compute and networking infrastructure, it just doesn’t work that way for storage when your apps are stateful and data is persistent. Your application still must know all about the underlying storage infrastructure to find its way to the data you need. And not just the location of that data, but all the other fine-grained considerations (performance, protection, resiliency, data governance, and cost) that come with different kinds of storage infrastructure, that most data scientists don’t want to think about.
Why, in a cloud-native world where we’ve automated away the management of so much underlying hardware complexity, is storage still so painful? Two words: data silos.
As long as we continue to manage data via the different infrastructures it lives on, rather than focusing on the data itself, we’ll inevitably end up juggling islands of storage, with all the headaches that come with them. Fortunately, this is not an intractable problem. By changing the way we think about data management, from an infrastructure-centric to a data-centric approach, we can use Kubernetes to give us what was promised in the first place: making storage SEP (Someone Else’s Problem).
Virtualize Your Data
When the data you need is sprawled across different storage silos, each with its own unique attributes (this-or-that cloud, on-premises, object, high-performance, etc.), there’s just no way to abstract away infrastructure considerations. Someone still has to answer all those questions about performance and cost and data governance to set up your pipeline. (And if that person is an IT admin you call for help, you can bet they cringe every time your name pops up on a ticket. Because they know they’re going to be spending the day wrestling with arcane infrastructure interfaces to wrangle your data across all the different copies and data stores, and there’s no way they’re getting that done before lunch.)
The only way to get rid of that headache—the only way to actually realize the speed and simplicity that Kubernetes is supposed to give you—is by virtualizing your data. Basically, you need an intelligent abstraction layer between your data and all your diverse storage infrastructure. That abstraction layer should let see and access your data everywhere, without having to worry about whether a given infrastructure has the right cost, location, or governance for what you’re doing, and without having to constantly make new copies.
Making this happen is not as difficult as it sounds. The key: metadata. When you can encode all the data requirements, context, or lineage considerations into metadata that follows your data everywhere, then it no longer matters which infrastructure data happens to reside on at any given moment. Now, when you’re setting up a data pipeline, you can work entirely with metadata. And your virtualization layer can use AI/ML to automatically handle all the underlying data management and infrastructure considerations for you.
Capitalize on Infrastructure Abstraction
Once you have your virtualization layer in place, and you’re handing data management via metadata, you can do all sorts of things you couldn’t do before. Things like:
- Eliminate data silos: Now, it doesn’t matter which infrastructure the data you need lives on or where that infrastructure is located. To your application, all those previously siloed storage resources (on-premises, cloud, hybrid, archival) just look like a universal global namespace.
- Access storage resources programmatically: Since you’re dealing in metadata—instead of a dozen different underlying hardware infrastructures—you can now set up your pipeline and access your data via declarative statements: I need this data, with this performance, and that’s all I really care about. The intelligent virtualization layer then goes and makes it happen, without your application (or your overburdened IT admin) needing to tell it exactly how.
- Make data management self-service: Data scientists don’t want to worry about comparing the costs of different storage types, enabling data protection, or making sure they’re meeting security and compliance requirements every time they set up a pipeline. (For that matter, your IT and security teams likely don’t want data scientists making those choices either—unless they like having everything run on the most expensive storage, without proper compliance.) Once you separate management of metadata from data, that all goes away. Storage administrators can set guardrails by configuring basic policy once. Users can then self-service most of their data management needs from then on—without opening a ticket, and without the errors that arise when they’re manually making those calls every time they set up a pipeline.
- Continually enrich your data: When your system supports customizable, extensible metadata, you can now do all sorts of interesting things. For example, you can build recursive processes, where you run data through a system, get some results, add those results back to the metadata, and run the job again. You can begin to build deep contextual understanding of the data around the data. The more that data is processed and used, the richer it becomes for other jobs in the future. And, that intelligence now always lives with that data everywhere, for any other application or data scientist who wants to use it. It’s not restricted to one copy, on one island of storage hidden away somewhere.
Unshackle Your Data
All of these things are possible when you virtualize your data, just because metadata is so much more flexible to work with than siloed storage infrastructures. The storage considerations that used to come with setting up and orchestrating your data pipeline can now just happen for you. Your storage resources become programmable, self-service, and automatically compliant, typically requiring no manual intervention.
All of a sudden, you’re actually living the reality that Kubernetes and software-defined storage was always supposed to deliver. Storage is software-defined, programmable, and consistent across hybrid cloud environments, regardless of the underlying infrastructure. Your data is richer and more flexible. Your IT team no longer keeps a blown-up picture from your ID card on the wall to throw darts at. Most important, you’re spending a lot more of your time actually working with your data—instead of worrying about where it lives.
About the author: Hammerspace Vice President of Product Marketing Brendan Wolfe has a long history of product marketing and product management in enterprise IT from servers to storage. Working with both large companies and startups, Brendan helps bring innovative products to new emerging markets.
November 30, 2021
- Okera Expands Its Footprint with AWS to Power Universal Data Authorization
- Cribl Achieves AWS Graviton Ready Designation
- Matillion Accelerates Enterprise Data Integration with Matillion Data Loader 2.0
- Duality and Intel Collaborate to Offer Accelerated Homomorphic Encryption-based ML Apps on AWS
- ChaosSearch Achieves AWS Data and Analytics Competency Status
- Sumo Logic Drives Unified Observability with Expanded Tracing Visibility into AWS Lambda
- BigID Announces Auto-Discovery Apps for Cloud at AWS re:Invent
- Cloudera Report: Enterprise Data Strategy is the Bridge to the Post-Pandemic Economy
- ScyllaDB Joins AWS ISV Accelerate Program to Scale Data-Intensive Applications
- Sumo Logic Unifies Intelligence Framework to Accelerate Discovery and Response to Security Threats
- Workato Joins AWS Partner Network and is a Launch Partner for AI for Data Analytics
- Data Virtualization Can Deliver ROI of 408% According to New Independent Research Study
- MongoDB Announces a Pay-As-You-Go Offering in AWS Marketplace
- Katana Graph and Intel Collaborate on Graph Analytics Python Library
November 29, 2021
- Brookhaven’s NSLS-II Scientists Release New Data Access Tool Called Tiled
- Trifacta Positioned as the Leader in the 2021 SPARK Matrix for Data Quality Tools Platform
- WANdisco Achieves AWS Migration and Modernization Competency Status
- Immuta Announces the Availability of SaaS for Modern Data Stacks
November 24, 2021
November 23, 2021
Most Read Features
- Visualizations That Make You Go ‘Hmmm’
- Snowflake Adds Python Support with Winter Release
- Big Data File Formats Demystified
- Data Mesh Vs. Data Fabric: Understanding the Differences
- What’s Driving Python’s Massive Popularity?
- Is Quantum Computing the Future of AI?
- Battle for Data Pros Heats Up as Burnout Builds
- What’s the Difference Between AI, ML, Deep Learning, and Active Learning?
- Who’s Winning In the $17B AIOps and Observability Market
- Data Lake or Warehouse? Databricks Offers a Third Way
- More Features…
Most Read News In Brief
- KX Welcomes New Languages to Speedy Analytics Database
- Data Prep Still Dominates Data Scientists’ Time, Survey Finds
- Spark Gets Closer Hooks to Pandas, SQL with Version 3.2
- Why Is SAS Going Public?
- Rockset Taps Reverse ETL for Last-Mile Delivery of Insight
- Andrew Ng’s Computer Vision Startup Nabs $57 Million
- Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says
- Confluent Ships ‘Cluster Linking’ in Kafka Platform Update
- Big Data Career Notes: November 2021 Edition
- Investment in Machine Learning Keeps Growing, DataRobot Finds
- More News In Brief…
Most Read This Just In
- Databricks Lakehouse Takes the Lead in New Data Warehousing Benchmarks
- Fox Sports Taps OpenDrives to Provide Next-Generation IT Architecture for its Biggest Live Broadcast Events
- Oracle Announces New Cloud Analytics Solution: Oracle Fusion SCM Analytics
- ThoughtSpot Raises New Funding at $4.2B Valuation to Fuel the Modern Analytics Cloud
- ChaosSearch Data Lake Platform is First to Unlock JSON Files for Analytics at Scale
- First Massive Artificial Intelligence System in the Spanish Language, MarIA, Begins to Summarize and Generate Texts
- NVIDIA Launches New, Updated Accelerated Computing Libraries
- Apollo GraphQL Introduces Federation 2 to Get More Organizations to the Graph
- Fujifilm and iRODS Partner to Provide Scalable Data Archive Solution
- More This Just In…
Sponsored Partner Content
December 6 - December 10San Diego CA United States
February 8, 2022 - February 10, 2022Houston TX United States