The Cure for Kubernetes Storage Headaches: Break Your Data Free
If you’re using Kubernetes, there’s likely a simple reason why: Because it makes your life easier. That is, after all, the whole premise behind container-based orchestration. Infrastructure becomes disposable. Spin it up when you need it, throw it away when you’re done, and let Kubernetes worry about the underlying infrastructure, so you don’t have to think too much about it.
At least, that’s how things are supposed to work. As you know if you’ve actually set up workloads that depend on persistent data, there’s one big asterisk – storage.
As great as Kubernetes is at abstracting away compute and networking infrastructure, it just doesn’t work that way for storage when your apps are stateful and data is persistent. Your application still must know all about the underlying storage infrastructure to find its way to the data you need. And not just the location of that data, but all the other fine-grained considerations (performance, protection, resiliency, data governance, and cost) that come with different kinds of storage infrastructure, that most data scientists don’t want to think about.
Why, in a cloud-native world where we’ve automated away the management of so much underlying hardware complexity, is storage still so painful? Two words: data silos.
As long as we continue to manage data via the different infrastructures it lives on, rather than focusing on the data itself, we’ll inevitably end up juggling islands of storage, with all the headaches that come with them. Fortunately, this is not an intractable problem. By changing the way we think about data management, from an infrastructure-centric to a data-centric approach, we can use Kubernetes to give us what was promised in the first place: making storage SEP (Someone Else’s Problem).
Virtualize Your Data
When the data you need is sprawled across different storage silos, each with its own unique attributes (this-or-that cloud, on-premises, object, high-performance, etc.), there’s just no way to abstract away infrastructure considerations. Someone still has to answer all those questions about performance and cost and data governance to set up your pipeline. (And if that person is an IT admin you call for help, you can bet they cringe every time your name pops up on a ticket. Because they know they’re going to be spending the day wrestling with arcane infrastructure interfaces to wrangle your data across all the different copies and data stores, and there’s no way they’re getting that done before lunch.)
The only way to get rid of that headache—the only way to actually realize the speed and simplicity that Kubernetes is supposed to give you—is by virtualizing your data. Basically, you need an intelligent abstraction layer between your data and all your diverse storage infrastructure. That abstraction layer should let see and access your data everywhere, without having to worry about whether a given infrastructure has the right cost, location, or governance for what you’re doing, and without having to constantly make new copies.
Making this happen is not as difficult as it sounds. The key: metadata. When you can encode all the data requirements, context, or lineage considerations into metadata that follows your data everywhere, then it no longer matters which infrastructure data happens to reside on at any given moment. Now, when you’re setting up a data pipeline, you can work entirely with metadata. And your virtualization layer can use AI/ML to automatically handle all the underlying data management and infrastructure considerations for you.
Capitalize on Infrastructure Abstraction
Once you have your virtualization layer in place, and you’re handing data management via metadata, you can do all sorts of things you couldn’t do before. Things like:
- Eliminate data silos: Now, it doesn’t matter which infrastructure the data you need lives on or where that infrastructure is located. To your application, all those previously siloed storage resources (on-premises, cloud, hybrid, archival) just look like a universal global namespace.
- Access storage resources programmatically: Since you’re dealing in metadata—instead of a dozen different underlying hardware infrastructures—you can now set up your pipeline and access your data via declarative statements: I need this data, with this performance, and that’s all I really care about. The intelligent virtualization layer then goes and makes it happen, without your application (or your overburdened IT admin) needing to tell it exactly how.
- Make data management self-service: Data scientists don’t want to worry about comparing the costs of different storage types, enabling data protection, or making sure they’re meeting security and compliance requirements every time they set up a pipeline. (For that matter, your IT and security teams likely don’t want data scientists making those choices either—unless they like having everything run on the most expensive storage, without proper compliance.) Once you separate management of metadata from data, that all goes away. Storage administrators can set guardrails by configuring basic policy once. Users can then self-service most of their data management needs from then on—without opening a ticket, and without the errors that arise when they’re manually making those calls every time they set up a pipeline.
- Continually enrich your data: When your system supports customizable, extensible metadata, you can now do all sorts of interesting things. For example, you can build recursive processes, where you run data through a system, get some results, add those results back to the metadata, and run the job again. You can begin to build deep contextual understanding of the data around the data. The more that data is processed and used, the richer it becomes for other jobs in the future. And, that intelligence now always lives with that data everywhere, for any other application or data scientist who wants to use it. It’s not restricted to one copy, on one island of storage hidden away somewhere.
Unshackle Your Data
All of these things are possible when you virtualize your data, just because metadata is so much more flexible to work with than siloed storage infrastructures. The storage considerations that used to come with setting up and orchestrating your data pipeline can now just happen for you. Your storage resources become programmable, self-service, and automatically compliant, typically requiring no manual intervention.
All of a sudden, you’re actually living the reality that Kubernetes and software-defined storage was always supposed to deliver. Storage is software-defined, programmable, and consistent across hybrid cloud environments, regardless of the underlying infrastructure. Your data is richer and more flexible. Your IT team no longer keeps a blown-up picture from your ID card on the wall to throw darts at. Most important, you’re spending a lot more of your time actually working with your data—instead of worrying about where it lives.
About the author: Hammerspace Vice President of Product Marketing Brendan Wolfe has a long history of product marketing and product management in enterprise IT from servers to storage. Working with both large companies and startups, Brendan helps bring innovative products to new emerging markets.
June 1, 2020
- Report Reveals 49% of Companies Use Analytics More or Much More Than Before COVID-19
- Esri Donates Free Software to GEO BON Grant Recipients
- Arcadia Emphasizes the Role of Custom Software Development in Global Medical Safety
- WANdisco Releases LiveData Platform for Petabyte Scale Cloud Migration to Microsoft Azure
May 29, 2020
- UW–Madison VisPy Data Visualization Project Awarded Chan Zuckerberg Initiative Grant
- Domo Releases Data Explorer Feature on its Interactive COVID-19 Global Tracker
- Catalytic Data Science Joins the XPRIZE Pandemic Alliance to Combat COVID-19
- MetiStream Secures Funding to Enhance Solution that Analyses Patient, Population Data Using NLP and AI
- Kyvos Announces Snowflake Integration Enabling Multidimensional Analytics on the Cloud
- Survey: Despite Reduced IT Budgets Due to COVID-19, IT Decision-Makers Continue Cloud and Analytics Investments
- Gravy Analytics Partners with Nitrogen.ai to Correlate Foot Traffic, Socio-Economic Data
- Siren Releases 10.5 with Knowledge Graph Augmentation on Demand, NLP and Position Tracking
May 28, 2020
- Okera Launches No-Code Policy Creation for Fine-Grained Access Control with De-Identification
- Reltio Offers Accelerated Deployment of Reltio Connected Customer 360 for Rapid Digital Transformation
- erwin Releases New Version of its Data Intelligence Suite
- Survey of Financial Institutions Reveals Disconnect Between Digital Transformation Progress and Data Readiness
- OmniSci Partners with SafeGraph and Veraset to Provide POI/GPS Data for Commercial, Business, Government
- Fujitsu Launches ETERNUS Data Services Platform
- Huawei Announces the Next-Gen OceanStor Pacific Series
- MoA and Data Agreements Signed with Global Spatial Technology Solutions
Most Read Features
- Big Data File Formats Demystified
- Spark 3.0 to Get Native GPU Acceleration
- Google Enters Data Catalog Business, Updates BigQuery
- How COVID-19 Is Impacting the Market for Data Jobs
- How to Build a Better Machine Learning Pipeline
- COVID-19 Has a Data Governance Problem
- Tracking the Spread of Coronavirus with Graph Databases
- The Big Cloud Data Boom Gets Even Bigger, Thanks to COVID-19
- Detecting Consumer Signals in the 90% Economy
- Is Python Strangling R to Death?
- More Features…
Most Read News In Brief
- New Map Shows Hundreds of Counties in the COVID-19 Endgame — and Thousands on the Uptick
- New MIT Analytics Tools Predict COVID-19 Patient Outcomes and More
- COVID-19 Spurs Offers for Free Software, Data, and Training
- New COVID-19 Model Shows Peak Scenarios for Your State
- For American Airlines, Machine Learning Solves an Air Cargo Conundrum
- War Unfolding for Control of Elasticsearch
- Inside Fortnite’s Massive Data Analytics Pipeline
- PostgreSQL Gets a Parallel Processing Boost
- IBM Extends Jupyter Notebooks for AI Development
- HPE Acquires MapR
- More News In Brief…
Most Read This Just In
- Womply, Opportunity Insights Partner to Launch Real-Time Economic Tracker for COVID-19 Impact
- Esri Provides Free Mapping Software for Women in GIS
- Iguazio and NetApp Collaborate to Accelerate Deployment of AI Applications
- VisionLabs to Hold Online ‘Machine Can See’ Summit
- Iguazio Becomes Certified for NVIDIA DGX-Ready Software Program
- C3.ai Publishes COVID-19 Data Lake
- Dremio Introduces AWS Edition, Shrinks Data Lake Query Engine Costs by 90%
- The Turing to Work with the University of Texas at Austin’s Oden Institute to Advance Data-Centric Engineering Research
- C3.ai Releases COVID-19 Data Lake V2
- GoodData Announces New Collaborative Data Modeling Solution
- More This Just In…
June 22 - June 26