The Cure for Kubernetes Storage Headaches: Break Your Data Free
If you’re using Kubernetes, there’s likely a simple reason why: Because it makes your life easier. That is, after all, the whole premise behind container-based orchestration. Infrastructure becomes disposable. Spin it up when you need it, throw it away when you’re done, and let Kubernetes worry about the underlying infrastructure, so you don’t have to think too much about it.
At least, that’s how things are supposed to work. As you know if you’ve actually set up workloads that depend on persistent data, there’s one big asterisk – storage.
As great as Kubernetes is at abstracting away compute and networking infrastructure, it just doesn’t work that way for storage when your apps are stateful and data is persistent. Your application still must know all about the underlying storage infrastructure to find its way to the data you need. And not just the location of that data, but all the other fine-grained considerations (performance, protection, resiliency, data governance, and cost) that come with different kinds of storage infrastructure, that most data scientists don’t want to think about.
Why, in a cloud-native world where we’ve automated away the management of so much underlying hardware complexity, is storage still so painful? Two words: data silos.
As long as we continue to manage data via the different infrastructures it lives on, rather than focusing on the data itself, we’ll inevitably end up juggling islands of storage, with all the headaches that come with them. Fortunately, this is not an intractable problem. By changing the way we think about data management, from an infrastructure-centric to a data-centric approach, we can use Kubernetes to give us what was promised in the first place: making storage SEP (Someone Else’s Problem).
Virtualize Your Data
When the data you need is sprawled across different storage silos, each with its own unique attributes (this-or-that cloud, on-premises, object, high-performance, etc.), there’s just no way to abstract away infrastructure considerations. Someone still has to answer all those questions about performance and cost and data governance to set up your pipeline. (And if that person is an IT admin you call for help, you can bet they cringe every time your name pops up on a ticket. Because they know they’re going to be spending the day wrestling with arcane infrastructure interfaces to wrangle your data across all the different copies and data stores, and there’s no way they’re getting that done before lunch.)
The only way to get rid of that headache—the only way to actually realize the speed and simplicity that Kubernetes is supposed to give you—is by virtualizing your data. Basically, you need an intelligent abstraction layer between your data and all your diverse storage infrastructure. That abstraction layer should let see and access your data everywhere, without having to worry about whether a given infrastructure has the right cost, location, or governance for what you’re doing, and without having to constantly make new copies.
Making this happen is not as difficult as it sounds. The key: metadata. When you can encode all the data requirements, context, or lineage considerations into metadata that follows your data everywhere, then it no longer matters which infrastructure data happens to reside on at any given moment. Now, when you’re setting up a data pipeline, you can work entirely with metadata. And your virtualization layer can use AI/ML to automatically handle all the underlying data management and infrastructure considerations for you.
Capitalize on Infrastructure Abstraction
Once you have your virtualization layer in place, and you’re handing data management via metadata, you can do all sorts of things you couldn’t do before. Things like:
- Eliminate data silos: Now, it doesn’t matter which infrastructure the data you need lives on or where that infrastructure is located. To your application, all those previously siloed storage resources (on-premises, cloud, hybrid, archival) just look like a universal global namespace.
- Access storage resources programmatically: Since you’re dealing in metadata—instead of a dozen different underlying hardware infrastructures—you can now set up your pipeline and access your data via declarative statements: I need this data, with this performance, and that’s all I really care about. The intelligent virtualization layer then goes and makes it happen, without your application (or your overburdened IT admin) needing to tell it exactly how.
- Make data management self-service: Data scientists don’t want to worry about comparing the costs of different storage types, enabling data protection, or making sure they’re meeting security and compliance requirements every time they set up a pipeline. (For that matter, your IT and security teams likely don’t want data scientists making those choices either—unless they like having everything run on the most expensive storage, without proper compliance.) Once you separate management of metadata from data, that all goes away. Storage administrators can set guardrails by configuring basic policy once. Users can then self-service most of their data management needs from then on—without opening a ticket, and without the errors that arise when they’re manually making those calls every time they set up a pipeline.
- Continually enrich your data: When your system supports customizable, extensible metadata, you can now do all sorts of interesting things. For example, you can build recursive processes, where you run data through a system, get some results, add those results back to the metadata, and run the job again. You can begin to build deep contextual understanding of the data around the data. The more that data is processed and used, the richer it becomes for other jobs in the future. And, that intelligence now always lives with that data everywhere, for any other application or data scientist who wants to use it. It’s not restricted to one copy, on one island of storage hidden away somewhere.
Unshackle Your Data
All of these things are possible when you virtualize your data, just because metadata is so much more flexible to work with than siloed storage infrastructures. The storage considerations that used to come with setting up and orchestrating your data pipeline can now just happen for you. Your storage resources become programmable, self-service, and automatically compliant, typically requiring no manual intervention.
All of a sudden, you’re actually living the reality that Kubernetes and software-defined storage was always supposed to deliver. Storage is software-defined, programmable, and consistent across hybrid cloud environments, regardless of the underlying infrastructure. Your data is richer and more flexible. Your IT team no longer keeps a blown-up picture from your ID card on the wall to throw darts at. Most important, you’re spending a lot more of your time actually working with your data—instead of worrying about where it lives.
About the author: Hammerspace Vice President of Product Marketing Brendan Wolfe has a long history of product marketing and product management in enterprise IT from servers to storage. Working with both large companies and startups, Brendan helps bring innovative products to new emerging markets.
October 30, 2020
- Industrial IoT Data Integrity Company Aperio Systems Secures $8.5M Series A Funding
- Lucidum Raises $4M Seed Investment to Automate Asset Discovery and Eliminate Blind Spots Across Cloud, Security, IT
- Varada Named a ‘Cool Vendor in Data Management’ by Gartner
- Domino Data Lab Joins Google Cloud Partner Advantage Program
- Privitar and StreamSets Announce Partnership and New Product Integration
- Surecomp Marketplace Goes Live with Financial Crime Solution ThetaRay AML Tech
- Envelop Risk Brings Advanced Machine Learning To Cyber Risk With Dataiku
- StorCentric Acquires Violin Systems, Adding Software-Defined All-flash Storage to Its Portfolio
October 29, 2020
- Chorus.ai Expands Enterprise Ecosystem with Conversation Intelligence API, Slack, Zapier Integrations
- GoodData Open-sources Next Gen Analytics Framework
- Fivetran Named Google Premier Partner
- Spectra Logic Releases StorCycle 3.2 Storage Lifecycle Management Software
- Gigamon Expands Cloud Ecosystem Reach with Ingram Micro Agreement
- Clarity AI Raises $15M to Fuel Expansion of Platform
- AnalyticsIQ Joins Narrative’s Data Streaming Platform
October 28, 2020
- Zaloni Releases Arena 6.1, Included in Machine Learning Data Catalogs Report and Finalist for NC Tech Awards
- New Forrester Study: Companies Turn to Data and Analytics During Economic Downturn
- Siren Releases Next Gen Investigative Intelligence Analytics Tool
- Anaconda Launches Dividend Program to Give Back to Open-Source Community
- Digital.ai and BMC to Provide AI-driven Change Management and Service Desk Analytics Solution
Most Read Features
- Big Data File Formats Demystified
- Systemic Data Errors Still Plague Presidential Polling
- How to Build a Better Machine Learning Pipeline
- Did Dremio Just Make Data Warehouses Obsolete?
- How Geospatial Data Drives Insight for Bloomberg Users
- Do You Need a Chief Data Scientist?
- Is Python Strangling R to Death?
- VC Ben Horowitz Dishes on Hadoop, AI, and Data Culture
- 10 Big Data Statistics That Will Blow Your Mind
- It’s Time to Implement Fair and Ethical AI
- More Features…
Most Read News In Brief
- Qubole is Latest Acquisition Target
- Informatica Likes Its Chances in the Cloud
- Domo Launches Election Tracker Comparing 2016, 2020 Polling Data
- Testing Data Literacy on Main Street
- Pandemic Driving ‘Back to Basics’ in Big Data, Study Suggests
- Researchers Demonstrate Less-than-One Shot Machine Learning
- AI Startup Uses FPGAs to Speed Training, Inference
- Patchwork of Data Privacy Laws Sows Confusion
- Splunk Makes a Whirlwind of News at .conf20
- War Unfolding for Control of Elasticsearch
- More News In Brief…
Most Read This Just In
- Datanami Reveals Winners of Fifth Annual Readers’ and Editors’ Choice Awards
- Tableau Launches Free Data Literacy Training Program
- NASA, ICIJ, ATPCO, Lyft and More Choose Neo4j for their Knowledge Graphs
- Fujitsu Enters Strategic Alliance with Palantir Technologies
- Hazelcast to Provide Additional Capabilities to IBM Cloud Pak for Multicloud Management
- Alida Integrates Stratifyd AI-powered Analytics Engine into New CXM Platform
- Data Science Professor Receives $1.25 Million Grant from Department of Defense
- COVID-19 Info Dashboards Come to the CDC with Georgia Tech Help
- Instaclustr Continues to Expand Managed Apache Kafka Services – Adds Managed Mirroring and Dedicated ZooKeeper Nodes
- KNIME and H2O.ai Accelerate and Simplify End-to-end Data Science Automation
- More This Just In…