Follow Datanami:

Tag: apache spark

To Centralize or Not to Centralize Your Data–That Is the Question

Should you strive to centralize your data, or leave it scattered about? It seems like it should be a simple question, but it’s actually a tough one to answer, particularly because it has so many ramifications for how d Read more…

Google Cloud’s Dataproc Gets a GPU-Powered Spark Boost

Google Cloud’s Dataproc – its big data platform that allows users to run Apache Hadoop and Spark jobs – is getting a boost. Apache Spark 3 and Hadoop 3 have launched general availability, enhancing users’ data an Read more…

Spark 3.0 Brings Big SQL Speed-Up, Better Python Hooks

Apache Spark 3.0 is now here, and it’s bringing a host of enhancements across its diverse range of capabilities. The headliner is an big bump in performance for the SQL engine and better coverage of ANSI specs, while e Read more…

Databricks Brings Data Science, Engineering Together with New Workspace

Data scientists and software engineers work in different ways and use different tools. But both personas will feel more comfortable developing applications in the new version of Databricks Data Science Workspace, which t Read more…

Databricks Cranks Delta Lake Performance, Nabs Redash for SQL Viz

Today at its Spark + AI Summit, Databricks unveiled Delta Engine, a new layer in its Delta Lake cloud offering that uses several techniques to significantly accelerate the performance of SQL queries. The company also ann Read more…

Spark 3.0 to Get Native GPU Acceleration

NVIDIA today announced that it’s working with Apache Spark’s open source community to bring native GPU acceleration to the next version of the big data processing framework. With Spark version 3.0, which is due out n Read more…

Kaskada Accelerates ML Workflow with Its Feature Store

There’s a lot of surface area in the typical data science workflow for the purveyors of automation to attack. What moves the needle for the folks at the startup Kaskada is the feature engineering and deployment stage, Read more…

Data Lakes Get Structured

The explosion of unstructured and partially structured data has made traditional data lakes harder to manage. Adding to the challenge are “brittle” data pipelines that are time-consuming to create as well as ephemera Read more…

StreamSets Eases Spark-ETL Pipeline Development

Apache Spark gives developers a powerful tool for creating data pipelines for ETL workflows, but the framework is complex and can be difficult to troubleshoot. StreamSets is aiming to simplify Spark pipeline development Read more…

Program Synthesis Moves a Step Closer to Reality

As data scientists and software developers sort through the plethora of tools and APIs ranging from Python to Apache Spark, automation schemes are emerging to help programmers navigate those tools and the accompanying in Read more…

Understanding Your Options for Stream Processing Frameworks

Real-time stream processing isn't a new concept, but it's experiencing renewed interest from organizations tasked with finding ways to quickly process large volumes of streaming data. Luckily for you, there are a handful Read more…

Apache Spark Is Great, But It’s Not Perfect

Apache Spark is one of the most widely used tools in the big data space, and will continue to be a critical piece of the technology puzzle for data scientists and data engineers for the foreseeable future. With that said Read more…

Startup MemVerge Combines Memory, Storage

A startup combining persistent memory and data storage has emerged from stealth mode with a platform running on chip maker Intel’s Optane architecture. MemVerge claims to have invented what it calls “memory-conver Read more…

Here’s What Doug Cutting Says Is Hadoop’s Biggest Contribution

Apache Hadoop isn't the center of attention in the IT world anymore, and much of the hype has dissipated (or at least regrouped behind AI). But the open source software project still has a place for on-premise workloads, Read more…

How Walmart Uses Nvidia GPUs for Better Demand Forecasting

During a presentation at Nvidia's GPU Technology Conference (GTC) this week, the director of data science for Walmart Labs shared how the company's new GPU-based demand forecasting model achieved a 1.7% increase in forec Read more…

What Makes Apache Spark Sizzle? Experts Sound Off

Apache Spark is one of the most popular open source projects in the world, and has lowered the barrier of entry for processing and analyzing data at scale. We asked some of the leaders in the big data space to give us th Read more…

A Decade Later, Apache Spark Still Going Strong

Don't look now but Apache Spark is about to turn 10 years old. The open source project began quietly at UC Berkeley in 2009 before emerging as an open source project in 2010. For the past five years, Spark has been on an Read more…

Microsoft Invests in Databricks

Databricks, the high-flying analytics startup founded by the creators of Apache Spark, announced yet another venture funding haul this week as it hustles to meet what it says is growing demand for its analytics platform. Read more…

Google Brings Kubernetes Operator for Spark to GCP

Those looking to run Apache Spark on clusters managed with Kubernetes will be interested in the new Spark operator for Kubernetes unveiled by Google today. The software, which is in beta, will be supported on the Google Read more…

Datanami