Engineering

Spark Summit 2020 Session: Chromatic Learning for Sparse Data

By Vlad Feinberg - May 28, 2020

While the format of Spark Summit 2020 might look a little different this year, I’m excited to be speaking remotely with over 7,500 engineers, scientists, developers, analysts, and leaders from around the world.

On Friday, June 26th, I’m leading a session on an effective tool for dealing with large, sparse datasets: graph coloring. Chromatic learning generalizes a previously-studied technique to enable machine learning on enterprise data, which usually requires truncation because there are so many features.

Enterprise data exhibits a number of challenges for analysis. It’s highly dimensional, but tends to be very sparse, requiring you to re-think your data representation in order to make analysis tractable. As is, this means organizations are unable to apply typical machine learning tools such as neural networks or polynomial regression to critical datasets about marketing spend efficiency, content consumption, or consumer behavior.

 

We could approach this problem using simpler models, like logistic regression with interaction terms or specialized representations, like compressed sparse row format. But these methods come with conflicting tradeoffs — they’re memory-intensive, suffer performance degradation, or require sticking to a specific modeling approach.

In my talk, I’m going to break down these challenges and demonstrate a different approach for using these complex datasets for analysis and machine learning. Specifically, we’ll look at approximate graph coloring to significantly collapse dataset width. We’ll walk through the speed gains and implications on accuracy for taking advantage of mutual exclusivity in the data.

(Illustrations by Michie Cao)

Now that Spark Summit is virtual, attendance is free for General Admission. That means you can attend my session on June 26th at 10:30 am PST as well as the sessions and keynotes throughout the week. And if you can’t make it, sessions will be available on demand. Sign up now, and see you at Spark Summit 2020!

 


Read more

Graph Coloring for Machine Learning

Based on our experience working with large, sparse datasets, we describe a method to use graph coloring to reduce the complexity of analysis.

Read more

Lightning-fast Schema Inference in Redshift

In this post, we’ll show you a simple trick we’ve used to improve schema inference performance by over 100x in Redshift.

Read more