By Vlad Feinberg - May 28, 2020
While the format of Spark Summit 2020 might look a little different this year, I’m excited to be speaking remotely with over 7,500 engineers, scientists, developers, analysts, and leaders from around the world.
On Friday, June 26th, I’m leading a session on an effective tool for dealing with large, sparse datasets: graph coloring. Chromatic learning generalizes a previously-studied technique to enable machine learning on enterprise data, which usually requires truncation because there are so many features.
Enterprise data exhibits a number of challenges for analysis. It’s highly dimensional, but tends to be very sparse, requiring you to re-think your data representation in order to make analysis tractable. As is, this means organizations are unable to apply typical machine learning tools such as neural networks or polynomial regression to critical datasets about marketing spend efficiency, content consumption, or consumer behavior.
We could approach this problem using simpler models, like logistic regression with interaction terms or specialized representations, like compressed sparse row format. But these methods come with conflicting tradeoffs — they’re memory-intensive, suffer performance degradation, or require sticking to a specific modeling approach.
In my talk, I’m going to break down these challenges and demonstrate a different approach for using these complex datasets for analysis and machine learning. Specifically, we’ll look at approximate graph coloring to significantly collapse dataset width. We’ll walk through the speed gains and implications on accuracy for taking advantage of mutual exclusivity in the data.
(Illustrations by Michie Cao)
Now that Spark Summit is virtual, attendance is free for General Admission. That means you can attend my session on June 26th at 10:30 am PST as well as the sessions and keynotes throughout the week. And if you can’t make it, sessions will be available on demand. Sign up now, and see you at Spark Summit 2020!