By Vlad Feinberg - February 11, 2020
At Sisu, we’re working with large, sparse datasets of Internet scale and shape. One of the most common and painful issues we see–particularly in enterprise data–is a glut of rare and high-cardinality categorical features, driving a large column count that slows down computation. For instance, key business metrics like click-through rate for a content publisher are impacted by a variety of infrastructure configurations (which CDNs are you using, page load latency), product decisions (a new layout or user flow), and first-party content metadata (content topics, author name).
Not only are these wide data sets expensive to analyze, they’re also frustrating to interpret because they tend to be highly NULL. In practice, this means organizations aren’t able to make use of critical data like user behavior, content consumption, and marketing influence. To explain this challenge in more detail, let’s start with a data set that’s representative of this problem: the Malicious URL dataset. It contains information on URLs, namely lexical and host-related features, with the security-oriented objective of identifying malicious links.
Let’s break down what this data set looks like:
This many categorical features is a showstopper for many popular classification algorithms, like logistic regression, where the number of parameters equals the number of features. Even algorithms with sparse input support, like XGBoost, suffer from performance degradation because there are so many feature columns to consider for a tree split.
Vowpal Wabbit’s hashing trick, which converts a large number of columns to N integers, is a flexible existing approach to this problem. However, to avoid mapping two columns to the same integer, you need N to be very large, which is problematic if you can’t stomach the entire width of the table.
We’ll demonstrate a powerful input representation that fixes these cases by leveraging a sparsity structure that occurs frequently in practice: mutual exclusivity. This technique was first proposed by the LightGBM authors, but we’ve found it very useful in lots of other settings, so we wanted to give it some light.
Many data-generating processes for structured data, especially those coming from processes with control flow, contain natural mutual exclusivity. Consider a hypothetical medical survey:
As long as we don’t care about null values, i.e., people who didn’t answer “How long ago was your most recent surgery?”, then we can combine them into a single column containing values surgery=3mo, surgery=12mo, or surgery=never.
But given a dataframe with a large number of categorical columns, like under_65, hip_replacement_surgery, and health_insurance, identifying these cases of mutual exclusivity requires intensive manual review of the data creation process’ interaction between each pair of columns. For example, is hip_replacement_surgery always NULL when under_65 is true? We’d like to detect such correspondences automatically, and use them to collapse the data as much as possible.
This brings us to the crux of the transform: we use approximate graph coloring. Suppose we start out with a dataframe looking like this:
We say two features (column-value pairs) co-occur if there’s a row with both of them present. In one pass, we can quickly construct the co-occurrence graph where vertices are features and edges are present whenever two features co-occur.
The next step is to perform an approximate graph coloring on our co-occu