By Vlad Feinberg - February 11, 2020
At Sisu, we’re working with large, sparse datasets of Internet scale and shape. One of the most common and painful issues we see–particularly in enterprise data–is a glut of rare and high-cardinality categorical features, driving a large column count that slows down computation. For instance, key business metrics like click-through rate for a content publisher are impacted by a variety of infrastructure configurations (which CDNs are you using, page load latency), product decisions (a new layout or user flow), and first-party content metadata (content topics, author name).
Not only are these wide data sets expensive to analyze, they’re also frustrating to interpret because they tend to be highly NULL. In practice, this means organizations aren’t able to make use of critical data like user behavior, content consumption, and marketing influence. To explain this challenge in more detail, let’s start with a data set that’s representative of this problem: the Malicious URL dataset. It contains information on URLs, namely lexical and host-related features, with the security-oriented objective of identifying malicious links.
Let’s break down what this data set looks like:
This many categorical features is a showstopper for many popular classification algorithms, like logistic regression, where the number of parameters equals the number of features. Even algorithms with sparse input support, like XGBoost, suffer from performance degradation because there are so many feature columns to consider for a tree split.
Vowpal Wabbit’s hashing trick, which converts a large number of columns to N integers, is a flexible existing approach to this problem. However, to avoid mapping two columns to the same integer, you need N to be very large, which is problematic if you can’t stomach the entire width of the table.
We’ll demonstrate a powerful input representation that fixes these cases by leveraging a sparsity structure that occurs frequently in practice: mutual exclusivity. This technique was first proposed by the LightGBM authors, but we’ve found it very useful in lots of other settings, so we wanted to give it some light.
Many data-generating processes for structured data, especially those coming from processes with control flow, contain natural mutual exclusivity. Consider a hypothetical medical survey:
As long as we don’t care about null values, i.e., people who didn’t answer “How long ago was your most recent surgery?”, then we can combine them into a single column containing values surgery=3mo, surgery=12mo, or surgery=never.
But given a dataframe with a large number of categorical columns, like under_65, hip_replacement_surgery, and health_insurance, identifying these cases of mutual exclusivity requires intensive manual review of the data creation process’ interaction between each pair of columns. For example, is hip_replacement_surgery always NULL when under_65 is true? We’d like to detect such correspondences automatically, and use them to collapse the data as much as possible.
This brings us to the crux of the transform: we use approximate graph coloring. Suppose we start out with a dataframe looking like this:
We say two features (column-value pairs) co-occur if there’s a row with both of them present. In one pass, we can quickly construct the co-occurrence graph where vertices are features and edges are present whenever two features co-occur.
The next step is to perform an approximate graph coloring on our co-occurrence graph. In practice greedy coloring is exceedingly fast, orders of magnitude faster than graph construction. A coloring yields a color map from vertices to colors, such that no two vertices of the same color are adjacent, which means they never co-occur.
But if they never co-occur then we can make a new column, one for each color, that recovers the original features with perfect fidelity.
In effect, we were able to perform a column reduction. Since the original set of columns was a valid coloring, we’ll tend to end up with fewer columns to represent the same non-null features, making it easier to express decision boundaries. A decision tree that would normally have to be depth-3 in the above example can now get away with being depth-2.
Column reduction in particular can be very powerful for improving speed and accuracy when working with high-dimensional data. To illustrate this, let’s examine the impact on our real dataset when we take XGBoost and apply graph coloring to it.
(Illustrations by Michie Cao)
As we can see above, this can be quite a powerful tool for working with structured data, whether doing diagnostic analysis, supervised learning, or modeling. We provide open-source code for this technique, with a vectorized implementation, so that you can use it in contexts outside of decision tree training, which can take advantage of parallel compute.
If you like to solve these kinds of practical ML problems, join us.