By Vlad Feinberg - June 22, 2020
This is the second post in a series. In his previous post, Graph Coloring for Machine Learning, Vlad Feinberg describes how approximate graph coloring can be used to transform sparse enterprise data for analysis.
At Sisu, we help analysts make better decisions with enterprise data, and to do so, we frequently need to process very high-dimensional, sparse datasets efficiently. Previously, we demonstrated how graph coloring could be an effective strategy for decision tree training. However, sometimes we want to consider different kinds of machine learning models. For instance, neural networks are a powerful tool that many data scientists might be interested in applying in sparse settings. However, for this kind of data, it’s not possible to directly apply neural networks because of the amount of memory they require to process sparse inputs. We’ll describe how you can use graph coloring to make neural networks applicable in such sparse settings.
As we discussed before, the rise of cheap data warehousing and the standardization of the SaaS data-collection pipeline with companies like Segment, Stitch, and Heap make it easy to collect data that looks like this:
This type of sparse data contains a lot of useful information. For example, let’s say we’re analyzing e-commerce data, and we have an analytics event that captures an order. A reasonable question might be: Might the customer who placed this order want to re-order this item later?
To answer this question, we’d pull in features about the order itself, which contains a large number of features, including data about the customer who made the order, their user preferences, and their full order history. Many of these features are set-valued: everything a user ordered in the past is a small subset of the entire online catalog.
High-dimensional set-valued data poses a challenge to anyone who wants to continuously analyze their data: all their features are constantly changing (e.g., users might change their preferences) and the schema describing the features is itself changing (e.g., my catalog can grow larger as we sell different types of products).
Like the Malicious URL datasets we used in our previous post, we often deal with sparsity that looks like:
For neural networks, D-dimensional inputs, where D is in the millions, require correspondingly large amounts of memory inside the GPU used to run them. Neural networks typically have an input layer which is a large matrix. The number of rows is determined by the number of neurons in that layer, and the number of columns is determined by the number of inputs that you have. If your hidden layer has H neurons (the size of the linear input layer’s dimension), and your batch size is B, then the memory size taken up by a dense representation of the first layer’s activations will be B × H × D × 4 bytes, where it takes 4 bytes to represent a floating-point number.
If we were only going to use a neural network framework to evaluate the Malicious URL dataset directly, we’d need over 27GB of GPU memory! This won’t fit on even advanced cloud GPUs like the Tesla V100. What makes this hard is that our inputs are set-valued, so we can’t just use a single embedding.
This isn’t to say there aren’t workarounds: you could buy more expensive GPUs or use slower algorithms—today, this means writing your own sparse backpropagation implementation or switching to CPUs. However, you would still end up paying for that many parameters with run time. Even when we lowered H down to 4—so we could run it on a V100—training took multiple hours (we gave up at that point). For a dataset that takes a few GB to represent on disk, this is just way too much time.
Instead, if we take our sparse dataset with n training examples and color it using the approximate graph coloring techniques in the previous blog post, we’ll end up with 𝜒 categorical columns.
The total number of categories across all 𝜒 columns is still D, but we only need H × 𝜒 dimensions to embed any input vector. This means we can use the usual embedding approach to represent categorical data, which is available in most neural network frameworks, including Pytorch and Tensorflow. As a result, the memory usage is negligible (in our experiments, well under 8GB of GPU memory without any optimization for a batch size of 256).
Of course, with a compact representation, there are tremendous speed advantages as well. Training with sparse data that took multiple hours before now takes less than 10 minutes in the chromatic representation.
So how does this method perform in terms of speed and accuracy for sparse datasets? We used the Wide and Deep architecture, as well as our previous decision tree results, to evaluate the performance of our new neural network on the URLs dataset.
As we can see, using a larger model enables higher accuracy, but a longer runtime. Now we’re able to make the tradeoff between larger more expensive models, and cheaper, less accurate ones, whereas previously for sparse data we simply didn’t have the option to use neural networks at all.
One very curious observation is that the categorical variables constructed in the chromatic representation are virtual: features found to be mutually exclusive don’t need to belong to the same category. Let’s think about this in terms of our e-commerce example. In our training data, customers may have never purchased toilet paper together with laptop chargers. So, as a result, the did_purchase_toilet_paper and did_purchase_laptop_charger categories could be placed in the same color variable. If this is the case, the embedding dimensions corresponding to the chromatic categorical variable “red” could have categorical feature values for toilet paper or laptop charger, but not both at once.
But couldn’t a customer order both toilet paper and laptop chargers in the same transaction, causing both items to appear in the test data? This kind of test-time collision can and does happen, and is an example of why chromatic learning’s effectiveness is so intriguing. In short, if you have enough data, then you’re likely to see co-occurrences that happen frequently enough to affect your accuracy. We formalize this intuitive argument in our paper, which also evaluates this technique on a variety of other sparse datasets and neural network architectures.
If you like to solve these kinds of practical ML problems, join us.
Illustrations by Michie Cao.
For benchmark code, view our open source repository.