By Vlad Feinberg - June 22, 2020
This is the second post in a series. In his previous post, Graph Coloring for Machine Learning, Vlad Feinberg describes how approximate graph coloring can be used to transform sparse enterprise data for analysis.
At Sisu, we help analysts make better decisions with enterprise data, and to do so, we frequently need to process very high-dimensional, sparse datasets efficiently. Previously, we demonstrated how graph coloring could be an effective strategy for decision tree training. However, sometimes we want to consider different kinds of machine learning models. For instance, neural networks are a powerful tool that many data scientists might be interested in applying in sparse settings. However, for this kind of data, it’s not possible to directly apply neural networks because of the amount of memory they require to process sparse inputs. We’ll describe how you can use graph coloring to make neural networks applicable in such sparse settings.
As we discussed before, the rise of cheap data warehousing and the standardization of the SaaS data-collection pipeline with companies like Segment, Stitch, and Heap make it easy to collect data that looks like this:
This type of sparse data contains a lot of useful information. For example, let’s say we’re analyzing e-commerce data, and we have an analytics event that captures an order. A reasonable question might be: Might the customer who placed this order want to re-order this item later?
To answer this question, we’d pull in features about the order itself, which contains a large number of features, including data about the customer who made the order, their user preferences, and their full order history. Many of these features are set-valued: everything a user ordered in the past is a small subset of the entire online catalog.
High-dimensional set-valued data poses a challenge to anyone who wants to continuously analyze their data: all their features are constantly changing (e.g., users might change their preferences) and the schema describing the features is itself changing (e.g., my catalog can grow larger as we sell different types of products).
Like the Malicious URL datasets we used in our previous post, we often deal with sparsity that looks like:
For neural networks, D-dimensional inputs, where D is in the millions, require correspondingly large amounts of memory inside the GPU used to run them. Neural networks typically have an input layer which is a large matrix. The number of rows is determined by the number of neurons in that layer, and the number of columns is determined by the number of inputs that you have. If your hidden layer has H neurons (the size of the linear input layer’s dimension), and your batch size is B, then the memory size taken up by a dense representation of the first layer’s activations will be B × H × D × 4 bytes, where it takes 4 bytes to represent a floating-point number.