Engineering

By John Hallman - January 7, 2021

At Sisu, we regularly analyze massive text and unstructured datasets of internet scale, such as customer reviews, item descriptions, and transaction details. While the size and complexity of these datasets render many common natural language processing (NLP) techniques prohibitively slow for interactive data analysis, n-gram featurization, one of the simplest tools in NLP, has proven to be invaluable to us due to the interpretability of its features and its computational efficiency.

While n-grams are most commonly used as features in classification and regression tasks, Sisu’s use case is slightly different. We utilize n-grams in conjunction with other dataset-specific and derived features to analyze changes in customer KPIs based on statistical properties such as increases or decreases in the prevalence of specific n-grams across text datasets. These signals make it easier to draw connections between our customers’ KPIs and the content of their unstructured data, while preserving interpretability of results.

For a more concrete example, let’s take a look at the Amazon Reviews dataset, which contains all the reviews ever posted on Amazon, along with their respective ratings from 1 to 5. We’ll focus on a subset of this data containing 100,000 reviews and start by examining a couple of examples.

A data analyst could use n-grams to extract the phrases in this dataset that appear regularly in positive reviews but rarely in negative ones or vice versa. Comparing 1- and 5-star reviews in our sample of 100,000 reviews, the most interesting 2-grams for our dataset would then be:

N-grams can also be used as inputs to more sophisticated models, like logistic regressors or decision trees.

Although n-gram featurization is far more efficient than almost every other NLP tool, computing n-grams efficiently for large datasets is still a massive challenge. As `n` increases, the number of possible n-grams grows exponentially as `V ^{n}`, where

The number of unique words (1-grams), as seen in the first bar in the plot above, pales in comparison to the number of higher-order n-grams. While most applications only require n-grams of length up to around 3 or 5 at most, even these cases result in an impractical number of features. For instance, using up to 3-grams or 5-grams instead of only word count features would correspond to an increase in feature dimensionality by around 20x and 60x respectively for our example above. Although the dimensionality can be reduced using embeddings, this defeats the purpose of using n-grams to improve the interpretability of our analysis.

The simplest solution to this problem is to observe that the vast majority of n-grams occur very infrequently and can therefore be discarded without significant impact on our model or analysis. In fact, many of our analyses only require the use of more common features. As shown below, the distribution of words and n-grams is extremely skewed, as is common with text data. **A full 67% of all words that appear in the Amazon Review dataset appear only a single time.
**

Below, we plot the proportion of n-grams that appear less than `k` times for varying `n` and `k`.

To resolve this issue, we usually discard all n-grams that occur fewer than 10 times in the dataset. For n-gram features up to length 5, this reduces the number of n-gram features we have to work with from around 21 million to 160 thousand, a **99.3% decrease**.

While removing rare n-grams helps, we still have a computational bottleneck. In our 100,000 review example above, counting the n-grams takes 2.9 seconds and requires over a gigabyte of memory on an AWS instance using a single CPU and an optimized Rust implementation. While 3 seconds isn’t bad, the runtime increases to 213 seconds for a 1 million review dataset, revealing the limitations of our approach when dealing with large datasets.

Although computation of n-grams can be distributed over multiple machines using MapReduce or Spark, this is not always a feasible, nor convenient, solution. Especially for the kinds of low-latency, interactive analytics we perform at Sisu, the startup of a single Spark executor often exceeds our response service-level agreements (SLAs) with our end-users.

Even if we filter out rare n-grams, it can be expensive in terms of runtime and memory to determine whether a given n-gram is actually rare. Fortunately, there’s a simple approach to doing so: looking at the occurrence counts of each n-gram’s *parents*.

Every n-gram of length `n > 1` has exactly two parent n-grams of length `n – 1` which can be extracted by removing the first or last word of the original n-gram. For the example reviews presented in the beginning, the first 4-gram of the first review is *“full of intrigue and,”* which has parent 3-grams *“full of intrigue”* and *“of intrigue and.”*

Because every n-gram of length 2 or more contains both of its parents, the parents are guaranteed to occur at least the same number of times as the child n-gram (similar to the Apriori algorithm). If we first compute the parents’ occurrence counts and observe that an n-gram has a parent with fewer than `k` occurrences, we can know that the child n-gram will occur fewer than `k` times as well. We can use this observation to improve the computational efficiency of our n-gram counting algorithm. We start by counting the occurrences of all 1-grams, then 2-grams, etc., and for each n-gram of length 2 or more, if either of its parents occurred fewer than `k` times, we can skip over it. A formalized version of this is presented in the pseudocode below.

for n = 1, ..., max_ngram_length: S_n <- {} # Initialize empty dict for storing length-n n-gram counts. for n_gram of length n in text_dataset: if n > 1 and (parent_1 not in S_{n-1} or parent_2 not in S_{n-1}: continue S_n[n_gram] += 1 for n_gram, count in S_n: if count < threshold: remove n_gram from S_n return union_{n = 1,... max_ngram_length} S_n

In this example, we need the second loop because the parent-checking filter only removes n-grams for which some parent has already been filtered, which does not capture cases where both of an n-gram’s parents may have sufficient counts but the n-gram itself does not.

We add the filtering heuristic based on parent counts and measure the time it takes to count the occurrences of n-grams for our Amazon Reviews example.

Next, we take a look at the maximum memory usage for our counting procedure for varying n. Notice that the memory use for the unoptimized procedure is less than the gigabyte described earlier because we now filter out n-grams with fewer than 10 occurrences.

With these optimizations, we are able to achieve a 4x speedup and 2x memory savings. This efficiency is even larger as we process longer n-grams.

Traditionally, counting the number of occurrences of n-grams in large text datasets is a computationally expensive procedure due to the large number of n-grams with very few occurrences. These rare n-grams correspond to a disproportionately large fraction of all n-grams, and are often not as useful as common n-grams due to how infrequently they are observed.

Filtering out these rare n-grams not only reduces the number of features we have to work with but also allows us to optimize the computation of the n-gram counts, enabling us to utilize n-grams for even larger datasets.

*If you like to solve these kinds of practical ML problems, join our team. *