Engineering

Efficient Featurization of Common N-grams via Dynamic Programming

By John Hallman - January 7, 2021

At Sisu, we regularly analyze massive text and unstructured datasets of internet scale, such as customer reviews, item descriptions, and transaction details. While the size and complexity of these datasets render many common natural language processing (NLP) techniques prohibitively slow for interactive data analysis, n-gram featurization, one of the simplest tools in NLP, has proven to be invaluable to us due to the interpretability of its features and its computational efficiency.

While n-grams are most commonly used as features in classification and regression tasks, Sisu’s use case is slightly different. We utilize n-grams in conjunction with other dataset-specific and derived features to analyze changes in customer KPIs based on statistical properties such as increases or decreases in the prevalence of specific n-grams across text datasets. These signals make it easier to draw connections between our customers’ KPIs and the content of their unstructured data, while preserving interpretability of results.

For a more concrete example, let’s take a look at the Amazon Reviews dataset, which contains all the reviews ever posted on Amazon, along with their respective ratings from 1 to 5. We’ll focus on a subset of this data containing 100,000 reviews and start by examining a couple of examples.