ML Systems

# SCANN | August 3, 2020

By Schuyler Fried - August 3, 2020

This week, we take a look at ScaNN (Scalable Nearest Neighbors), a method for efficient vector similarity search at scale.

Materials

Why SCANN?

• Maximum inner product search is useful for recommendation systems, label prediction, gradient computation and many more
• Traditional quantization uses reconstructive loss, which picks embeddings based off of how well the embedding reconstructs the original vector
• SCANN uses a loss function that picks embeddings based off of how well the embedding reconstructs the parallel component of the vector and achieves better [email protected], accuracy and queries/s performance than existing state of the art algorithms.

Nuggets

• A scored reconstructive loss function: $l(x_i, \tilde{x}_i, w) = E_{q \in Q} [w(\langle q,x_i \rangle)\langle q,x_i - \tilde{x}_i \rangle^2]$ with weight function $w$ allows higher weighting of loss terms with a larger inner product between $x$ and its embedding $\tilde{x}$.
• Product quantization (splitting up embeddings into multiple different partitions and quantizing each separately) is a useful technique to reduce the number of code words necessary to store the embedding from $k$ to $k^{1/m}$, where $m$ is the number of partitions. See paper here for more.
• SCANN can be implemented using k-means clustering with the custom loss function detailed above - the same as other product quantization techniques.

Raw Notes:

2020-08-03 Score-Aware Loss Functions for Vector Quantization.pdf

If you like applying these kinds of methods practical ML problems, join our team.