ML Systems

SCANN | August 3, 2020

By Schuyler Fried - August 3, 2020

This week, we take a look at ScaNN (Scalable Nearest Neighbors), a method for efficient vector similarity search at scale.

Materials

Why SCANN?

  • Maximum inner product search is useful for recommendation systems, label prediction, gradient computation and many more
  • Traditional quantization uses reconstructive loss, which picks embeddings based off of how well the embedding reconstructs the original vector
  • SCANN uses a loss function that picks embeddings based off of how well the embedding reconstructs the parallel component of the vector and achieves better [email protected], accuracy and queries/s performance than existing state of the art algorithms.

Nuggets

  • A scored reconstructive loss function: l(xi,x~i,w)=EqQ[w(q,xi)q,xix~i2]l(x_i, \tilde{x}_i, w) = E_{q \in Q} [w(\langle q,x_i \rangle)\langle q,x_i - \tilde{x}_i \rangle^2] with weight function ww allows higher weighting of loss terms with a larger inner product between xx and its embedding x~\tilde{x}.
  • Product quantization (splitting up embeddings into multiple different partitions and quantizing each separately) is a useful technique to reduce the number of code words necessary to store the embedding from kk to k1/mk^{1/m}, where mm is the number of partitions. See paper here for more.
  • SCANN can be implemented using k-means clustering with the custom loss function detailed above - the same as other product quantization techniques.

Raw Notes:

2020-08-03 Score-Aware Loss Functions for Vector Quantization.pdf

If you like applying these kinds of methods practical ML problems, join our team.


Read more

Fixed-X knockoffs | November 9, 2020

A look at a novel method for False Discovery Rate control in variable selection — the Fixed-X Knockoff filter by Rina Barber and Emmanuel Candès.

Read more

The Synthetic Controls Method | July 6, 2020

In this weeks discussion, we review the Synthetic Controls method, which extends potential outcomes form Causal Inference literature to time-dependent observational data.

Read more