Engineering

Lightning-fast Schema Inference in Redshift

By Dan Burkert - February 5, 2019

 

Inferring the schema — or organizational structure — of a SQL query is key to understanding what data is contained in a result set. The query schema is how we obtain names of columns, headers in dataframes, and metadata for result export. At Sisu, fast schema inference is a critical first step in helping our users put their data to work.

While schema inference is easy for simple tables (e.g., using SQL’s DESCRIBE), schema inference can be prohibitively expensive for complex queries including joins, subqueries, and derived tables. In this post, we’ll show you a simple trick we’ve used to improve schema inference performance by over 100x in Redshift. The principles here should translate to most relational databases, too.

KISS?

The simplest way to infer the schema of a query is to execute the query and inspect the results. Every library for interacting with a database has some way to inspect the metadata of a result; for example, JDBC has a ResultSet.getMetaData method that returns the schema of the table.

inference method runtime
execute-query 11.2s

 

Unfortunately, this simple strategy incurs a large cost: the database must fully evaluate the query and return the result, which is subsequently thrown away. As a result, the performance of this approach to schema inference depends on both the complexity of the query and how much data the query processes. We found firsthand that this approach doesn’t scale to large datasets, so we set out to find a better solution.

Who said the database should actually use query options?

At this point, we looked for a way to trim the data returned to the client without affecting the metadata. The JDBC API provides a convenient method that limits the rows included in the result set: Statement.setMaxRows.

inference method runtime
max-rows 11.3s

 

Unfortunately, setting the maximum number of rows to 0 via the JDBC API’s setMaxRows parameter has a negligible effect on performance. It turns out that the setMaxRows option is only a hint in the Redshift JDBC driver library and has no effect on the amount of work the database performs or the amount of data passed back to the client. As a result, this option is just as slow as our simple baseline.

Subqueries to the Rescue!

Stymied by slow schema inference, our team received a well-timed hint from a Sisu user with a lot of Redshift expertise. The idea is simple: wrap the query in a filter or limit clause that will force it to return 0 rows:

SELECT * FROM (<query>) WHERE true = false;
-- or
SELECT * FROM (<query>) WHERE true <> true;
-- or
SELECT * FROM (<query>) LIMIT 0;

This results in a huge speed increase:

inference method runtime
true = false 153ms
true <> true 86ms
LIMIT 0 83ms

 

What causes this massive performance improvement? Each additional clause guarantees an empty result. One consequence of this empty result set is that this optimization cuts down on the amount of data which needs to be transmitted. However, these clauses also enable the database to perform additional optimizations.

Specifically, Redshift is able to push the outer filter (or limit) into the subquery (i.e., the query being analyzed), which in turn enables execution to short-circuit and return an empty result set without ever paying the cost of query evaluation!

Of course, there is a caveat: the effectiveness of this optimization across databases is subject to the internals of the database query optimizer. Ensuring that the optimizer behaves correctly across all databases and versions is fragile.

We found that Redshift will optimize seemingly similar filters differently. For example, with Redshift the true = false filter appears to do more work than the other examples. Whereas true <> true and LIMIT 0 are short-circuited before even being added to the query log (the SVL_QLOG system table), the true = false query is not:

user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true = false;
...
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true <> true;
...
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) LIMIT 0;
...
user@cluster:db> SELECT query, elapsed, substring
   FROM SVL_QLOG
  WHERE userid <> 1
    AND starttime > (GETDATE() - INTERVAL '10 seconds')
  ORDER BY query DESC;
+---------+-----------+-----------------------+
| query   | elapsed   | substring             |