By Dan Burkert - February 5, 2019
Inferring the schema — or organizational structure — of a SQL query is key to understanding what data is contained in a result set. The query schema is how we obtain names of columns, headers in dataframes, and metadata for result export. At Sisu, fast schema inference is a critical first step in helping our users put their data to work.
While schema inference is easy for simple tables (e.g., using SQL’s DESCRIBE), schema inference can be prohibitively expensive for complex queries including joins, subqueries, and derived tables. In this post, we’ll show you a simple trick we’ve used to improve schema inference performance by over 100x in Redshift. The principles here should translate to most relational databases, too.
The simplest way to infer the schema of a query is to execute the query and inspect the results. Every library for interacting with a database has some way to inspect the metadata of a result; for example, JDBC has a ResultSet.getMetaData method that returns the schema of the table.
Unfortunately, this simple strategy incurs a large cost: the database must fully evaluate the query and return the result, which is subsequently thrown away. As a result, the performance of this approach to schema inference depends on both the complexity of the query and how much data the query processes. We found firsthand that this approach doesn’t scale to large datasets, so we set out to find a better solution.
At this point, we looked for a way to trim the data returned to the client without affecting the metadata. The JDBC API provides a convenient method that limits the rows included in the result set:
Unfortunately, setting the maximum number of rows to 0 via the JDBC API’s
setMaxRows parameter has a negligible effect on performance. It turns out that the
setMaxRows option is only a hint in the Redshift JDBC driver library and has no effect on the amount of work the database performs or the amount of data passed back to the client. As a result, this option is just as slow as our simple baseline.
Stymied by slow schema inference, our team received a well-timed hint from a Sisu user with a lot of Redshift expertise. The idea is simple: wrap the query in a filter or limit clause that will force it to return 0 rows:
SELECT * FROM (<query>) WHERE true = false; -- or SELECT * FROM (<query>) WHERE true <> true; -- or SELECT * FROM (<query>) LIMIT 0;
This results in a huge speed increase:
What causes this massive performance improvement? Each additional clause guarantees an empty result. One consequence of this empty result set is that this optimization cuts down on the amount of data which needs to be transmitted. However, these clauses also enable the database to perform additional optimizations.
Specifically, Redshift is able to push the outer filter (or limit) into the subquery (i.e., the query being analyzed), which in turn enables execution to short-circuit and return an empty result set without ever paying the cost of query evaluation!
Of course, there is a caveat: the effectiveness of this optimization across databases is subject to the internals of the database query optimizer. Ensuring that the optimizer behaves correctly across all databases and versions is fragile.
We found that Redshift will optimize seemingly similar filters differently. For example, with Redshift the
true = false filter appears to do more work than the other examples. Whereas
true <> true and
LIMIT 0 are short-circuited before even being added to the query log (the
SVL_QLOG system table), the
true = false query is not:
user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true = false; ... user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) WHERE true <> true; ... user@cluster:db> SELECT * FROM (SELECT col1 FROM tbl1) LIMIT 0; ... user@cluster:db> SELECT query, elapsed, substring FROM SVL_QLOG WHERE userid <> 1 AND starttime > (GETDATE() - INTERVAL '10 seconds') ORDER BY query DESC; +---------+-----------+-----------------------+ | query | elapsed | substring |