Programming Efficient Joins

Fritz Henglein

No content found (yet?).

A relational join is a function that combines data from multiple sources and joins them by programmer-defined attributes they share. They are everywhere data records are stored and processed, also outside database systems. Common examples include joining customers and orders; connecting genes, mutations, and experiments; and combining songs, artists and plays. Joins have a reputation of being tricky to program and difficult to implement efficiently. For example, computing the set of triangles (x, y, z) where x likes y, y likes z and z likes x, cannot be implemented efficiently using techniques employed in standard SQL query engines such as binary hash joins and any form of query plan optimization.

In this talk we show that joins are simple to program to be guaranteed worst-case optimal, even for cyclic queries like triangles. They require only basic data structures and programming techniques: Nested dictionaries, iterating over the smallest set, and nested iteration. We illustrate this by 20-line Python and Haskell programs for triangles, which can execute orders of magnitude faster than common SQL engines. We give a useful intuition underlying the method's worst-case optimality proof and provide additional techniques for rather straightforwardly coded optimizations.

Licensed to the public under https://creativecommons.org/licenses/by/3.0/de

Download

Embed

Share:

Tags