Wondering what algorithms look like for implementing a complex JOIN query handler. Taking a look at this paper Building Query Compilers they talk about Query Graphs and Join Trees and Cost Functions for determining how well join trees are when the engine figures out how to organize the steps in the query.
What I'm particularly looking for is the part of how they segment the input query into the query/join tree. Specifically, this relates to scheduling and queries. I understand how the cost functions work. I just don't understand how you take a relational algebra query tree and break them apart into distinct chunks of operations sequenced together. They usually just say "we have a query x and here is the corresponding tree":
without a description of how the tree was obtained.
I'm confused as to why they would break apart conditional statements, and how they know where to break them apart, and how to order them (when do heuristics apply, etc.)
What follows is a brief overview of the algorithm, showing where I am confused.
Given a query tree as input, we first convert it to CNF. Then we construct a query graph, which is a graph over the WHERE + FROM portions of the query. This graph is only used for determining if the query is correct or not in some cases. Otherwise we can skip the query graph. Then we construct the JOIN tree by x... This is what I'm wondering. In the following algorithm it would be the last part.
\begin{align} & \texttt{Input}\ Q: \texttt{QueryTree} \\ & \texttt{Output}\ Q''': \texttt{JoinTree} \\ & \texttt{Begin} \\ & X : Q \rightarrow Q'\ //\ convert\ to\ CNF \\ & G : Q' \rightarrow Q''\ //\ convert\ to\ query\ graph \\ & Validate\ Q'' \\ & J : Q'' \rightarrow Q'''\ //\ convert\ to\ join\ tree (confused here) \\ & Return\ Q''' \\ & \texttt{End} \end{align}
The question is, how a database query engine goes through and figures out the order of operations.
