Quantcast
Viewing all articles
Browse latest Browse all 3353

Re: Does Hana use nested loop Join on two column tables?

The question here is

 

1. What is the join method used? we expect nested loop to be used

2. why do we see 10,000 as output instead of 1 for DEPT table.

3. Can we control Join method and Join order?

 

Let's start with the second question: the reason for the 10.000 records mentioned in the explain plan is that this is the default estimate for unfiltered table access.

Since SAP HANA creates the final execution plan only at execution time, the EXPLAIN PLAN can only provide a rough layout of it. The numbers you see in it are not precise by any means.

 

The first question requires to ask back: why would you or your team expect a nested loop join?

This would be a reasonable assumption for a row-oriented database that uses cached pages to access table data. In such a scenario the simple join of employee to department can be performed by using a nested loop inner join implementation rather efficiently - especially when efficiency is determined by the amount of computational work and not by response time.

This is to say: this way of thinking (less total computational work = better) is the single-threaded, most data on disk way of thinking.

 

With columnar in-memory databases this approach doesn't get you all that far; to yield fundamentally better performance, a sequential row-by-row model like nested loop joins don't leverage the columnar compression, the SIMD instruction set, CPU cache line optimised block processing or the available hundreds of processing threads.

 

Implementing a join via nested loops on a columnar structure is actually rather inefficient.

To draw a parallel here: imagine what work would be required on a DBMS like Oracle, if the nested loop join would have to use BITMAP indexes.

 

So, what join algorithm does SAP HANA use now? For most cases we"re looking at flavours of HASH JOINS (and yes, the discussion around HASH vs. SORT MERGE execution speed is well known also within SAP).

For your example, SAP HANA would look up the valueID for 'john' in the employee table and find all rowIDs for that reference to this value. This is either done via a full scan of the column or via the inverted index that you've created.

With now the DEPT_NO column of the EMP table gets 'joined' in, so that we now have all DEPT_NO values necessary for the join matching.

Since the values for DEPT_NO are not necessarily encoded with the same valueIDs in both tables, a mapping has to be created. SAP HANA now creates a so-called translation table that allows to map valueIDs from two tables for joins.

Now we have the valueIDs we need to look out for in the DEPT.DEPNO column. Again, this is done either via a full column scan (with SIMD support - extremely fast) or via the inverted index.

 

As an outcome we now know with rowIDs on both sides of the join will belong to the result set.

Finally the result set rows are constructed in parallel by resolving the valueIDs from all the columns that are part of the final projection list to actual values and returning them to the client - this is the late materialization step you see in the EXPLAIN PLAN.

 

How is this now more efficient? During the whole execution we always worked with valueIDs and rowIDs not with actual table content. This means a lot less data needed to be shuffled around. No effort was required to 'peel out' the required values from the row oriented data structures.

Certain steps can be done in parallel like looking up rowIDs for a given valueID.

 

You probably realise by now that the actual join implementation - even just the pseudo code description - is more complex than the nested loop join (btw: the openHPI courses by Mr. Plattner and also his book on in-memory databases explain all that in a decent, practical way. Good options to learn up on these points).

 

And this leads me to the answer for the last question:

Yes, there are ways to influence the query processing. As of SPS 10, the SAP HANA documentation lists HINTS for that (SELECT - SAP HANA SQL and System Views Reference - SAP Library).

 

However, I strongly oppose to using hints in any production code.

First of all, as you've just seen for this really simple example, it's unlikely that there is a correct understanding of the default execution.

 

Even if the actual execution is well understood, the alternative options would need to be understood equally well to be able to decide which one would be better. Due to the high speed of change in SAP HANA development such know-how is usually only available to the very core HANA developers who write the code for it.

 

And even then, still, I would not recommend using those hints in production code, since they always introduce a technology-dept. With every change of data, data model or SAP HANA revision it has to be checked whether the hint is still necessary and if it still yields what it supposed to do in the first place.

That has to be done for all hints that have been introduced anywhere in your coding.

This is a task virtually unmanageable if not tightly controlled.

 

Oh well, I got side-tracked here topic wise... however I hope you got your three questions answered nevertheless.


Viewing all articles
Browse latest Browse all 3353

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>