Conditional joins: Part 2
Posted by anotherdavewebster on December 30, 2010
There’s nothing like the Christmas break to slow things down at work a little! This has given me a little time to finally write up Conditional Joins Part 2. Part 2 is actually “Part 1″ from a chronological perspective but I’ve written them the other way round simply because Part 1 is a little more general whereas Part 2, as useful as it is, solves a very specific problem.
So the real world problem is as follows. Think of a large set of orders which are placed over many months but “settled” at one point in time. My driving table contains the settled orders but I now need to go back to the record of each order being placed to retrieve some data from that record. The placement table is partitioned by placement date (a_date) and has a local non unique index on the ID (an_id). Both the placement date and the ID exist on the settled (driving) table. The vast majority of the settled orders were placed in the last couple of days but the “tail” stretches back many months. I was asked to look at the problem which basically came down to a choice between a hash join and a nested loop join from the settled table to the placement table. Neither approach really suited the problem due to the long tail. I explained the dilemma to the client to which he replied “isn’t it a shame that Oracle can’t choose to hash join to the recent partitions and nested loop to the older ones”. Bingo. That’s exactly what we want and we can do it using a conditional join.
If you read my original post, you probably don’t need the detail but here’s a test case to demonstrate the approach. We start with a driving table and a lookup table:
create table driving_table
(
a_date date not null,
an_id number not null
);
create table lookup_partitioned_table
(
a_date date not null,
an_id number not null,
text_column varchar2(1000) not null
)
partition by range (a_date)
(
partition TOP values less than (TO_DATE(' 9999-12-31 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
);
On the lookup table, I split off partitions for each day for best part of three months and fill each day with 864000 records. I also populate the driving table with 90% of the data for the last two days but only 0.01% of the data for previous days. This is a rather crude but effective simulation of the “tail”.
DECLARE
l_start_date DATE := TO_DATE( '01-JAN-2010', 'DD-MON-YYYY' );
l_end_date DATE := TO_DATE( '29-MAR-2010', 'DD-MON-YYYY' );
BEGIN
WHILE l_start_date < l_end_date LOOP
-- Create a new partition by spliting a slice off the TOP partition.
EXECUTE IMMEDIATE 'alter table lookup_partitioned_table split partition top at (to_date('''
|| TO_CHAR( l_start_date + 1, 'DDMMYYYY' )
|| ''',''DDMMYYYY'')) into (partition dy'
|| TO_CHAR( l_start_date, 'DDMMYYYY' )
|| ' , partition top )';
-- Write some data into the partition
INSERT INTO lookup_partitioned_table
SELECT l_start_date +( ( ROWNUM - 1 ) / 864000 )
, ( ( l_start_date - TO_DATE( '01-JAN-2010', 'DD-MON-YYYY' ) ) * 86400 )
+ ( ROWNUM - 1 )
, LPAD( 'X', 1000, 'X' )
FROM DUAL
CONNECT BY ROWNUM <= 864000;
-- If we are within 2 days of the end date, put 90% of the rows into the driving table. If not, put 0.01% in
INSERT INTO driving_table
SELECT a_date, an_id
FROM lookup_partitioned_table
WHERE a_date >= l_start_date
AND a_date < l_start_date + 1
AND rownum <= (864000 * case when l_start_date < (l_end_date - 2) then 0.0001 else 0.9 end);
COMMIT;
l_start_date := l_start_date + 1;
END LOOP;
END;
and finally a local index on the lookup table:
CREATE INDEX lookup_id ON lookup_partitioned_table (an_id) LOCAL;
The query I actually want to run is:
select drv.a_date, drv.an_id, lpt.text_column from driving_table drv, lookup_partitioned_table lpt where drv.a_date = lpt.a_date and drv.an_id = lpt.an_id;
Without playing with the code, I have the option of either a hash join or a nested loop:
DECLARE
CURSOR c_test
IS
SELECT /*+ USE_HASH (DRV LPT) */
drv.a_date
, drv.an_id
, lpt.text_column
FROM driving_table drv, lookup_partitioned_table lpt
WHERE drv.a_date = lpt.a_date AND drv.an_id = lpt.an_id;
BEGIN
FOR rec IN c_test LOOP
NULL;
END LOOP;
END;
--------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
|* 1 | HASH JOIN | | 84M|
| 2 | PART JOIN FILTER CREATE | :BF0000 | 108M|
| 3 | TABLE ACCESS FULL | DRIVING_TABLE | 108M|
| 4 | PARTITION RANGE JOIN-FILTER| | 84M|
| 5 | TABLE ACCESS FULL | LOOKUP_PARTITIONED_TABLE | 84M|
--------------------------------------------------------------------------
Nested Loop gives me:
declare
DECLARE
CURSOR c_test
IS
SELECT /*+ USE_NL (DRV LPT) */
drv.a_date
, drv.an_id
, lpt.text_column
FROM driving_table drv, lookup_partitioned_table lpt
WHERE drv.a_date = lpt.a_date AND drv.an_id = lpt.an_id;
BEGIN
FOR rec IN c_test LOOP
NULL;
END LOOP;
END;
Finally, there’s the conditional joining version which hash joins to the last two partitions (28 and 29 March) but nested loops to the remaining ones:
DECLARE
CURSOR c_test
IS
SELECT /*+ USE_NL (LOOKUP_NL) INDEX (LOOKUP_NL LOOKUP_ID) */
hash_join_res.a_date
, hash_join_res.an_id
, DECODE( hash_join_res.lookup_id
, NULL, lookup_nl.text_column
, hash_join_res.text_column
)
FROM ( SELECT /*+ ORDERED USE_HASH (DRV LOOKUP_HASH) */
drv.a_date
, drv.an_id
, lookup_hash.text_column
, lookup_hash.an_id lookup_id
FROM driving_table drv, lookup_partitioned_table lookup_hash
WHERE drv.a_date = lookup_hash.a_date(+)
AND drv.an_id = lookup_hash.an_id(+)
AND lookup_hash.a_date(+) >= '28-MAR-2010'
AND lookup_hash.a_date(+) < '30-MAR-2010' ) hash_join_res
, lookup_partitioned_table lookup_nl
WHERE DECODE( hash_join_res.lookup_id, NULL, hash_join_res.a_date, NULL ) = lookup_nl.a_date(+)
AND DECODE( hash_join_res.lookup_id, NULL, hash_join_res.an_id, NULL ) = lookup_nl.an_id(+);
BEGIN
FOR rec IN c_test LOOP
NULL;
END LOOP;
END;
---------------------------------------------------------------------------------
| Id | Operation | Name | E-Rows |
---------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | |
| 1 | NESTED LOOPS OUTER | | 9958G|
| 2 | VIEW | | 108M|
|* 3 | HASH JOIN OUTER | | 108M|
| 4 | TABLE ACCESS FULL | DRIVING_TABLE | 108M|
| 5 | PARTITION RANGE ITERATOR | | 1711K|
| 6 | TABLE ACCESS FULL | LOOKUP_PARTITIONED_TABLE | 1711K|
| 7 | PARTITION RANGE ITERATOR | | 91532 |
|* 8 | TABLE ACCESS BY LOCAL INDEX ROWID| LOOKUP_PARTITIONED_TABLE | 91532 |
|* 9 | INDEX RANGE SCAN | LOOKUP_ID | 3661K|
---------------------------------------------------------------------------------
Wanna see some stats? Go on then:
CPU used by physical reads session logical
this session DB time direct* reads
Hash Join 37288 116633 10888944 13674686
Nested Loop 9529 26475 4108 1119849
Conditional 3449 3994 247476 320629
*Nigel will be delighted to see the serial direct reads in effect on the database I’ve borrowed from him!
Now obviously the results will vary depending on the shape of the “tail” and you may need to programmatically vary the partitions to which you hash join to but I do believe that the pattern holds in general. Note that the shape of the tail in our live data varies with the type of market but it tapers in a way that my test case does not. It’s possible to get too carried away with this and spend more time working out where to introduce the break between Hash Join and Nested Loop but we’ve plumped for two days and get a couple of orders of magnitude reduction in run time.
