Dave Webster's Blog

Tinkering with Oracle

Conditional joins: Part 2

Posted by anotherdavewebster on December 30, 2010

There’s nothing like the Christmas break to slow things down at work a little! This has given me a little time to finally write up Conditional Joins Part 2. Part 2 is actually “Part 1″ from a chronological perspective but I’ve written them the other way round simply because Part 1 is a little more general whereas Part 2, as useful as it is, solves a very specific problem.

So the real world problem is as follows. Think of a large set of orders which are placed over many months but “settled” at one point in time. My driving table contains the settled orders but I now need to go back to the record of each order being placed to retrieve some data from that record. The placement table is partitioned by placement date (a_date) and has a local non unique index on the ID (an_id). Both the placement date and the ID exist on the settled (driving) table. The vast majority of the settled orders were placed in the last couple of days but the “tail” stretches back many months. I was asked to look at the problem which basically came down to a choice between a hash join and a nested loop join from the settled table to the placement table. Neither approach really suited the problem due to the long tail. I explained the dilemma to the client to which he replied “isn’t it a shame that Oracle can’t choose to hash join to the recent partitions and nested loop to the older ones”. Bingo. That’s exactly what we want and we can do it using a conditional join.

If you read my original post, you probably don’t need the detail but here’s a test case to demonstrate the approach. We start with a driving table and a lookup table:

create table driving_table
(
  a_date                 date not null,
  an_id                  number not null
);

create table lookup_partitioned_table
(
  a_date                 date not null,
  an_id                  number not null,
  text_column            varchar2(1000) not null
)
partition by range (a_date)
(
partition TOP values less than (TO_DATE(' 9999-12-31 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
);

On the lookup table, I split off partitions for each day for best part of three months and fill each day with 864000 records. I also populate the driving table with 90% of the data for the last two days but only 0.01% of the data for previous days. This is a rather crude but effective simulation of the “tail”.

DECLARE
   l_start_date   DATE := TO_DATE( '01-JAN-2010', 'DD-MON-YYYY' );
   l_end_date     DATE := TO_DATE( '29-MAR-2010', 'DD-MON-YYYY' );
BEGIN
   WHILE l_start_date < l_end_date LOOP
      -- Create a new partition by spliting a slice off the TOP partition.
      EXECUTE IMMEDIATE    'alter table lookup_partitioned_table split partition top at (to_date('''
                        || TO_CHAR( l_start_date + 1, 'DDMMYYYY' )
                        || ''',''DDMMYYYY'')) into (partition dy'
                        || TO_CHAR( l_start_date, 'DDMMYYYY' )
                        || ' , partition top )';

      -- Write some data into the partition
      INSERT INTO lookup_partitioned_table
         SELECT     l_start_date +(  ( ROWNUM - 1 ) / 864000 )
                  ,   (  ( l_start_date - TO_DATE( '01-JAN-2010', 'DD-MON-YYYY' ) ) * 86400 )
                    + ( ROWNUM - 1 )
                  , LPAD( 'X', 1000, 'X' )
               FROM DUAL
         CONNECT BY ROWNUM <= 864000;

      -- If we are within 2 days of the end date, put 90% of the rows into the driving table.  If not, put 0.01% in
      INSERT INTO driving_table
      SELECT a_date, an_id
        FROM lookup_partitioned_table
       WHERE a_date >= l_start_date
         AND a_date <  l_start_date + 1
         AND rownum <= (864000 * case when l_start_date < (l_end_date - 2) then 0.0001 else 0.9 end);

      COMMIT;
      l_start_date    := l_start_date + 1;
   END LOOP;
END;

and finally a local index on the lookup table:

CREATE INDEX lookup_id ON lookup_partitioned_table (an_id) LOCAL;

The query I actually want to run is:

select drv.a_date, drv.an_id, lpt.text_column
  from driving_table drv, lookup_partitioned_table lpt
 where drv.a_date = lpt.a_date
   and drv.an_id = lpt.an_id;

Without playing with the code, I have the option of either a hash join or a nested loop:

DECLARE
   CURSOR c_test
   IS
      SELECT /*+ USE_HASH (DRV LPT) */
             drv.a_date
           , drv.an_id
           , lpt.text_column
        FROM driving_table drv, lookup_partitioned_table lpt
       WHERE drv.a_date = lpt.a_date AND drv.an_id = lpt.an_id;
BEGIN
   FOR rec IN c_test LOOP
      NULL;
   END LOOP;
END;


--------------------------------------------------------------------------
| Id  | Operation                    | Name                     | E-Rows |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                          |        |
|*  1 |  HASH JOIN                   |                          |     84M|
|   2 |   PART JOIN FILTER CREATE    | :BF0000                  |    108M|
|   3 |    TABLE ACCESS FULL         | DRIVING_TABLE            |    108M|
|   4 |   PARTITION RANGE JOIN-FILTER|                          |     84M|
|   5 |    TABLE ACCESS FULL         | LOOKUP_PARTITIONED_TABLE |     84M|
--------------------------------------------------------------------------

Nested Loop gives me:

declare
DECLARE
   CURSOR c_test
   IS
      SELECT /*+ USE_NL (DRV LPT) */
             drv.a_date
           , drv.an_id
           , lpt.text_column
        FROM driving_table drv, lookup_partitioned_table lpt
       WHERE drv.a_date = lpt.a_date AND drv.an_id = lpt.an_id;
BEGIN
   FOR rec IN c_test LOOP
      NULL;
   END LOOP;
END;

Finally, there’s the conditional joining version which hash joins to the last two partitions (28 and 29 March) but nested loops to the remaining ones:

DECLARE
   CURSOR c_test
   IS
      SELECT /*+ USE_NL (LOOKUP_NL) INDEX (LOOKUP_NL LOOKUP_ID) */
             hash_join_res.a_date
           , hash_join_res.an_id
           , DECODE( hash_join_res.lookup_id
                   , NULL, lookup_nl.text_column
                   , hash_join_res.text_column
                   )
        FROM ( SELECT /*+ ORDERED USE_HASH (DRV LOOKUP_HASH) */
                     drv.a_date
                   , drv.an_id
                   , lookup_hash.text_column
                   , lookup_hash.an_id lookup_id
                FROM driving_table drv, lookup_partitioned_table lookup_hash
               WHERE drv.a_date = lookup_hash.a_date(+)
                 AND drv.an_id = lookup_hash.an_id(+)
                 AND lookup_hash.a_date(+) >= '28-MAR-2010'
                 AND lookup_hash.a_date(+) < '30-MAR-2010' ) hash_join_res
           , lookup_partitioned_table lookup_nl
       WHERE DECODE( hash_join_res.lookup_id, NULL, hash_join_res.a_date, NULL ) = lookup_nl.a_date(+)
         AND DECODE( hash_join_res.lookup_id, NULL, hash_join_res.an_id, NULL ) = lookup_nl.an_id(+);
BEGIN
   FOR rec IN c_test LOOP
      NULL;
   END LOOP;
END;


---------------------------------------------------------------------------------
| Id  | Operation                           | Name                     | E-Rows |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                          |        |
|   1 |  NESTED LOOPS OUTER                 |                          |   9958G|
|   2 |   VIEW                              |                          |    108M|
|*  3 |    HASH JOIN OUTER                  |                          |    108M|
|   4 |     TABLE ACCESS FULL               | DRIVING_TABLE            |    108M|
|   5 |     PARTITION RANGE ITERATOR        |                          |   1711K|
|   6 |      TABLE ACCESS FULL              | LOOKUP_PARTITIONED_TABLE |   1711K|
|   7 |   PARTITION RANGE ITERATOR          |                          |  91532 |
|*  8 |    TABLE ACCESS BY LOCAL INDEX ROWID| LOOKUP_PARTITIONED_TABLE |  91532 |
|*  9 |     INDEX RANGE SCAN                | LOOKUP_ID                |   3661K|
---------------------------------------------------------------------------------

Wanna see some stats? Go on then:

               CPU used by             physical reads   session logical
              this session   DB time           direct*            reads
Hash Join            37288    116633         10888944          13674686
Nested Loop           9529     26475             4108           1119849
Conditional           3449      3994           247476            320629

*Nigel will be delighted to see the serial direct reads in effect on the database I’ve borrowed from him!

Now obviously the results will vary depending on the shape of the “tail” and you may need to programmatically vary the partitions to which you hash join to but I do believe that the pattern holds in general. Note that the shape of the tail in our live data varies with the type of market but it tapers in a way that my test case does not. It’s possible to get too carried away with this and spend more time working out where to introduce the break between Hash Join and Nested Loop but we’ve plumped for two days and get a couple of orders of magnitude reduction in run time.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: