Dave Webster's Blog

Tinkering with Oracle

Polling for new data: Part 2

Posted by anotherdavewebster on October 7, 2010

Apologies for the delay. The day job has been rather busy recently and I was fortunate enough to get a trip out to Open World again this year. Much more black and white for me this time: some real gems and some truly terrible presentations! My personal highlights include a few wonderful sessions from Craig Shallahamer (you have to see his physical demonstration of the workings of the buffer cache), some real index treats from Richard Foote and a really good session on Query Transformations from a fantastically enthusiastic Joze Senegacnik which shone the light on a few areas of the optimiser which I’d never really thought about (but should have).

So, back to Polling for new data. One point that has come to light since I started this blog is that I should have made clear from the outset that almost everything I do is about performance. Our main production system performs many thousands of transactions a second and up to one hundred thousand queries a second at peak. The vast majority of these calls require response times measure in milliseconds. Consequently, we need to be pretty careful about the SQL and PL/SQL we write.

The aim is to provide a sequenced stream of data from a table which:
– adds the minimum overhead to DML operations against the table in question.
– generates the least possible load on the system for both the sequencing and polling components.
– bonus points for a clean design with no more objects and code than is necessary!

I’ve put together some detail on the technique I described in the original item. We create our table with an additional column “commit_sequence” which defaults to 0 and is indexed:

CREATE TABLE test_table
   column1 NUMBER       NOT NULL,
   column2 VARCHAR2(10) NOT NULL,
   commit_sequence NUMBER DEFAULT 0 NOT NULL

CREATE INDEX test_commit_sequence ON test_table(commit_sequence);

If the table is going to receive any update activity then we’ll also need an update trigger:

   ON test_table
   WHEN (OLD.commit_sequence = NEW.commit_sequence AND NEW.commit_sequence != 0)
   -- Reset the commit_sequence as long as this is not the update which is doing the sequencing.
   :NEW.commit_sequence    := 0;
END test_table_update;

So now the table can be used in the normal way. INSERTs are almost exactly the same cost as they were previously – the “DEFAULT 0” costs nothing compared to the other work going on although we do have an extra index on the table. The UPDATE is a little more expensive due to the trigger but you can’t have everything! We now just need our background job which does the sequencing of the data:

   c_batch_size   CONSTANT NUMBER := 10;
   c_sleep_time   CONSTANT NUMBER := 1;
   l_commit_sequence       NUMBER;
   l_row_count             NUMBER;
   -- Get the next commit_sequence to be used
   SELECT NVL( MAX( commit_sequence ), 0 ) + 1
     INTO l_commit_sequence
     FROM test_table;

      UPDATE test_table
         SET commit_sequence = l_commit_sequence
       WHERE commit_sequence = 0 AND ROWNUM <= c_batch_size;

      l_row_count := SQL%ROWCOUNT;
      IF l_row_count > 0 THEN
         l_commit_sequence    := l_commit_sequence + 1;
      END IF;
      -- If the number of rows updated != the batch size then we updated all visible rows so we should sleep for a while before trying again.
      -- If not, there is every chance that there are more rows to be sequenced so we should get on with it.
      IF l_row_count != c_batch_size THEN
         DBMS_LOCK.sleep( c_sleep_time );
      END IF;

This can be run as a dbms_job / scheduler or an external process. The only really important thing is to make sure there is always one and only one of these running. There are obviously a variety of ways of accomplishing this.

Just a quick note on the CONSTANTs in the code snippet. First of all, be aware that a package level constant actually means that the package has state which is rarely a good thing. We tend to use (deterministic) functions to return values instead of using constants. Digression aside, the constant c_batch_size is important. If it’s set to be greater than one and the application can perform updates to multiple rows in one transaction then there is a chance of a deadlock. Given the default “no wait” behaviour of commits within PL/SQL (see here), we needn’t worry too much about the cost of the commits although obviously updating 100 rows in one statement is cheaper than updating 100 rows in 100 statements! The primary goal of this approach is to impact the processes writing to the table as little as possible so if there is a chance of introducing a deadlock then we should certainly avoid that by using a batch size of 1. Just to be clear, if I’m only inserting into this table or updating only single rows in each transaction, a batch size of greater than one will be more efficient. Applying this technique to an INSERT only table is ideal – we don’t need the trigger (or at least it won’t fire) and we can use large batch sizes.

I now have a clean and efficient technique for sequencing the data in a table. All that remains is to create the consumer process which would generally come from an external application. We might provide a helper PL/SQL function so that the application would call fn_get_hwm once and then repeatedly poll for data by calling pr_get_deltas with the last HWM. The HWM for each call is the commit_sequence from the last row returned by the previous call.

   FUNCTION fn_get_hwm

   PROCEDURE get_deltas( i_commit_sequence_hwm IN NUMBER, o_rc OUT sys_refcursor );
END pkg_get_deltas;

   FUNCTION fn_get_hwm
      l_commit_sequence_hwm   NUMBER;
      SELECT NVL( MAX( commit_sequence ), 0 )
        INTO l_commit_sequence_hwm
        FROM test_table;

      RETURN l_commit_sequence_hwm;
   END fn_get_hwm;

   PROCEDURE get_deltas( i_commit_sequence_hwm IN NUMBER, o_rc OUT sys_refcursor )
      OPEN o_rc
          SELECT   column1
                 , column2
                 , commit_sequence
              FROM test_table
             WHERE commit_sequence > i_commit_sequence_hwm
          ORDER BY commit_sequence;
   END get_deltas;
END pkg_get_deltas;

There are a few other things to consider including:
– depending on what we’re doing with the “stream”, we may want to persist the HWM to the DB every now and again so we don’t get any gaps. This would obviously influence the retrieval of the first HWM.
– This approach wouldn’t allow us to capture DELETEs.
– A combination of “select all data from the table” and picking up the maximum commit_sequence allows a remote cache to be built and kept up to date. This sounds rather like the RESULT CACHE but with one, massively important, difference. With the result cache, if I change ONE row in a table, the entire result cache is thrown away. That’s not much use for something like an account table which may have millions of rows yet I only change one row every few seconds.

2 Responses to “Polling for new data: Part 2”

  1. dombrooks said

    That’s an interesting point on the use of deterministic functions rather than packaged globals.

    Any chance of further expansion? 😉

  2. […] The material I’ll be presenting will include and extend on my posts Polling for new data and Polling for new data: Part 2.  Come and see me here and we can have a good old chat on the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: