Polling for new data: Part 2
Posted by anotherdavewebster on October 7, 2010
Apologies for the delay. The day job has been rather busy recently and I was fortunate enough to get a trip out to Open World again this year. Much more black and white for me this time: some real gems and some truly terrible presentations! My personal highlights include a few wonderful sessions from Craig Shallahamer (you have to see his physical demonstration of the workings of the buffer cache), some real index treats from Richard Foote and a really good session on Query Transformations from a fantastically enthusiastic Joze Senegacnik which shone the light on a few areas of the optimiser which I’d never really thought about (but should have).
So, back to Polling for new data. One point that has come to light since I started this blog is that I should have made clear from the outset that almost everything I do is about performance. Our main production system performs many thousands of transactions a second and up to one hundred thousand queries a second at peak. The vast majority of these calls require response times measure in milliseconds. Consequently, we need to be pretty careful about the SQL and PL/SQL we write.
The aim is to provide a sequenced stream of data from a table which:
- adds the minimum overhead to DML operations against the table in question.
- generates the least possible load on the system for both the sequencing and polling components.
- bonus points for a clean design with no more objects and code than is necessary!
I’ve put together some detail on the technique I described in the original item. We create our table with an additional column “commit_sequence” which defaults to 0 and is indexed:
CREATE TABLE test_table ( column1 NUMBER NOT NULL, column2 VARCHAR2(10) NOT NULL, commit_sequence NUMBER DEFAULT 0 NOT NULL ); CREATE INDEX test_commit_sequence ON test_table(commit_sequence);
If the table is going to receive any update activity then we’ll also need an update trigger:
CREATE OR REPLACE TRIGGER test_table_update BEFORE UPDATE ON test_table FOR EACH ROW WHEN (OLD.commit_sequence = NEW.commit_sequence AND NEW.commit_sequence != 0) BEGIN -- Reset the commit_sequence as long as this is not the update which is doing the sequencing. :NEW.commit_sequence := 0; END test_table_update;
So now the table can be used in the normal way. INSERTs are almost exactly the same cost as they were previously – the “DEFAULT 0″ costs nothing compared to the other work going on although we do have an extra index on the table. The UPDATE is a little more expensive due to the trigger but you can’t have everything! We now just need our background job which does the sequencing of the data:
DECLARE c_batch_size CONSTANT NUMBER := 10; c_sleep_time CONSTANT NUMBER := 1; l_commit_sequence NUMBER; l_row_count NUMBER; BEGIN -- Get the next commit_sequence to be used SELECT NVL( MAX( commit_sequence ), 0 ) + 1 INTO l_commit_sequence FROM test_table; LOOP UPDATE test_table SET commit_sequence = l_commit_sequence WHERE commit_sequence = 0 AND ROWNUM <= c_batch_size; l_row_count := SQL%ROWCOUNT; COMMIT; IF l_row_count > 0 THEN l_commit_sequence := l_commit_sequence + 1; END IF; -- If the number of rows updated != the batch size then we updated all visible rows so we should sleep for a while before trying again. -- If not, there is every chance that there are more rows to be sequenced so we should get on with it. IF l_row_count != c_batch_size THEN DBMS_LOCK.sleep( c_sleep_time ); END IF; END LOOP; END;
This can be run as a dbms_job / scheduler or an external process. The only really important thing is to make sure there is always one and only one of these running. There are obviously a variety of ways of accomplishing this.
Just a quick note on the CONSTANTs in the code snippet. First of all, be aware that a package level constant actually means that the package has state which is rarely a good thing. We tend to use (deterministic) functions to return values instead of using constants. Digression aside, the constant c_batch_size is important. If it’s set to be greater than one and the application can perform updates to multiple rows in one transaction then there is a chance of a deadlock. Given the default “no wait” behaviour of commits within PL/SQL (see here), we needn’t worry too much about the cost of the commits although obviously updating 100 rows in one statement is cheaper than updating 100 rows in 100 statements! The primary goal of this approach is to impact the processes writing to the table as little as possible so if there is a chance of introducing a deadlock then we should certainly avoid that by using a batch size of 1. Just to be clear, if I’m only inserting into this table or updating only single rows in each transaction, a batch size of greater than one will be more efficient. Applying this technique to an INSERT only table is ideal – we don’t need the trigger (or at least it won’t fire) and we can use large batch sizes.
I now have a clean and efficient technique for sequencing the data in a table. All that remains is to create the consumer process which would generally come from an external application. We might provide a helper PL/SQL function so that the application would call fn_get_hwm once and then repeatedly poll for data by calling pr_get_deltas with the last HWM. The HWM for each call is the commit_sequence from the last row returned by the previous call.
CREATE OR REPLACE PACKAGE pkg_get_deltas IS FUNCTION fn_get_hwm RETURN NUMBER; PROCEDURE get_deltas( i_commit_sequence_hwm IN NUMBER, o_rc OUT sys_refcursor ); END pkg_get_deltas; CREATE OR REPLACE PACKAGE BODY pkg_get_deltas IS FUNCTION fn_get_hwm RETURN NUMBER IS l_commit_sequence_hwm NUMBER; BEGIN SELECT NVL( MAX( commit_sequence ), 0 ) INTO l_commit_sequence_hwm FROM test_table; RETURN l_commit_sequence_hwm; END fn_get_hwm; PROCEDURE get_deltas( i_commit_sequence_hwm IN NUMBER, o_rc OUT sys_refcursor ) AS BEGIN OPEN o_rc FOR SELECT column1 , column2 , commit_sequence FROM test_table WHERE commit_sequence > i_commit_sequence_hwm ORDER BY commit_sequence; END get_deltas; END pkg_get_deltas;
There are a few other things to consider including:
- depending on what we’re doing with the “stream”, we may want to persist the HWM to the DB every now and again so we don’t get any gaps. This would obviously influence the retrieval of the first HWM.
- This approach wouldn’t allow us to capture DELETEs.
- A combination of “select all data from the table” and picking up the maximum commit_sequence allows a remote cache to be built and kept up to date. This sounds rather like the RESULT CACHE but with one, massively important, difference. With the result cache, if I change ONE row in a table, the entire result cache is thrown away. That’s not much use for something like an account table which may have millions of rows yet I only change one row every few seconds.