Polling for new data
Posted by anotherdavewebster on September 13, 2010
Most places I’ve worked at have had some sort of requirement for picking up new or changed rows from a table in the database. It’s only recently that I’ve found, what I would consider to be, a clean way of doing this. Before I talk about that, I’d like to digress and share a rather amusing anecdote with you. A few years back, my current employer outsourced the Oracle development of a new product to a third party. From what I understand, the implementation didn’t go well and a colleague of mine was involved in reviewing some of the third party code. He was fortunate enough to stumble upon quite possibly the most disturbing approach he’d seen to solving the problem of polling for new data. The developer coded his application to look for new data by searching for rows with a rowid greater than the last rowid retrieved. We enjoyed that one for a while I can tell you.
Anyway, back to the problem of picking up new data. Just so we’re clear on why this is not completely trivial, I will attempt to clarify the problem. The database has many concurrent sessions all inserting (or updating – but we’ll touch on that later) rows in a table. How do I poll for new data in that table? It seems that the most common gut reaction is to use a sequence number written on each row. It’s then maintain a High Water Mark and look for data more recent than the last value seen but this doesn’t work due to the time difference between the insert and the commit. Even if I’m allocating sequence numbers in strict order, there’s every chance that they will be committed out of order. A quick example:
|Session||Sequence Number||Commit Order|
all good so far but it’s only a matter of time before the order in which the sequence number are given out and the order in which they are committed deviates:
|Session||Sequence Number||Commit Order|
So my polling code which was keeping track of a high watermark to enable it to detect new data has a problem. It saw Sequence Number 5 so is now only looking for Sequence Numbers greater than 5 and consequently misses 4. We get the same problem if we choose to use timestamps plus the fact that I will, at some point, get duplicate timestamps so I’ll always have to read >= the last HWM and remove duplicates (I have seen this solution attempted!).
So how do we get around the problem? There are a few options of course. Oracle would have had us use Streams (although they’d now rather we use Golden Gate). Their next option may well be persistent AQ. A couple of problems spring to mind:
1. AQ is pretty expensive.
2. It requires the reader / poller / dequeuer to “commit” the fact that the data has been read. If my poller is coming from outside the DB and reads the data without committing the fact, I have a problem.
A “homegrown” solution I’ve also seen before is a trigger on the row which writes a message into a “message table” which a “polling” package reads from, deleting the data after it’s been read. This gives use very similar problems to the AQ solution.
My current favoured solution is to add a Commit Sequence column (indexed) to the table which defaults to 0. I then have a Sequencer process (dbms_job / dbms_scheduler / external) running which updates the Commit Sequence with a Batch Sequence from internal state. When the process starts, it picks up the first sequence number by adding one to the maximum sequence number on the table (the column is indexed so this is fast). The polling process can now work from a High Water Mark, safe in the knowledge that there will be no gaps and that the numbers will also increase.
There are a few points to be aware of:
1. There must be one and only one Sequencer job running.
2. There is no meaningful way to order the rows within each update statement / cycle of the Sequencer. This means I will have a number of rows with the same Commit Sequence which just means that they were sequenced in the same cycle of the sequencer. I don’t consider this to be an issue.
3. If the table is going to be updated then life gets a little more complicated. I would suggest a row level trigger which sets the Commit Sequence back to zero which will cause it to be sequenced again. Obviously if you’re expecting a large volume of updates, this may give you a performance issue.
Being able to construct a reliable stream of change from the database has all kinds of applications and is, in my opinion, something for which Oracle should have provided a high performance solution. I believe that Golden Gate is probably the closest they have but that’s an extra cost option. With some luck, someone reading this will point me in the direction of something better and all the above ramblings will be consigned to the scrap heap! If that proves to be the case, at least we have the rowid tale….