Dave Webster's Blog

Tinkering with Oracle

Polling for new data

Posted by anotherdavewebster on September 13, 2010

Most places I’ve worked at have had some sort of requirement for picking up new or changed rows from a table in the database. It’s only recently that I’ve found, what I would consider to be, a clean way of doing this. Before I talk about that, I’d like to digress and share a rather amusing anecdote with you. A few years back, my current employer outsourced the Oracle development of a new product to a third party. From what I understand, the implementation didn’t go well and a colleague of mine was involved in reviewing some of the third party code. He was fortunate enough to stumble upon quite possibly the most disturbing approach he’d seen to solving the problem of polling for new data. The developer coded his application to look for new data by searching for rows with a rowid greater than the last rowid retrieved. We enjoyed that one for a while I can tell you.

Anyway, back to the problem of picking up new data. Just so we’re clear on why this is not completely trivial, I will attempt to clarify the problem. The database has many concurrent sessions all inserting (or updating – but we’ll touch on that later) rows in a table. How do I poll for new data in that table? It seems that the most common gut reaction is to use a sequence number written on each row. It’s then maintain a High Water Mark and look for data more recent than the last value seen but this doesn’t work due to the time difference between the insert and the commit. Even if I’m allocating sequence numbers in strict order, there’s every chance that they will be committed out of order. A quick example:

Session Sequence Number Commit Order
A 1 1
B 2 2
C 3 3

 
all good so far but it’s only a matter of time before the order in which the sequence number are given out and the order in which they are committed deviates:

Session Sequence Number Commit Order
A 4 5
B 5 4
C 6 6

 
So my polling code which was keeping track of a high watermark to enable it to detect new data has a problem. It saw Sequence Number 5 so is now only looking for Sequence Numbers greater than 5 and consequently misses 4. We get the same problem if we choose to use timestamps plus the fact that I will, at some point, get duplicate timestamps so I’ll always have to read >= the last HWM and remove duplicates (I have seen this solution attempted!).

So how do we get around the problem? There are a few options of course. Oracle would have had us use Streams (although they’d now rather we use Golden Gate). Their next option may well be persistent AQ. A couple of problems spring to mind:
1. AQ is pretty expensive.
2. It requires the reader / poller / dequeuer to “commit” the fact that the data has been read. If my poller is coming from outside the DB and reads the data without committing the fact, I have a problem.

A “homegrown” solution I’ve also seen before is a trigger on the row which writes a message into a “message table” which a “polling” package reads from, deleting the data after it’s been read. This gives use very similar problems to the AQ solution.

My current favoured solution is to add a Commit Sequence column (indexed) to the table which defaults to 0. I then have a Sequencer process (dbms_job / dbms_scheduler / external) running which updates the Commit Sequence with a Batch Sequence from internal state. When the process starts, it picks up the first sequence number by adding one to the maximum sequence number on the table (the column is indexed so this is fast). The polling process can now work from a High Water Mark, safe in the knowledge that there will be no gaps and that the numbers will also increase.

There are a few points to be aware of:
1. There must be one and only one Sequencer job running.
2. There is no meaningful way to order the rows within each update statement / cycle of the Sequencer. This means I will have a number of rows with the same Commit Sequence which just means that they were sequenced in the same cycle of the sequencer. I don’t consider this to be an issue.
3. If the table is going to be updated then life gets a little more complicated. I would suggest a row level trigger which sets the Commit Sequence back to zero which will cause it to be sequenced again. Obviously if you’re expecting a large volume of updates, this may give you a performance issue.

Being able to construct a reliable stream of change from the database has all kinds of applications and is, in my opinion, something for which Oracle should have provided a high performance solution. I believe that Golden Gate is probably the closest they have but that’s an extra cost option. With some luck, someone reading this will point me in the direction of something better and all the above ramblings will be consigned to the scrap heap! If that proves to be the case, at least we have the rowid tale….

About these ads

9 Responses to “Polling for new data”

  1. Alexandr said

    If you’re interested, I’ve described the way we used Streams to solve such a problem here: http://antonoal.blogspot.com/2010/08/data-change-tracking-via-streams.html

    • I’ve had a quick look at your code and I can see how this kind of approach could work. Unless Streams has become an awful lot more efficient, it will be much more expensive than the approach I suggest. As is so often the case, most choices include a degree of compromise!

  2. Gary said

    There is a new feature in 11gr2 called DBMS_UTILITY.WAIT_ON_PENDING_DML. I blogged about it here :

    http://blog.sydoracle.com/2010/01/lite-serialisation-in-11gr2-with.html

    The short version is that it makes it easier to use sequences/timestamps.
    You grab the current timestamp or sequence nextval.
    Do a WAIT_ON_PENDING_DML so that you wait until the completion of any transactions which were outstanding when you did your pick.
    Then you can poll using your sequence/timestamp safe in the knowledge that you don’t have to worry about uncommitted transactions.

    You probably don’t want to use it when you have VERY long running transactions.

  3. dombrooks said

    Hi Dave,

    Is it possible that you could clarify the detail of this solution a little with an example?

    Cheers,
    Dominic

    • Leave it with me Dominic. I actually received a similar comment from a blogger friend of mine. It has also been pointed out to me that I didn’t make enough of a big deal about how performance critical and efficient the required approach had to be so an example would be a good opportunity to clarify that point.

      Interesting blog btw – your place of work sounds like it has much in common with mine (and probably that of half the Oracle guys out there).

      Thanks,

      Dave

    • Does this help?

      • dombrooks said

        Yes – much clearer, Dave.
        Thanks.

        Apologies for the delay in reading the follow-up – I feel guilty as I had specifically asked for more detail. But I didn’t know the follow-up was there. I always use the OraNA feed to consume blog updates – hope you don’t mind but I’ve asked Eddie Awad to add your blog in there (it didn’t look like it was there already).

  4. [...] My associate Dave Webster has an alternative viewpoint on [...]

  5. […] the Oracle database.  The material I’ll be presenting will include and extend on my posts Polling for new data and Polling for new data: Part 2.  Come and see me here and we can have a good old chat on the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: