Well ladies and gentlemen, it seems that I’ll be sharing my thoughts on different ways of streaming data from the Oracle database. The material I’ll be presenting will include and extend on my posts Polling for new data and Polling for new data: Part 2. Come and see me here and we can have a good old chat on the subject.
Posted by anotherdavewebster on September 30, 2013
Posted by anotherdavewebster on August 19, 2013
I was recently asked to take a look at a problem with a nightly batch job which deletes historical data from some transaction tables. The requirement is as follows:
- I would like to retain three years of data in my transaction table.
- Each night, I’d like to delete data older than 3 years which means, providing my nightly job doesn’t fail, I’ll be deleting one day of data.
- One day of data is around two million rows and everyone would feel more comfortable if we deleted that data in chunks rather than all two million rows in a single transaction.
- I don’t have partitioning and/or it’s not appropriate to partition the data.
- I don’t have an index on transaction_date (the column used to determine the “age” of the data).
The general approach that had been taken was along the lines of:
DELETE FROM my_table WHERE transaction_date < trunc( add_months( sysdate, -36 ) ) AND rownum < 10000;
They would run this in a loop, committing after each delete until no rows were deleted.
I’m sure you can see the problem: with no index on transaction_date, we full scan the table around 2m/10k times. Even if we were happy with a single 2m row delete we’d still be taking a very long time to find the data to be deleted and poor old read consistency would be working it’s little socks off for the OLTP application reading and writing to this table.
The developers wanted to add an index in transaction_date. There’s a lot of sense in this: it would allow me to very efficiently find each batch of 10k rows I wanted to delete. The trouble is that I hate adding indexes, especially to big tables like this and even more so when they are there for a single process. Do I really need that extra index?
Not surprisingly, I had another suggestion (else what would be the point of this blog item?). There are other indexes already present on the table. There is one on account_id and one on transaction_id. You probably have a pretty good idea what these columns are and it’s the second one, the Primary Key that we’re interested in. As is fairly common, this is a number generated from a sequence at the time of insert. You may well see where I’m going here but, if not, then please read on.
Because the transaction_id is sourced from a sequence then it will increase in a generally contiguous manner*. The order of the rows sorted by transaction_id will be similar (although not exactly the same) to the order of the rows sorted by transaction_date. Although the transaction_id index can’t exactly identify the rows with the transaction_date I want, it can get me very close. So I rewrite the delete statement as follows:
WHERE transaction_id < ( SELECT min( transaction_id) + 20000
FROM my_table )
AND transaction_date < trunc( add_months( sysdate, -36 ) )
AND rownum < 10000;
So I can using the PK index to read the first 20k rows from the table, via the index. Obviously this 20k can be played with a little bit depending on how contiguous the PK values are and also how much the order of transaction_id can deviate from the order of transaction_date. As before, this statement should be run until no rows have been deleted.
Obviously there are cases where we can’t use this approach and we may well need to code some additional logic to cope with large gaps in sequence numbers but, in the case I was trying to solve, we avoided creating another index.
* This sequence does not CYCLE. If it did then it’s time to go back to the drawing board. We’re also running on a single instance DB so we don’t have to contend with multiple RAC nodes having multiple caches of the sequence values.
Posted by anotherdavewebster on August 2, 2013
I’m constantly surprised by how little information I find on the internet about this so I thought I’d give a short summary of what we consider best practice to be and what other options we’ve looked at. Let’s take the common case of wanting to return the results of a SQL query to a the client Java code. There’s just boat loads of this code so surely it’s worth trying to make it efficient. In terms of performance, I’ve not found anything better than a PreparedStatement wrapping the query and executing over JDBC. It seems that Oracle have optimised this route as it’s probably the most common. They give us statement caching, fetch and prefetch size which all help. I’m not really going to talk much about this approach. Instead, I’m going to look at using a PL/SQL API (stored procedures) to the data which provide a number of advantages over executing queries directly from Java:
- Impact analysis for schema changes is improved as packages will not compile if objects are changed.
- It allows a comprehensive DB Unit Test to be constructed for the SQL.
- It provides a level of abstraction from the DB allowing some DB changes to be made “transparently” to the middle-tier.
- It allows security to be more tightly controlled as access is only granted to packages and not underlying tables.
- It allows me to make multiple procedure calls and queries which result in a single result set.
- There’s no “danger” of Java coders writing SQL queries. There you go, I’ve said it, it’s out there. You were all thinking it anyway.
So I think it’s fair to say that there are some advantages of a PL/SQL API. The question is, what’s the best way of passing the data back to Java? I’ve tried a few things over the years:
- Returning data in an object array as a return value from a function or OUT parameter of a procedure. It probably comes as no great surprise that the creation of these objects isn’t cheap. My tests showed a not insignificant overhead. It also requires the database to be full of TYPEs for each shape of result set (at least it did when I last played with it).
- Returning data as a list of OUT parameters. This works well (to a point – see below) for a single row of data.
- Returning a refcursor to the result set.
It’s the last two options that I’ve been playing with recently.
The refcursor option is good in many ways, mainly because the process remains close to the optimised “query over JDBC” approach. The only real downside I’ve found is a really annoying one which I was amazed to find out hadn’t been addressed: there’s no prefetch for refcursors through JDBC.
We use the thin JDBC client so we fired up the thick client to see if that helped. We made sure the prefetch value we were setting was actually being applied. We tried the latest version of the thin driver. Nothing. We couldn’t get it to work. Reading around on the internet it seems that prefetching refcursors is possible through the PHP driver and I found numerous pieces suggesting that prefetch over JDBC worked but nothing concrete. In the end, we got word from the man that knows, the JDBC man at Oracle. He confirmed that it’s not supported.
This is a real nuisance. The lack of prefetch means at least one extra round trip. One for the procedure call and one for the fetch from the refcursor. It also reduces the benefit of returning multiple refcursors from a single call. That would have given me the edge over direct SQL. I could have executed and fetched multiple queries with a single round trip. I don’t like to compromise, I really don’t but now direct SQL has an advantage over calling packages that return refcursors. Irritating doesn’t cover it.
Up until recently our internal best practice said that for a single row of data, a procedure call with multiple OUT parameters was our best practice. It doesn’t have the horrible (did I mention that I didn’t like it?) downside of the extra round trip we get with refcursors. There is an issue though. We created a stored procedure with 70 OUT parameters. Everything looked fine until we ramped up the contention and then it got really, really expensive. When I say expensive, I mean properly expensive i.e. lots of CPU. The OS (linux) shows very high system time and a little digging reveals lots of slow mmap calls. I may blog the detail at some point but to summarise: directly selecting into >= 24 OUT parameters triggers lots of mmap calls on the underlying OS. Some strange specifics:
- The mmap calls only seem expensive under contention. Running our tests single threaded still generates the calls but they are quick.
- The size of the OUT parameters doesn’t have an impact i.e. 23 large parameters are as fast as 23 small parameters. 24 large parameters are as slow as 24 small parameters.
- If you declare PL/SQL variables in the stored procedure, SELECT INTO those variables and then assign the values to the OUT parameters then the problem goes away! This obviously provides us with a work around but who wants to publish that as a best practice? I’d be laughed out of the room.
- All of this was on 126.96.36.199
I think that’s about the end of it for now. In summary, we like the PL/SQL API. We like refcursors as the method of passing data and we’ll like them even more when the JDBC guys give us the prefetch that is, in my opinion, a significant oversight. We quite like scalar OUT parameters BUT we know we need to avoid >= 24 of them!
Posted by anotherdavewebster on August 1, 2013
Has it really been two and a half years since I last posted? In that time I have changed roles a few times and travelled the world a little. I know it’s no excuse but there you have it. The good news is that while I’ve been idle, we’ve got ourselves onto Oracle 11 which means Nigel and I have better toys to play with. I have a few posts in mind and I reckon I’ll kick off with a general ramble about getting data from Oracle in to Java. Let’s see how long it takes me to do the first real post.
Posted by anotherdavewebster on December 30, 2010
There’s nothing like the Christmas break to slow things down at work a little! This has given me a little time to finally write up Conditional Joins Part 2. Part 2 is actually “Part 1″ from a chronological perspective but I’ve written them the other way round simply because Part 1 is a little more general whereas Part 2, as useful as it is, solves a very specific problem.
So the real world problem is as follows. Think of a large set of orders which are placed over many months but “settled” at one point in time. My driving table contains the settled orders but I now need to go back to the record of each order being placed to retrieve some data from that record. The placement table is partitioned by placement date (a_date) and has a local non unique index on the ID (an_id). Both the placement date and the ID exist on the settled (driving) table. The vast majority of the settled orders were placed in the last couple of days but the “tail” stretches back many months. I was asked to look at the problem which basically came down to a choice between a hash join and a nested loop join from the settled table to the placement table. Neither approach really suited the problem due to the long tail. I explained the dilemma to the client to which he replied “isn’t it a shame that Oracle can’t choose to hash join to the recent partitions and nested loop to the older ones”. Bingo. That’s exactly what we want and we can do it using a conditional join.
Read the rest of this entry »
Posted by anotherdavewebster on October 7, 2010
Apologies for the delay. The day job has been rather busy recently and I was fortunate enough to get a trip out to Open World again this year. Much more black and white for me this time: some real gems and some truly terrible presentations! My personal highlights include a few wonderful sessions from Craig Shallahamer (you have to see his physical demonstration of the workings of the buffer cache), some real index treats from Richard Foote and a really good session on Query Transformations from a fantastically enthusiastic Joze Senegacnik which shone the light on a few areas of the optimiser which I’d never really thought about (but should have).
So, back to Polling for new data. One point that has come to light since I started this blog is that I should have made clear from the outset that almost everything I do is about performance. Read the rest of this entry »
Posted by anotherdavewebster on September 13, 2010
Most places I’ve worked at have had some sort of requirement for picking up new or changed rows from a table in the database. It’s only recently that I’ve found, what I would consider to be, a clean way of doing this. Before I talk about that, I’d like to digress and share a rather amusing anecdote with you. A few years back, my current employer outsourced the Oracle development of a new product to a third party. From what I understand, the implementation didn’t go well and a colleague of mine was involved in reviewing some of the third party code. He was fortunate enough to stumble upon quite possibly the most disturbing approach he’d seen to solving the problem of polling for new data. The developer coded his application to look for new data by searching for rows with a rowid greater than the last rowid retrieved. We enjoyed that one for a while I can tell you.
Anyway, back to the problem of picking up new data. Just so we’re clear on why this is not completely trivial, I will attempt to clarify the problem. Read the rest of this entry »
Posted by anotherdavewebster on August 30, 2010
This little trick came my way courtesy of a colleague of mine named Matt Cobley. Over the years, I’ve frequently needed to produce a comma delimited list of strings from multiple rows of a query. This is easy enough if you know how many rows you have but if not, I always end up writing a little piece of PL/SQL containing a cursor which builds a string which I then output to dbms_output. All rather long winded and more than a little limiting. That was until Matt sent an eMail (on a subject I no longer recall) containing the little gem XMLAGG which works something like this:
SELECT RTRIM( XMLAGG( XMLELEMENT( any_old_name, table_name || ',' ) ORDER BY last_analyzed DESC ).EXTRACT( '//text()' ) , ',' ) FROM dba_tables WHERE ROWNUM < 20 AND owner = 'SYS'
Now I don’t suppose for a minute that it’s cheap but it really is very easy and it provides us with a nice clean:
I made a mental note at the time to have a dig around the XML stuff in Oracle as I’m sure there are lots of other treats in there. I sorry to say that I’ve not yet got started.
UPDATE: A couple of you have kindly pointed out that 11r2 contains LISTAGG, an extension to the wonderful analytics functionality, which does the same job in a more elegant way.
Posted by anotherdavewebster on August 26, 2010
It would be reasonable to assume that everything that could ever be written about bind variables has already been written. Certainly regarding the concept of them and why they are a good thing anyway. A little while back, I was digging around on a system which was suffering from very high parse rates. I did the normal stuff and took a look in v$sql:
SELECT ( SELECT username FROM dba_users WHERE user_id = parsing_user_id ) username , parsing_user_id , SUBSTR( sql_text, 1, 30 ) , COUNT( * ) , MAX( TO_DATE( first_load_time, 'YYYY-MM-DD/HH24:MI:SS' ) ) , MAX( last_active_time ) FROM v$sql GROUP BY parsing_user_id, SUBSTR( sql_text, 1, 30 ) HAVING COUNT( * ) > 5 ORDER BY 4 DESC
Sure enough, I found a couple of statements with many thousands of variations, differing only by a handful of source table names and many different literal values. A quick look at the code confirmed that we had a piece of dynamic SQL (to manage a few different source tables) which concatenated a literal value into the SQL string. Something along the lines of:
execute immediate 'insert into table_a select * from ' || l_tab_name || ' where id = ' || l_id;
“Great stuff” I thought, I’ll go and find the developer who wrote that piece of code and ask him to switch to using a bind variable for the id but obviously retain the dynamic nature of the source table name. So I wrote him a quick eMail asking him to use a bind variable and the response rather surprised me. When I tell people the story, I ask them what they think the most worrying answer would be. To a man, they all go for “What’s a bind variable?”. That would indeed be a worrying response, but not unprecedented. The response I actually got was “I have”.
Posted by anotherdavewebster on August 24, 2010
A few years back, some colleagues of mine were bemoaning the fact that functions created as DETERMINISTIC weren’t actually implemented as such in 10.2. Just to be clear, this is no longer the case with Oracle 11. DETERMINISTIC functions actually are deterministic:
CREATE OR REPLACE FUNCTION non_det_fn( i_parameter IN NUMBER ) RETURN NUMBER IS BEGIN DBMS_OUTPUT.put_line( 'Called non_det_fn with i_parameter = ' || i_parameter ); RETURN i_parameter; END; CREATE OR REPLACE FUNCTION det_fn( i_parameter IN NUMBER ) RETURN NUMBER DETERMINISTIC IS BEGIN DBMS_OUTPUT.put_line( 'Called det_fn with i_parameter = ' || i_parameter ); RETURN i_parameter; END; DECLARE l_result PLS_INTEGER; BEGIN dbms_output.put_line('Standard function'); FOR i IN 1 .. 5 LOOP l_result := non_det_fn( 1 ); END LOOP; dbms_output.put_line('Deterministic function'); FOR i IN 1 .. 5 LOOP l_result := det_fn( 1 ); END LOOP; END;
You get pretty much what you would expect from dbms_output:
Standard function Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Deterministic function Called det_fn with i_parameter = 1
On 10.2, Oracle doesn’t yet implement the deterministic functionality and we get:
Standard function Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 1 Deterministic function Called det_fn with i_parameter = 1 Called det_fn with i_parameter = 1 Called det_fn with i_parameter = 1 Called det_fn with i_parameter = 1 Called det_fn with i_parameter = 1
Now this is not news of course but I did stumble on a rather surprising case where Oracle did implement the functionality in 10.2:
SELECT non_det_fn( mod_number ) , det_fn( mod_number ) FROM ( SELECT MOD( ROWNUM, 3 ) mod_number FROM DUAL CONNECT BY ROWNUM <= 10 );
dbms_output gives us:
Called non_det_fn with i_parameter = 1 Called det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 2 Called det_fn with i_parameter = 2 Called non_det_fn with i_parameter = 0 Called det_fn with i_parameter = 0 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 2 Called non_det_fn with i_parameter = 0 Called non_det_fn with i_parameter = 1 Called non_det_fn with i_parameter = 2 Called non_det_fn with i_parameter = 0 Called non_det_fn with i_parameter = 1
It’s not the same as the PL/SQL implementation in 11.1 but it’s certainly more than I was expecting and may well come in useful at some point. Of course, you can make all functions “deterministic” by calling them from Scalar Subqueries but that’s another matter…