Dave Webster's Blog

Tinkering with Oracle

Archive for August, 2013

Do you really need another index?

Posted by anotherdavewebster on August 19, 2013

I was recently asked to take a look at a problem with a nightly batch job which deletes historical data from some transaction tables.  The requirement is as follows:

  • I would like to retain three years of data in my transaction table.
  • Each night, I’d like to delete data older than 3 years which means, providing my nightly job doesn’t fail, I’ll be deleting one day of data.
  • One day of data is around two million rows and everyone would feel more comfortable if we deleted that data in chunks rather than all two million rows in a single transaction.
  • I don’t have partitioning and/or it’s not appropriate to partition the data.
  • I don’t have an index on transaction_date (the column used to determine the “age” of the data).

The general approach that had been taken was along the lines of:
DELETE FROM my_table WHERE transaction_date < trunc( add_months( sysdate, -36 ) ) AND rownum < 10000;

They would run this in a loop, committing after each delete until no rows were deleted.

I’m sure you can see the problem: with no index on transaction_date, we full scan the table around 2m/10k times.  Even if we were happy with a single 2m row delete we’d still be taking a very long time to find the data to be deleted and poor old read consistency would be working it’s little socks off for the OLTP application reading and writing to this table.

The developers wanted to add an index in transaction_date.  There’s a lot of sense in this: it would allow me to very efficiently find each batch of 10k rows I wanted to delete.  The trouble is that I hate adding indexes, especially to big tables like this and even more so when they are there for a single process.  Do I really need that extra index?

Not surprisingly, I had another suggestion (else what would be the point of this blog item?).  There are other indexes already present on the table.  There is one on account_id and one on transaction_id.  You probably have a pretty good idea what these columns are and it’s the second one, the Primary Key that we’re interested in.  As is fairly common, this is a number generated from a sequence at the time of insert.  You may well see where I’m going here but, if not, then please read on.

Because the transaction_id is sourced from a sequence then it will increase in a generally contiguous manner*.  The order of the rows sorted by transaction_id will be similar (although not exactly the same) to the order of the rows sorted by transaction_date.  Although the transaction_id index can’t exactly identify the rows with the transaction_date I want, it can get me very close.  So I rewrite the delete statement as follows:

DELETE
FROM my_table
WHERE transaction_id < ( SELECT min( transaction_id) + 20000
FROM my_table )
AND transaction_date < trunc( add_months( sysdate, -36 ) )
AND rownum < 10000;

So I can using the PK index to read the first 20k rows from the table, via the index.  Obviously this 20k can be played with a little bit depending on how contiguous the PK values are and also how much the order of transaction_id can deviate from the order of transaction_date.  As before, this statement should be run until no rows have been deleted.

Obviously there are cases where we can’t use this approach and we may well need to code some additional logic to cope with large gaps in sequence numbers but, in the case I was trying to solve, we avoided creating another index.

* This sequence does not CYCLE.  If it did then it’s time to go back to the drawing board.  We’re also running on a single instance DB so we don’t have to contend with multiple RAC nodes having multiple caches of the sequence values.

Posted in Patterns, performance | Leave a Comment »

How best to get data from Oracle into our Java code?

Posted by anotherdavewebster on August 2, 2013

I’m constantly surprised by how little information I find on the internet about this so I thought I’d give a short summary of what we consider best practice to be and what other options we’ve looked at.  Let’s take the common case of wanting to return the results of a SQL query to a the client Java code.  There’s just boat loads of this code so surely it’s worth trying to make it efficient.  In terms of performance, I’ve not found anything better than a PreparedStatement wrapping the query and executing over JDBC.  It seems that Oracle have optimised this route as it’s probably the most common.  They give us statement caching, fetch and prefetch size which all help.  I’m not really going to talk much about this approach.  Instead, I’m going to look at using a PL/SQL API (stored procedures) to the data which provide a number of advantages over executing queries directly from Java:

  • Impact analysis for schema changes is improved as packages will not compile if objects are changed.
  • It allows a comprehensive DB Unit Test to be constructed for the SQL.
  • It provides a level of abstraction from the DB allowing some DB changes to be made “transparently” to the middle-tier.
  • It allows security to be more tightly controlled as access is only granted to packages and not underlying tables.
  • It allows me to make multiple procedure calls and queries which result in a single result set.
  • There’s no “danger” of Java coders writing SQL queries.  There you go, I’ve said it, it’s out there.  You were all thinking it anyway.

So I think it’s fair to say that there are some advantages of a PL/SQL API.  The question is, what’s the best way of passing the data back to Java?  I’ve tried a few things over the years:

  • Returning data in an object array as a return value from a function or OUT parameter of a procedure.  It probably comes as no great surprise that the creation of these objects isn’t cheap.  My tests showed a not insignificant overhead.  It also requires the database to be full of TYPEs for each shape of result set (at least it did when I last played with it).
  • Returning data as a list of OUT parameters.  This works well (to a point – see below) for a single row of data.
  • Returning a refcursor to the result set.

It’s the last two options that I’ve been playing with recently.

The refcursor option is good in many ways, mainly because the process remains close to the optimised “query over JDBC” approach.  The only real downside I’ve found is a really annoying one which I was amazed to find out hadn’t been addressed: there’s no prefetch for refcursors through JDBC.

We use the thin JDBC client so we fired up the thick client to see if that helped.  We made sure the prefetch value we were setting was actually being applied.  We tried the latest version of the thin driver.  Nothing.  We couldn’t get it to work.  Reading around on the internet it seems that prefetching refcursors is possible through the PHP driver and I found numerous pieces suggesting that prefetch over JDBC worked but nothing concrete.  In the end, we got word from the man that knows, the JDBC man at Oracle.  He confirmed that it’s not supported.

This is a real nuisance.  The lack of prefetch means at least one extra round trip.  One for the procedure call and one for the fetch from the refcursor.  It also reduces the benefit of returning multiple refcursors from a single call.  That would have given me the edge over direct SQL.  I could have executed and fetched multiple queries with a single round trip.  I don’t like to compromise, I really don’t but now direct SQL has an advantage over calling packages that return refcursors.  Irritating doesn’t cover it.

Up until recently our internal best practice said that for a single row of data, a procedure call with multiple OUT parameters was our best practice.  It doesn’t have the horrible (did I mention that I didn’t like it?) downside of the extra round trip we get with refcursors.  There is an issue though.  We created a stored procedure with 70 OUT parameters.  Everything looked fine until we ramped up the contention and then it got really, really expensive.  When I say expensive, I mean properly expensive i.e. lots of CPU.  The OS (linux) shows very high system time and a little digging reveals lots of slow mmap calls.  I may blog the detail at some point but to summarise: directly selecting into >= 24 OUT parameters triggers lots of mmap calls on the underlying OS.  Some strange specifics:

  • The mmap calls only seem expensive under contention.  Running our tests single threaded still generates the calls but they are quick.
  • The size of the OUT parameters doesn’t have an impact i.e. 23 large parameters are as fast as 23 small parameters.  24 large parameters are as slow as 24 small parameters.
  • If you declare PL/SQL variables in the stored procedure, SELECT INTO those variables and then assign the values to the OUT parameters then the problem goes away!  This obviously provides us with a work around but who wants to publish that as a best practice?  I’d be laughed out of the room.
  • All of this was on 11.0.2.3

I think that’s about the end of it for now.  In summary, we like the PL/SQL API.  We like refcursors as the method of passing data and we’ll like them even more when the JDBC guys give us the prefetch that is, in my opinion, a significant oversight.  We quite like scalar OUT parameters BUT we know we need to avoid >= 24 of them!

Posted in Patterns, performance | Leave a Comment »

Wakey Wakey Mr Webster

Posted by anotherdavewebster on August 1, 2013

Has it really been two and a half years since I last posted? In that time I have changed roles a few times and travelled the world a little. I know it’s no excuse but there you have it. The good news is that while I’ve been idle, we’ve got ourselves onto Oracle 11 which means Nigel and I have better toys to play with. I have a few posts in mind and I reckon I’ll kick off with a general ramble about getting data from Oracle in to Java. Let’s see how long it takes me to do the first real post.

Posted in Uncategorized | Leave a Comment »