I was recently asked to take a look at a problem with a nightly batch job which deletes historical data from some transaction tables. The requirement is as follows:
- I would like to retain three years of data in my transaction table.
- Each night, I’d like to delete data older than 3 years which means, providing my nightly job doesn’t fail, I’ll be deleting one day of data.
- One day of data is around two million rows and everyone would feel more comfortable if we deleted that data in chunks rather than all two million rows in a single transaction.
- I don’t have partitioning and/or it’s not appropriate to partition the data.
- I don’t have an index on transaction_date (the column used to determine the “age” of the data).
The general approach that had been taken was along the lines of:
DELETE FROM my_table WHERE transaction_date < trunc( add_months( sysdate, -36 ) ) AND rownum < 10000;
They would run this in a loop, committing after each delete until no rows were deleted.
I’m sure you can see the problem: with no index on transaction_date, we full scan the table around 2m/10k times. Even if we were happy with a single 2m row delete we’d still be taking a very long time to find the data to be deleted and poor old read consistency would be working it’s little socks off for the OLTP application reading and writing to this table.
The developers wanted to add an index in transaction_date. There’s a lot of sense in this: it would allow me to very efficiently find each batch of 10k rows I wanted to delete. The trouble is that I hate adding indexes, especially to big tables like this and even more so when they are there for a single process. Do I really need that extra index?
Not surprisingly, I had another suggestion (else what would be the point of this blog item?). There are other indexes already present on the table. There is one on account_id and one on transaction_id. You probably have a pretty good idea what these columns are and it’s the second one, the Primary Key that we’re interested in. As is fairly common, this is a number generated from a sequence at the time of insert. You may well see where I’m going here but, if not, then please read on.
Because the transaction_id is sourced from a sequence then it will increase in a generally contiguous manner*. The order of the rows sorted by transaction_id will be similar (although not exactly the same) to the order of the rows sorted by transaction_date. Although the transaction_id index can’t exactly identify the rows with the transaction_date I want, it can get me very close. So I rewrite the delete statement as follows:
DELETE
FROM my_table
WHERE transaction_id < ( SELECT min( transaction_id) + 20000
FROM my_table )
AND transaction_date < trunc( add_months( sysdate, -36 ) )
AND rownum < 10000;
So I can using the PK index to read the first 20k rows from the table, via the index. Obviously this 20k can be played with a little bit depending on how contiguous the PK values are and also how much the order of transaction_id can deviate from the order of transaction_date. As before, this statement should be run until no rows have been deleted.
Obviously there are cases where we can’t use this approach and we may well need to code some additional logic to cope with large gaps in sequence numbers but, in the case I was trying to solve, we avoided creating another index.
* This sequence does not CYCLE. If it did then it’s time to go back to the drawing board. We’re also running on a single instance DB so we don’t have to contend with multiple RAC nodes having multiple caches of the sequence values.