Wednesday, May 11, 2016

Increasing Performance through Data Guard

I came across an interesting problem. A customer's database kept reporting "Checkpoint not complete could not allocate log" in its alert log.

This error means lgwr was switching logs faster than dbwr could write out dirty buffers. Redo logs protect dirty buffers in case of a database crash. If a crash were to occur, transactions would be read from the redo logs (but not the archived logs), and applied to the database blocks.

Dirty buffers in the buffer cache could not be written fast enough to keep up with lgwr switching & archiving redo logs. And the database halts when this happens. It halts so that unwritten blocks in its buffer cache can be written out to disk.

I also found that logs were switching every 5 minutes or so. Googled it, and it seems that that also causes a performance hit. At every log switch, the database may also halt.

The database had an online redo log config of 3 x 50 MB. So, 150 MB of redo could be written before dbwr would be blocking the entire database because its dirty buffers contain data protected by an online redo log, about to be overwritten. Well, I know redo is very efficient, so a great many blocks would have to be updated. But, the number is relative. In a busy database, the number of dirty blocks waiting to be written could exceed the capacity of 150 MB redo.

For example, let's say 150 MB of redo "allows" a dirty buffer wait list of 10k blocks. Nice. Except, it's a busy database, and the buffer cache is huge. At every log switch, 100k blocks have to be written. Well, lgwr has no choice but to wait for dbwr to catch up, then it can safely overwrite the oldest online redo logfile.

In any case, seems a waste that data, written out to redo & archive, is halting the database. The Undo tablespace also generates redo. Makes me a little sad when I think about it :(.

Hmm, well, the redo logs are usually small to minimise the amount of redo that could be lost in case of catastrophic failure.

But, wait a second.

The database has a standby database maintained by Data Guard Broker. It's configured in Maximum Availability mode. The standby db has Standby Redo Logs (SRL's) (the primary also has them). The Oracle "Data Guard Concepts and Administration" manual has this to say about Maximum Availability mode:

"Like maximum protection mode, a transaction will not commit until the redo needed to recover that transaction is written to the local online redo log and to the standby redo log of at least one transactionally consistent standby database."

The same manual says:

"During a failover, Data Guard can recover and apply more redo data from standby redo log files than from the archived log files alone."

Well, if, that is the case - complete protection at commit - then, why use small redo logs? I suppose, if the network were down, and a very isolated hardware issue were to affect both copies of a redo log group, I might lose data, and to minimise the amount of data lost, the redo log could be kept small.

This is a bit of a Hobson's choice - it's preferable that the database crashes, than to lose data. Understandable, I suppose.

Except it's no longer 1990! Hardware is extremely reliable. I've never had a hardware issue affect redo logs, far less two copies of a redo log group. And, both conditions would have to occur for data loss - a network failure and a hardware fault affecting ALL copies of the redo log group.

Once there is no loss of both copies of a redo log group, I can't see why not make the redo logs much larger. It would mean less redo log switches. It would mean the lazy dbwr will not be halting the database because unwritten data has to be written.

Database startup will likely take longer, but, meh, how often do we do that?

The data protection offered by Data Guard allows us to increase the size of the redo logs without increasing the risk of data loss. In turn this increases database performance.

I settled on 5 x 1GB redo log groups.

The parameter archive_lag_target was already set to 3600, guaranteeing at least one switch per hour.

Update:

https://prutser.files.wordpress.com/2008/12/checkpointsukoug.pdf

Looks like I was right. This Oracle chap says on page 24:

"Online redo log files can only be too small, never too large!"

Sounds good to me.

No comments:

Post a Comment