AWS Database Blog
Implement corruption protection mechanisms in Amazon RDS for Oracle
Amazon Relational Database Service (Amazon RDS) for Oracle allows customers to focus on innovation and application development, while Amazon RDS manages time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.
Ensuring the integrity of data is crucial to customers, especially when it pertains to mission-critical systems where data inconsistency can result in serious financial consequences or reputational damage. Database block corruption is a phenomenon that poses a risk to the integrity of any Oracle database, irrespective of the hosting platform. A data block is the smallest logical unit of data storage in an Oracle database. Block corruption can be logical or physical and it occurs when the contents of a block has changed from what the Oracle database expects. Block corruption can be caused by a variety of reasons, including but not limited to faulty disks or controllers, memory corruption, firmware errors, or Oracle database software defects.
Logical corruption occurs when a block has a valid checksum but its contents are logically inconsistent. One cause of logical corruption is lost writes. For example, during normal transaction processing, changes made to data in the buffer cache is flushed to disk. Lost write happens when a false acknowledgement of a write is received from the I/O subsystem. Lost writes can be a result of hardware or software bugs in the operating system or I/O subsystem. Other causes of logical corruption include cache or memory corruption, or database software bugs.
Physical block corruption is also known as media corruption, it happens when the checksum or header of a block is invalid or when the header and footer of a block do not match, such that the block is unrecognizable to the Oracle database. Common causes of physical block corruption include disk or controller faults, storage firmware bugs, or memory issues.
In this post, we discuss some recommended solutions and practices you should adopt in order to protect your databases hosted on Amazon RDS for Oracle from database block corruption.
Overview
The nature of hardware and software systems is such that failures are inevitable. One of the pillars of the AWS Well-Architected Framework is Reliability, which provides guidance to customers on how to embed resilience and proven failure recoverability into their system architecture. The solution described in this post helps you build and run an Oracle database that is resilient to block corruption on Amazon RDS using a combination of AWS services and native Oracle features to prevent, promptly detect, and recover from block corruption.
Preventing block corruption in Amazon RDS for Oracle
Prevention is crucial when it comes to maintaining the integrity and reliability of Oracle databases. In this section, we explore effective methods for safeguarding your Amazon RDS for Oracle from block corruption.
Enabling Oracle block checking
You can proactively prevent physical block corruption by enabling block checking on the database by configuring some initialization parameters. In Amazon RDS, you manage database parameters using a container for engine configuration called parameter groups.
It is recommended to identify and fix any logical block corruptions in the database before setting block-checking parameters to prevent such blocks from being permanently irrecoverable.
To enable block checking in your Amazon RDS for Oracle database, set the following parameters that control protection levels:
Parameter name | Type | Default Value | Allowed Values | Recommended Value | Description | Overhead |
DB_BLOCK_CHECKSUM | Dynamic | TYPICAL | OFF, FALSE, TYPICAL, TRUE, FULL | FULL | Store checksum in db blocks and check during reads | 4% – 5% performance overhead depending on the workload |
DB_BLOCK_CHECKING | Dynamic | MEDIUM | FALSE, OFF, LOW, MEDIUM, TRUE, FULL | FULL | Header checking and data and index block checking | 1% – 10% performance overhead depending on the workload |
DB_LOST_WRITE_PROTECT | Dynamic | NONE | NONE, TYPICAL, FULL | TYPICAL | Enable lost write detection | Increase in the amount of redo generation |
Alternatively, you can enable block checking by setting only the parameter DB_ULTRA_SAFE. Using this parameter requires that you do not set any of the DB_BLOCK_*
and DB_LOST_WRITE_PROTECT
parameters. DB_ULTRA_SAFE
is a static parameter, which requires a reboot of your RDS instance to take effect.
Parameter name | Type | Default Value | Allowed Values | Recommended Value | Description | Overhead |
DB_ULTRA_SAFE | Static | TYPICAL | OFF, DATA_ONLY, DATA_AND_INDEX | DATA_AND_INDEX | Sets defaults for other parameters that control protection levels | 5% – 15% performance overhead with increase in the amount of redo generation |
For the most comprehensive block checking (recommended), set DB_ULTRA_SAFE
to DATA_AND_INDEX. This automatically sets the parameters DB_BLOCK_CHECKING
to FULL, DB_BLOCK_CHECKSUM
to FULL, and DB_LOST_WRITE_PROTECT
to TYPICAL. This setting has the highest performance impact on the database.
For a less comprehensive block checking, set DB_ULTRA_SAFE
to DATA_ONLY. This automatically sets the parameters DB_BLOCK_CHECKING
to MEDIUM, DB_BLOCK_CHECKSUM
to FULL, and DB_LOST_WRITE_PROTECT
to TYPICAL. The performance impact although high, is slightly less compared to the recommended setting.
Enabling block checking adds overhead, which varies based on your workload pattern and the parameters set. It is highly recommended that you perform thorough testing to ascertain the impact on your application before enabling database block checking in production. Amazon RDS simplifies testing by making it easy to create a clone of your database instance using the RDS snapshot feature. Test your applications against a clone copy of your production database with block-checking parameters enabled.
If you have RDS for Oracle replicas configured, enable these parameters on at least one of the replicas.
Using Amazon RDS for Oracle Replicas
Amazon RDS simplifies creation of physical copies of your primary database referred to as replicas. It allows up to five replicas for one primary database. Amazon RDS for Oracle Replica is a feature available only with Oracle Database Enterprise Edition. There are two replica types — mounted and read-only. Mounted replicas use Oracle Data Guard for replication, but do not accept user connections. Read-only is the default replica type, and uses Oracle Active Data Guard to transmit and apply changes from the primary database to all read replicas and accepts read only user connections. Read-only replicas require additional Oracle Active Data Guard license for the primary database and each of the read replicas.
To prevent block corruption use RDS for Oracle read-only replicas. RDS for Oracle read-only replicas use the Oracle Active Data Guard automatic block media recovery feature to automatically fix physical block corruption, which is the most common type of block corruption. However, it does not address logical block corruption, which can be prevented by setting the parameter DB_BLOCK_CHECKING
on the primary and the replica databases.
Detecting block corruption in Amazon RDS for Oracle
Oracle only reports corruption the first time it accesses a corrupt block; in other words, block corruption can be a latent issue and remains undetected for a long time. The longer it takes to detect corruption, the more difficult your ability to recover from it. Therefore, it is important to detect block corruption as soon as possible.
Automated monitoring and alerting for ORA-* errors in the alert log
When Oracle encounters a corrupt block, it reports errors such as ORA-01578, ORA-00600 or ORA-07445 in the alert log. We recommend you automate the monitoring and configure alerts for ORA-* errors in the alert log to ensure prompt detection of block corruption in your Amazon RDS for Oracle database. You can implement the solution documented in Enable notifications for block corruption on Amazon RDS for Oracle to generate automatic alerts. Include the replicas in your monitoring setup to ensure block corruption is caught early on the replicas too.
Validating the database using Recovery Manager
The primary solution for managing database backup and restore activities in Amazon RDS is based on RDS snapshots. However, Amazon RDS for Oracle provides support for some Oracle Recovery Manager (RMAN) functions, including creating database backups and validating database files. To ensure prompt detection of block corruption, it is recommended to frequently perform RMAN validation of your Amazon RDS for Oracle database files. You can use the Amazon RDS package rdsadmin.rdsadmin_rman_util
on the primary database to validate your database instance files, such as data files, control files, or server parameter files (SPFILEs). Refer Validating DB instance files in Amazon RDS documentation for details. For a comprehensive validation of the database, we recommend executing the procedure rdsadmin.rdsadmin_rman_util.validate_database
using the parameter p_validation_type
set to PHYSICAL+LOGICAL
to check for both physical and logical corruptions. Any block corruption detected is reported in the output of the RMAN validation, alert log and also in the V$DATABASE_BLOCK_CORRUPTION database view. We recommend that you schedule the database validation procedure during a lean period with a frequency smaller than your backup retention period. A more frequent validation can help prevent the detection of corruption that might be older than your latest available backup.
The following example schedules a validation job VALIDATE_DB_JOB
to run every Sunday at 6 AM.
Validating individual objects using Structured Query Language (SQL)
You can also detect block corruption on individual database objects such as tables and their partitions or indexes and their partitions by running the ANALYZE SQL command, which instructs Oracle to validate the structure of the specified object. The command either reports successful validation or an error message if it encounters data corruption. Running ANALYZE is recommended if you want to avoid running a full database validation. You can take corrective actions such as recreating indexes or restoring from backup if the ANALYZE command reveals corruption.
The following SQL command analyzes the table example_corrupt_table
.
The following SQL command analyzes the table example_corrupt_table
and all its indexes.
The following SQL command analyzes the index example_corrupt_index
.
Detecting corruption in Amazon RDS for Oracle Replicas
Amazon RDS for Oracle replicas, both read-only and mounted, use Oracle Data Guard for replication which can detect block corruption. The apply process will stop due to a corruption in the redo stream or when it encounters a lost write and ORA-01578 error will be reported in the alert log. Monitor replication lag closely, as a significant increase in lag may indicate a block corruption.
Recovering from block corruption in Amazon RDS for Oracle
The key to recovering from block corruption is having robust backup and recovery methods in place. One of the inherent benefits of Amazon RDS is that it simplifies backup and recovery using features such as automatic snapshots and continuous backups of archivelogs to Amazon Simple Storage Service (Amazon S3), which allows for point-in-time recovery.
Using Amazon RDS point-in-time recovery
When you detect logical or physical database corruption in your RDS for Oracle database, recover by restoring the database to a point in time before the corruption happened. With RDS point-in-time-recovery (PITR), you can restore to any time within the latest restorable time for a DB instance and the configured retention period. View the latest restorable time for each DB instance in the Amazon RDS Console or by running the AWS CLI describe-db-instances command. Amazon RDS has a maximum retention period of 35 days for point-in-time recovery. PITR creates a new RDS instance based on the state of your data at the specified point in time. You can access a good copy of your data once the restored instance is available.
Using Amazon RDS for Oracle Replicas
Amazon RDS for Oracle read-only replicas use the automatic block repair feature of Active Data Guard to automatically detect and repair physical block corruptions. Although physical block corruptions can be prevented to a certain extent using read-only replicas, logical block corruptions will still need setting the DB_BLOCK_CHECKING
parameter.
Mounted replicas do not repair corrupt blocks automatically. However, mounted replicas stop redo apply when a physically corrupt block is detected in the redo stream, making them consistent to the point of corruption. You can promote the replica as the new primary database to overcome physical corruption.
Skipping corrupt blocks
Alternatively, if the corrupt block is inconsequential and if the minor data loss (if any) is acceptable, you can either drop and recreate the segment or mark the segment to skip the corrupt block.
To identify the type of data stored in the corrupt block, run the SQL commands below.
If the corrupt block is a FREE block, you can safely ignore the corruption and wait until the block is re-used which will automatically re-format the block.
If the corrupt block belongs to an INDEX, drop and recreate the INDEX.
If the corrupt block belongs to a TABLE, refer AWS documentation on how to skip the corrupt block.
Recommendations at a Glance
The following table shows a brief summary of the block corruption prevention, detection and recovery options discussed in this post for the different RDS for Oracle deployment types.
Deployment type | Licensing requirement | Prevention | Detection | Recovery |
RDS for Oracle – no replica |
|
or
|
|
|
RDS for Oracle – mounted replica |
|
or
|
|
|
RDS for Oracle – read replica |
|
or
|
|
|
Conclusion
As an Oracle database user, database block corruption poses a risk to data integrity and reliability. Block corruption can be caused by hardware failure, firmware errors, memory corruption, software bugs, and even human error. If left unchecked, block corruption can lead to data loss, degraded performance, and even system crashes. It is essential to note that the risk of block corruption remains irrespective of the platform where the database is hosted.
In this post, we described how to use the features of Amazon RDS and other services, with native Oracle database options to prevent, detect, and recover from block corruption on Amazon RDS for Oracle.
If you are unsure on how to proceed further, or need technical assistance with corruption recovery, you can contact AWS Support to get help from experienced technical support engineers.
For feedback or suggestions on the content, leave a comment in the comments section.
About the Authors
Lanre Showunmi is a Sr. DB Specialty Architect with AWS Professional Services. He helps customers and partners to build robust solutions in AWS cloud.
Naveen Gowda is a DB Migrations Consultant with AWS Professional Services. He works closely with customers to migrate and modernize their databases.