helps ensure high drive reliability
by Gary Herbst
Operating electronic components such as disk drives at high temperatures can dramatically reduce their reliability. In many computer systems, failures in cooling components (such as clogged filters on fans) can go undetected for an extended time. The resulting stress can lead to unexpected failures and even data loss. To prevent this from happening, IBM has integrated temperature sensors into its new Ultrastar 9LP, 18XP, and 9ZX server disk drives. High temperature conditions are reported to the host system using the Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.) standard. Once the computer system is alerted to any temperature problems, the user or system administrator can take action.
This white paper describes how a new Ultrastar feature, Temperature Indicator Processor, Drive-TIP, works and its benefits to users of data-intensive applications.
Today's applications require outstanding drive reliability
When it comes to capacity, performance, and reliability, one name stands above the rest-the IBM Ultrastar family of high-capacity, high-performance disk drives. IBM Ultrastar was the first drive family to implement the features now defined in the S.M.A.R.T. standard. Called Predictive Failure Analysis* (PFA), it monitors parameters such as head flying height, noise and signal amplitude, signal coherence, and writing parameters. PFA predicts impending drive failures using algorithms that are robust enough to help avoid failing good drives.¹ Likewise, IBM is first to market with temperature-sensing drives. Following on to PFA, the Drive-TIP feature is also expected to find widespread use as an aid to improving data availability.
Heat has a major effect on drive reliability
Figure 2 shows the dramatic effect that temperature has on the overall reliability of a hard disk drive. Derivations from a nominal operating temperature (assumed to be maintained over the life of a drive) can result in a derivation from the nominal failure rate. As the temperature exceeds the recommended level, the failure rate increases two to three percent for every one degree rise above it. For example, a hard disk drive running for an extended period of time at five degrees above the recommended temperature can experience an increase in failure rate of 10 to 15 percent. Likewise, operating a drive below the recommended temperature can extend drive life.
Several failure modes within a disk drive are exacerbated by temperature. Thermal tilt of the disk stack and actuator arms can occur very quickly and cause off-track writes, corrupting data on adjacent cylinders. Outgassing of the lubricants in the spindle motor and voice coil motor occurs at high temperatures (experienced over a relatively short 30-60 day time period), which can lead to stiction failures or a possible head crash. Over an extended period of time, the bearings can wear out and cause mechanical failures.
Heat can build up within computer systems due to a clogged fan, failure of air conditioning in a room, operating more drives than the cooling system can handle, and so on. Unfortunately, these conditions can go completely unnoticed until a failure occurs. Because of the essential nature of today's workstations and servers, such risks are unacceptable for many users. What is needed is a way to identify high-temperature situations before they affect data integrity.
Drive-TIP helps warn of extreme temperatures
Two temperature trip points have been preprogrammed into Ultrastar drives. The first trip point is defined by the system provider (or in some cases the system administrator) in the Vendor Unique Parameter Mode page (00h) in the drive. Typically, this is set to the expected nominal temperature. The difiant is 50 degrees Celcius. The second trip point is 65 degrees Celsius-the maximum allowable temperature of the base casting.
If the first temperature trip point is exceeded, Drive-TIP sets an internal flag in the drive. A warning is sent to the drive controller when the PFA interval timer expires. The Information Exception Control (IEC) mode page (1Ch) controls the interval for posting the PFA errors and warnings.
The drive microprocessor reads the temperature when it is powered on and every 25 minutes thereafter, as part of the Drive-TIP algorithm. The temperature warning is generated in compliance with the SCSI-3 standard as defined in Figure 4, which is a portion of Table 66 in the SCSI-3 Primary Commands (SPC) document, ASC and ASCQ Assignments. A unique Unit Error Code (UEC) of 22F is also returned on a subsequent Sense command.
Figure 4: SCSI-3 Definition of temperature warnings
When the first temperature trip point is exceeded, the sampling period changes from 25 minutes to 15 minutes. Also, a log entry is made in the permanent drive error log that includes the temperature and Power-On Hours (POH) when it occurred. As long as the temperature remains above the first trip point, it will continue to create log entries. If the temperature exceeds the 65 degree trip point, the sampling period changes from 15 minutes to 10 minutes. The log entries into the permanent drive error log continue at the 10 minute interval.
All log entries in the media error or hardware error logs also include the temperature at the time of the error. All unit starts and unit stops also include the temperature. In addition, the disk drive records the accumulated power-on hours that the temperature is above each trip points and the maximum temperature experienced during the life of the drive. This information is stored in the non-customer data cylinders on the drive.
Applications of Drive-TIP
If the warning is recognized by system users or administrators, corrective action can actually save data. Systems now have the information to vary cooling capacity based on component needs. For example, fan speed can be controlled based on temperature within the system, producing better reliability for customer data.
The Ultrastar Family-Storage Solutions for Data-Intensive Applications
Product description data represents design objectives and is provided for comparative purposes; actual results may vary depending on a variety of factors. Product claims are true as of the date of the first printing. This product data does not constitute a warranty. Questions regarding IBM warranty terms or the methodology used to derive this data should be referred to an IBM representative. Data subject to change without notice.
© International Business Machines Corporation 1997
* The following are trademarks or registered trademarks of the IBM Corporation in the United States, other countries, or both: IBM, Ultrastar, Drive-TIP, No-ID, and Predictive Failure Analysis.
IBM Storage Systems Division
IBM hard disk drive product information and technical support center:
IBM TECHFAX: 1-408-256-5418 (requires a touch-tone phone)
Japan Sales Branch Office