= AST RESEARCH, INC. TECHNICAL BULLETIN # 1158 4-26-95 = MANHATTAN SMP HOT REPLACEMENT AND HOT SWAP This is an informational bulletin. The information contained in this bulletin is intended to assist trained technicians in performing Hot Swap and Hot Replacements of hard disks with the Empire DAC 960 disk array subsystem, on the Manhattan SMP. Knowledge of the Manhattan SMP and the operation of the DAC 960 and associated utilities is required. Do not attempt a Hot Replacement of a hard disk, unless all hardware and software configuration for the Manhattan SMP, the Empire DAC 960, the operating system, and the replacement drive are set correctly. Attempting a Hot Replacement with hardware or software incorrectly configured may result in loss or corruption of data. For more information regarding the Manhattan SMP, see the Manhattan SMP Installation and Maintenance Manual, AST Part Numbers 001169-001 or 001725-001. For more information on the Empire DAC 960, see the AST Disk Array Users Manual, AST Part Number 001724-001. This document is divided into two sections. Section one lists problems that may have led a Hot Replacement drive to fail. Section two lists problems that may have led a Hot Stand-by to fail. HOT REPLACEMENT FAILURES If a Hot Replacement drive fails, verify that the following guidelines have been met, and/or check for the following error or configuration conditions. 1. The replacement drive must be the same capacity or larger than the failed drive. 2. The replacement drive must be installed into the failed drive's drive bay. 3. Ensure that the replacement drive is the drive intended to be replaced, and that its hardware configuration (i.e. SCSI ID) is as intended. 4. The replacement drive must have been low level formatted. 5. Ensure that the replacement drive does not have a SCSI ID conflict. A SCSI ID conflict may stop all the drives on the same channel, and crash the system. 6. Ensure that the SCSI ID cables are connected correctly. Pin 1 of the SCSI ID cable, must be connected to SCSI ID jumper pin 1. If the cable is reversed, the SCSI ID will not be set correctly. 7. Ensure that the termination is set correctly on the replacement drive. A termination conflict may cause the sub-system to fail. The termination is set on the backplane of the Manhattan SMP, and therefore hard drives should have termination disabled. 8. Ensure that the drive is set to spin when initialized by the DAC 960. 9. If the replacement drive can not be set to Stand-by, wait 10 seconds, and then try the command again (the drive may be in the process of spinning up). Check to make sure that the drive status is Dead. If it is not Dead, then use the Kill option in the DAC960TK utility to change the drive's status to Dead. Note: Use care with the Kill option. If the wrong drive has its status set to Dead by the Kill command, the system will crash. HOT STAND-BY FAILURES If a Hot Stand-by drive fails, investigate the following areas, and take the appropriate steps to bring the system back to its expected configuration. Determine why the hot Stand-by failed and take any required action to ensure that the configuration will function with a Hot Stand-by drive. 1. The Hot Stand-by Drive may be the wrong capacity. The Hot Stand-by needs to be the same capacity as the failed drive. (See also Technical Bulletin number 1055, RAID Considerations with AST's Mylex DAC.) Perform a Hot Replacement of the failed drive and not the Stand-by drive. 2. The Hot Stand-by drive may have been used previously as a data drive. If the Hot Stand-by has been used as a data drive, use the Hot Replacement method to replace the failing drive, then low level-format the Hot Stand-by drive and set its status to 'Standby.' 3. A Hot Stand-by drive may have been installed, when a hot replacement was required. Perform a Hot Replacement of the failed drive using its drive bay location. 4. A drive may have been killed using the DAC960TK 'kill' command. A Hot Replacement must be performed on the drive that was killed. The 'kill' option will not start a rebuild on a Hot Stand-by. 5. The RAID configuration may be either RAID level 0 or RAID level 7. These RAID environments are not redundant, and can not be rebuilt. The drive must be replaced, reconfigured, and data restored from a backup. If no backup is available and there may be corrupt data on the RAID drive, the user can attempt to set the dead drive's status to 'on-line' to recover. 6. The Hot Stand-by drive may have failed while rebuilding data. Perform a Hot Replacement on the Hot Standby-by drive. A new drive should be placed in the original failed drive's location, low level formatted, and set to 'Stand-by.' 7. The Hot Stand-by may have been installed into the wrong drive bay. Perform a Hot Replacement of the failed drive. When the failed drive is removed from the system and the new hot standby is installed, the new hot standby drive must be installed into the failed drive's drive bay. The new Hot Stand-by drive must be low level formatted, the drive must spin up, and the drive must be set to 'Stand-by.' 8. A new Hot Stand-by drive may have failed. Perform a Hot Replacement on either the failed drive, or the Hot Stand-by. Use the DAC960TK utility's remap command, to determine which drive needs replacement. The new Hot Stand-by drive must be low level formatted, the drive must spin up, and the drive must be set to 'Stand-by.' 9. Eight Hot Stand-bys may have been performed. Perform a Hot Replacement of the failed drive. Check the number of Stand-by Rebuilds using the remap option of the DAC960TK utility. Reset using the DAC960CF.BAT utility (view/edit - save).