Monday, November 7, 2011

Replacing Fiber Adapter (HBA) in IBM AIX 5.2.0 on p595

One of the lpar running on p595 servers had Started reporting  Errors on one of the two Fiber Adapters (fscsi1) in Error report, the errors appeared as Type:TEMP as shown below.
825849BF   1106105211 T H fcs1           ADAPTER ERROR
B8FBD189   1106105211 T S fscsi1         SOFTWARE PROGRAM ERROR

LABEL:          FSCSI_ERR6
IDENTIFIER:     B8FBD189

Date/Time:       Sun Nov  6 10:52:31 EST
Sequence Number: 872005
Machine Id:      XXXXXX
Node Id:        XXXXXXX
Class:           S
Type:            TEMP
Resource Name:   fscsi1

LABEL:          FCS_ERR2
IDENTIFIER:     825849BF

Date/Time:       Sun Nov  6 10:52:31 EST
Sequence Number: 873473
Machine Id:      XXXXXXXX
Node Id:         XXXXXXX
Class:           H
Type:            TEMP
Resource Name:   fcs1
Resource Class:  adapter
Resource Type:   df1000fa
Location:        U5791.001.992083W-P1-C05-T1
VPD:

Description
SOFTWARE PROGRAM ERROR

Probable Causes
ADAPTER MICROCODE
SOFTWARE PROGRAM
SOFTWARE DEVICE DRIVER

Failure Causes
ADAPTER MICROCODE
SOFTWARE PROGRAM
SOFTWARE DEVICE DRIVER

        Recommended Actions
        IF PROBLEM PERSISTS THEN DO THE FOLLOWING
        CONTACT APPROPRIATE SERVICE REPRESENTATIVE

This server was running DB2 and DB2 crashed around the same time when we started getting these alerts. Found that DB2 crashed due to super block corruption in the filesystem and DB2 was unable to create files..,Unmounted the Filesystem,  corrected superblock corruption by copying secondary superblock.
dd count=1 bs=4k skip=31 seek=1 if=/dev/LVXX of=/dev/LVXX
dd: 1+0 records in.
dd: 1+0 records out.
Had contacted SAN Team to look into any SAN issues and they did not find any issues on SAN.
Found that it works for a while and goes into degraded state and eventually into Failed state.
Placed a service call with IBM to replace the Adapter and replacing the adapter fixed the issue.. Steps followed to replace the Fiber Adapter..

  • datapath query adapter --- To identify the faulty or degraded adapter
Adpt# Name State Mode Select Errors Paths Active
0 fscsi1 FAILED ACTIVE 2059041211 711 20 0
1 fscsi3 NORMAL ACTIVE 231765446 37 20 18

  • datapath remove adapter 0 --- To remove the adapter from SDD
  • rmdev -Rdl fcs1 ---To remove adapter from ODM. This will also remove all child devices ( hdiskXX). This step is not needed as the following step will do this for you, hotplug mgr will do that for you, having the device still there allows you to identify it easier
  • diag, choose Task Selection, Hot Plug Task, PCI Hot Plug Manager.... tasks to replace the adapter
  • lscfg -vl fcs1 ---To identify new wwn. Have SAN Team assign LUN's to this adapter
  • cfgmgr -vl fscsi1 --- Run cfgmgr to rediscover the deleted paths to LUNs
    cfgmgr -v
  • addpaths ---add paths to SDD
  • datapath query device ----verify all LUNS have same number of active paths
     datapath query adapter