PV may be permanently corrupted
 
*
Welcome, Guest. Please login or register. January 08, 2009, 10:37:23 PM


Login with username, password and session length


Pages: [1]   Go Down
  Print  
Author Topic: PV may be permanently corrupted  (Read 1398 times)
0 Members and 1 Guest are viewing this topic.
Michael
Administrator
Hero Member
*****
Posts: 539


« Reply #4 on: February 18, 2008, 09:06:19 AM »

Removed? as in rmdev -dl hdisk7; rmdev -dl vscsi0;

If so, this did not remove anything - other than the entries in the ODM. To have actually removed them you would need to do that AND also remove the vscsi client definition on your HMC.

As to what could have gone wrong - and the SAN was "down" at least partially - the most likely area to research - for any hint of an explanation - would be the errpt logs on the vios - although I do not know whether communication errors between SAN LUNS (data) and VIO are stored. Another area would be to check the Service Focal Point on the HMC for messages from the partition.

I suspect you did many things on the HMC as well, e.g. with the command mkvdev and (I hope) rmvdev.
Logged
Carl
Jr. Member
**
Posts: 7


« Reply #3 on: February 18, 2008, 07:13:25 AM »

Thank you both for your replies. Unfortunately there was no time left to try any of your suggestions. I solved this little inconvenience by removing the hdisk including its vscsi device from the client. Than I pursued to remove the virtual adapter on the vios, but I left the vpath device intact (data). I rebuilt the whole situation, but used another (existing) virtual adapter (vhost). This solved the problem and the hdisk/volume group combination was no longer sick as a dog.
I hope I have given enough feedback. If anyone knows of any reason why this worked? Please feel free to share it with all of us.
Logged
John Peck
Global Moderator
Senior Member
*****
Posts: 46


« Reply #2 on: February 09, 2008, 01:07:10 AM »


The errpt extract means there was a problem writing to a JFS (2) log area, which would be in the course of a write to a filesystem, maybe not on the same "disk".  Could be an unfortunate bad block under that area, could be a total loss of ability to write to the "disk", could be any number of things as Michael says.

However, to explain the rest of it...

The "brin1app" VG was apparently not varied on and could not be varied on because there was not a quorum of VGDAs that agreed.

If it's a one "disk" VG, that's because there's a problem with that "disk" and in writing to the VGDAs (2) on it, presumably at the same time as the problem with the write to the JFS (2) log.

With a two "disk" VG you have two VGDAs on the first "disk" (added first or lowest numbered when added together), and one on the second.  You need two of the three to match, but, if it was one VGDA on the "disk" with two VGDAs that was noted as different to the others, that "disk" is considered bad, thus the other VGDA on it is also ignored and you only have one "good" VGDA which is not a quorum.  In that situation you can attempt to varyon after setting the VG to ignore quorum ("smit chvg" as one should with a two disk, hopefully mirrored pair of "disks").  At the point when a two "disk" VG looses quorum (which will be checked at each write to the VG), the VG is varied off immediately - in the case of it being rootvg, you are then stuffed.  This is why not using quorum checking is a good idea with two "disk" VGs, whether mirrored or not.

With a three or more "disk" VG, you get one VGDA on each disk, and then only the "disk" that has a problem will be declared bad and not in the quorum of the others, so less issues.

You have to have a VG varied on to do an fsck, or anything else, to the things in it of course. 

The "fslv07" is a filesystem related to the log in question ?  You may need the "logform" command to wipe the JFS (2) log, rather than use fsck to replay the log (if you start using other options to -n don't do anything).
Logged
Michael
Administrator
Hero Member
*****
Posts: 539


« Reply #1 on: February 08, 2008, 05:18:25 PM »

Well, this sort of error may have many causes - with the J2 log error just being a symptom. What I would want to see is the hdisk errors (errpt -N hdisk*) , or in your case perhaps errpt -N hdisk7. This gives some indication of a possible error with the hdisk itself.

Further, there may be other errors, e.g. in communication between the system and SAN storage - assuming that hdisk7 was served by the SAN environment.

Another simple command that could help with knowing a bit about your system is:
# lspv to see the disks and their associated volume groups. Also,
# lsdev -Cc disk to see abit about their connectivity.

What is running thru my head right now is that AIX was not able to update data in the "corrupted" environment, and it's concept of the hdisk - if it is in fact a LUN - is dead.

I further recommend you check the man page for errpt and learn how you can get all the error messages between a start and stop time, and maybe include all the messages from when you first get i/o or adapter (communication) errors, to the failure of your jfs log.
Logged
Carl
Jr. Member
**
Posts: 7


« on: February 07, 2008, 09:17:58 AM »

Since two days ago our DS4100 had a disk failure (replaced in the meantime) two machines have had trouble with PV's. On one of them I could quite easily remove the hdisk, vscsi, virtual i/o adapters and so forth and recreate a new disk. This solved the problem, but I never really understood what happened and whether there is another way of solving the problem. Since the other machine is test, I would like to take this opportunity to share my problem with you and maybe find a nice way to solve it.
The following is a train of events.
errpt contains the following error:
---------------------------------------------------------------------------
LABEL:          J2_LOG_EIO
IDENTIFIER:     C1348779

Date/Time:       Thu Feb  7 09:35:29 CET 2008
Sequence Number: 15063
Machine Id:      00CE62DE4C00
Node Id:         tstapp2
Class:           O
Type:            INFO
Resource Name:   SYSJ2           

Description
LOG I/O ERROR

Probable Causes
ADAPTER HARDWARE OR MICROCODE
DISK DRIVE HARDWARE OR MICROCODE
SOFTWARE DEVICE DRIVER
STORAGE CABLE LOOSE, DEFECTIVE, OR UNTERMINATED

        Recommended Actions
        CHECK CABLES AND THEIR CONNECTIONS
        INSTALL LATEST ADAPTER AND DRIVE MICROCODE
        INSTALL LATEST STORAGE DEVICE DRIVERS
        IF PROBLEM PERSISTS, CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
JFS2 LOG MAJOR/MINOR DEVICE NUMBER
000D 0001
ERROR CODE
0000 0005
BUF STRUCTURE B_FLAGS
000C 0005
BLOCK NUMBER
0000 0008
---------------------------------------------------------------------------

$ lsvg brin1app
0516-010 : Volume group must be varied on; use varyonvg command.
$ varyonvg brin1app
0516-013 varyonvg: The volume group cannot be varied on because there are no good copies of the descriptor area.
$ varyoffvg brin1app
0516-010 lqueryvg: Volume group must be varied on; use varyonvg command.
0516-010 lvaryoffvg: Volume group must be varied on; use varyonvg command.
0516-942 varyoffvg: Unable to vary off volume group brin1app.
$ importvg -y brin1app hdisk7
0516-360 getvgname: The device name is already used; choose a different name.
0516-776 importvg: Cannot import hdisk7 as brin1app.
$ fsck -n /dev/fslv07
The current volume is: /dev/fslv07
Open volume read-only returned, rc = 6
fsck: 0507-289 Device unavailable or locked by another process. Cannot continue.

Just supplying as much info as possible. Hope to hear from you soon.
Logged
Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.2 | SMF © 2006-2007, Simple Machines LLC

Valid XHTML 1.0! Valid CSS! Dilber MC Theme by HarzeM
Page created in 0.87 seconds with 19 queries.




eXTReMe Tracker

Terms of Use and Privacy and Security Policies
Copyright 2001-2008 Michael Felt and ROOTVG.NET