ROOTVG

AIX => Administration => Topic started by: gz3xzf on May 13, 2008, 11:24:44 AM



Title: Failure of a disk in a mirrored rootvg.
Post by: gz3xzf on May 13, 2008, 11:24:44 AM
One and all, I have a question, if you have a server (AIX 5.2 ml7) which has a rootvg that is mirrored across two internal SCSI disks (hdisk0 and hdisk1, obviously) and is running happily. I have never witnessed the failure of a root disk and was wondering what you would expect to happen, would the system crash and reboot on the working disk, carry on running on the good disk, crash completely are require manual booting on the good disk. (Assume that the mirrored rootvg has been created using the IBM recognised procedure, i.e. bosboot has been run, bootlist reconfigured, etc.)

A secondary question is, why is the dumplv never mirrored, does anybody know? I normally mirror it manually after the mirrorvg command has completed.
 :)


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: kondoor on May 13, 2008, 08:06:27 PM
I just went and pulled hdisk0 out of my running test server with hdisk0 and hdisk 1 mirrored,and i got a nice errpt and the server kept running just fine.  So if it fails, your server should stay up.  I cannot answer about dumplv as I do not know.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: gz3xzf on May 13, 2008, 08:18:45 PM
I just went and pulled hdisk0 out of my running test server with hdisk0 and hdisk 1 mirrored,and i got a nice errpt and the server kept running just fine.  So if it fails, your server should stay up.  I cannot answer about dumplv as I do not know.
Cor  :o :o can I come and work with you? We have plenty of servers but none we can just have fun with, they all seem to be used for something meaningful!!!
Thanks for that.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: kondoor on May 13, 2008, 08:24:15 PM
 :D its nice having a test server, btw if you pull both your mirrored drives your system will crash.  I needed to redo the test machine anyway.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: Michael on May 13, 2008, 09:04:51 PM
The command mirrorvg mirrors all the logical volumes that are present on rootvg at the time the command is run, so I suspect that your dumplv was made after mirrorvg of rootvg had been run. Simply run mklvcopy to make a copy of your dumplv and it will be mirrored as well.

Further, rootvg runs chvg -an rootvg to disable quorum checking - so that if the disk with 2 VGDA goes bad, rootvg stays up (otherwise the volume group would go offline/varyoff).

With a mirrored rootvg the system should operate normally - as long as the failed disk is visible - when physically removed (PVMISSING) you may have some difficulities rebooting - as AIX, by default, does not varyon a volume group with a PVMISSING disk - IF - quorum checking is off.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: John Peck on May 13, 2008, 11:10:41 PM
Actually that's not correct Michael.

Mirroring rootvg does NOT mirror the dumplv if that is there as a separate LV dumplv,
and that's because writes to a mirrored dump device were not supported as they used to say.

However, the default, for a time anyway, was for dumping in the paging space and that is
mirrored - go figure  :o

I do not recommend mirroring your dumplv, nor using the paging space as the dump area.
sysdumpdev to change settings.

Note also that the quorum change requires a reboot to take effect, otherwise
you carry on with quorum required and one of the disks with 2 VGDAs, the other with 1 VGDA,

The result being that if you happen to kill the disk with the 2 VGDAs on it before you reboot
to drop the quorum requirement, then your rootvg will lose quorum and hang.

Just so happens that this happened at a customer of mine the other week
- they plugged in an un-terminated external SCSI tape and blew up the SCSI card
that had one of the two rootvg disks on the internal bus.  (So much for independent
buses.)  The effects were most odd - errors about not being able to write disks
anywhere basically. 

After pulling a mirrored disk out and re-inserting it, you will/should have lots of
stale PPs in lsvg.  That will need to be syncvg matched up again, but it might require
the disk to be "removed" logically and re-added - worthy of further testing kondoor
as you'll be in that situation.  I don't think it's as nice as just plug it back in and
continue if the system was doing anything to the VG at the time - you will
remain highly vulnerable running on one disk until it's fixed properly.

Secondly I do not believe the PVMISSING will not varyon boot comment
- I have in the past pulled a disk in a rootvg and booted on only one of them.
Precisely the sort of thing you might well want to do if you're testing a configuration
properly before using it ! 

Pretty sure I've done it with SSA disks too, but that was a long time ago.
Anyway, it would be amazingly stupid if it were set up to not vary on with a
missing disk !

P.S.
Loving the option for Recent Posts off the main menu - didn't he do well  8)
And if you're not seeing Michael's new logo yet, with the green tag line, it's
because the old image will still be cached in your browser.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: gz3xzf on May 14, 2008, 08:36:14 AM
Quote from: John Peck
I do not recommend mirroring your dumplv, nor using the paging space as the dump area.
sysdumpdev to change settings.
If the dumplv is not mirrored and the system is not recording dump information in the paging space; what will happen to a system if the root disk with the dumplv device on fails?

I have also asked IBM what they think will happen and so far the answer seems to be, anything! The system may carry on running if the failed disk hasn't screwed the SCSI bus, it may crash if the failure has screwed the SCSI bus.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: John R Peck on May 14, 2008, 12:46:59 PM
If your disks are on separate SCSI cards (not just buses), i.e. you cover any single points of failure,
and follow the mirrorvg instructions, reboot etc,
then
any disk failing will not be a problem.

Indeed you might want to test that at least once by physically pulling a disk, fixing, testing the other,
just to be sure your system is working the way you intend, before you use it in anger.
(This comes under the heading of final testing in my Build Guide, hopefully not too final  ??? - backup first - and I wouldn't try it on an old existing system as in extreme circumstances you may find as I did once that dust stops the connectors working when you try to plug the disk back in.)

If the paging space is your dump area, dumps do work in my experience, but the
offical advice was that writes to mirrored dump devices were not supported.  Don't know why, but you can imagine that it's a bit much to ask a dying system that feels like taking a dump to also handle all the LVM and mirror write it, or for a re-syncing VG to guarantee the dump stays if there were a fault in one of those PPs that needed to be resynced.  So my preference is to use the unmirrored dumplv as the dump device. 

If an unmirrored dumplv is on a dead disk, it shouldn't be a problem for everything else to continue.  The only time anything is written to the dump area is when there's a dump of course.  However I do recall statements that the dump device must always remain open.  So although no dumps are possible if your area is missing, I don't know what happens if you then try to use it.  One for testing to see what exactly happens, if you feel like also forcing a dump after killing the dump area disk.  However, unless you are trying to resolve a problem that needs a good dump, there's seems very little point in worrying about the dump device on a failed disk - better deal with one problem at a time.  Normally, how often do you need to take a dump  ;) hardly ever in my case  ;D and even then, how often do you analyse it (I gather on mainland Europe they do it all the time whereas in the UK we have designed the area differently and just flush it away :D )

What happened the other day where quorum checking was still in place (not rebooted after mirrorvg) and the disk with the majority of VGDAs was killed, was that there with just one dead disk, the VG was then also dead.  The system was running in memory only then, unable to write to disk, rootvg apparently locked.  So whatever was in memory seemed fine, anything else was error messages and weird ones too.  sync ; halt didn't complain; rebooting apparently reset whatever SCSI fault and it came up all OK again.  There were logs showing fsck rollbacks on certain areas.  I assume the card fuse blew and reset as a result of the non-terminated hot-plug connection - which again is not supported, but in most cases is never a problem, just bad luck.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: kondoor on May 14, 2008, 02:24:07 PM
Quote
After pulling a mirrored disk out and re-inserting it, you will/should have lots of
stale PPs in lsvg.  That will need to be syncvg matched up again, but it might require
the disk to be "removed" logically and re-added - worthy of further testing kondoor
as you'll be in that situation.  I don't think it's as nice as just plug it back in and
continue if the system was doing anything to the VG at the time - you will
remain highly vulnerable running on one disk until it's fixed properly.

Just to clarify, I just pulled the drive, to see if the system would stay up and run.  I didn't try to reinsert the drive.  After I pulled the first drive and let it run for awhile, I pulled the second drive in the mirror just to see if it crashed instantly and it did.  I did a clean install last night, if I get time today, I will pull the drive again, and see what it takes to remirror that vg, most likely only a sync but I am now curios to see if i have to do a rmdev on it first and then add it back in.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: Michael on May 18, 2008, 11:17:29 PM
No, you should not have to do a rmdev. You have the option of running cfgmgr to rediscover the disk state, or you can just run mkdev -l hdiskX to bring the disk back online yourself.

You do not need to run rmdev - because the disk is already in a Defined status.


Title: Re: Failure of a disk in a mirrored rootvg.
Post by: aixdude71 on May 20, 2008, 02:26:07 PM
Quote
One and all, I have a question, if you have a server (AIX 5.2 ml7) which has a rootvg that is mirrored across two internal SCSI disks (hdisk0 and hdisk1, obviously) and is running happily. I have never witnessed the failure of a root disk and was wondering what you would expect to happen, would the system crash and reboot on the working disk, carry on running on the good disk, crash completely are require manual booting on the good disk. (Assume that the mirrored rootvg has been created using the IBM recognised procedure, i.e. bosboot has been run, bootlist reconfigured, etc.)

A secondary question is, why is the dumplv never mirrored, does anybody know? I normally mirror it manually after the mirrorvg command has completed.

For the first part - it will happen. Disks spin and have tiny little ball bearings that will wear out. Just a matter of time. You will receive and alert via errpt and whatever other non-aix trap/alert tools you have.

What you will do is call IBM and they will dispatch a CE with your new drive. You will need to provide the system serial number (prtconf) and the FRU of the failed drive (lscfg -vl hdiskX), you will then break the rootvg mirror and reduce the vg of the disk, then let the the CE replace the drive using diag and hot plug task. You then extend it back into rootvg, re-mirror, and do a bosboot on the new disk. If you don't do things in that order you may have problems.

Second part - do a man on sysdumpdev - you should create a primary and secondary dump space, rather than mirroring the one dump lv.

System dumps can be huge and are not system critical (other than cause analysis). They are also many times worthless even when fully captured.