Full Version : RAID Recovery
asustech >>Hard Drive (IDE, SATA, & RAID) Assistance >>RAID Recovery


<< Prev | Next >>

dbritch- 12-31-2006
I'm looking for help in recovering from a problem on my RAID array. I would appreciate any assistance you can provide.

I have used an A8N-SLI Premium for nearly a year, with no major problems. I use a single system disk, a pair of disks striped using the nVidia controller, and 4 disks in a RAID-10 configuration using the SIL controller.

Yesterday, when I booted my system, the 4th disk in the RAID-10 set did not show up. It appeared to be missing. I powered down the system, checked the cables, and powered it up again. At first I was pleased - all 4 disks appeared. However, when I started up the SATARAID5 app to check the status, it indicated that the second disk was active, and all 3 remaining disks were available. As you might expect, Windows did not mount any drive from this RAID.

I powered down again, and looked at the system using the BIOS utility. It grouped disks 1, 3, and 4 together, and indicated invalid RAID configuration. It also wasn't too happy with disk 2. ;-)

So - how can I recover my RAID and my data?

I removed power from the four disks, so that I could use the rest of my system while sorting this out. One approach that has occured to me is to plug in disks 1 and 3 - I presume that these disks should have the two sides of the striped array, and the two that appear the most likely to be damaged are disks 2 and 4.

Another is to power disks 1-3, hoping that I have an actual failure on disk 4, and that metadata on disk 4 is conflicting with metadata on the other disks.

I'm concerned about the possibility that the RAID controller will think it knows what to do, and will start to incorrectly rebuild a mirror, and wipe out data.

Thank you for any help on this!

David

Arlie- 12-31-2006
Before you do anything else, replace all four cables and see is that clears up the problem. We've had a rash of cable failures in here of late with the cables that ship with the mbs.

If that does not clear up the problem, you will need to -*test*-('") each drive independently to determine which has failed. Obtain a copy of the drive manufacturer's -*test*-('")ing software. Unpower three of the four drives and plug the drive to be -*test*-('")ed into a free NV port and -*test*-('"). Repeat for ALL four drives until you find the bad one(s). Replace that drive(s) with identical drive(s) and the array should rebuild itself. That's the safest option.

dbritch- 12-31-2006
Thank you for your suggestions. They are consistent with my understanding and expectations of RAID. I'll try them to the extent I can.

I started with a variant of your recommendations - I don't have immediate access to additional SATA cables. I can order a new set, or scavenge them from another array, but in the meantime...

Before -*test*-('")ing each disk, I tried powering up (to BIOS) the system with only 3 of 4 disks powered up. My disks are numbered 0-3. The BIOS consistently reported that disk 1 was an invalid RAID array, and that disks 0, 2, and 3 were reserved.

I'm concerned even if/when I identify a bad disk or cable, that my metadata may be damaged, and may need to be rebuilt. I'm not sure how to do that...

David

Arlie- 12-31-2006
You don't have "metadata" on a RAID 1/0 array, you may be thinking about the parity data stored on the redundant disk in a RAID5. What you have is essentially two raid 0 arrays. Each disk in the primary raid zero has a failover disk in the second array.

The problem with trying to do anything until we know which disk is bad is that we don't know which of the four disks are in which array set. If you go playing with them before you determine specifically what went wrong, you run a very high risk of damaging data on one disk and you are then absolutely toast in terms of rebuilding. You really need to take your time on this one and play it by the book.

You can start -*test*-('")ing each disk now before you get new cables. If you have one known good cable, you can use that to -*test*-('") the four drives immediately. If all four drives -*test*-('") out, you have a bad cable, a bad SATA port, or a corrupted array. At least you have that much figured out.

dbritch- 01-01-2007
Fair enough - I'll -*test*-('") each disk separately. I don't want to damage my data.
The drives are Seagate Barracudas. Is the Seatools online edition (available here: http://www.seagate.com/support/seatools/B7c.html) an adequate disk -*test*-('")? Is using this -*test*-('") and the current SATA cables likely sufficient to detect both failed cables and failed disks?

Could you elaborate on what you mean when you say that there is no metadata on a RAID-10 array? There are four disks - you need to know how they are paired. Furthermore, you need to know the block size for the striping. The specification of these things is metadata. It may be that with a 3114 controller, there is no choice of block size, or of layout - I don't know this controller. Is that what you meant? Or perhaps we're using different terminology.

It sounds like you see the symptoms I'm observing as consistent with a failed disk. I was concerned that this may be something deeper, such as a failed controller. The enterprise-class RAID systems I've worked with before typically have much clearer and more consistent error reporting. I guess that's part of what you pay for in an enterprise-class system.

You indicated that the array should rebuild itself when I replace the failed drive. Shouldn't it also operate in degraded mode with the failed drive removed?

Thanks!

David

Arlie- 01-01-2007
The Seatools -*test*-('")ing suite is about as good as it gets in the SATA world. That is an excellent choice.

Yes, we are using different terminologies. I don't consider stripe size, block size, and drive cardinality "metadata. " Sorry for the confusion on that one. I come from a data warehousing background and I think of something totally different when the term "metadata" comes to mind. Using the term in the manner to which you are accustomed, you will not damage anything by removing drives from the array and -*test*-('")ing them as stand alone drives. JUST DON'T WRITE to them!!! And, make sure you keep careful track of which drives you remove and replace them in exactly the same location on the same controllers when you're done -*test*-('")ing.

I am still not sure you have a failed disk. The barracuda's have a great reputation for longevity. Unless you heard the disk grinding or whining, I doubt the problem is a bad disk. I think it much more likely that you have a bad cable.

As to whether or not the Sil Image array will operate in a degraded condition, I can't answer that because I don't use it on my machine and I have not encountered anyone that has experienced a "disk" failure on it. On an enterprise class raid 1/0, I would agree that it should function normally with one malfunctioning disk/cable. I'll be interested in your results on this one, as it will speak volumes about the use of mirroring on the Sil Image controllers. If the entire array craps when one disk/cable goes, then the value of mirroring is close to nothing.

dbritch- 01-01-2007
Thank you for your help in this.

I'm -*test*-('")ing drives, and I've ordered cables. The price difference between ordering and purchasing locally is remarkable!

Is it likely sufficient to run the short disk self -*test*-('")? Or should I plan to run the 20+ minute -*test*-('")s on each drive? I haven't heard any whining or grinding. I was rather surprised when the array abruptly stopped working.

Now I understand the difference in our terminology - I believe you're referring to filesystem metadata, and I was talking about lower-level RAID configuration metadata. In this system, is the RAID configuration stored in CMOS somewhere, or on the disks themselves? The thing that has concerned me most about this is that after the reboot on Saturday, the controller has not appeared to recognize that the disks - errors or no - are supposed to be in a mirrored and striped configuration. If the cables are bad enough to confuse the controller *that* much, I'm surprised the system has worked as well as it has for this long.

I decided to go ahead and order new cables for my other 3 drives, as well - just in case. I've always been a little concerned about these cables. They don't have a latch, and the are very easy to knock off of the connector.

David

Arlie- 01-01-2007
I'm afraid I can't answer your question with any authority. It is my conjecture that the stripe, block, and cardinality are stored within CMOS or in a secondary chip accessed by CMOS. However, I have not researched the Sil Image controller extensively.

Testing the drives on the NV controller using the quick -*test*-('") from Seagate should be sufficient. That -*test*-('")ing should not damage the drives or their ability to function in the array provided you shut down the rest of the drives before you remove one for -*test*-('")ing by unpowering each. I base that statement on my experience with the NV raid ports, so take it with a grain of salt.

Personally, I run RAID 0 on these boards and rely on the backup. I don't yet trust the mainstream mb builders to build a reliable mirrored array. The SCSI guys have been doing this for two decades. The SATA folks are still wet behind the ears.

dbritch- 01-02-2007
All four disks have passed the short -*test*-('"). 3 passed the extended -*test*-('"), and the forth is currently running the extended. New cables should arrive tomorrow, and I'll replace them.

What is the procedure that you expect to be necessary to bring my array up again?

It appears that the controller no longer believes that all four disks are a RAID-10 array. I presume that I'll need to program the CMOS (or wherever the config is stored). It would be nice if it simply figured that out when I replace the cables, but I suspect the real solution will be a bit more complicated....

Thanks!

David

Arlie- 01-02-2007
Actually, I expect it to pick up the entire array once you have four good cables and move on with life. If not, you'll have to enter the Sil Image BIOS (F4/Cntrl-S? on boot) and poke around gingerly to find the rebuild option.

If it does not bounce right back up, then the mb is suspect.

dbritch- 01-02-2007
OK - all four disks passed their -*test*-('")s. I hope you're right - that the system will pick up and move on. The new cables should be here in the mornting. However, I've been somewhat suspicious of the motherboard/controller from the start.

I don't know how the Seagate tools -*test*-('") - the tools run something that is described as a self--*test*-('"). I used the existing cables that the disks have been using for the last year or so when I ran the -*test*-('")s. I don't suppose this is an indication that the cables are OK? Does the -*test*-('") run natively on the disks, or does it load data across the cables to -*test*-('")?

I guess it doesn't matter much at this point - I'll find out tomorrow, anyway. ;-)

Thanks,

David

Arlie- 01-02-2007
Communication across the cable is required for the -*test*-('"), so the fact that you've -*test*-('")ed all four successfully with their original cables indicates that the cables are OK with the port on which you -*test*-('")ed them. Whether or not they are seating correctly on the Sil Image controllers may be another matter entirely. I have replaced all of my cables here with latching cables to make sure they seat correctly. I've had more than my fair share of issues with the stock ASUS cables working loose over time and causing problems. I have my fingers crossed for you.

dbritch- 01-03-2007
Looks like a motherboard issue. I replaced all four cables with new (clip-on) cables, and at the BIOS config, I see the same thing. Disks 0, 2, and 3 are reserved, and disk 1 is an invalid raid array. Not only that, it hung while getting the disk info.

The motherboard appears to have 3-year warranty, and it's just under a year old. I guess I'll find out how Asus's warranty support is....

Thanks for your help in -*test*-('")ing and diagnosis!

David

Arlie- 01-03-2007
That is NOT what I was hoping you would find. If you purchased the board through a local or online retailer, I recommend that you RMA through them. ASUS support is very slow. That's one of the reasons the founder here created this forum.

Make sure you mark all of your disks and cables carefully so as to replace them on the new board in precisely the same locations. I am hoping that a bad controller did not scramble disk 1. If it did, you may need to reformat it and put it back in the array and rebuild, even after the new board.

dbritch- 01-04-2007
I bought it through NewEgg. However, I don't think I can RMA it through them. The invoice gives a phone number for Asus for support. I'll check, though.

Interestingly, I sent a query to Asus through the member site a couple of days ago. Their response suggested that they didn't read my email. They suggested that if I re-install windows, and carefully install the device drivers, then this may eliminate the conflicts in the Device Manager. Apparently, either it was a canned reply, or it was intended to answer someone else's question.

I've been waiting for a few minutes for a response on Asus's livesupport page. Meanwhile, I'll contact NewEgg.

I'm glad that this site is here!

dbr

Free Forum Hosting by Forumer.comTM!