Full Version : Works with Linux, not with Windows
<< Prev | Next >>
dbritch- 11-05-2007
I have an Asus A8N-SLI Premium, with 4GB of RAM. On Friday morning, I updated my copy of WinSCP, and it said I needed to reboot. My system never came up. Instead, it goes through a normal-looking BIOS process, and at the point where a Windows splash screen appears, it stayed blank.
I put in a Knoppix disk, and booted up Linux. It could see all my individual disk drives, and my NVIDIA raid (two drives striped together). It could not see my Sil-based mirrored pair, but I imagine that's because the standard kernel doesn't support it (although I haven't verified this).
So, I thought that my MBR may be corrupted. I put in my WindowsXP Pro installation disk, and booted from it. I started to run fixmbr and fixboot, but I found that the machine didn't seem to see my hard drives. Instead, it said it saw 4 drives, and none had a disk in it.
So, Linux can see the drives (and read data from them) but Windows could not. What could be causing the difference?
Remembering that (at least once, long ago) Windows accesses disks through BIOS calls and Linux does not, I thought my BIOS may be corrupted. (I'm running the la-*test*-('") non-beta release. I think it's 1009.) I don't have a floppy drive, so I thought I'd try the BIOS recovery feature. I put in the original CD, and booted up. The menu it presented offered lots of options to make a floppy disk, but that seems rather pointless, since I don't have a drive. There was no option presented to re-load the BIOS. Does this mean that the BIOS passed some sort of internal validation -*test*-('")?
I went back to the Windows install disk. When I ran diskpart to look at the disks again, I noticed that when it exited I got a BSOD. It said that the problem was with setupdd.sys. The problems appear to be consistent with a memory problem. Great! I thought! I just added my second 2GB a few weeks ago - maybe it's bad. I pulled out the new memory. Same problem.
I did some -*test*-('")ing with the new memory and the old. The new memory *does* seem to have an occasional error (on -*test*-('") 7 of mem-*test*-('"), it seems to have two bits in error in one word). I also noticed that the automatic timings for memory are significantly different (such as 333MHz instead of 400MHz). I saw that someone else had memory problems until he manually set his memory voltage to 2.6V. So - I manually set my memory voltage to 2.6, and set the timings to match auto timings that I see with 2GB installed, and I'm currently running mem-*test*-('") on it. Haven't seen any errors, yet.
So - any ideas on this? Why would it appear to work with Knoppix, but not Windows? Any suggestions for the next step, other than replacing the motherboard?
Thanks!
David
64dragon- 11-05-2007
to an extent, i had a similar issue where my system was unable to recognize my hard drives. i wanted to format windows to have a fresh install for the beginning of the semester but the XP install disk wouldnt see the drive. Arlie (one of the mods here) allowed me to borrow some parts he had laying around (board, ram and psu) and i still had the issue with his parts. i ended up upgrading to am2 (replacing board, ram and cpu) which run fine. so i never figured out what the issue was, the only thing i didnt try was a different cpu.
dbritch- 11-05-2007
Thanks for your comments, 64dragon.
Yeah - I'm pretty sure that replacing the mobo, cpu, and memory will take care of the problem - I was hoping for something a little less expensive. ;-)
Maybe it's time to upgrade my system.
I will try to reload the BIOS again before giving up on the system. I keep coming back to the fact that Linux seems to be able to see my disks and files just fine! I wonder why Windows can't.
I guess I'll start looking at new motherboards and chipsets while I'm debugging....
64dragon- 11-05-2007
i'm not directly suggesting that you should upgrade your whole system, but just wanted to point out that the board may not necessarily be the issue as it was in my case. and at the time upgrading was the easier solution/ quick fix at the start of my senior yr of college (majoring in engineering) since i depend on a computer
good luck
if you are considering upgrading, depending on budget, and if you wish to stay with amd, the quad core phenom is supposed to come out on the 19th
Arlie- 11-05-2007
Before you rush out and replace the system, try recovering from your backup (please tell me you have one), using only the original two memory sticks. I would put money on the fact that you fubar'd the drives with memory that was throwing errors.
If you can't do that, use the linux disk to transfer your data to an external drive and reinstall XP. If the install fails with the original two mem sticks, then look for new hardware.
dbritch- 11-05-2007
I'm not sure I follow you - I don't know how to reinstall XP when the XP installation disk cannot find any hard drives on which to install.
David
dbritch- 11-06-2007
64dragon - thanks, that makes sense. I have trouble going without my primary system for any length of time, too. Fortunately, I have some others on which I can communicate, although all my photography and development software is on the dead machine.
Ari - I have tried taking out the new memory, and booting from my XP installation disk. At that point, I have tried both the recovery console and the install procedure. In both cases, Windows see 4 drives, and says that none has a disk in it. If I plug a CF memory card into the card reader and do the same, Windows sees that and offers to install onto the CF card - but I'm not sure the CF card is big enough, and that seems to have rather limited usefulness, anyway.
Anyone - This is the second apparent failure I've seen on my A8N-Sli Premium in less than two years. I replaced my original board when I had a disk crash, and none of my drives exhibited *any* symptoms. The Sil controller seemed to lose its mind. I know that this is a very small sample size, but it strikes me as a *very* high failure rate.
Is this typical for these boards? Am I simply unlucky? Or is it likely that there is something else in my environment (such as a power supply) that is faulty? I'm using filtered power (on an APC UPS). The room is more-or-less climate-controlled (it's in an air-conditioned and heated house).
Both failures were during winter, times of low humidity. Are these boards particularly subject to static and/or low humidity? Are they particularly subject to fluctuations in power, perhaps from a so-so power supply?
Is there something other than a problem on the board that is likely to cause these symptoms? I really don't think that badly corrupted data on the disks would cause an OS installation disk to not find them, and then to exhibit a BSOD. I'm not wild about Windows, but I give it more credit than that.
Thanks for any thoughts, comments, or suggestions!
David
64dragon- 11-06-2007
dbritch, over the roughly 2 yrs that i had my 939 system using an a8n32 sli deluxe i had roughly 5 or so disk crashes/data corruption incidents (between 3 hard drives) in which the drives were not physically dead or bad after (i'm still using them).
what are you using for a psu?
in my situation i came up with a hypothesis that my hard drives were running too cool, having large temp fluctuations when transfering data and such. i came up with this after reading the pdf which is linked below. since i've only been running my am2 system since september (using no fans on the hard drives), i've not made a conclusion about this yet but hope it was the issue.
http://rds.yahoo.com/_ylt=A0geu6VqzTBHoaMA...sk_failures.pdf
Arlie- 11-06-2007
I should have been more explicit. If Linux can see and read your drives, then your hardware (motherboard and drives) must be OK. That leaves the new RAM sticks as the likely culprit for your problems. The AMD processors really don't like working with four sticks, especially if they are not exactly matched sticks. And XP really, really does not like working with that situation.
Remove the two new sticks and go back to the original two in slots A1 and B1. Start and go into BIOS and make sure that you are using the automatic settings for your RAM. It should come up in dual channel, 400mhz mode. If, from there, it won't boot into XP, then that indicates that one of your hard drives is corrupted. If that is the case, the easiest solution is to recover from backup. If you can't do that, then you should use the Linux route to recover your data from the drives.
When you attempt to use the XP install disk recovery console, you will have to use the F6 option at the start to load in the Sil Image drivers via floppy/memory stick. Otherwise, XP will NOT see the drives, just as you describe.
By the way, you do not mention if you are running RAID, are you? If so, then things get a little more complex.
dbritch- 11-06-2007
Oh, my! Five incidents resulting in data loss over two years, with three disks? I would not be happy with that either.
And judging by the Google paper, that seems to be considerably less reliable than Google's disks have been. I'd seen that paper, but had not read it closely. The results on temperature are interesting - average temp doesn't seem to be an issue until the disks are 3-4 years old. There was also a similar paper from CMU that was published about the same time. The most earth-shaking results that I recall from the two paper were that Fibre Channel and "Enterprise" disks weren't much more reliable that consumer ones - they seemed to be so primarily because they were more often kept in a climate-controlled machine room
Were your disks new? What OS were you running? Did the incidents appear to be a glitch in the performance of your disks themselves, or do you think that it may have been an OS issue - maybe a filesystem or driver?
You may well be onto something with the temperature fluctuation. However, I'd be happier with a thermostatically controlled fan on them than with none.
I'm typing this on a 5-year old Mac laptop, which I've had for about 3 years. I haven't used it quite as hard as my desktop system, but I've never had a problem with it that resulted in data loss. (Of course, now that I've said that, I'll probably have a disk crash tonight! ;-) )
On my primary system, I've now had two incidents that may result in data loss. I'm using a 500W Turbolink ATX-CW500P4. It is the one that came with my case, and I'm a little suspicious of it at this point.
The first incident was a RAID crash. I had most of my important data on a filesystem held on 4 disks, mirrored and raided. I've worked with big systems (NetApp, SGI, etc.) and was quite surprised when I found that there were no documented recovery procedures for a disk failure with the Sil 3114 on the A8N-Sli Premium. Moreover, there were not even any great success stories (yeah, I had a disk fail, so I replaced it and things went smoothly along).
I was able to recover most of my data, and only lost one directory (that I know of). The behavior of the system suggested that the problem was with the controller rather than disks, but was inconclusive. All my disks passed the Seagate diagnostics, and I replaced my motherboard and SATA cables. After that, all went well until Friday.
I'm pretty sure that my current problem is not a bad disk. It's not unreasonable to expect a disk to fail - I do have 7, after all (74GB Raptor system disk and 6 300GB Seagates - I do a lot of digital photography). And I've been prepared for that. However, I also think that it's unreasonable to have two failures in less than two years - particularly since both appear to be unrelated to moving parts.
I've contacted Asus to request an RMA. The board has a 3-year warranty. We'll see what they say about it.
Arlie- 11-06-2007
OK, with that information, I'll say this. It is NOT your board gone bad. The problem you have is software and related to the addition of mismatched RAM. The Sil Image controller is not a hardware RAID solution, it depends on drivers loaded with the OS to function. The same holds true for the NV Raid solution as well. That is why you don't see success stories recovering from a failed hard drive. Even though you are running RAID 1, there is no hardware handling the rebuild effort. Instead, you are dependent on an OS that you can't get to because one of the drives is fubar or the entire array is corrupted. For that reason, I only run RAID 0 on these boards and depend on a good backup solution in the event a drive fails. You can RMA the board and wait 2-3 weeks for a replacement, but it will not solve your problem.
The key question is this, will Linux read the array, or is it simply seeing four independent drives? If it only sees four independent drives, then you need to circle back to the XP recovery console and make sure that you load the Sil Image drivers using the F6 prompt. If that fails, you're installing from scratch.
dbritch- 11-07-2007
Arlie,
The new RAM sticks may have caused the initial problem, but they are no longer an issue. As I indicated before, I removed them. They are gone. I -*test*-('")ed them, and they're gone.
I tried to do a clean install of XP on my drive plugged into the nvidia controller. XP sees 4 drives and no disks. Are you saying that I need the Sil drivers in order for XP to see the disks on the nvidia controller? That seems particularly strange to me, because I'm pretty sure that I didn't do that for the initial install; however, I may not have had all the disks plugged in - I don't recall. XP works with the nvidia controller without extra drivers. (Yes, nvidia provides drivers that may improve performance, but the standard MS drivers work.)
I respect you and your opinions, but your repeated admonition to remove the bad RAM concerns me - either I'm not presenting myself well, or you are not reading carefully. The bad RAM is *gone*. It's on a shelf, not in the computer. I *cannot* remove the bad RAM. I have my original RAM in the original sockets, and I'm still seeing exactly the same behavior. The RAM is sitting on a shelf. If that's still influencing the behavior of my computer, I have some serious problems! ;-)
Also, I have another Linux CD, Ubuntu 7.10. I tried to boot from that, and it was not successful. If this is a bad disk, it is more aggressive than simple data corruption - it must be actively interfering with the system's ability to recognize and access other devices.
I guess I'll unplug some disks, turn off the Sil controller, unplug *all* the Sil disks, and try again. I'll let you know what happens.
dbritch- 11-07-2007
I tried another experiment. I began by leaving the defective RAM on the shelf across the room. ;-)
I unplugged all my SATA disks except the single Raptor non-raided system disk. I made sure that my memory timings and all board voltages were set to auto. I disabled the Sil controller. I booted from the Windows XP installation CD. I saw the same symptoms - I have 4 "unknown disk"s, each with no disk in the drive - both when I went to the recovery console, and when I tried to do a clean install.
I had an old PATA disk in the basement, so I tried plugging it in in place of my second optical drive, and tried to boot from the CD. At the "press any key to boot from cd" prompt, I pressed the space bar. The screen went black, and the CD whirred for a few moments, then stopped. After several minutes, the screen was still black. I don't think that this is caused by data corruption on a hard drive, nor by missing Sil drivers. Could those cause these symptoms?
I'm not sure what the problem is, but I keep coming back to the motherboard or the BIOS. Does the motherboard keep a CRC for the BIOS, or something like that? How can I re-flash the BIOS, to verify that this is *not* the issue?
Thanks!
David
dbritch- 11-07-2007
Ari - I tried to do as you suggested. However, I don't know how to get the Windows installer to load the Sil drivers. I don't have a floppy drive. I tried putting them on a USB drive, and inserting it when prompted. However, the installer could not find it. When I continued anyway, it located the thumb drive as drive G; apparently, it was looking for the drives on drive A:.
I also tried putting the original driver CD in the boot CD drive when it prompted for drivers. It didn't find that, either.
So - you said that if this didn't work, I'd have to do a clean install. OK - I've tried to do that. The installation program (still) finds four drives, and says that none has a disk in it. IT WON'T INSTALL, BECAUSE IT SAYS I HAVE NO DISKS!
I'm trying to follow your advice. How would you suggest that I do a clean install?
I have an RMA number now. I was able to use this installation disk to install on the system originally. I bet that if I replaced the motherboard I'd be able to install again.
Oh - and the defective RAM is still on a shelf across the room, and the RAM timings and voltages are set to auto.
Thanks!
David
Arlie- 11-07-2007
OK, let's try this. Unplug everything from the Sil Image ports and plug just the Raptor into the first NV port. Fire up the machine and go into BIOS and verify that BIOS recognizes the drive on the first port. If it does not, then stop there and report back.
Provided it does recognize it, download a copy of the Western Digital Data Lifeguard software from
http://support.wdc.com/download/index.asp?swid=1. I believe that version can be set up on CD. Create the bootable CD, drop it in the machine, start up, and format the Raptor. If you get through this phase, your hardware is good.
Replace the CD with the XP CD and restart. You should see the Raptor as the only available drive on which to install the OS. If you don't, then I would suspect that your install CD has a problem. Report back at that point and let us know if you are using an original Microsoft disk or one from an OEM build.
If you have a floppy drive laying around, it would be helpful for you to hook it up. They come in mighty handy in cases like this.
Free Forum Hosting by Forumer.comTM!