Weird hard drive issue – questions

Joined
Oct 20, 2005
Messages
12,042
Location
PA
Description of the problem and steps taken so far, then questions below.

I have an old Windows 10-based PC acting as a file server. Twice now, after it has been on for a few days, I've seen the HD access LED stuck on. In Explorer, one of the hard drives looked like it was half-disconnected; it had a volume name and drive letter, but no numbers for free/total capacity. The system was responsive until I tried to access that drive, at which point Explorer hung and I couldn't shut down or restart except with the power button. After restart, it was absolutely fine for a few days. All hard drives worked. Thought the problem was a fluke until it happened the second time.

The problematic hard drive is one of two identical Seagate Ironwolf drives connected via the motherboard's Marvell 88SE9172 chip. The other drive – again, same model, same controller – has yet to miss a beat.

I plugged the problematic hard drive into an external enclosure last night and everything seems fine so far. Chkdsk found no file system issues and no bad sectors. Virus scan (Webroot SecureAnywhere CE) found nothing.

I've already replaced the hard drive with a spare and updated the drivers for the Marvell controller (it had been using the generic driver). I plan to restore all the data from backups and move on. But ideally I'd like to know more about what may have happened.


Questions:

1. Is it possible that the hard drive is defective in a way that shows up only when connected via the onboard controller but not when connected to the external enclosure? If so, how can I check? The drive is still under warranty.

2. What other hardware/software problems are worth looking for at this point?

3. Can I trust the integrity of the data on that drive, so that I can copy everything directly from it to the new drive? Or does it make sense to download everything from cloud storage?


TIA!
 
1. Is it possible that the hard drive is defective in a way that shows up only when connected via the onboard controller but not when connected to the external enclosure? If so, how can I check? The drive is still under warranty.

Hard drive can start to fail in ways you've seen, but then once powered on again seem fine. Its very possible it would end up letting you down again connected to the external enclosure. Either way, I would not trust this drive anymore

2. What other hardware/software problems are worth looking for at this point?

Well, its possible that SATA port is bad, and you end up with the same issue on a new drive, unless you've already ruled that out with how long it would take to act up again. Generally a cheap chipset like that shouldn't be trusted, you'd want a real HBA

Run the Seagate Diagnostics, HD Tune, or other disk diagnostic software to get to the SMART data and see what you find. Something like StableBit Scanner or Hard Disk Sentinel could be worth the price, and have it do scheduled scans on your disks and setup email alerts for SMART issues

3. Can I trust the integrity of the data on that drive, so that I can copy everything directly from it to the new drive? Or does it make sense to download everything from cloud storage?

Thats tough to say, odds are the data is fine, however its possible its not. If its critical, I would download it

I would purchase some kind of pre-build NAS or build your own NAS that has some redundancy, so a single drive failing doesn't force you to restore from backups and question the data. Synology is a good choice for home, but I'm a sucker for ZFS so look at ixsystems for a TrueNAS box, they do sell some smaller ones
 
You can look at the SMART stat using a utility of some kind and see if any number continue to increase: SATA interface error, reallocation sectors, servo, etc. They should not increase regularly and if you see it, that's probably a bad drive or bad motherboard port (I've seen it).

However like others said, it is better safe than sorry and just replace the drive.
 
Sincere thanks, everyone.

Thankfully I've already replaced the drive with a spare. What I'm trying to figure out at this point is whether I can get a warranty replacement, what other hardware problems are worth checking for, and whether I can trust the data.

Related: I'd love to install a real HBA. The thing is, I really don't need speed and I'm kind of pinching Watts in this application. The machine isn't really critical and it lives in a low-airflow environment. If there's a very power efficient HBA out there that would take load off the CPU while adding reliability, I'm all ears.
 
You’ll find out which sector(s) are bad when you copy all the data off the bad HDD and onto the good one. The HDD electronics tries 5 times to read a bad sector before it fails. Windows tries 3 times before it fails. That’s a total of 15 head reseeks, which is characteristically noisy. If a read attempt succeeds, it moves on. (A NetApp has algorithms that look for longer than normal read times and pre-fail a disk based on them.)

A full data copy is the most accurate HDD health check, much more accurate than short or long self-test.

A HDD never remaps a bad sector on a read; it will remap on a write, most of the time.
 
You’ll find out which sector(s) are bad when you copy all the data off the bad HDD and onto the good one. The HDD electronics tries 5 times to read a bad sector before it fails. Windows tries 3 times before it fails. That’s a total of 15 head reseeks, which is characteristically noisy. If a read attempt succeeds, it moves on. (A NetApp has algorithms that look for longer than normal read times and pre-fail a disk based on them.)

A full data copy is the most accurate HDD health check, much more accurate than short or long self-test.

A HDD never remaps a bad sector on a read; it will remap on a write, most of the time.
This is great info. Thanks.

Just under 1.9 TB on that drive, so I wasn't there to listen to the drive while it copied everything. The operation did succeed, though. Every time I checked the transfer rate it was between 100 and 200 mbps (hope I capitalized that right – whatever the units are in Windows file xfer dialogs). Would you infer from this that everything's probably fine, within an acceptable margin of certainty for non-critical data?
 
Would you infer from this that everything's probably fine, within an acceptable margin of certainty for non-critical data?

I would accept that margin.

Any chance to check the event viewer to see any event IDs related to the time it took place?
 
Could it be the controller or cable and not the hard drive?
Yes, furthermore, I have personally had an SATA cable go bad. It drove me nuts for over a week until, in desperation, I replaced the cable and voila, that was it. I didn't think that this could happen, but it did.
A while later I had a buddy that was having a hard drive problem with his NVR. I told him to try replacing the SATA cable. He was shocked to discover that the cable had gone bad.
A bad SATA cable is not the likely cause of the OP's problem because replacing the hard drive fixed it. His likely problem is a bad Seagate hard drive. I will NOT purchase Seagate hard drives any longer because of a rash of failures I experienced with them about 12 years ago, a few years after they acquired Maxtor. Maxtor manufactured some really lousy hard drives and I suspect that I got ahold of a bunch of rebadged Maxtors.
 
Yeah if I/O stayed even and high then it’s a casual indicator the copy was fine.
The real way to verify it went fine is to do a cryptographic hash of every file on both drives, then compare the hashes file by file to see if there’s any differences, but that takes loads of time.
 
You’ll find out which sector(s) are bad when you copy all the data off the bad HDD and onto the good one. The HDD electronics tries 5 times to read a bad sector before it fails. Windows tries 3 times before it fails. That’s a total of 15 head reseeks, which is characteristically noisy. If a read attempt succeeds, it moves on. (A NetApp has algorithms that look for longer than normal read times and pre-fail a disk based on them.)

A full data copy is the most accurate HDD health check, much more accurate than short or long self-test.

A HDD never remaps a bad sector on a read; it will remap on a write, most of the time.
Internal to the HDD there are a lot of ECC correction, and most likely a semi-good sector would have been reallocated elsewhere (and grown bad sector / reallocated sector count increased).

1724306851513.webp
 
Last edited:
Yes, furthermore, I have personally had an SATA cable go bad. It drove me nuts for over a week until, in desperation, I replaced the cable and voila, that was it. I didn't think that this could happen, but it did.
A while later I had a buddy that was having a hard drive problem with his NVR. I told him to try replacing the SATA cable. He was shocked to discover that the cable had gone bad.
A bad SATA cable is not the likely cause of the OP's problem because replacing the hard drive fixed it. His likely problem is a bad Seagate hard drive. I will NOT purchase Seagate hard drives any longer because of a rash of failures I experienced with them about 12 years ago, a few years after they acquired Maxtor. Maxtor manufactured some really lousy hard drives and I suspect that I got ahold of a bunch of rebadged Maxtors.
I was at Maxtor. What Seagate did back then after the merger was to closed down California and moved to Colorado by combining the Colorado operation of Maxtor and Seagate. The "better" talents of the earlier, higher cost of living part of the company was laid off and most of the employees went to WD, SanDisk, Marvell, and all sorts of other companies. In the end "you get what you pay for" is what damaged Seagate as they lost the better part of the employee pool.

They also didn't have the better read write heads that WD got from their merger with Read-Rite. Seagate ended up flying their heads lower, have less safety margin on their read write quality, and therefore fail more often.
 
Any chance to check the event viewer to see any event IDs related to the time it took place?
Found events in the System log for the most recent incident.

Lots of 153 (disk – IO op retried)
Some 140 (NTFS – failed to flush data to transaction log)
Some 129 (storahci – reset issued to port)
 
SMART data from SeaTools:
Screenshot 2024-08-22 at 4.44.24 PM.jpg


Stats missing from the screenshot are Ultra DMA CRC Error (0), plus head flight hours and lifetime reads/writes, which I assume aren't important here (someone please correct me if I'm wrong).
 
Seagate Ironwolf is a NAS drive. I don't know how well it works in a PC as a desktop drive due to timing or other issues. Your SMART stat looks ok other than ECC on the fly and read error rate being non perfect. The windows' log seems to show retry and reset the port due to timeout, if other Ironwolf does not behave like that it could be the drive is not doing well. If other Ironwolf behaves the same it could be NAS related timing issues. WD got caught changing their Red NAS drive and causing timeout in some server, so I'd see if Ironwolf has the same issue or not.

Do you trust the Marvell SATA? Could that or the cable be a problem?
 
Seagate Ironwolf is a NAS drive. I don't know how well it works in a PC as a desktop drive due to timing or other issues. Your SMART stat looks ok other than ECC on the fly and read error rate being non perfect. The windows' log seems to show retry and reset the port due to timeout, if other Ironwolf does not behave like that it could be the drive is not doing well. If other Ironwolf behaves the same it could be NAS related timing issues. WD got caught changing their Red NAS drive and causing timeout in some server, so I'd see if Ironwolf has the same issue or not.

Do you trust the Marvell SATA? Could that or the cable be a problem?
What do you mean by timing?

I "trust" the Marvell controller in the sense that I think it's a lot less likely to be the problem than the hard drive or cable is. I don't trust it enough to rule it out completely as a potential cause.

I just looked back in the event log and found many more of the same kinds of events, apparently on the same channel. So, maybe that channel or the cable is bad...
 
Back
Top Bottom