JMSWRNR

Hardware Failure 2023

What I've learned from a month-long nightmare of returns and repairs following unexpected hardware failure.

Type
Article
Published
25 Aug 2023

Introduction

Just over a month ago, I went to boot up my workstation PC to jump into some 3D work (definitely no Diablo 4 as well), and it failed to start. Not only that, it didn't even make it to the BIOS GUI, so I knew something serious had failed.

I wasn't worried about data loss; after killing a bunch of drives from heavy usage, I'm well prepared for that with a local NAS and cloud backups. But to make matters worse, my NAS motherboard also recently died due to an unrelated issue.

info TLDR

My workstation PC and NAS broke, and it took a month to troubleshoot, RMA, source parts, and get back up and running.

If you don't want to read the story, skip to the final section for my takeaways and advice!

The Workstation PC

In early 2021, I built a new workstation PC for 3D work; I went all out and meticulously planned it for my needs. I "maxed out" a Zen 3 build at the end of the DDR4 era for stability, intending to make it last.

For those interested in specs, it looked something like this:
  • CPU AMD Ryzen™ 9 5950X
  • GPU 2x ASUS ROG Strix OC 3090 with NVLink Bridge
  • RAM 128GB Crucial DDR4 3600Mhz CL16
  • Motherboard ASUS ROG Crosshair VIII Dark Hero
  • 4-Port USB 3 PCIe Card with 4 Controllers
  • Custom water cooling loop for the two hot GPUs

I wanted to be able to play games but also run demanding work tasks, and this was an amazing gaming/workstation hybrid.

I didn't go for Threadripper/Quadro because:
  • I believed it was unnecessarily overkill due to the availability of cloud computing/rendering for extreme cases
  • Gaming performance suffers heavily on Threadripper
  • The unethical restrictions on consumer-grade Nvidia GPUs can be removed by using nvidia-patch
  • Ampere NVLink Bridge also works on RTX 3090 to pool VRAM to 48GB

The Issue

The monitor displayed an error of BIOS is updating Led firmware. and was stuck in a boot loop without getting to the BIOS GUI. After a great deal of troubleshooting, part swapping, and BIOS flashing, it was time to accept some hardware had failed.

After much Reddit searching, I narrowed it down to the CPU or Motherboard, which are both luckily within the three-year warranty. This seemed like a simple RMA process to replace a part within its warranty period.

RMA Time

From my search, it appeared the CPU was the main issue here, and the AMD RMA process was amazing. I sent it off using the prepaid label, and a few days after they received it, I already had a brand-new replacement at my door. However, the same error was displayed after rebuilding with the new CPU.

So that confirms it: the motherboard is still an issue. Maybe the CPU also failed, which caused the motherboard to brick, but it was time to RMA the motherboard. Now, that's where things get interesting.

Unlike AMD, the ASUS motherboard had to be returned to the retailer I purchased it from, not to ASUS. It was collected for free, and they confirmed it was broken after testing. However, because this retailer is now focused on newer hardware, they did not have any suitable replacement to send me and decided to refund the payment method I used two years ago without any confirmation; I'm pretty sure the bank account is now closed. Fun!

I searched frantically for a suitable X570 motherboard that could handle the demanding hardware I intended to use it with. Turns out, that is now incredibly hard to find in the UK; not a single store sold the Crosshair VIII Dark Hero anymore, and this is only two years later.

A couple of days later, the retailer listed a suitable open-box X570 motherboard for sale, an ASUS ProArt X570-Creator. I purchased it! Unfortunately, it arrived in very poor condition. I assumed it had been properly tested, and the damage was purely cosmetic, but it didn't respond to the power button after building! It was straight back into the box for a refund. They even expected me to pay to return the faulty motherboard until I mentioned UK consumer rights.

The retailer was Scan, and my experience with their support was not great. I often had no response and had to follow up for acknowledgment or updates. I also noticed their website often shows inaccurate, misleading restock due dates that differ from what their support says.

The Last Resort

It had been around a month of email threads and RMAs since my workstation PC initially died. So, as a last resort, I imported a motherboard from another country; this voids all manufacturer warranty that comes with it, but I just wanted to be back up and running again. I got a good deal on Amazon EU on a Gigabyte X570 Aorus Xtreme, and luckily, it was the latest Revision 2.0 with new OC features. After being slightly worried that the motherboard box was open as it was sealed on the wrong side, I went ahead and rebuilt my entire workstation PC. And it works! Luckily, no other hardware was affected by the faulty motherboards.

The NAS

My NAS is a custom-built machine running TrueNAS Core. When building the NAS in 2015, I picked the most recommended motherboard, the ASRock Rack C2750D4I. Unfortunately, this model suffered a couple of major issues and finally died this year, unable to post.

At this moment, I couldn't access the files on the NAS due to a lack of alternative hardware that supports 6 SATA drives to use as a backup NAS.

I considered switching to a prebuilt Synology NAS for simplicity but realized that it doesn't support ZFS, so I'd have to buy new drives and copy everything across or upload and download everything to migrate from TrueNAS Core to Synology. Both cases require a way to mount the ZFS drives.

So, I decided to stick with the DIY TrueNAS Core route. I had to replace the motherboard, but it was no longer under warranty. It was an ITX form factor, used DDR3 RAM, and the CPU was embedded. After searching for a replacement motherboard, I soon realized a suitable DDR3 board was next to impossible to source, and an upgrade to DDR4 was the best course of action.

Luckily, I found some great deals on eBay for used hardware and picked up 128GB DDR4 ECC RAM and a Supermicro X11SDV-8C-TLN2F; I added an 80mm Noctua fan to the CPU heatsink with some fresh thermal paste, new battery, updated the BIOS, and reset the IPMI config, and it's running great! I would have preferred a Supermicro A2SDi-H-TF, which runs a bit cooler and has a lower idle power draw, but this wasn't readily available, and the low cost of the used Xeon-D board was a no-brainer.

This was a much smoother repair than the PC, but happening simultaneously added to the complexity and stress.

What I Learned

I completely overlooked hardware availability and ease of maintenance/repair, which had costly consequences. I must have partially rebuilt this workstation PC 10-20 times, and being water-cooled involved annoying draining and cleaning.

I wrote this blog post to shed some light on how this could impact other self-employed work-from-home professionals. If hardware uptime is important to you, my advice would be:

  • If building a custom PC, be mindful of the future hardware availability for parts if something goes wrong.
  • If buying a prebuilt workstation PC, ensure they offer reputable support and warranty in case it needs to be returned for repair, ideally local with fast turnarounds. Apple workstations are no exception here; be prepared to drive that desktop to an Apple store and for them to send it off for repair unless stated otherwise (speaking from experience).
  • Have backup hardware, such as a suitable laptop or old desktop, to switch to during downtime.
  • Actively backup your important work files to the cloud and work locally from NAS storage instead of your PC drive.
  • When buying or building a NAS, just like a workstation PC, consider its upgradability and repairability.
  • Have a SATA or NVMe to USB adapter to plug your PC drive directly into another device if you ever need to recover something urgently.
  • If using water cooling in your custom-built PC, quick disconnect fittings are essential; quickly removing the CPU/GPU/Motherboard without draining the system is a lifesaver. Halfway through this process, I added Koolance QD3 fittings, an expensive but worthwhile upgrade.