RTX 3080 Crashing HiveOS

As the title implies, I am having trouble and I’m hoping someone can provide me an answer. I started mining on HiveOS back in February with 2 RX 580’s. I then added an MSI Ventus 3x RTX 3080. Everything went great for several months. Then, a month ago I added two more MSI Ventus 3x RTX 3080s. I have an MSI Z390-A Pro with 4gb RAM.

And that’s when trouble started. I was unable to see or load more than 4 gpus with my motherboard. I changed the recommended settings in the bios; enabled 4g, Gen2, power on, disabled sata, enabled power on, etc. Every time I would get a GOP error on reboot.

Then, after much reading I discovered that it wasn’t working because I was using an i3-9100F CPU which lacks integrated graphics. So, I replaced it with one that did and still got GOP driver errors. After much more reading, I finally found that I had to roll back the bios to the 2018 version. I did that, changed the settings again and it finally loaded all 5 cards.

Unfortunately, I still get crashes, and it seems like a common one. HiveOS shows my rig as offline; it’s still drawing power at the wall, but the whole OS seems frozen. I have tried everything that I have found online:

I replaced all the risers and have the 009 version. I have tried multiple miners, multiple versions of each miner, and multiple versions of HiveOS. I have tried several different nvidia drivers as well. I have narrowed the culprit down to one of my RTX 3080s (one of the two new ones).

Without it, the other four cards run great with no issues for several days. With this card, or with this card by itself, the rig will crash within 3 hours. I have yet to see any errors, other than sometimes there’s high CPU use and high LA. *side note, I have logs turned on, and have poured through them, but I’m also not sure what I’m even looking for or which log is the correct one to spot such an error.

I suspected thermal issues so I replaced the thermal pads (and voided the warranty). I have tried removing the overclock and it still crashes on base settings.

With the whole rig intact, I have 2 1000w PSUs; The first has the motherboard, cpu, and 2 3080’s with their risers. The second has the third 3080 and the 2 580’s.

It doesn’t seem to be a power issue, driver issue, version issue, or riser issue since it all works great without this one card. However, I cannot for the life of me figure out what to try next. The best it’s done is 3 hours, and that was at 0 core, 0 memory, and PL of 210.

I do not have windows so I can’t check the memory junction temps, but this last time when I had it running for 3 hours the gpu temp never got above 45.

I’m really hoping someone out there might have a suggestion that could fix this?

Just happened to catch a brief glimpse of an error on the miner log. It said something about a Cuda error: launch failed reduce overclocks. But I don’t have any overclocks set, so…?