5700xt keeps crashing after a few hours

My rig is composed of 4 5700xt.
It keeps crashing after anywhere from 20 mins to 18 hours. I’ve never got it to run for over 18 hours.

Error logs

Miner logs

[0mAverage speed (10s): 0.00mh/s | 48.23mh/s | 0.00mh/s | 0.00mh/s Total: 48.23 mh/s 
[38;2;189;183;107mNew job received: 0xd01ff5 Epoch: 387 Target: 000000006df37f67 
[0m[38;2;178;034;034mStuck device detected, invoking emergency script 

The real problem : OS logs (repeated over and over)

Jan 10 12:25:23 hive5700XT kernel: [58483.988705] amdgpu: Failed to export SMU metrics table!
Jan 10 12:25:28 hive5700XT kernel: [58488.988954] amdgpu: Msg issuing pre-check failed and SMU may be not in the right state!

I’ve tried numerous different OC settings, these ones are pretty conservative and have good temps

What I’ve tried so far :

  • Update B250 motherboard with this guide
  • Change risers, splitters and power cables
  • Tried running a single card at stock, crashed after ~20 hours (same OS error log)
  • Update Hive OS to 0.6-191@210109
  • Switched miners from PhoenixMiner to lolMiner (same OS error log)

Other info :

  • Kernel 5.0.21-201105-hiveos
  • AMD Driver OpenCL 20.30

I can’t find much information about these error logs, most of the ones that I found online are related to monitor issues, which doesn’t apply to me.

Anybody else encountered this issue?

Maybe a power supply issue. The watts reported by Hive / Software are less than actual, which is why people get a monitor that actually plus into the wall to measure power draw.

I just put a Sapphire in my system (See GPU 2) and it has been running fine, so I have not modded the vBIOS yet.

Try using my overclock setting and see how it works.

Oh, Also I am using the Asus B250 on latest BIOS (update via web), not using the PCI x16 slot, and all PCI slots are set as Gen2. Just FYI, I read the protocol you linked and I think that I followed the same instructions.

For your risers, how are you powering them? I know many come with SATA power cables but those are unreliable at best and dangerous at worst. I power all of mine using the 6-pin PCIe cable. AND, if you are running multiple power supplies the riser and card need to be on the same PSU.

I have 6x rx5700xt. I found that undervolting the memory controller caused stability issues. Try removing your under volts for those.

Then slowly work them all down over 24 hours again.

I ended up changing mobo to a brand new tb250-btc, I was powering the cables through pcie and then I changed to molex, single molex cable per riser. Unfortunately still encountering same problem.