Bad Riser example


#1

The error look like this

[drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on
ring 1 (-110).
[drm:amdgpu_vce_ring_test_ib] *ERROR* amdgpu: IB test timed out.
[drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on
ring 2 (-110).
[drm:amdgpu_vce_ring_test_ib] *ERROR* amdgpu: IB test timed out.
[drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on
ring 3 (-110).
[drm:amdgpu_vce_ring_test_ib] *ERROR* amdgpu: IB test timed out.
[drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on
ring 4 (-110).
[drm:amdgpu_vce_ring_test_ib] *ERROR* amdgpu: IB test timed out.

The reason is bad soldering or even broken line on riser. Third from the right on the image

And another example


#2

I’m also getting this error from two different cards, even when plugged directly into my motherboard (no riser). I have 2 cards with this error.

There is also this: https://bugs.archlinux.org/task/53042, https://bbs.archlinux.org/viewtopic.php?id=225597

Not only that - I can take and swap risers with working cards and those cards continue to work while the same cards that fail IB RING 12 mine just fine in a different system (it is only RING 12).

Both are XFX RX 480s without modded BIOS (didn’t get to mod them yet).

I’ve proven it isn’t a riser issue and it isn’t with the cards (nor with which PCIe slot they are using).

Help?


#3

I can confirm what Rootless said. I’m facing the same problem on my 12 gpu rig, running rx570 / rx580. Changed/switched risers, changed mobos (h110/tb250), same random IB RING errors on boot (with different numbers). This affects Ethos, Smos and Hive as well, though I had more luck running Ethos (different kernel+firmware I suppose, but this did not solve the problem completely). It’s so freaking annoying, cause it requires a physical restart of the PSU.
Help, please…?


#4

I can confirm it too. But i’ve flash the stok rom to cards, make 20+ reboots, and don’t have this error. I think kernel in HIVEOS can’t work with moded bios. Maybe it’s ROCM works bad. RX580 4Gb Elpida cards have this problem. But Radeon RX 580 8gb workes well. Dima, need help, what we can do ?


#5

Other RX580 4gb Elpida works well, hive ver. 0-5.32 all rigs


#6

The same problem and change more tgat 20 risers and different card and mb. No stability… help.


#7

same problem with rx580 Pulse Elpida 4Gb - no solution found yet


#8