I’ve asked this before and typically I do not have any issues, yet I am running into random events like this that are driving me crazy.
Everything is on 240 power with tripplite PDU’s and only at ~60-70% of the PDU’s allowed amperage. I literally swapped ALL of the ATX PSUs out with 2400W platinum server PSUs with either 8 or 13 gpu rigs (Corsair 400w ATX to power just the mb, ssd, and cpu. Rigs have been running for 24 hours without issue and then this morning I have 2 rigs go offline without warning and 3 rigs that have 5-9 GPU’s ALL in the 511c range. I know this is obviously a false temp, but what is causing this error? Risers are brand new, server psus are all good, more than enough power and overhead for 54 GPU’s. DPM is ~4 or 5 on all rigs, power is 880-900 running CN Heavy (higher than it needs to be as I can’t get this stuff stable enough to undervolt).
This rig setup typically can run for 3-5 days straight without ever having an issue. Once I swapped to server PSUs nearly all of my problems disappeared until now.