"Autofan: GPU driver error, rebooting" message

gpu
driver
error
autofan
nvidia

#1

Hello,

One of my rigs began producing an “Autofan: GPU driver error, rebooting” message yesterday. What could be the cause? I upgraded to version 5.63 yesterday. Could it be an issue with that version?

GPUs are MSI GeForce GTX 1060 and 1070 Gaming X.

Please advise. Thank you.

Update: Same message now appearing on another rig.


#2

UPDATE

On the first rig, the issue seems to have been related to a defective riser card or GPU. I disconnect both and the system has been mining without issue for 16 hours now.

On the second rig, I upgraded it to 5.64, and the rig produced a new error message of “Autofan: GPU driver error, no temps.”

Identical and similar messages on two different rigs with different brand and model GPUs… root problem seems to be with HiveOS. Disappointing I haven’t any gotten replies, especially from the HiveOS team.


#3

I have the same problem, used to not have this error before the .63 update.
Hope the devs can help us


#4

Same here, error started at 05-63 update and thought 05-64 will fix it. Pls, rig turns off after 3 reboots. Thanks.


#5

I have the same issue. Started a couple of updates back. Nvidia cards. Sometimes its just my 980 that stops. Manual restart fixes


#6

тоже самое


#7


#8


#9

Same here. Anyone got an update on this?


#10

тоже самое


#11

Enabling REBOOT_ON_ERROR=1 in /hive-config/autofan.conf seems to be the only solution, according to Hive OS changelog


#12

И у меня то же самое, райзера перетыкивал - ничего не помогает.


#13

autofan.conf didn’t exist, so I created it with the following data:

#URL
#https://forum.hiveos.farm/t/how-to-use-autofan-autofan/4551/4?u=77164

#https://hiveos.farm/changelog/
REBOOT_ON_ERROR=1

# Target GPU temperature
TARGET_TEMP=60

# Minimal fan speed
MIN_FAN=30

# Stop mining at critical temp
CRITICAL_TEMP=85

# Set to 1 to disable AMD fan control
NO_AMD=1

Please note that I disabled auto fan control for AMD GPUs because I don’t have any.


#14

For those wondering how to create the autofan.conf file:

first, SSH in to your rig. The type the following:

nano /hive-config/autofan.conf

then add the following into the file contents:

#https://hiveos.farm/changelog/
REBOOT_ON_ERROR=1

# Target GPU temperature
TARGET_TEMP=60

# Minimal fan speed
MIN_FAN=30

# Stop mining at critical temp
CRITICAL_TEMP=85

# Set to 1 to disable AMD fan control
NO_AMD=1

then type:

Ctrl-x (press ‘y’ to confirm)

Restart the rig


#15

So after extensive testing, the riser and GPU in question were not defective at all. What actually resolved the issue was reducing the GPU quantity from 12 to 10 in that particular rig. I came to that realization/conclusion yesterday when I re-installed the riser and GPU and the error messages re-appeared, then, again, I removed a riser and GPU, and the error messages stopped.

This particular rig used to function properly with 12 GPUs. However, that may have been on Windows, before I switched the rigs over to Hive OS.

For clarification, my other rig that also produced these error messages has only 6 GPUs. So the 11+ GPU issue doesn’t apply in every situation.


#16

Getting same error. I have 5 1060s and 5 1080ti connected to my rig with Asus B250 Mining MoBo. Rig’s working fine for 24hrs and then gets this error. Will try to fix with manual autofan config…


#17

Mines doing the same thing regardless of 8 or 10 GPU rigs.

When the error shows up on dashboard all my cards drop from dashboard in red. While I can still SSH in the miner runs at 50% hashrate.


#18

I’ve just started using HiveOS (7x 1060 3GB MSI on 270P mobo), but this issue started right away for me. I’ve replaced risers & graphics cards with no change to the error. The error is random in timing, ranging from about 45 minutes to 26 hours.

Using the “Tuning” option & enabling the hashrate watchdog has mitigated the issue for me without requiring manual intervention. I didn’t try the manual autofan config because my experience is that manual fixes are undone by upgrades.

Just my 2¢


#19

Here are my results after creating autofan.config: got all of these errors in the last 12hrs…

If anyone has any suggestions on how to fix these will much appreciate, thanks


#20

I had the same problem with one of my freshly upgraded rigs, so far I’ve downgraded to 0.5-60 that it’s been running before. Will see how it goes, but I think it’s the proper version to use before autofans were implemented. So far it works great for a couple of hours. Just a solution for you to try.

If it helped, tips are welcome:
3LMaJKvM5UgWqhJ6dgmLMGryuBafUo9gdT

P.S. Version downgrade thread: Version Downgrade