More
referral
Increase your income with Hive. Invite your friends and earn real cryptocurrency!

Changing PCI-E slot forces reinstalling HiveOS from scratch..?

Hi, I have 2x 3060 Ti LHR and 1x RX 5700.

Whenever I change pci-e slots (i.e. to check if certain instability could come from the pci-e slot during fine tunning or simply to clean a card) on the AMD card, gminer (tried other miners too) can’t use the card anymore. There’s nothing wrong with the hardware itself, reflashing HiveOS fixed it. I thought this could be specific to AMD but the other cards are also misbehaving as if they had huge OC settings or so (they don’t and their OC is tested and can run 3 days with no issues). So it seems that changing PCI-e slots on an AMD card causes all sorts of issues overall.

I didn’t try reinstalling AMD drivers as it’s advised not to by HiveOS docs. I tried selfupgrade but obviously nothing changes as I’m already in the latest stable version.

Please refrain from suggesting my risers and/or motherboard are the issue. Remember: re-flashing HiveOS from scratch into the pen-drive fixes the issue. The only thing I didn’t try is an SSD instead of the flashdrive, but I don’t see how that could have any impact on GPUs not being accessible by miners.

amd-info shows everything correctly.

Gminer just fails to start and WD tries to restart it in an endless loop. Stopping miners and simply doing: /hive/miners/gminer/xx.xx/gminer --list_devices throws the error:
“Illegal instruction (core dumped)”

Nothing useful in miner logs, should I look into another log ?

It’s really not user-friendly if I have to reflash the USB drive everytime I want to change PCI-E slot positioning…

Using SSD now, but I don’t plan to remove any card in the near future. Next time I need to clean them I can test this out.

Nevertheless, it would be nice if someone from HiveOS team can investigate this issue, it has consistently happened by changing cards’ PCI-E slots. My MB is a MSI z270a-pro if that helps debugging. Clearing CMOS has no effect either by the way and if I boot to windows all cards are detected by the same miner (gminer for instance). This is 100% a software issue.

What does the miner say in shell when starting after moving the cards around?

@keaton_hiveon Miner output is blank, the only displayed output is the WD warning stating it will restart the miner. Which is why I mentioned miner logs are useless for my case and ask if there’s any other log that might shed a light…?

This is the reason why I stopped miners and tried to list devices with gminer which again fails completely, no output appart from the one mentioned “Illegal instruction (core dumped)”.

Once HiveOS gets to this state, reinstalling miners or installing new ones makes 0 difference, something gets corrupted. I had only gminer for the AMD card and one of the times HiveOS got to this state I tried installing TRM, which would fail too. The output was slightly different though, I COULD run “list devices” command, but it wouldn’t detect any card and would have a red OpenCL error at the bottom of the output (not sure if related though). Amd-info always consistently works fine.

It’s almost like your version of gminer (and all miners) is compiled specifically for a setup and once that setup changes it starts erroring out (i.e. a specific pci-e device Id)… I can’t explain otherwise why amd-info can see the card just fine and/or I can boot to windows and the card is also fine. Really weird.

When you tried to install trm it would fail? Or it would install fine and then fail when running?

Can you repeat the issue and take some screenshots of your worker overview screen/Amd-info/miner watchdog messages etc?

@keaton_hiveon I already lost a lot of mining hours due to this. Best I can do until I need to clean or re-pad a card (which is not going to be anytime soon fortunately, all done) is to get the logs from the pendrive. Tell me what logs you want to see and I’ll retrieve them. I work with linux for over a decade, feel free to ask me to grab whatever you need.

This was replicated 3 times consistently, or I wouldn’t be reporting it and would assume it’s my system’s issue (can still be related to the pendrive itself, I didn’t try to replicate it with an SSD, just moved to it today). I believe it also happened one time after getting some AMD driver errors and rebooting, same state… But I can’t confirm this as I didn’t try to replicate it for obvious reasons.

Should be fairly simple to replicate by devs:

  • Flash HiveOS into pendrive (did it from a separate system);
  • Plug multiple cards in PCI-E slots (in my case 2 nvidia, 1 amd, but I’d argue 1nvidia + 1amd or even just a single amd card might have the same result);
  • Boot from the pendrive and do the initial setup;
  • Let it mine for a bit and ensure everything is fine;
  • shutdown now and swap the AMD card into a different slot;
  • boot, miner assigned to AMD card should start failing;

PS: to answer your question, all miners can install and reinstall fine with HiveOS in this state. They just can’t access amd drivers (I’m assuming) cause they all fail. Tried with 3 miners, don’t remember the 3rd one to be honest I just did it for the rule of 3.

I have 30+ rigs and move cards around all the time between rigs, I’ve never encountered an issue like you’re having. I don’t think any of my rigs still have the same cards/orders that I set them up with. If I was able to replicate it myself I’ would.

I’ve also moved cards around in Hive just fine. This seems to be with this specific setup, hence my suggestion that it should be easy to replicate, should you replicate the setup too.

  • MSI Z270A-PRO MB (not that it would matter, no bios settings caused this, neither does it fix the issue once HiveOS is in that state)
  • 2x RTX 3060 Ti LHR
  • 1x RX 5700 reference (shouldn’t matter which one, all of them have the exact same bios)

Swapping the Nvidia cards is fine. The rig used to be only Nvidia. This started happening specifically once the RX 5700 was added. If it was a card problem, it wouldn’t work in windows.

If you tell me what info you need, I can grab it, this isn’t a rig measuring contest (pun intended). You’re asking me basically to replicate it a 4th time, I’d argue rule of 3 is plenty to consider this a possible issue and investigate, but I’m not going to intentionally screw up my miner once again and force a new installation from scratch when I was able to replicate this 3 times already.

I did my job of not coming here crying without proper replication, missing information (which unfortunately is blank miner logs), etc. So if what you need is “See to believe”, tell me what logs in HiveOS contain useful information and I’ll share them. I don’t see how doing it a 4th time will give you any extra information or assurance, the pendrive is still intact so I can still grab whatever logs were produced the last time it got to that state, which is the last time the pendrive was used before I moved to the SSD so it’s easy to find.

So… What logs you need?

im not trying to have a measuring contest, or saying youre not having an issue, just simply saying ive never ran into this issue, nor heard anyone else with your exact issue, especially in a repeatable fashion. i dont have a z270 chipset board on hand to test with unfortunately.
1.) do all the cards show up in the dashboard correctly with temps/fan showing and fan control working after you move cards around? (indicating driver is working)
2.) what kernel(s)?
3.) what hiveos version(s)?
4.) does it work if you put the cards back in the same order as before?
5.) does an nvdia only miner see the nvidia cards after?
6.) does the botched install drive work on other systems after it fails on the z270?
7.) does this happen on any other usb drives/ssd, or only that one drive? (possible the drive is failing?)

basically just trying to isolate/rule out variables

Yeah getting your hands on a z270 these days will not be easy, but I honestly don’t think it’s related to the MB. If the MB had issues windows should have them too, not just HiveOS.

  1. Yes, as previously stated. amd-info works fine (along with the other amd related binaries such as the temps one, cant remember the name anymore). I assume the dashboard just pulls info from amd-info, but nonetheless: yes, dashboard also shows all cards correctly. Changing OC/fan values works too and is reflected in amd-info right after.
  2. Whatever kernel comes by default with 0.6-212-stable, should be 5.10.83 (last re-flash to fix the issue was yesterday)
  3. 0.6-212-stable
  4. No, once it gets to this state any AMD miner has the same issue with the AMD card specifically. In miners that support both Nvidia and AMD (such as T-rex), the list-devices command displays only the nvidia cards too.
  5. Nvidia cards keep working and I can keep switching them around after the breaking point, no problem. It’s specifically the AMD card that never works again after a slot switch until a fresh flashed HiveOS runs. So yes, nvidia only miners see nvidia cards, nvidia+amd miners also see the Nvidia cards.
  6. Haven’t tried. Best I can do is not boot the pen drive again until the weekend, when I should have some time to try this out. My steps will be: using the botched OS in the pendrive I will see if the issue is still there and if so, I’ll move the AMD card to my main gaming PC and boot from the same pendrive.
  7. Unfortunately this is the only decent 8GB+ pendrive I have around, so I haven’t tried, which is why one of my first suggestions was either the pendrive itself, or HiveOS itself running on a (any) pendrive. The first I can’t answer for sure, but nothing else points to a bricked pendrive … The latter should be something devs can replicate if it was indeed the issue.

I can still access the pendrive without booting so if you need logs, let me know.

PS: I only mentioned one HiveOS version because even though this also happened with the previous stable, nothing changed in the last release except for miner updates. Not only I don’t use the miners that got updated, but also it happened in both versions.
Ignore that, the recent releases (one more today) are not for the stable version, just realised that.

Edit: yeah… you guys did an awesome job obfuscating external access and what not. I see now logs are removed on shutdown, so I can’t give you old logs unless you teach me something I don’t know about hive :slight_smile:

This topic was automatically closed 416 days after the last reply. New replies are no longer allowed.