this post was submitted on 20 Jan 2024
41 points (93.6% liked)

Linux

48738 readers
1385 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Any advice on where to go from here? This console was running dmesg -w to try and catch an intermittent crash... And this is what I got. I am using an el cheapo USB wifi adapter that I'm suspicious of.

Everything was working fine until I rebuilt nixos with Nvidia support... Now my old generations of the OS are crashing after a few minutes (display on, no response to input, keyboard lights don't respond, SysRq doesn't work)

all 19 comments
sorted by: hot top controversial new old
[–] The2b 18 points 11 months ago* (last edited 11 months ago) (1 children)

You're only showing us part pf the error. There should be more above the list pf modules loaded that will provide useful information

dmesg > dmesg-out will give the entire dmesg log as a text file, and you can cut out the irrelevant parts

[–] mvirts 6 points 11 months ago (1 children)

Good to know! I need to set that up next time, the whole system was unresponsive when I took the photo.

[–] The2b 4 points 11 months ago (1 children)

In that case it should be in your logs. I believe the default is /var/log/dmesg.log*, depending on how many rotations have occured since the error

[–] mvirts 2 points 11 months ago

Lol I checked the system journal but forgot to check if the dmesg los is being written 😹 thanks for the reminder, going to take a look later today

[–] Sorcaeden 12 points 11 months ago* (last edited 11 months ago) (1 children)

I don't pretend to be an expert in this, and I also have no idea what the state machine looks like for unauthenticated WiFi, but my thinking on the call stack is either you were authenticated and the association with the AP dropped while sending a frame and puked, or it kicked it while attempting to authenticate to an AP, and I have no idea why a mutex would be taken, or to what, but it timed out apparently.

So why would this happen after a rebuild?

  1. freak accident/timing thing.

  2. I see multiple mt## modules loaded, and I'm suspecting while not looking it up that they are operating a MediaTek chip in that dongle, and are potentially conflicting.

  3. lots of wifi devices I've seen recently have loaded firmware separately from driver from /use/lib(or lib64)/firmware and the version changed from before, and maybe needs updating now or you did it before or whatever.

I agree with others - I'd give you a fiver if it happens again without the adapter connected.

[–] mvirts 3 points 11 months ago

I think You're right, it is a mediatek chip and I used to add the USB device id manually to load the module, but with nixos 23.11 it started working automatically. I'm also running a preemptable kernel... Probably related now that I think about it :P

I should track down the firmware, that was one of the things I was looking into when setting up the device id hack.

I think this happened once before after uptime of about a week... But I didn't get any information from that crash. Also, I'm remembering that some configurations were failing to see this wifi device and falling back to wired so maybe this has been a hidden problem since the new nixos release...

Thanks to everyone for your thoughts, it's very helpful.

[–] mkwt 10 points 11 months ago (1 children)

Don't know much, but nl80211 in the stack is indicative that the crash happens in a WiFi driver.

Looks like maybe some bad behaviour with a mutex.

[–] [email protected] 9 points 11 months ago

It's the networking stack causing the panic, my guess is the WiFi card gets sad.

[–] [email protected] 7 points 11 months ago (1 children)

Does it still get the error without the wifi adapter connected? The stack trace shows some network-related stuff (which doesn't necessarily mean that's where the issue arose, but it would be a little coincidence based on what you said).

That's the first thing I'd try, and if removing the adapter fixes it (long term) I wouldn't use the adapter anymore. Sometimes broken hardware breaks other hardware it's connected to.

If removing the adapter doesn't fix it, then the next thing I'd try is booting back into the known-good old old OS, maybe removing the NVidia card, basically simplify everything one step at a time until it stops happening, if you can.

[–] mvirts 1 points 11 months ago

Next chance I get I'm booting without the USB wifi adapter. I'm worried I may have broken something because it was mostly stable before :/ lol I actually don't have the Nvidia card yet, I ordered a cheap Tesla K80 that's arriving on Tuesday 😹 and it already brokey system :P

That's a good idea, I have an Ubuntu partition that I should try.

[–] [email protected] 3 points 11 months ago (1 children)

Comm: wpa_supplicant being the wifi function makes me suspicious of your wifi hardware as well before I saw the rest of your post. I've had the best success with PCIe based wifi cards (if this is a desktop pc)

[–] mvirts 2 points 11 months ago

Agreed, this wifi stick was mega cheap on AliExpress so I went for it. I may take a look at the PCB in detail if removing it restores order to my PC. Yes, desktop PC (still hanging on to 2012 hardware woohoo!)

[–] [email protected] 3 points 11 months ago* (last edited 11 months ago) (1 children)
[–] mvirts 3 points 11 months ago

Good going asrock :P

[–] [email protected] 3 points 11 months ago* (last edited 11 months ago) (1 children)

Look at the line with the asm_exc_invalid_op. That seems like a hardware fault caused by an invalid asm instruction to me. Either something wrong is being interpreted as an opcode (unlikely) or maybe the driver was compiled with extensions not available on the current machine.

OP, how old is your CPU? And how old is the nic you are using?

Edit: ~~did you use a custom driver for the NIC? I'm looking at the Linux src and rt_mutex_schedule does not exist.~~ Nevermind. Was checking 4.18 instead of 6.7. found it now. The bug is most likely inside a macro called preempt_disable(). Unfortunately most of the functions are pretty heavily inlined and architecture dependent so you won't get much out of it. But it is likely any changes you made in terms of premption might also be causing the bug.

[–] mvirts 1 points 11 months ago (1 children)

It's a 3770k... So super old? 😅 The USB nic is this guy: CF-953AX https://a.aliexpress.com/_mNfj796

Maybe I should set up a config that doesn't use a preemptable kernel for when I want faster wifi :P

Maybe this is my chance to actually fix something kernel related

Thanks for taking a look at this, your comments are super helpful.

[–] [email protected] 2 points 11 months ago (1 children)

My suggestion would be to try compiling the kernel locally.its highly likely the one packaged in your distro contains extensions that you don't have. Doing a local native compile should rule that out pretty quickly without having to disable any additional features.

[–] mvirts 1 points 11 months ago

Looks like dmesg isn't being logged to disk... But I made my font smaller 😹 Definitely more to go on there, this happened while playing Minecraft with a small human so I didn't dig into it yet. I'm pretty sure the kernel I'm running was built by a derivation that applies some preempt patches so I'll start there. Ubuntu works fine with the adapter, but it's also not a preemptable kernel.