this post was submitted on 01 Dec 2023
2 points (66.7% liked)

techsupport

2497 readers
62 users here now

The Lemmy community will help you with your tech problems and questions about anything here. Do not be shy, we will try to help you.

If something works or if you find a solution to your problem let us know it will be greatly apreciated.

Rules: instance rules + stay on topic

Partnered communities:

You Should Know

Reddit

Software gore

Recommendations

founded 2 years ago
MODERATORS
 

The system:

MSI Raider GE67 HX 12UHS

Intel Core i9-12900HX

nVidia GeForce RTX 3080Ti (laptop)

32GiB RAM

Win11 Pro 64-bit

The problem:

Once in a while (usually 2-3 times per day), the system crashes, usually resulting in a blue screen with one of various error codes. Codes I've seen include:

HYPERVISOR_ERROR

CLOCK_WATCHDOG_TIMEOUT

VIDEO_TDR_FAILURE

IRQL_NOT_LESS_OR_EQUAL

Sometimes the system hangs but the blue screen never comes, and I have to power it off manually. When this happens, the fans go to full speed and yet the laptop quickly becomes incredibly hot if I don't power it off as soon as possible, suggesting that the CPU or GPU is maxing out for some reason.

Checking with Event Viewer shows nothing out of the ordinary in the lead up to the crash.

Things I've ruled out:

I initially thought it only happened while plugged in, and bought a new power supply. That didn't seem to affect the frequency of the issue, and I also have now seen it happen while on battery. I also initially thought it was more frequent while playing games that use the dedicated graphics card, but I'm not sure that's actually true; I have seen it happen even while just watching Youtube. At one point I felt that it happened more when I moved the laptop or plugged in USB devices, but I think that may be magical thinking; I have never been able to make it happen on purpose by doing those things. It does seem to be true that after it happens, if I let the laptop restart automatically, it often happens again in a short time, but shutting down and then turning it back on gives more time before the next incident.

Solutions I've tried:

I tried updating the BIOS and the Intel firmware to the latest available on MSI's website, but that doesn't seem to have helped. I also updated my nVidia drivers.

A possibly related issue:

A week or so before this happened for the first time, I updated the BIOS to fix a different issue. What happened then was: I was playing a game on battery unintentionally, and didn't notice until that "low battery - switching to Super Battery" warning appeared and began throttling system performance. I plugged the laptop in, but performance didn't improve. I restarted and performance was terrible across all applications, even Firefox. I checked Resource Manager and noticed that the CPU was being throttled down to around 0.16GHz. Event Viewer was showing warnings about this that said the processor was being limited by system firmware.

I tried using various Windows and MSI power management settings to resolve the issue, which persisted across restarts, fully charging the battery, etc. In the end, I solved it by updating the BIOS (to a version that is now one version back from the most current one).

It was a while, maybe a week, after running the update that the crash happened for the first time.

Current theory:

Is it possible I screwed up the BIOS update somehow? I noticed that it instructs you to return clock speeds to stock before doing the update. I don't think I've manually adjusted them, but MSI's "MSI Center" software seems to offer automatic adjustment. It was set to "Balanced" when I did the most recent update, but it may have been set to "Auto" when I did the first one, which I guess could be a problem if the CPU was automatically overclocked.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 1 year ago (2 children)

Doesn't seem like you have tested the RAM, have you? Repeated BSODs with different error codes can be a sign of bad RAM, and loading memtest on a bootable USB is a great way to test for that.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

Update: memtest86 passed! That's good, I guess, but I really did think this was the best suggestion, so I'm kind of surprised. I'm going to find a test for the graphics card, and if it passes I'm following the other recommendation to clean reinstall the OS.

[–] [email protected] 2 points 1 year ago (1 children)

Good and bad news indeed. I think you've got the right course of action, if it's not a discernable piece of hardware, then a nuclear approach to software is warranted. BIOS/microcode updates are another effort I would add as well. I wish you luck!

[–] [email protected] 1 points 10 months ago* (last edited 10 months ago)

Hey just so you know, I finally got around to fixing this after puttering around with it a bit at a time for months, and long story short the SSD was failing, despite several test programs claiming it was good (???). New SSD is running fine.

Edit: Well, that didn't last long. The bluescreens are back on the new hardware with a clean install. New hypothesis: whatever is causing them is also what caused the previous SSD to fail. Rather than sacrifice additional components trying to figure it out, I'm just going to call it here and see if it's still under warranty.

[–] [email protected] 1 points 1 year ago

Good call, I will do that!