Yet another interesting case lands on our doorstep thanks to NTDEV (original post here).
I firmly believe that you have zero chance in diagnosing a non-trivial crash if you don’t understand the bugcheck code. The bugcheck code is, in fact, THE definitive reason for the crash. Of course, just understanding the bugcheck code itself is hardly ever sufficient to diagnose the problem, but it’s a fundamental to how you approach the particular crash.
The OP’s crash was fun for me because I’d never seen the bugcheck code before and I’m always happy to meet a type of system crash (especially if I wasn’t the one that caused it):
IRQL_UNEXPECTED_VALUE (c8) The processor's IRQL is not what it should be at this time. This is usually caused by a lower level routine changing IRQL for some period and not restoring IRQL at the end of that period (eg acquires spinlock but doesn't release it). if UniqueValue is 0 or 1 2 = APC->KernelRoutine 3 = APC 4 = APC->NormalRoutine Arguments: Arg1: 0000000000000000, (Current IRQL < < 16) | (Expected IRQL << 8) | UniqueValue Arg2: 0000000000000002 Arg3: 0000000000000000 Arg4: 0000000000000000
Any time I’m presented with a crash, I try to dwell on the crash code and its arguments for a while before looking at anything else. In this case I was on board and feeling pretty good about the crash as I read the description. It seems reasonable to have a crash that results in someone changing the IRQL without ever restoring it.
Once I got to the arguments though I went right off a cliff. As I followed the description and decoded the arguments I learned that:
- The Current IRQL is 0
- The Expected IRQL is 0
- UniqueValue is 0, so:
- Arg2 is an APC’s Kernel Routine and it’s 2
- Arg3 is an APC and it’s NULL
- Arg4 is an APC’s Normal Routine and it’s 0
Presumably this crash only happens if the “Current IRQL” doesn’t match the “Expected IRQL”, but what I just decoded doesn’t support that. The other arguments don’t make sense to me either because I’d expect some other kind of crash if someone queued an NULL APC or an APC with a Kernel Routine set to 2.
This got me curious as to what the crash code actually meant, so I broke out WinDbg and started poking. The call stack indicated that ndis!ndisExpandStack called some function exported by NT, which then ended up in some optimized code area and crashed the machine:
nt!KeBugCheckEx nt! ?? ::FNODOBFM::`string'+0x18d14 ndis!ndisExpandStack+0x19
Based on the name of the NDIS function I had a guess as to what function this was, but to confirm I disassembled ndis!ndisExpandStack:
uf ndis!ndisexpandstack sub rsp,38h and qword ptr [rsp+20h],0 xor r9d,r9d mov r8d,4CCCh call qword ptr [ndis!_imp_KeExpandKernelStackAndCalloutEx] add rsp,38h ret
The theory at this point is that KeExpandKernelStackAndCalloutEx is the one generating the bugcheck code.
Looking at that function I see the source of the 0xC8 bugcheck and the mystery of the arguments is solved:
mov rsi, cr8 ; Get the current IRQL ... call qword ptr [rsp+60h] ; Call the stack expand callback ... mov rax, cr8 ; Get the current IRQL cmp al, sil ; Are they the same? jz short loc_14007040C ; If yes just return, otherwise crash with C8 movzx r8d, al ; BugcheckArg1 = Current IRQL movzx edx, sil ; BugcheckArg2 = Previous IRQL ... mov ecx, 0C8h ; IRQL_UNEXPECTED_VALUE call KeBugCheckEx ; Grand closing...
The bugcheck makes much more sense now. Someone’s stack expansion callback was called at DISPATCH_LEVEL (Arg2 == 2) and returned at PASSIVE_LEVEL (Arg1 == 0). That’s against the rules, thus you get a system crash.
Personally I would call this a bug in KeExpandKernelStackAndCalloutEx seeing as how it is generating an IRQL_UNEXPECTED_VALUE using invalid (unexpected?) arguments. At a minimum the documentation is currently wrong though and I have filed a bug to try to get that addressed.