One of my favorite parts of learning a new programming model is learning what happens when something fails. Getting a simple program built and running is fun and all, but until you dig in to a serious project you don’t realize what the repercussions are for your stupid blindly obvious mistakes.
Once you do make mistakes, you’re often left wondering what the heck just failed and following insane threads of logic as to why. If you’re especially foolish and write a lot of code before testing it, it’s a lot like searching for a needle in a haystack… only you don’t know what a needle looks like, and the guy who told you it was there only speaks Old English. If you only write a little code first, you might have a hay bale rather than a haystack, but the basic problem of not knowing anything yet is still there.
The fun part about driver code is that kernel modules generally trust each other, so innocent looking but horribly broken code can thrash around for a while before someone notices. Fortunately, Windows provides a few tools to help you catch things. Driver Verifier is amazing, and the stuff it does with the pool to help catch bad accesses are really clever. Four or five lines of SAL before a function declaration might seem excessive at first, and even unreadable, but it’s definitely worth using – the static analysis tool will often catch you breaking your own rules before you have to spend hours debugging it.
Of course, it’s still impossible for any tool to read your mind and figure out what you actually meant with your broken code. For example, the other day my previously working code triggered a WORKER_THREAD_RETURNED_AT_BAD_IRQL bugcheck.
“Oh, I know what IRQL is,” I said to myself, “I’ll just look for any functions that raises it and see how I somehow returned without lowering it.”
I couldn’t find any places where I was obviously raising the IRQL without lowering it, so I figured that one of the system calls I was making raised the IRQL. Unfortunately, none of those raised the IRQL either. At this point I began doubting my own sanity, so I added ASSERTs at the beginning and end of each function to make sure that I wasn’t changing the IRQL. Even with this in place, I still couldn’t see anything wrong.
Eventually, I asked one of the engineers here how this could be. Basically, I was told that it couldn’t – and that I should check that I’m not corrupting memory somehow.
Of course, my mistake was not realizing that bugchecks are like any other type of fatal error you might hit – it’s only the last thing you see before the program keels over and dies. It’s a symptom of your original problem, but it might not be directly related. This should be obvious, but the bugcheck code was so descriptive and specific that it was really tempting to latch on to it.
So, what was ye olde bugchecke trying to tell me? My needle ended up being shaped something like this:
typedef struct _MY_COOL_TYPE { LIST_ENTRY RightField; ULONG Counter; ULONGLONG WrongField; } MY_COOL_TYPE, *PMY_COOL_TYPE; // ... PMY_COOL_TYPE someType; someType = (PMY_COOL_TYPE)CONTAINING_RECORD(listEntry, MY_COOL_TYPE, WrongField); someType->counter = 1;
By passing the wrong member variable as the offset to the CONTAINING_RECORD macro, I was returned a pointer to an invalid memory location. Unfortunately, that “invalid” memory location was actually valid to someone else. By writing to the invalid pointer I then corrupted their state sufficiently that they set or restored an invalid IRQL.
The original problem then ended up having nothing (directly) to do with IRQL at all, and was just a case of me pasting a CONTAINING_RECORD macro without triple checking my arguments. Definitely not what I was expecting when I first saw the bugcheck!