Last reviewed and updated: 10 August 2020
Most of us are still stuck supporting drivers and system components that date all the way back to Windows 7. If you’re one of these “lucky” folks, you might want to read about an interesting race condition we encountered when a driver was being unloaded and loaded at the same time. In this article, we analyze what we saw, how we reached the conclusion we reached and the remedial steps we used to attempt to mitigate against this particular problem.
The Crash
We were recently given a crash dump from a system that had been under test with a file system filter driver that performs isolation – that is, it controls the cache and uses shadow file objects to distinguish between the resources that it controls and the resources that belong to the underlying file system (typically NTFS).
Analyzing the crash with WinDBG, we found two interesting threads. Here’s the first:
THREAD fffffa80018b8b50 Cid 0004.0044 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 2 Not impersonating DeviceMap fffff8a0000060f0 Owning Process fffffa8001844840 Image: System Attached Process N/A Image: N/A Wait Start TickCount 264493 Ticks: 1 (0:00:00:00.015) Context Switch Count 60091 IdealProcessor: 2 UserTime 00:00:00.000 KernelTime 00:00:01.263 Win32 Start Address nt!ExpWorkerThread (0xfffff80002ad8530) Stack Init fffff88003195db0 Current fffff88003195230 Base fffff88003196000 Limit fffff88003190000 Call 0 Priority 13 BasePriority 12 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5 Child-SP RetAddr Call Site fffff880`03194fc0 fffff800`02db4c57 nt!ObLogSecurityDescriptor+0x50 fffff880`03195030 fffff800`02db6057 nt!SeDefaultObjectMethod+0x57 fffff880`03195080 fffff800`02db4ee2 nt!ObpAssignSecurity+0xc7 fffff880`031950f0 fffff800`02db76ff nt!ObInsertObjectEx+0x1e2 fffff880`03195340 fffff800`02db6b06 nt!PspInsertThread+0x2f3 fffff880`031954c0 fffff800`02d65da5 nt!PspCreateThread+0x246 fffff880`03195740 fffff880`0799b3e2 nt!PsCreateSystemThread+0x125 fffff880`03195830 fffff880`0799b6d8 Driver!SetupReadWorkQueue+0xe2 [x:\driver\isolate\workerqueue.cpp @ 125] fffff880`03195890 fffff880`0799c34a Driver!SetupWorkerQueues+0x22c [x:\driver\isolate\workerqueue.cpp @ 333] fffff880`03195970 fffff800`02eb32c7 Driver!DriverEntry+0x72 [x:\driver\isolate\driver.cpp @ 43] fffff880`031959a0 fffff800`02eb36c5 nt!IopLoadDriver+0xa07 fffff880`03195c70 fffff800`02ad8641 nt!IopLoadUnloadDriver+0x55 fffff880`03195cb0 fffff800`02d65e5a nt!ExpWorkerThread+0x111 fffff880`03195d40 fffff800`02abfd26 nt!PspSystemThreadStartup+0x5a fffff880`03195d80 00000000`00000000 nt!KiStartSystemThread+0x16
This is the driver entry thread. It is actually setting up various global resources – in this case it is in the middle of creating a work queue for a custom queue package that runs in this driver.
Here is the second thread:
THREAD fffffa80018b9b50 Cid 0004.0038 Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 1 Not impersonating DeviceMap fffff8a0000060f0 Owning Process fffffa8001844840 Image: System Attached Process N/A Image: N/A Wait Start TickCount 264493 Ticks: 1 (0:00:00:00.015) Context Switch Count 52067 IdealProcessor: 2 UserTime 00:00:00.000 KernelTime 00:00:01.357 Win32 Start Address nt!ExpWorkerThread (0xfffff80002ad8530) Stack Init fffff88003180db0 Current fffff880031809e0 Base fffff88003181000 Limit fffff8800317b000 Call 0 Priority 13 BasePriority 12 UnusualBoost 1 ForegroundBoost 0 IoPriority 2 PagePriority 5 Child-SP RetAddr Call Site fffff880`0317f7e8 fffff800`02e391c4 nt!KeBugCheckEx fffff880`0317f7f0 fffff800`02df405d nt!PspUnhandledExceptionInSystemThread+0x24 fffff880`0317f830 fffff800`02afa06c nt! ?? ::NNGAKEGL::`string'+0x227d fffff880`0317f860 fffff800`02af9aed nt!_C_specific_handler+0x8c fffff880`0317f8d0 fffff800`02af88c5 nt!RtlpExecuteHandlerForException+0xd fffff880`0317f900 fffff800`02b09851 nt!RtlDispatchException+0x415 fffff880`0317ffe0 fffff800`02ace642 nt!KiDispatchException+0x135 fffff880`03180680 fffff800`02acd1ba nt!KiExceptionDispatch+0xc2 fffff880`03180860 fffff880`0799b724 nt!KiPageFault+0x23a (TrapFrame @ fffff880`03180860) fffff880`031809f0 fffff880`0799c477 Driver! StopWorkerQueues+0x14 [x:\driver\isolate\workerqueue.cpp @ 351] fffff880`03180a20 fffff880`010fae09 Driver!UnloadCallback+0xd3 [x:\driver\isolate\driver.cpp @ 76] fffff880`03180a80 fffff880`010f9dcd fltmgr!FltpDoUnloadFilter+0xf9 fffff880`03180c70 fffff800`02ad8641 fltmgr!FltpSyncOpWorker+0x2d fffff880`03180cb0 fffff800`02d65e5a nt!ExpWorkerThread+0x111 fffff880`03180d40 fffff800`02abfd26 nt!PspSystemThreadStartup+0x5a fffff880`03180d80 00000000`00000000 nt!KiStartSystemThread+0x16
This is a thread that is unloading the driver.
Upon seeing this we note that the driver load and unload are supposed to be serialized against one another by the operating system, as there is no way for a driver to protect against this scenario. It really does require external serialization to properly prevent this.
We did a bit of research and confirmed with our friends in Redmond that this problem is a known issue – and fixed in Windows 8. Unfortunately the system under test (and the customer solution itself) still requires support for Windows XP as the primary platform, and Windows 7 as the secondary platform. Windows 8 is not even on the customer’s radar yet.
Solutions to Consider
One approach to handling this issue pre-Windows 8, could involve building a multi-driver system. The first driver would be responsible for starting the second driver in a serialized fashion. Driver 1 would load Driver2 via ZwLoadDriver. When this function returned successfully, Driver 1 would then call Driver 2 (via an IOCTL, FSCTL or export function) to actually perform the registration as a mini-filter.
Driver 2’s Unload routine would call back to Driver 1 to ensure that the registration call had completed successfully by serializing with an EVENT object in Driver 1. Thus, this would ensure strict correct ordering between the two. The only purpose for Driver 1 would be to avoid this narrow race condition.
Another potential approach that we considered was to have the DriverEntry function create a device object. In the Unload routine, we can look at the Flags field of the device object to see if the DO_DEVICE_INITIALIZING bit has been cleared. If it has not, then we know that there is still a risk that DriverEntry has not yet exited and we should sleep and then check again.
This relies upon the fact that the I/O Manager actually clears this bit after DriverEntry returns.
Note It is not necessary to clear the DO_DEVICE_INITIALIZING flag on device objects that are created in DriverEntry, because this is done automatically by the I/O Manager. However, your driver should clear this flag on all other device objects that it creates.
Source: http://msdn.microsoft.com/en-ca/library/windows/hardware/ff539265(v=vs.85).aspx (Last Accessed 11 May 2020).
Mitigation
Building a two driver system to protect against a very narrow race condition might be overkill in a situation like this. So rather than an outright solution, what can we do to at least minimize the window in which DriverEntry could still be running?
The simplest thing we can do is make sure the driver does filter registration as its last step – after setting up all of its other internal data structures and queues. This doesn’t entirely prevent the crash, but it minimizes the window even further. This is ultimately the approach the owner of the driver took to solve the problem.
However, if that hadn’t been enough, we proposed using a global driver event and set it at the end of DriverEntry. Then have the Unload wait on that event and afterwards pause for some period of time. This wouldn’t entirely prevent the race condition but at least it would further minimize the window in which it could occur. Thus, a short (few seconds) delay is likely to be sufficient in most production environments.
Conclusions
Since observing this particular crash, we have followed this structure for our own mini-filters: we do registration at the end of our Driver Entry function. By doing so, we minimize the likelihood of the crash happening.
We have not explored the potential solutions or mitigations that we proposed, but we offer them to our readers for consideration in the event they need to further mitigate against this problem.