|
|
|
Some of our customers have been experiencing difficulties with Windows "delayed
write failure" errors. While this is a general class of errors that actually has many
root causes, we have diagnosed one particular case that has clearly affected our
customers. We have been able to replicate this problem in our lab, define the conditions
under which the problem may occur, and identify some viable workarounds.
The Symptoms
The Problem
The Culprit
Test Methodology
Server Notes
Workstation Notes
How to Test Your System
Immediate Workarounds
Conclusions and Recommendations
The Symptoms
In the field, there are a nearly infinite number of problems these Delayed Write Failures
may cause for our (or any other) product. Of course, the primary symptom we are targeting
is data loss error when performing rapid additions to database tables (.DBF in this case),
but we have every reason to believe that the delayed write failures can affect us in other
high I/O contexts as well. For example, one developer was able to induce a failure during
a data recovery operation, leaving database files and indices out of synchronization.
The database update process probably just triggers the problem because it involves a good
burst of file I/O. Since the failure can occur at nearly any point of the process, the
trail of evidence can appear very inconsistent. Sometimes, our program will actually
terminate with an exception error, but often it does not. Sometimes the activity log will
appear as if everything is alright, even though there was a failure. The errors usually
indicate that a previous process failed, but we have seen that they can also occasionally
appear if there is a delayed write failure during the current process.
The Problem
The primary error in question is the delayed write failure, shown in the Event Viewer with
a source of MrxSmb and an Event ID of 50. This particular variation of the delayed write
failures, caused by SMB signing, is a known Microsoft problem. The log entries will
sometimes refer to a specific file, but other times just refer to
\Device\LanmanRedirector. The final identifying characteristic is that the last word of
the error description when viewed as data type Words displays a status code of c000020c.
Here is a Microsoft link relating to the problem:
http://support.microsoft.com/default.aspx?scid=%2Fservicedesks%2Fbin%2Fkbsearch.asp%3FArticle%3D293842
Now, despite these assertions from Microsoft, the problem clearly was NOT fixed in Service
Pack 3 (Win2K server SP3.) We were able to confirm this through testing, and we are not
alone (see below link). There appear to be both workstation and server-side
compenents to this problem.
The Culprit
The problem can occur when a particular feature is enabled on the server: SMB signing. SMB
signing is a security feature intended to help keep hackers from hijacking open sessions
or sniffing passwords. It also incurs a performance penalty on both the servers and the
workstations by additionally taxing the CPUs (10-15% by Microsoft's estimate). On the
server, SMB signing is controlled by the following registry entry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanserver\Parameters\enablesecuritysignature
If this parameter is set to 1 (enabling SMB signing), these problems may arise (again with
certain workstation/server combinations.) If this parameter is set to 0 (disabling SMB
signing), the problem goes away and 0 is indeed the default setting. Note that there are
also workstation-side registry entries that control whether this feature is enabled and/or
required.
As mentioned in the following link some or all versions of Norton AntiVirus will enable
SMB signing, triggering the problem:
http://www.appliedsystems.com/SRVS/CGI-BIN/WEBCGI.EXE/,/?St=30,E=0000000000004150979,K=708,Sxi=1,Case=obj(17392)
Our customer had this enabled. It may have been enabled by Norton Antivirus, unbeknownst
to them (as it was to us.) Or, it may be a corporate security policy. My testing also
directly confirms the assertion in the above link that, despite the fact the Microsoft
claims to have fixed the c000020c-type delayed write problems in Service Pack 3, this is
clearly not true.
Test Methodology
One of our developers was originally able to replicate the problem by running a database
update of a large number of items (i.e., 90,000) while also running Explorer and
performing refreshes to watch the files being updated. We stuck with this proven method so
we could change various workstation and server conditions to see if the problem would go
away. While the update was running, we would go to Explorer to look at the progress of the
files updates in the \DATA directory, sort the column to display the most recently updated
files first, and press F5 to refresh every couple of seconds.
Please note that it failed much less frequently, for whatever reason, if we did nothing
other than watch the update run. What is it about a little additional activity from
Explorer that helps trigger the error? That's a mystery.
Server Notes
Some observations regarding servers:
- The problem does not seem to happen at all on a NetWare 5 server (using an
XP workstation).
- The problem does clearly happen on a Windows NT SP 5 server (using an XP
workstation) with SMB signing enabled.
- The problem does clearly happen on a Windows 2000 server (no SP or SP3) with SMB signing
enabled. The problem was clearly worse (easier to make happen) for some reason on the 2000
server than it was on the NT server.
- The c000020c errors seem to be solved on a Windows 2000 server with SP4. However,
though much less often, c000022c-type delayed write failures will still occur when writing
to a SP4 server.
Workstation Notes
Some observations regarding workstations:
- The problem clearly happens on XP workstations when using SMB signing with an affected
server. This is true, unless XP SP1 and Microsoft's hotfix referenced in their
knowledgebase article 321733 is applied to the workstation.
http://support.microsoft.com/default.aspx?kbid=321733
Now, despite what it says in that article, we found that the hotfix not only cures the
c000022c-type errors, but it also cures the much more prevalent c000020c-type errors.
In fact, with the hotfix, despite the Windows 2000 server service pack level, the
delayed write problems appear to be solved. However, using an XP workstation with a
Windows NT SP5 server the delayed write problems still exist, even with the XP hotfix.
Please note: The hotfix mentioned above is also contained within XP SP2, although we
have yet tested that combination entensively.
- While most of our testing was done on XP workstations, the problem did not seem to
happen from a Windows NT workstation, regardless of the server. This is probably only true
unless someone has gone out of their way to enable SMB signing on the NT workstation. The
feature was made available in NT SP3, but was disabled by default and only enabled by an
obscure registry entry (different from the one used on 2000 and XP).
- The problem did not seem to happen if the workstation was Windows 2000 SP4, regardless
of the server. This was a bit surprsing at first, but now makes sense given that we
have shown the the fix documented in Microsoft's Knowldegebase article 321722 addresses
the c000020c-type errors every bit as much as the c000022c-type errors it is purported to
correct.
How to Test Your System
Though we originally induced delayed write errors using some of our DocuTran software with
test data, we later created a stand-alone utility that simply emulated the sort of
processing that seems to trigger the problem. Basically, the code just writes a
large number of data records to a database file. This program may be freely
redistributed so that others may assess their plaform's exposure to this problem.
1. Download the following executable program:
http://www.tangent-systems.com/support/dlaytest.exe
2. Create a folder on your server and copy the DLAYTEST.EXE to the server.
3. If you do not already have a drive mapped somewhere above where DLAYTEST.EXE
resides, map a drive. The program will not execute properly from a browsed folder or
UNC.
3.Execute DLAYTEST.EXE. The program display simply increments the number of
records being written and shows a time stamp. If it is able to write 99,999 records,
DLAYTEST will simply terminate.
4. While DLAYTEST is running, use Windows Explorer to display the folder where DLAYTEST
resides. You will see that two files are written: DLAYTEST.DBF and DLAYTEST.MDX.
Press the refresh key (F5) every few seconds as you watch the files grow.
This is often all that is required to induce delayed write failure errors on an affected
system. If if doesn't fail the first time, try again.
Further Ideas for Testing
- While we have seen that the errors will occasionally occur without the
"Explorer" refresh trick (and evidence from the field suggests that it likely
does as well), it is much easier to trigger the Delayed Write Failures using Explorer for
some reason.
- You may wish to create a batch file to run DLAYTEST over and over, to see if errors
occur over a long period of time. Each time the program executes, it will simply
overwrite existing .MDX and .DBF files. Check your Event Viewer System log to see if
any errors occurred.
- Only one instance of DLAYTEST may be executed at a time. To stress test a server,
create multiple folders with DLAYTEST and execute separate copies for each workstation.
Immediate Workarounds
The workaround is to disable SMB signing (either on the server, or the workstation).
Disabling SMB (Server Message Block) signing eliminates the encryption of data
passing between two machines on a network. This,
however, is the only step we've found that ensures data integrity when Windows XP clients
write data to Windows 2000 Server or Windows NT Server.
To disable SMB signing on the server (Windows 2000 and Windows NT SP 3 or
higher), edit the registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanserver\Parameters\
Set enablesecuritysignature to 0.
To disable SMB signing on a workstation (XP or Windows 2000),
edit the registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\lanmanworkstation\Parameters\
Set enablesecuritysignature to 0 and set requiresecuritysignature to 0.
To disable SMB signing on a workstation (Windows NT SP3 or
higher), edit the registry:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Rdr\Parameters\
Set enablesecuritysignature to 0 and set requiresecuritysignature to 0.
Conclusions and
recommendations
- If you are using a Windows 2000 server with XP workstations and you wish
to have SMB signing enabled, you should install SP1 and the hotfix referenced in
Microsoft's knowledgebase article 321733 on your workstations. XP SP2 also contains
this hotfix, although we have not tested extensively with XP SP2. If you are
unwilling or unable to rollout the hotfix to all your workstations, you should at least
install SP4 on your server.
- If you are using a Windows NT server with XP workations, it appears that
you should disable SMB block signing or consider moving to another platform.
- If you are using a Windows 2000 server with Windows 2000 workstations and
you wish to have SMB signing enabled, you should install SP4 on the workstations.
|