TanLogo Power to Solve, Flexibility to Evolve
 
Products & Services
In the News
People
Tech Support
Tangent Partners
Job Opportunities
Home
Email Us
Delayed Write Failures

Some of our customers have been experiencing difficulties with Windows "delayed write failure" errors. While this is a general class of errors that actually has many root causes, we have diagnosed one particular case that has clearly affected our customers. We have been able to replicate this problem in our lab, define the conditions under which the problem may occur, and identify some viable workarounds.

The Symptoms
The Problem
The Culprit
Test Methodology
Server Notes
Workstation Notes
How to Test Your System
Immediate Workarounds
Conclusions and Recommendations

The Symptoms
In the field, there are a nearly infinite number of problems these Delayed Write Failures may cause for our (or any other) product. Of course, the primary symptom we are targeting is data loss error when performing rapid additions to database tables (.DBF in this case), but we have every reason to believe that the delayed write failures can affect us in other high I/O contexts as well. For example, one developer was able to induce a failure during a data recovery operation, leaving database files and indices out of synchronization.

The database update process probably just triggers the problem because it involves a good burst of file I/O. Since the failure can occur at nearly any point of the process, the trail of evidence can appear very inconsistent. Sometimes, our program will actually terminate with an exception error, but often it does not. Sometimes the activity log will appear as if everything is alright, even though there was a failure. The errors usually indicate that a previous process failed, but we have seen that they can also occasionally appear if there is a delayed write failure during the current process.

The Problem
The primary error in question is the delayed write failure, shown in the Event Viewer with a source of MrxSmb and an Event ID of 50. This particular variation of the delayed write failures, caused by SMB signing, is a known Microsoft problem. The log entries will sometimes refer to a specific file, but other times just refer to \Device\LanmanRedirector. The final identifying characteristic is that the last word of the error description when viewed as data type Words displays a status code of c000020c. Here is a Microsoft link relating to the problem:

http://support.microsoft.com/default.aspx?scid=%2Fservicedesks%2Fbin%2Fkbsearch.asp%3FArticle%3D293842

Now, despite these assertions from Microsoft, the problem clearly was NOT fixed in Service Pack 3 (Win2K server SP3.) We were able to confirm this through testing, and we are not alone (see below link).   There appear to be both workstation and server-side compenents to this problem. 

The Culprit
The problem can occur when a particular feature is enabled on the server: SMB signing. SMB signing is a security feature intended to help keep hackers from hijacking open sessions or sniffing passwords. It also incurs a performance penalty on both the servers and the workstations by additionally taxing the CPUs (10-15% by Microsoft's estimate). On the server, SMB signing is controlled by the following registry entry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanserver\Parameters\enablesecuritysignature

If this parameter is set to 1 (enabling SMB signing), these problems may arise (again with certain workstation/server combinations.) If this parameter is set to 0 (disabling SMB signing), the problem goes away and 0 is indeed the default setting. Note that there are also workstation-side registry entries that control whether this feature is enabled and/or required.

As mentioned in the following link some or all versions of Norton AntiVirus will enable SMB signing, triggering the problem:

http://www.appliedsystems.com/SRVS/CGI-BIN/WEBCGI.EXE/,/?St=30,E=0000000000004150979,K=708,Sxi=1,Case=obj(17392)

Our customer had this enabled. It may have been enabled by Norton Antivirus, unbeknownst to them (as it was to us.) Or, it may be a corporate security policy. My testing also directly confirms the assertion in the above link that, despite the fact the Microsoft claims to have fixed the c000020c-type delayed write problems in Service Pack 3, this is clearly not true.

Test Methodology
One of our developers was originally able to replicate the problem by running a database update of a large number of items (i.e., 90,000) while also running Explorer and performing refreshes to watch the files being updated. We stuck with this proven method so we could change various workstation and server conditions to see if the problem would go away. While the update was running, we would go to Explorer to look at the progress of the files updates in the \DATA directory, sort the column to display the most recently updated files first, and press F5 to refresh every couple of seconds.

Please note that it failed much less frequently, for whatever reason, if we did nothing other than watch the update run.  What is it about a little additional activity from Explorer that helps trigger the error?  That's a mystery.

Server Notes
Some observations regarding servers:

  • The problem does not seem to happen at all on a NetWare 5 server (using an
    XP workstation).
  • The problem does clearly happen on a Windows NT SP 5 server (using an XP
    workstation) with SMB signing enabled.
  • The problem does clearly happen on a Windows 2000 server (no SP or SP3) with SMB signing enabled. The problem was clearly worse (easier to make happen) for some reason on the 2000 server than it was on the NT server.
  • The c000020c errors seem to be solved on a Windows 2000 server with SP4.  However, though much less often, c000022c-type delayed write failures will still occur when writing to a SP4 server.


Workstation Notes
Some observations regarding workstations:

  • The problem clearly happens on XP workstations when using SMB signing with an affected server.  This is true, unless XP SP1 and Microsoft's hotfix referenced in their knowledgebase article 321733 is applied to the workstation. 

http://support.microsoft.com/default.aspx?kbid=321733

Now, despite what it says in that article, we found that the hotfix not only cures the c000022c-type errors, but it also cures the much more prevalent c000020c-type errors.   In fact, with the hotfix, despite the Windows 2000 server service pack level, the delayed write problems appear to be solved.  However, using an XP workstation with a Windows NT SP5 server the delayed write problems still exist, even with the XP hotfix.

Please note: The hotfix mentioned above is also contained within XP SP2, although we have yet tested that combination entensively. 

  • While most of our testing was done on XP workstations, the problem did not seem to happen from a Windows NT workstation, regardless of the server. This is probably only true unless someone has gone out of their way to enable SMB signing on the NT workstation. The feature was made available in NT SP3, but was disabled by default and only enabled by an obscure registry entry (different from the one used on 2000 and XP).
  • The problem did not seem to happen if the workstation was Windows 2000 SP4, regardless of the server.  This was a bit surprsing at first, but now makes sense given that we have shown the the fix documented in Microsoft's Knowldegebase article 321722 addresses the c000020c-type errors every bit as much as the c000022c-type errors it is purported to correct.

How to Test Your System
Though we originally induced delayed write errors using some of our DocuTran software with test data, we later created a stand-alone utility that simply emulated the sort of processing that seems to trigger the problem.  Basically, the code just writes a large number of data records to a database file.   This program may be freely redistributed so that others may assess their plaform's exposure to this problem.

1. Download the following executable program:

http://www.tangent-systems.com/support/dlaytest.exe

2. Create a folder on your server and copy the DLAYTEST.EXE to the server. 

3. If you do not already have a drive mapped somewhere above where DLAYTEST.EXE resides, map a drive.  The program will not execute properly from a browsed folder or UNC. 

3.Execute DLAYTEST.EXE.  The program display simply increments the number of records being written and shows a time stamp.  If it is able to write 99,999 records, DLAYTEST will simply terminate. 

4. While DLAYTEST is running, use Windows Explorer to display the folder where DLAYTEST resides.  You will see that two files are written: DLAYTEST.DBF and DLAYTEST.MDX.   Press the refresh key (F5) every few seconds as you watch the files grow.  This is often all that is required to induce delayed write failure errors on an affected system.  If if doesn't fail the first time, try again. 

Further Ideas for Testing

  • While we have seen that the errors will occasionally occur without the "Explorer" refresh trick (and evidence from the field suggests that it likely does as well), it is much easier to trigger the Delayed Write Failures using Explorer for some reason.
  • You may wish to create a batch file to run DLAYTEST over and over, to see if errors occur over a long period of time.  Each time the program executes, it will simply overwrite existing .MDX and .DBF files.  Check your Event Viewer System log to see if any errors occurred.
  • Only one instance of DLAYTEST may be executed at a time.  To stress test a server, create multiple folders with DLAYTEST and execute separate copies for each workstation.

Immediate Workarounds
The workaround is to disable SMB signing (either on the server, or the workstation).   Disabling SMB (Server Message Block) signing eliminates the encryption of data passing between two machines on a network. This,
however, is the only step we've found that ensures data integrity when Windows XP clients write data to Windows 2000 Server or Windows NT Server. 

To disable SMB signing on the server (Windows 2000 and Windows NT SP 3 or higher), edit the registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Lanmanserver\Parameters\

Set enablesecuritysignature to 0.

To disable SMB signing on a workstation (XP or Windows 2000), edit the registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\lanmanworkstation\Parameters\

Set enablesecuritysignature to 0 and set requiresecuritysignature to 0.

To disable SMB signing on a workstation (Windows NT SP3 or higher), edit the registry:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Rdr\Parameters\

Set enablesecuritysignature to 0 and set requiresecuritysignature to 0.

Conclusions and recommendations

  • If you are using a Windows 2000 server with XP workstations and you wish to have SMB signing enabled, you should install SP1 and the hotfix referenced in Microsoft's knowledgebase article 321733 on your workstations.   XP SP2 also contains this hotfix, although we have not tested extensively with XP SP2.  If you are unwilling or unable to rollout the hotfix to all your workstations, you should at least install SP4 on your server. 
  • If you are using a Windows NT server with XP workations, it appears that you should disable SMB block signing or consider moving to another platform.
  • If you are using a Windows 2000 server with Windows 2000 workstations and you wish to have SMB signing enabled, you should install SP4 on the workstations.