I posted earlier about a troublesome piece of code I came across that parses a string to see if it’s numeric. Phil came back with an interesting reply with his take on it; I’m the first to admit that perhaps I worded my worries a little too strongly in my original post (there are indeed some situations where blindly swallowing an exception could be warranted, but in very rare cases where it’s a benign exception and won’t bring the application to its knees). That being said, that was not the intention of my post…what I was really trying to get at was the cost associated with generating unneeded exceptions (don’t generate/catch/throw an exception unless something exceptional is going on, and only in rare cases use try/catch blocks to dictate control of flow or method returns).
I then came across this post which seems to reiterate exactly what I was getting at…exceptions are expensive to generate (lots of CLR overhead involved). I also noticed on Phil’s post (and mine) that the regex folks decided to chime in as well (I have nothing against regex’s, I just usually prefer to work with raw char data instead). So I’ve decided to put all of this to rest (for me at least)…I borrowed Jeff’s code for his benchmark on TryParse vs Parse and extended it with my own char array and regex testing routines to see what really would be the most effecient way to parse strings for numeric data. Like all benchmarks, take it with a grain of salt…though some of the results were surprising even to me.
Outline of the String Parsing Benchmark Application
The application (link at the end of this post) is written in VB.Net and targets version 2.0 beta 2 of the .Net framework and consists of 4 core benchmark routines, 1 each for:
- Parsing the string collection for numeric data using Int32/Decimal/Single/Double.Parse methods (which generates a FormatException if used in a try/catch block).
- Parsing the string collection for numeric data using the new 2.0 CLI Int32/Decimal/Single/Double.TryParse methods (returns a boolean).
- Stripping demarcation data out of the string collection (commas/periods), then breaking the string collection into a char array and iterating for numeric data. There is an option to use an optimized char array loop that stops the loop on the first occurence of non-numeric data, but to be fair to regexes (and all routines for that matter) the default is to iterate over all values in the char array.
- Using a Regular Expression to test the string for a numeric match. There is an option to provide your own expressions as admittedly I’m pretty rusty on my regex knowledge.
There are also options to specify the number of strings to use, and what percentage of the strings should test false for being purely numeric data. It’s worth mentioning that Jeff’s algorithm for creating the strings only generates positive values…perhaps in the future I’ll re-roll his routine, but for now I’m just keeping it simple for the sake of brevity.
Environment and Setup
For the purposes of my testing, I generated 10,000 strings and tested them in 10% failure intervals. Each interval was run 3 times and the average of the 3 was used as the final result benchmark. I don’t feel the need to give detailed specs of the machine used for testing…let’s just say it’s a fairly fast laptop with plenty of memory (which towards the higher end of the failure spectrum definitely comes into play). The application was stopped/restarted between percentage intervals to clean up memory usage, and AutoRecover was disabled in Visual Studio 2005 (if autorecover kicks in during the benchmark it will block the main thread until it’s done, thus skewing the results…IMO it shouldn’t kick in if an application is in runtime, but that’s another post).
Results and Conclusions
I won’t get into terribly much detail here (see the link at the end of this post for full results as an .xls file), but here’s the 20,000 foot overview of the results I obtained:
- As is to be expected, the new TryParse routines in version 2.0 of the CLI trounce all other routines averaging approximately 3–6 milliseconds across all percentage iterations. This was the only routine that maintained a a constant speed, which means the developers over at Microsoft have indeed done their jobs correctly.
- Also as is to be expected, the Parse routine performed the worst out of the bunch, degrading linearly as the percentage of failure increased. What was surprising is how drastically the initial degradation was from 100% success to 90%. Testing 10,000 strings at a 10% failure rate averaged approximately 4.5 seconds per data type tested, with the decimal type coming in quickest at 3.24 seconds. What this means is that unless you can guarantee with some certainty (as Phil pointed out as well) that 95%+ of the values being tested will not fail, performance will suffer drastically due to the overhead of creating and handling unneeded exceptions. At 20% failure the degradation more than doubles to just under 10 seconds. The floating types suffered the worst, in most cases being taking about 30% more time to parse than the integer/decimal data types.
- Breaking the strings into a char array and iterating over each value in the array bore some interesting results. Each type performed at roughly the same level (a neglible variance of 1–2%), and as I predicted earlier was quite fast. The degradation trend over percentage of failure was more logarithmic than linear, I’d say right in the middle of the two with no real noticeable (from an end user perspective) lag until 40% failure rate where it crept above the 1 second mark. At 10% failure the average was about 80 milliseconds…fast indeed.
- As I mentioned above, to be fair to the other routines which must process each individual value in the string I made the char array routines evaluate each value as well. Optimizing the char array routine to break out of the evaluation loop on the first occurence of a non-numeric value drastically improved performance by about 100% with each percentage iteration coming in at twice the speed of the unoptimized char array routine. No noticeable performance degradation would be noticed until 60% of the strings failing the test.
- The regex results were a little surprising. While I did expect working with char data to be faster, I didn’t expect it to outperform regexes to the extent that it did. Regexes were fairly on par with raw char data until the 20% failure mark, and things went steadily downhill from there. The unoptimized char array routine was clocking in at 1/4 second at the 20% failure mark, whereas regexes were taking 3 times as long at this point, and whereas the char array was more logarithmic in nature, regexes leaned towards the linear end of things reaching 5 seconds at 50% failure (char arrays topped 5 seconds at around 85% failure rate). If you consider the optimized char array routine (which is more than likely how it would be done if char data was used in an application for this purpose) which performed roughly 10x better, it’s pretty apparent that using regexes aren’t a viable solution for this type of scenario (they just don’t seem to perform very well in high iteration loops). Regexes definitely have their place for data validation, but in this case working with char data is just outright faster. I’m curious to see if anyone can improve the performance of regexes in this scenario (see my disclaimer above about my regex skills).
The new TryParse methods in version 2.0 of .Net make this type of benchmark obsolete of course, but it’s interesting nonetheless to see just how fast raw char data is, and given the fact that you can work with it on a char by char level it’s easier to optimize loops, whereas with regexes it’s an all or nothing deal and from what I can tell, all data values are evaluated. For simple data evaluation such as the above, char data is definitely the way to go. That being said, as the number of char validations increase in the loop performance does indeed degrade and regexes start to pull through as the obvious choice, though I would definitely like to experiment with this some more to see if char data is faster for more complicated data validation situations (for example, validating a social security number/phone number/email address/etc) so send me your regexes and I’ll see if I can get a char array to outperform it.
Hopefully more than anything, this article has served to show just how expensive generating unneeded exceptions are; it should be pretty obvious that doing this can easily bring an application to it’s knees (not to mention the huge memory spikes I saw while running the Parse routine due to the huge number of exceptions being thrown), so this is definitely something to think about when designing your applications. Cheers, and happy coding.
Numeric validation application (with source) and results spreadsheet (for Visual Studio 2005 Beta 2/Excel 2003). This will be archived to my downloads page in the near future (when I get a newer version of the downloads page in place).
Sun, Aug 14 2005 6:31 PM