by Steven Roman
www.romanpress.com
Copyright 2001 © The Roman Press, Inc. All Rights Reserved.
Introduction
This article is devoted to describing the concept of a string as it relates to Visual Basic. It is excerpted and condensed from my book Win32API Programming with Visual Basic (available August 1999). We assume the reader is familiar with the contents of the articles Pointers and Data Types.
Strings
The subject of strings can be quite confusing, but this confusion tends to disappear with some careful attention to detail (as is usually the case). The problem stems from the tendency of many programmers to think of a string as an array of characters.
Indeed, the Visual Basic documentation tends to support this erroneous viewpoint at times. According to the VB documentation, a string is
A data type consisting of a sequence of contiguous characters that represent the characters themselves rather than their numeric values.
It seems to me that Microsoft is trying to say that the underlying set for the VB String data type is the set of finite-length sequences of characters. For Visual Basic, all characters are represented by two-byte Unicode integers. For instance, the ASCII representation for the character h is &H68 so the Unicode representation is &H0068, appearing in memory as 68 00.
Thus, the "string" help is represented as
00 68 00 65 00 6C 00 70
Note, however, that because words are written with their bytes reversed in memory, the "string" help appears in memory as
68 00 65 00 6C 00 70 00
This is fine, but it is definitely not how we should think of strings in VB programming. To avoid any possibility of ambiguity, we will refer to this type of object as a Unicode character array which is, after all, precisely what it is! This also helps distinguish it from an ANSI character array, that is, an array of characters represented using single-byte ANSI character codes.
Here is the key to understanding VB strings. When we write the VB code
Dim str As String
str =
we are not defining a Unicode character array per se. We are defining a member of a data type called BSTR, which is short for Basic String. A BSTR is, in fact, a pointer to a null-terminated Unicode character array that is preceeded by a 4-byte length field. We had better elaborate on this.
Actually, the VB string data type defined by
Dim str As String
underwent a radical change between versions 3 and 4 of Visual Basic, due in part to an effort to make the type more compatible with the Win32 operating system.
Just for comparison (and to show that we are more fortunate now), Figure 1 shows the format for the VB string data type under Visual Basic 3, called an HLSTR (high-level string).
Figure 1 - The High-Level String Format (HLSTR) Used by VB3
The rather complex HLSTR format starts with a pointer to a string descriptor, which contains the 2-byte length of the string along with another pointer to the character array, which is in ANSI format (one byte per character).
With respect to the Win32 API, this string format is a nightmare. Beginning with Visual Basic 4, the VB string data type changed. The new data type, called a BSTR, is shown in Figure 2.
Figure 2 - A BSTR
This data type is actually defined in the OLE 2.0 specifications, that is, it is part of Microsoft's ActiveX specification.
There are several important things to note about the BSTR data type:
We should emphasize that an embedded null Unicode character is a 16-bit 0, not an 8-bit 0. Watch out for this when testing for null characters in Unicode arrays.
Note that it is common practice to speak of "the BSTR help" or to say that a BSTR may contain embedded null characters when what is really being referred to is the character array pointed to by the BSTR.
Because a BSTR may contain embedded null characters, the terminating null is not of much use, at least as far as VB is concerned. However, its presence is extremely important for Win32. The reason is that the Unicode version of a Win32 string (denoted by LPWSTR) is defined as a pointer to a null-terminated Unicode character array which is not allowed to contain embedded null characters.
This makes it clear why BSTR's are null terminated. A BSTR with no embedded nulls is also an LPWSTR. We will not discuss VC++ strings in this article. (For more on VC++ strings, please see my book Win32 API Programming with Visual Basic.)
Let us emphasize that code such as
Dim str As String
str =
means that str is the name of a BSTR, not a Unicode character array. In other words, str is the name of the variable that holds the address xxxx, as shown in Figure 2.
Here is a brief experiment we can do to test the fact that a VB string is a pointer to a character array and not a character array. Consider the following code, which defines a structure whose members are strings:
Private
Type utTestThe output from this code is
7
8
In the case of the string variable s, the Len function reports the length of the character array, in this case there are 7 characters in the character array 'testing'. However, in the case of the structure variable uTest, the Len function actually reports the length of the structure (in bytes). The return value of 8 clearly indicates that each of the two BSTRs has length 4. This is because a BSTR is a pointer!
The functions VarPtr and StrPtr are not documented by Microsoft, but they can be very useful in understanding the structure of BSTRs.
If var is a variable, then
VarPtr
(var)is the address of that variable, returned as a long. If str is a BSTR variable then
StrPtr
(str)is contents of the BSTR! This contents is the address of the Unicode character array pointed to by the BSTR.
Let us verify these statements. Figure 3 shows a BSTR
Figure 3 - a BSTR
The code for this figure is simply
Dim str As String
str =
Note that the variable str is located at address aaaa and the character array begins at address xxxx, which is the contents of the pointer variable str.
To see that
VarPtr
= aaaajust run the following code:
Dim
lng As Long, i As Integer, s As StringThe output is
StrPtr:1836612
Length field: 8
VarPtr:1243988
True
104 0 101 0 108 0 112 0 0 0
This shows that the character array in a BSTR is indeed in Unicode format and that the length field does indeed hold the byte count and not the character count.
Finally, we note that you can also simulate StrPtr using VarPtr as follows:
' Simulate StrPtr
This code copies the contents of the BSTR pointer, which is the value of StrPtr to a long variable lng.