Jump to content

Tutorial on Regular Expressions


Cannonfodder
 Share

Recommended Posts

As many of us know, figuring out regular expressions isn't intuitive... here's a tutorial I gleaned from work! I'm sure there are better ones out there.. yeah yeah its written by microsoft..

 

May 10, 1999

 

Contents

 

Regular Expressions: What the Heck Are They?

VBScript RegExp Object

VBScript Matches Collection Object

VBScript Match Object

So What Does a Pattern Look Like?

Gimme an Example Already!

More Power to You!

Information Overload!

 

Wow! Talk about living in Internet time. Only nine months ago, the Microsoft® Scripting Technologies team released Version 4.0 of its scripting engines in Visual Studio® 6.0. Now, with the release of Internet Explorer 5.0, we roll out version 5.0 of the script engines—with lots of new bells and whistles! Some welcome improvements include HTML and ASP performance improvements, script encoding, and many Visual Basic® Scripting Edition (VBScript) and JScript® language features. Here's some insight into one of the most often requested features in VBScript—regular expressions!

 

Regular Expressions: What the Heck Are They?

So what are regular expressions? Regular expressions provide tools for developing complex pattern matching and textual search-and-replace algorithms. Ask any Perl, egrep, awk or sed developer, and they'll tell you that regular expressions are one of the most powerful utilities available for manipulating text and data. By creating patterns to match specific strings, a developer has total control over searching, extracting, or replacing data. In short, to master regular expressions is to master your data.

 

In this article, I'll describe all objects related to VBScript Regular Expressions, summarize common regular expression patterns, and provide some examples of using regular expressions in code.

 

VBScript RegExp Object

VBScript Version 5.0 provides regular expressions as an object to developers. In design, the VBScript RegExp object is similar to JScript's RegExp and String objects; in syntax it is consistent with Visual Basic. First, let me describe the object and its properties and methods. The VBScript RegExp object provides 3 properties and 3 methods to the user.

 

Properties  Methods 

Pattern  Test (search-string) 

IgnoreCase  Replace (search-string, replace-string) 

Global  Execute (search-string) 

 

Pattern - A string that is used to define the regular expression. This must be set before use of the regular expression object. Patterns are described in more detail below.

IgnoreCase - A Boolean property that indicates if the regular expression should be tested against all possible matches in a string. By default, IgnoreCase is set to False.

Global - A Boolean property that indicates if the regular expression should be tested against all possible matches in a string. By default, Global is set to False.

Test (string) - The Test method takes a string as its argument and returns True if the regular expression can successfully be matched against the string, otherwise False is returned.

Replace (search-string, replace-string) - The Replace method takes 2 strings as its arguments. If it is able to successfully match the regular expression in the search-string, then it replaces that match with the replace-string, and the new string is returned. If no matches were found, then the original search-string is returned.

Execute (search-string) - The Execute method works like Replace, except that it returns a Matches collection object, containing a Match object for each successful match. It doesn't modify the original string.

For more information, including some sample code, see the Microsoft Scripting Site.

 

VBScript Matches Collection Object

As I mentioned, the Matches collection object is only returned as a result of the Execute method. This collection object can contain zero or more Match objects, and the properties of this object are read-only.

 

Properties 

Count 

Item 

 

Count - A read-only value that contains the number of Match objects in the collection.

Item - A read-only value that enables Match objects to be randomly accessed from the Matches collection object. The Match objects may also be incrementally accessed from the Matches collection object, using a For-Next loop.

For more information, including some sample code, see the Microsoft Scripting Site.

 

VBScript Match Object

Contained within each Matches collection object are zero or more Match objects. These represent each successful match when using regular expressions. The properties of these objects are also read-only, and contain information about each match.

 

Properties 

FirstIndex 

Length 

Value 

 

FirstIndex - A read-only value that contains the position within the original string where the match occurred. This index uses a zero-based offset to record positions, meaning that the first position in a string is 0.

Length - A read-only value that contains the total length of the matched string.

Value - A read-only value that contains the matched value or text. It is also the default value when accessing the Match object.

For more information, including some sample code, see the Microsoft Scripting Site.

 

So What Does a Pattern Look Like?

Ok, this all sounds great and wonderful, but what do they look like? Regular expressions are almost another language by itself, but users familiar with Perl will feel right at home. VBScript derives its pattern set from Perl, and its core features are, therefore, similar to Perl. I'll try to describe some of the pattern sets used to define regular expressions. These sets can be broken down into several categories and areas:

 

Position Matching

Position matching involves the use of the ^ and $ to search for beginning or ending of strings. Setting the pattern property to "^VBScript" will only successfully match "VBScript is cool." But it will fail to match "I like VBScript."

 

Symbol Function

^  Only match the beginning of a string.

 

"^A" matches first "A" in "An A+ for Anita." 

$  Only match the ending of a string.

 

"t$" matches the last "t" in "A cat in the hat" 

\b  Matches any word boundary

 

"ly\b" matches "ly" in "possibly tomorrow." 

\B  Matches any non-word boundary

 

Literals

Literals can be taken to mean alphanumeric characters, ACSII, octal characters, hexadecimal characters, UNICODE, or special escaped characters. Since some characters have special meanings, we must escape them. To match these special characters, we precede them with a "\" in a regular expression.

 

Symbol Function

Alphanumeric  Matches alphabetical and numerical characters literally. 

\n  Matches a new line 

\f  Matches a form feed 

\r  Matches carriage return 

\t  Matches horizontal tab 

\v  Matches vertical tab 

\?  Matches ? 

\*  Matches * 

\+  Matches + 

\.  Matches . 

\|  Matches | 

\{  Matches { 

\}  Matches } 

\\  Matches \ 

\[  Matches [ 

\]  Matches ] 

\(  Matches ( 

\)  Matches ) 

\xxx  Matches the ASCII character expressed by the octal number xxx.

 

"\50" matches "(" or chr (40). 

\xdd  Matches the ASCII character expressed by the hex number dd.

 

"\x28" matches "(" or chr (40). 

\uxxxx  Matches the ASCII character expressed by the UNICODE xxxx.

 

"\u00A3" matches "£". 

 

Character Classes

Character classes enable customized grouping by putting expressions within [] braces. A negated character class may be created by placing ^ as the first character inside the []. Also, a dash can be used to relate a scope of characters. For example, the regular expression "[^a-zA-Z0-9]" matches everything except alphanumeric characters. In addition, some common character sets are bundled as an escape plus a letter.

 

Symbol Function

[xyz]  Match any one character enclosed in the character set.

 

"[a-e]" matches "b" in "basketball". 

[^xyz]  Match any one character not enclosed in the character set.

 

"[^a-e]" matches "s" in "basketball". 

.  Match any character except \n. 

\w  Match any word character. Equivalent to [a-zA-Z_0-9]. 

\W  Match any non-word character. Equivalent to [^a-zA-Z_0-9]. 

\d  Match any digit. Equivalent to [0-9]. 

\D  Match any non-digit. Equivalent to [^0-9]. 

\s  Match any space character. Equivalent to [ \t\r\n\v\f]. 

\S  Match any non-space character. Equivalent to [^ \t\r\n\v\f]. 

 

Repetition

Repetition allows multiple searches on the clause within the regular expression. By using repetition matching, we can specify the number of times an element may be repeated in a regular expression.

 

Symbol Function

{x}  Match exactly x occurrences of a regular expression.

 

"\d{5}" matches 5 digits. 

(x,}  Match x or more occurrences of a regular expression.

 

"\s{2,}" matches at least 2 space characters. 

{x,y}  Matches x to y number of occurrences of a regular expression.

 

"\d{2,3}" matches at least 2 but no more than 3 digits. 

?  Match zero or one occurrences. Equivalent to {0,1}.

 

"a\s?b" matches "ab" or "a b". 

*  Match zero or more occurrences. Equivalent to {0,}. 

+  Match one or more occurrences. Equivalent to {1,}. 

 

Alternation & Grouping

Alternation and grouping is used to develop more complex regular expressions. Using alternation and grouping techniques can create intricate clauses within a regular expression, and offer more flexibility and control.

 

Symbol Function

()  Grouping a clause to create a clause. May be nested. "(ab)?©" matches "abc" or "c".

|  Alternation combines clauses into one regular expression and then matches any of the individual clauses.

 

"(ab)|(cd)|(ef)" matches "ab" or "cd" or "ef". 

 

BackReferences

Backreferences enable the programmer to refer back to a portion of the regular expression. This is done by use of parenthesis and the backslash (\) character followed by a single digit. The first parenthesis clause is referred by \1, the second by \2, etc.

 

Symbol Function

()\n  Matches a clause as numbered by the left parenthesis

 

"(\w+)\s+\1" matches any word that occurs twice in a row, such as "hubba hubba." 

 

Gimme an Example Already!

This example covers some of the things I've discussed in this article. It's a simple app that makes use of regular expressions to test if a valid input has been entered. It will repeatedly prompt the user for input until a valid input has been entered. First, I'll explain the initial pattern in detail.

 

"^\s*((\$\s?)|(£\s?))?((\d+(\.(\d\d)?)?)|(\.\d\d))\s*(UK|GBP|GB|USA|US|USD)?)\s*$"

"^\s*…" and "…\s*$" - means that there can be any number of leading and end space characters, and the input must be on a line by itself

"((\$\s?)|(£\s?))?" - means an optional $ or £ sign followed by an optional space

"((\d+(\.(\d\d)?)?)|(\.\d\d))" - searches for at least one digit, followed by an optional decimal and two digits (which are optional) or a decimal and two digits. This means that input such as 6., 23.33, .88 are all allowed, but 5.5 is not.

"\s*(UK|GBP|GB|USA|US|USD)?" - means that any number of space characters are valid followed by optional and acceptable arguments to the string.

In this example, regular expressions are used to determine if the user entered US dollars or British pounds. I search for the strings £, UK, GBP, or GB. If the regular expression is true, then the user has entered an amount in British pounds. Otherwise, I assume USD currency.

 

To run this code, you can save it as CurrencyEx.vbs and run it using Windows Script Host, copy it to VB (need to add references to Microsoft VBScript Regular Expressions) or embed the code in an HTML file.

 

Sub CurrencyEx

Dim inputstr, re, amt

Set re = new regexp  'Create the RegExp object

 

'Ask the user for the appropriate information

inputstr = inputbox("I will help you convert USA and CAN currency. Please enter the amount to convert:")

'Check to see if the input string is a valid one.

re.Pattern = "^\s*((\$\s?)|(£\s?))?((\d+(\.(\d\d)?)?)|(\.\d\d))\s*(UK|GBP|GB|USA|US|USD)?)\s*$"

re.IgnoreCase = true

do while re.

Test(inputstr) <> true

'Prompt for another input if inputstr is not valid

inputstr = inputbox("I will help you convert USA and GBP currency. Please enter the amount to(USD or GBP):")

 

loop

 

'Determine if we are going from GBP->US or USA->GBP

re.Pattern = "£|UK|GBP|GB"

if re.Test(inputstr) then

'The user wants to go from GBP->USD

 

re.Pattern = "[a-z$£ ]"

re.Global = True

amt = re.Replace(inputstr, "")

amt = amt * 1.6368

amt = cdbl(cint(amt * 100) / 100)

amt = "$" & amt

else

'The user wants to go from USD->GBP

 

re.Pattern = "[a-z$£ ]"

re.Global = True

amt = re.Replace(inputstr, "")

amt = amt * 0.609

amt = cdbl(cint(amt * 100) / 100)

amt = "£" & amt

end if

 

msgbox ("Your amount of: " & vbTab & inputstr & vbCrLf & "is equal to: " & vbTab & amt)

End sub

 

More Power to You!

To ensure that Visual Basic developers can use regular expressions, the VBScript regular expression engine has been implemented as a COM object. This makes them much more powerful, since they can be called from various sources outside of VBScript, such as Visual Basic or C.As an example, I have written a small Visual Basic application that sniffs through your contact list in Outlook® 97, Outlook 98 or Outlook 2000, and returns the names of contacts that live in those particular cities.

 

This program is simple. First, it prompts the user to enter the names of the cities to search for, separated by commas. Second, it prompts for the name of the new contact folder to create within Outlook. After each contact match, the contact is copied to the newly created contact folder.

 

By adding references to the Microsoft VBScript Regular Expressions object library we can do some fancy early binding. An early binding object has some advantages. It is faster and coding programs becomes simpler. This is because you can add the reference to your object and then cut and paste code from VBScript directly to VB since "new RegExp" will work right away.

 

I've also referenced the Outlook 9.0 object library in the same manner as regular expressions, and for the same reasons. Of course, you can still make COM calls using CreateObject(), but this proves to be simpler. Afterward creating these objects, it is just simple code that accesses the contact folders and tries to match the name of cities. I have a small helper-function, compareCollectionObjects(x,y) that takes 2 collection objects and compares them to see if they are equal.

 

To try this program, simply copy the code to VB(need to add references) and call the function FindCityContacts().

 

Sub FindCityContacts()

 

    Dim strTemp

    Dim index

    Dim citySearch

    Dim myNameSpace, myContacts, newCityContacts, newCityContactsName

    Dim contact

    Dim newContact

 

    'Set the early binding objects

    Dim re as New RegExp

    Dim myApp as New Outlook.Application

 

    re.Global = True

    re.IgnoreCase = True

 

    citySearch = InputBox("Please enter the cities of your search, separated by commas.")

    newCityContactsName = InputBox("Please enter the new contact folder name")

 

    'Set some of the objects and create the new Contacts folder

    Set myNameSpace = myApp.GetNamespace("MAPI")

    'olFolderContacts = 10

    Set myContacts = myNameSpace.GetDefaultFolder(10)

    Set newCityContacts = myContacts.Folders.Add(newCityContactsName)

 

    'Set cities, using regular expressions to contain the city names

    re.Pattern = "[^,]+"

    Set cities = re.Execute(citySearch)

    For Each city In cities

 

      'Set citytokens to be the individual tokens in the city name

      'Then we compare them to the address tokens in each contact

        re.Pattern = "[^ ]+"

        Set citytokens = re.Execute(city)

 

        For i = 1 to myContacts.Items.Count

            re.Pattern = "[^ ]+"

            Set contact = myContacts.Items.Item(i)

 

            Set HomeAddressCityTokens = re.Execute(contact.HomeAddressCity)

            If compareCollectionObjects(HomeAddressCityTokens, citytokens) = 1 Then

 

                Set newContact = contact.Copy

                newContact.Move newCityContacts

            End If

 

            Set OtherAddressCityTokens = re.Execute(contact.OtherAddressCity)

            If compareCollectionObjects(OtherAddressCityTokens, citytokens) = 1 Then

                Set newContact = contact.Copy

                newContact.Move newCityContacts

            End If

 

            Set BusinessAddressCityTokens = re.Execute(contact.BusinessAddressCity)

            If compareCollectionObjects(BusinessAddressCityTokens, citytokens) = 1 Then

                Set newContact = contact.Copy

                newContact.Move newCityContacts

            End If

        Next

    Next

 

MsgBox "done"

 

End Sub

 

'This function is provided as a helper-function

' to compare two collection objects.

Function compareCollectionObjects(x, y)

 

    Dim index

    Dim flag

    flag = 1

 

    If x.Count <> y.Count Then

        flag = 0

    Else

        index = x.Count

 

        For i = 0 To (index - 1)

            If StrComp(x.Item(i), y.Item(i), 1) Then

                flag = 0

            End If

        Next

    End If

 

    compareCollectionObjects = flag

 

End Function

 

Information Overload!

As you can see, Microsoft beefs up VBScript with regular expressions in Version 5.0. This was one of the biggest gripes when comparing the robustness of VBScript and JScript. In Version 5.0 of the scripting engines, we have made a dedicated move towards sharpening up the feature set of VBScript. With the addition of regular expressions, you now have more control and power over your data, enabling you to create more powerful Web applications on the client and server.

 

The Scripting Technologies Team is always looking for feedback, so please feel to use our scripting newsgroups. Or mail us at mailto:mssccript@microsoft.com. We are always looking for feedback on how we can continue to improve our product.

 

For more information:

 

Mastering Regular Expressions, by Jeffrey E. F. Friedl.

 

Microsoft Scripting Technologies Site

 

 

 

 

 

--------------------------------------------------------------------------------

 

Scripting Clinic

 

Vernon W. Hui is a Software Test Engineer in the Microsoft Scripting Technologies group. He splits his time as a basketball gym rat and playing network games with his close friends in Chicago and Urbana, IL.

 

Contact Us  |  E-Mail this Page  |  MSDN Flash Newsletter 

© 2002 Microsoft Corporation. All rights reserved.  Terms of Use  Privacy Statement  Accessibility 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
 Share

×
×
  • Create New...