C++ Boost

Regex++, Introduction.

Copyright (c) 1998-2001

Dr John Maddock

Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Dr John Maddock makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.


Introduction

Regular expressions are a form of pattern-matching that are often used in text processing; many users will be familiar with the Unix utilities grep, sed and awk, and the programming language perl, each of which make extensive use of regular expressions. Traditionally C++ users have been limited to the POSIX C API's for manipulating regular expressions, and while regex++ does provide these API's, they do not represent the best way to use the library. For example regex++ can cope with wide character strings, or search and replace operations (in a manner analogous to either sed or perl), something that traditional C libraries can not do.

The class boost::reg_expression is the key class in this library; it represents a "machine readable" regular expression, and is very closely modelled on std::basic_string, think of it as a string plus the actual state-machine required by the regular expression algorithms. Like std::basic_string there are two typedefs that are almost always the means by which this class is referenced:

namespace boost{

template <class charT, 
          class traits = regex_traits<charT>, 
          class Allocator = std::allocator<charT> >
class reg_expression;

typedef reg_expression<char> regex;
typedef reg_expression<wchar_t> wregex;

}

To see how this library can be used, imagine that we are writing a credit card processing application. Credit card numbers generally come as a string of 16-digits, separated into groups of 4-digits, and separated by either a space or a hyphen. Before storing a credit card number in a database (not necessarily something your customers will appreciate!), we may want to verify that the number is in the correct format. To match any digit we could use the regular expression [0-9], however ranges of characters like this are actually locale dependent. Instead we should use the POSIX standard form [[:digit:]], or the regex++ and perl shorthand for this \d (note that many older libraries tended to be hard-coded to the C-locale, consequently this was not an issue for them). That leaves us with the following regular expression to validate credit card number formats:

(\d{4}[- ]){3}\d{4}

Here the parenthesis act to group (and mark for future reference) sub-expressions, and the {4} means "repeat exactly 4 times". This is an example of the extended regular expression syntax used by perl, awk and egrep. Regex++ also supports the older "basic" syntax used by sed and grep, but this is generally less useful, unless you already have some basic regular expressions that you need to reuse.

Now lets take that expression and place it in some C++ code to validate the format of a credit card number:

bool validate_card_format(const std::string s)
{
   static const boost::regex e("(\\d{4}[- ]){3}\\d{4}");
   return regex_match(s, e);
}

Note how we had to add some extra escapes to the expression: remember that the escape is seen once by the C++ compiler, before it gets to be seen by the regular expression engine, consequently escapes in regular expressions have to be doubled up when embedding them in C/C++ code. Also note that all the examples assume that your compiler supports Koenig lookup, if yours doesn't (for example VC6), then you will have to add some boost:: prefixes to some of the function calls in the examples.

Those of you who are familiar with credit card processing, will have realised that while the format used above is suitable for human readable card numbers, it does not represent the format required by online credit card systems; these require the number as a string of 16 (or possibly 15) digits, without any intervening spaces. What we need is a means to convert easily between the two formats, and this is where search and replace comes in. Those who are familiar with the utilities sed and perl will already be ahead here; we need two strings - one a regular expression - the other a "format string" that provides a description of the text to replace the match with. In regex++ this search and replace operation is performed with the algorithm regex_merge, for our credit card example we can write two algorithms like this to provide the format conversions:

// match any format with the regular expression:
const boost::regex e("\\A(\\d{3,4})[- ]?(\\d{4})[- ]?(\\d{4})[- ]?(\\d{4})\\z");
const std::string machine_format("\\1\\2\\3\\4");
const std::string human_format("\\1-\\2-\\3-\\4");

std::string machine_readable_card_number(const std::string s)
{
   return regex_merge(s, e, machine_format, boost::match_default | boost::format_sed);
}

std::string human_readable_card_number(const std::string s)
{
   return regex_merge(s, e, human_format, boost::match_default | boost::format_sed);
}

Here we've used marked sub-expressions in the regular expression to split out the four parts of the card number as separate fields, the format string then uses the sed-like syntax to replace the matched text with the reformatted version.

In the examples above, we haven't directly manipulated the results of a regular expression match, however in general the result of a match contains a number of sub-expression matches in addition to the overall match. When the library needs to report a regular expression match it does so using an instance of the class match_results, as before there are typedefs of this class for the most common cases:

namespace boost{
typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<std::string::const_iterator> smatch;
typedef match_results<std::wstring::const_iterator> wsmatch; 
}

The algorithms regex_search and regex_grep (i.e. finding all matches in a string) make use of match_results to report what matched.

Note that these algorithms are not restricted to searching regular C-strings, any bidirectional iterator type can be searched, allowing for the possibility of seamlessly searching almost any kind of data.

For search and replace operations in addition to the algorithm regex_merge that we have already seen, the algorithm regex_format takes the result of a match and a format string, and produces a new string by merging the two.

For those that dislike templates, there is a high level wrapper class RegEx that is an encapsulation of the lower level template code - it provides a simplified interface for those that don't need the full power of the library, and supports only narrow characters, and the "extended" regular expression syntax.

The POSIX API functions: regcomp, regexec, regfree and regerror, are available in both narrow character and Unicode versions, and are provided for those who need compatibility with these API's.

Finally, note that the library now has run-time localization support, and recognizes the full POSIX regular expression syntax - including advanced features like multi-character collating elements and equivalence classes - as well as providing compatibility with other regular expression libraries including GNU and BSD4 regex packages, and to a more limited extent perl 5.

Installation and Configuration Options

[ Important: If you are upgrading from the 2.x version of this library then you will find a number of changes to the documented header names and library interfaces, existing code should still compile unchanged however - see Note for Upgraders. ]

When you extract the library from its zip file, you must preserve its internal directory structure (for example by using the -d option when extracting). If you didn't do that when extracting, then you'd better stop reading this, delete the files you just extracted, and try again!

This library should not need configuring before use; most popular compilers/standard libraries/platforms are already supported "as is". If you do experience configuration problems, or just want to test the configuration with your compiler, then the process is the same as for all of boost; see the configuration library documentation.

The library will encase all code inside namespace boost.

Unlike some other template libraries, this library consists of a mixture of template code (in the headers) and static code and data (in cpp files). Consequently it is necessary to build the library's support code into a library or archive file before you can use it, instructions for specific platforms are as follows:

Borland C++ Builder:

make -fbcb5.mak

The build process will build a variety of .lib and .dll files (the exact number depends upon the version of Borland's tools you are using) the .lib and dll files will be in a sub-directory called bcb4 or bcb5 depending upon the makefile used. To install the libraries into your development system use:

make -fbcb5.mak install

library files will be copied to <BCROOT>/lib and the dll's to <BCROOT>/bin, where <BCROOT> corresponds to the install path of your Borland C++ tools.

You may also remove temporary files created during the build process (excluding lib and dll files) by using:

make -fbcb5.mak clean

Finally when you use regex++ it is only necessary for you to add the <boost> root director to your list of include directories for that project. It is not necessary for you to manually add a .lib file to the project; the headers will automatically select the correct .lib file for your build mode and tell the linker to include it. There is one caveat however: the library can not tell the difference between VCL and non-VCL enabled builds when building a GUI application from the command line, if you build from the command line with the 5.5 command line tools then you must define the pre-processor symbol _NO_VCL in order to ensure that the correct link libraries are selected: the C++ Builder IDE normally sets this automatically. Hint, users of the 5.5 command line tools may want to add a -D_NO_VCL to bcc32.cfg in order to set this option permanently.

If you would prefer to do a static link to the regex libraries even when using the dll runtime then define BOOST_REGEX_STATIC_LINK, and if you want to suppress automatic linking altogether (and supply your own custom build of the lib) then define BOOST_REGEX_NO_LIB.

If you are building with C++ Builder 6, you will find that <boost/regex.hpp> can not be used in a pre-compiled header (the actual problem is in <locale> which gets included by <boost/regex.hpp>), if this causes problems for you, then try defining BOOST_NO_STD_LOCALE when building, this will disable some features throughout boost, but may save you a lot in compile times!

Microsoft Visual C++ 6 and 7

You need version 6 of MSVC to build this library. If you are using VC5 then you may want to look at one of the previous releases of this library

Open up a command prompt, which has the necessary MSVC environment variables defined (for example by using the batch file Vcvars32.bat installed by the Visual Studio installation), and change to the <boost>\libs\regex\build directory.

Select the correct makefile - vc6.mak for "vanilla" Visual C++ 6 or vc6-stlport.mak if you are using STLPort.

Invoke the makefile like this:

nmake -fvc6.mak

You will now have a collection of lib and dll files in a "vc6" subdirectory, to install these into your development system use:

nmake -fvc6.mak install

The lib files will be copied to your <VC6>\lib directory and the dll files to <VC6>\bin, where <VC6> is the root of your Visual C++ 6 installation.

You can delete all the temporary files created during the build (excluding lib and dll files) using:

nmake -fvc6.mak clean

Finally when you use regex++ it is only necessary for you to add the <boost> root directory to your list of include directories for that project. It is not necessary for you to manually add a .lib file to the project; the headers will automatically select the correct .lib file for your build mode and tell the linker to include it.

Note that if you want to statically link to the regex library when using the dynamic C++ runtime, define BOOST_REGEX_STATIC_LINK when building your project (this only has an effect for release builds). If you want to add the source directly to your project then define BOOST_REGEX_NO_LIB to disable automatic library selection.

Important: there have been some reports of compiler-optimisation bugs affecting this library, (particularly with VC6 versions prior to service patch 5) the workaround is to build the library using /Oityb1 rather than /O2. That is to use all optimisation settings except /Oa. This problem is reported to affect some standard library code as well (in fact I'm not sure if the problem is with the regex code or the underlying standard library), so it's probably worthwhile applying this workaround in normal practice in any case.

Note: if you have replaced the C++ standard library that comes with VC6, then when you build the library you must ensure that the environment variables "INCLUDE" and "LIB" have been updated to reflect the include and library paths for the new library - see vcvars32.bat (part of your Visual Studio installation) for more details. Alternatively if STLPort is in c:/stlport then you could use:

nmake INCLUDES="-Ic:/stlport/stlport" XLFLAGS="/LIBPATH:c:/stlport/lib" -fvc6-stlport.mak

If you are building with the full STLPort v4.x, then use the vc6-stlport.mak file provided and set the environment variable STLPORT_PATH to point to the location of your STLport installation (Note that the full STLPort libraries appear not to support single-thread static builds).
 
 

GCC(2.95)

There is a conservative makefile for the g++ compiler. From the command prompt change to the <boost>/libs/regex/build directory and type:

make -fgcc.mak

At the end of the build process you should have a gcc sub-directory containing release and debug versions of the library (libboost_regex.a and libboost_regex_debug.a). When you build projects that use regex++, you will need to add the boost install directory to your list of include paths and add <boost>/libs/regex/build/gcc/libboost_regex.a to your list of library files.

There is also a makefile to build the library as a shared library:

make -fgcc-shared.mak

which will build libboost_regex.so and libboost_regex_debug.so.

Both of the these makefiles support the following environment variables:

CXXFLAGS: extra compiler options - note that this applies to both the debug and release builds.

INCLUDES: additional include directories.

LDFLAGS: additional linker options.

LIBS: additional library files.

For the more adventurous there is a configure script in <boost>/libs/config; see the config library documentation.

Sun Workshop 6.1

There is a makefile for the sun (6.1) compiler (C++ version 3.12). From the command prompt change to the <boost>/libs/regex/build directory and type:

dmake -f sunpro.mak

At the end of the build process you should have a sunpro sub-directory containing single and multithread versions of the library (libboost_regex.a, libboost_regex.so, libboost_regex_mt.a and libboost_regex_mt.so). When you build projects that use regex++, you will need to add the boost install directory to your list of include paths and add <boost>/libs/regex/build/sunpro/ to your library search path.

Both of the these makefiles support the following environment variables:

CXXFLAGS: extra compiler options - note that this applies to both the single and multithreaded builds.

INCLUDES: additional include directories.

LDFLAGS: additional linker options.

LIBS: additional library files.

LIBSUFFIX: a suffix to mangle the library name with (defaults to nothing).

This makefile does not set any architecture specific options like -xarch=v9, you can set these by defining the appropriate macros, for example:

dmake CXXFLAGS="-xarch=v9" LDFLAGS="-xarch=v9" LIBSUFFIX="_v9" -f sunpro.mak

will build v9 variants of the regex library named libboost_regex_v9.a etc.

Other compilers:

There is a generic makefile (generic.mak) provided in <boost-root>/libs/regex/build - see that makefile for details of environment variables that need to be set before use. Alternatively you can using the Jam based build system. If you need to configure the library for your platform, then refer to the config library documentation.


Copyright Dr John Maddock 1998-2001 all rights reserved.