Regex++, Traits Class Reference.Copyright (c) 1998-2001 Dr John Maddock Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Dr John Maddock makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. |
This section describes the traits class requirements of the reg_expression template class, these requirements are somewhat complex (sorry), and subject to change as uses ask for new features, however I will try to keep them stable for a while, and ideally the requirements should lessen rather than increase.
The reg_expression traits classes encapsulate both the properties of a character type, and the properties of the locale associated with that type. The associated locale may be defined at run-time (via std::locale), or hard-coded into the traits class and determined at compile time.
The following example class illustrates the interface required by a "typical" traits class for use with class reg_expression:
class mytraits { typedef implementation_defined char_type; typedef implementation_defined uchar_type; typedef implementation_defined size_type; typedef implementation_defined string_type; typedef implementation_defined locale_type; typedef implementation_defined uint32_t; struct sentry { sentry(const mytraits&); operator void*() { return this; } }; enum char_syntax_type { syntax_char = 0, syntax_open_bracket = 1, // ( syntax_close_bracket = 2, // ) syntax_dollar = 3, // $ syntax_caret = 4, // ^ syntax_dot = 5, // . syntax_star = 6, // * syntax_plus = 7, // + syntax_question = 8, // ? syntax_open_set = 9, // [ syntax_close_set = 10, // ] syntax_or = 11, // | syntax_slash = 12, // syntax_hash = 13, // # syntax_dash = 14, // - syntax_open_brace = 15, // { syntax_close_brace = 16, // } syntax_digit = 17, // 0-9 syntax_b = 18, // for \b syntax_B = 19, // for \B syntax_left_word = 20, // for \< syntax_right_word = 21, // for \ syntax_w = 22, // for \w syntax_W = 23, // for \W syntax_start_buffer = 24, // for \` syntax_end_buffer = 25, // for \' syntax_newline = 26, // for newline alt syntax_comma = 27, // for {x,y} syntax_a = 28, // for \a syntax_f = 29, // for \f syntax_n = 30, // for \n syntax_r = 31, // for \r syntax_t = 32, // for \t syntax_v = 33, // for \v syntax_x = 34, // for \xdd syntax_c = 35, // for \cx syntax_colon = 36, // for [:...:] syntax_equal = 37, // for [=...=] // perl ops: syntax_e = 38, // for \e syntax_l = 39, // for \l syntax_L = 40, // for \L syntax_u = 41, // for \u syntax_U = 42, // for \U syntax_s = 43, // for \s syntax_S = 44, // for \S syntax_d = 45, // for \d syntax_D = 46, // for \D syntax_E = 47, // for \Q\E syntax_Q = 48, // for \Q\E syntax_X = 49, // for \X syntax_C = 50, // for \C syntax_Z = 51, // for \Z syntax_G = 52, // for \G syntax_bang = 53, // reserved for future use '!' syntax_and = 54, // reserve for future use '&' }; enum{ char_class_none = 0, char_class_alpha, char_class_cntrl, char_class_digit, char_class_lower, char_class_punct, char_class_space, char_class_upper, char_class_xdigit, char_class_blank, char_class_unicode, char_class_alnum, char_class_graph, char_class_print, char_class_word }; static size_t length(const char_type* p); unsigned int syntax_type(size_type c)const; char_type translate(char_type c, bool icase)const; void transform(string_type& out, const string_type& in)const; void transform_primary(string_type& out, const string_type& in)const; bool is_separator(char_type c)const; bool is_combining(char_type)const; bool is_class(char_type c, uint32_t f)const; int toi(char_type c)const; int toi(const char_type*& first, const char_type* last, int radix)const; uint32_t lookup_classname(const char_type* first, const char_type* last)const; bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const; locale_type imbue(locale_type l); locale_type getloc()const; std::string error_string(unsigned id)const; mytraits(); ~mytraits(); };
The member types required by a traits class are defined as
follows:
Member name | Description | ||
char_type | The character type encapsulated by this traits class, must be a POD type, and be convertible to uchar_type. | ||
uchar_type | The unsigned type corresponding to char_type, must be convertible to size_type. | ||
size_type | An unsigned integral type, with at least as much precision as uchar_type. | ||
string_type | A type that offers the same facilities as std::basic_string<char_type. This is used for collating elements, and sort strings, if char_type has no locale dependent collation (it is not a "character"), then it could be something simpler than std::basic_string. | ||
locale_type | A type that encapsulates the locale used by the traits class, probably std::locale but could be a platform specific type, or a dummy type if per-instance locales are not supported by the traits class. | ||
uint32_t | An unsigned integral type with at least 32-bits of precision, used as a bitmask type for character classification. | ||
sentry | A class or
struct type which is constructible from an instance of
the traits class, and is convertible to void*. An
instance of type sentry will be constructed before
compiling each regular expression, it provides an
opportunity to carry out prefix/suffix operations on the
traits class. For example a traits class that encapsulates the global locale, can use this as an opportunity to synchronize with the global locale (by updating any cached data). |
The following member constants are used to represent the
locale independent syntax of a regular expression; the member
function syntax_type returns one of these values, and is
used to convert a locale dependent regular expression, into a
locale-independent sequence of tokens.
Member constant | English language representation | ||
syntax_char | All non-special characters. | ||
syntax_open_bracket | ( | ||
syntax_close_bracket | ) | ||
syntax_dollar | $ | ||
syntax_caret | ^ | ||
syntax_dot | . | ||
syntax_star | * | ||
syntax_plus | + | ||
syntax_question | ? | ||
syntax_open_set | [ | ||
syntax_close_set | ] | ||
syntax_or | | | ||
syntax_slash | \ | ||
syntax_hash | # | ||
syntax_dash | - | ||
syntax_open_brace | { | ||
syntax_close_brace | } | ||
syntax_digit | 0123456789 | ||
syntax_b | b | ||
syntax_B | B | ||
syntax_left_word | < | ||
syntax_right_word | |||
syntax_w | w | ||
syntax_W | W | ||
syntax_start_buffer | ` | ||
syntax_end_buffer | ' | ||
syntax_newline | \n | ||
syntax_comma | , | ||
syntax_a | a | ||
syntax_f | f | ||
syntax_n | n | ||
syntax_r | r | ||
syntax_t | t | ||
syntax_v | v | ||
syntax_x | x | ||
syntax_c | c | ||
syntax_colon | : | ||
syntax_equal | = | ||
syntax_e | e | ||
syntax_l | l | ||
syntax_L | L | ||
syntax_u | u | ||
syntax_U | U | ||
syntax_s | s | ||
syntax_S | S | ||
syntax_d | d | ||
syntax_D | D | ||
syntax_E | E | ||
syntax_Q | Q | ||
syntax_X | X | ||
syntax_C | C | ||
syntax_Z | Z | ||
syntax_G | G | ||
syntax_bang | ! | ||
syntax_and | & |
The following member constants are used to represent
particular character classifications:
Member constant | Description | ||
char_class_none | No classification, must be zero. | ||
char_class_alpha | All alphabetic characters. | ||
char_class_cntrl | All control characters. | ||
char_class_digit | All decimal digits. | ||
char_class_lower | All lower case characters. | ||
char_class_punct | All punctuation characters. | ||
char_class_space | All white-space characters. | ||
char_class_upper | All upper case characters. | ||
char_class_xdigit | All hexadecimal digit characters. | ||
char_class_blank | All blank characters (space + tab). | ||
char_class_unicode | All extended unicode characters - those that can not be represented as a single narrow character. | ||
char_class_alnum | All alpha-numeric characters. | ||
char_class_graph | All graphic characters. | ||
char_class_print | All printable characters. | ||
char_class_word | All word characters (alphanumeric characters + the underscore). |
The following member functions are required by all regular
expression traits classes, those members that are declared here
as const, could be declared static instead if the
class does not contain instance data:
Member function | Description | ||
static size_t length(const char_type* p); | Returns the length of the null-terminated string p. | ||
unsigned int syntax_type(size_type c)const; | Converts
an input character into a locale independent token (one
of the syntax_xxx member constants). Called when parsing
the regular expression into a locale-independent parse
tree. Example: in English language regular expressions we would use "[[:word:]]" to represent the character class of all word characters, and "\w" as a shortcut for this. Consequently syntax_type('w') returns syntax_w. In French language regular expressions, we would use "[[:mot:]]" in place of "[[:word:]]" and therefore "\m" in place of "\w", therefore it is syntax_type('m') that returns syntax_w. |
||
char_type translate(char_type c, bool icase)const; | Translates
an input character into a unique identifier that
represents the equivalence class that that character
belongs to. If icase is true, then the returned value is
insensitive to case. [An equivalence class is the set of all characters that must be treated as being equivalent to each other.] |
||
void transform(string_type& out, const string_type& in)const; | Transforms the string in, into a locale-dependent sort key, and stores the result in out. | ||
void transform_primary(string_type& out, const string_type& in)const; | Transforms the string in, into a locale-dependent primary sort key, and stores the result in out. | ||
bool is_separator(char_type c)const; | Returns true only if c is a line separator. | ||
bool is_combining(char_type c)const; | Returns true only if c is a unicode combining character. | ||
bool is_class(char_type c, uint32_t f)const; | Returns true only if c is a member of one of the character classes represented by the bitmap f. | ||
int toi(char_type c)const; | Converts
the character c to a decimal integer. [Precondition: is_class(c,char_class_digit)==true] |
||
int toi(const char_type*& first, const char_type* last, int radix)const; | Converts
the string [first-last) into an integral value using base
radix. Stops when it finds the first non-digit
character, and sets first to point to that
character. [Precondition: is_class(*first,char_class_digit)==true] |
||
uint32_t lookup_classname(const char_type* first, const char_type* last)const; | Returns the bitmap representing the character class [first-last), or char_class_none if [first-last) is not recognized as a character class name. | ||
bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const; | If the sequence [first-last) is the name of a known collating element, then stores the collating element in buf, and returns true, otherwise returns false. | ||
locale_type imbue(locale_type l); | Imbues the class with the locale l. | ||
locale_type getloc()const; | Returns the traits-class locale. | ||
std::string error_string(unsigned id)const; | Returns the locale-dependent error-string associated with the error-number id. The parameter id is one of the REG_XXX error codes described by the POSIX standard, and defined in <boost/cregex.hpp. | ||
mytraits(); | Constructor. | ||
~ mytraits(); | Destructor. |
There is also an example of a custom traits class supplied by Christian Engström,
see iso8859_1_regex_traits.cpp
and iso8859_1_regex_traits.hpp.
This example inherits from c_regex_traits and provides it's own
implementations of two locale specific functions. This ensures
that the class gives consistent behaviour (albeit tied to one
locale) on all platforms. A fuller desciption by the author is
available in the readme file.
Copyright Dr John Maddock 1998-2001 all rights reserved.