regex++ traits-class reference

Regex++, Traits Class Reference.

Dr John Maddock

Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Dr John Maddock makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.

This section describes the traits class requirements of the reg_expression template class, these requirements are somewhat complex (sorry), and subject to change as uses ask for new features, however I will try to keep them stable for a while, and ideally the requirements should lessen rather than increase.

The reg_expression traits classes encapsulate both the properties of a character type, and the properties of the locale associated with that type. The associated locale may be defined at run-time (via std::locale), or hard-coded into the traits class and determined at compile time.

The following example class illustrates the interface required by a "typical" traits class for use with class reg_expression:

class mytraits
{
   typedef implementation_defined char_type;
   typedef implementation_defined uchar_type;
   typedef implementation_defined size_type;
   typedef implementation_defined string_type;
   typedef implementation_defined locale_type;
   typedef implementation_defined uint32_t;
   struct sentry
   {
      sentry(const mytraits&);
      operator void*() { return this; }
   };

   enum char_syntax_type
   {
      syntax_char = 0,
      syntax_open_bracket = 1,                  // (
      syntax_close_bracket = 2,                 // )
      syntax_dollar = 3,                        // $
      syntax_caret = 4,                         // ^
      syntax_dot = 5,                           // .
      syntax_star = 6,                          // *
      syntax_plus = 7,                          // +
      syntax_question = 8,                      // ?
      syntax_open_set = 9,                      // [
      syntax_close_set = 10,                    // ]
      syntax_or = 11,                           // |
      syntax_slash = 12,                        //
      syntax_hash = 13,                         // #
      syntax_dash = 14,                         // -
      syntax_open_brace = 15,                   // {
      syntax_close_brace = 16,                  // }
      syntax_digit = 17,                        // 0-9
      syntax_b = 18,                            // for \b
      syntax_B = 19,                            // for \B
      syntax_left_word = 20,                    // for \<
      syntax_right_word = 21,                   // for \
      syntax_w = 22,                            // for \w
      syntax_W = 23,                            // for \W
      syntax_start_buffer = 24,                 // for \`
      syntax_end_buffer = 25,                   // for \'
      syntax_newline = 26,                      // for newline alt
      syntax_comma = 27,                        // for {x,y}

      syntax_a = 28,                            // for \a
      syntax_f = 29,                            // for \f
      syntax_n = 30,                            // for \n
      syntax_r = 31,                            // for \r
      syntax_t = 32,                            // for \t
      syntax_v = 33,                            // for \v
      syntax_x = 34,                            // for \xdd
      syntax_c = 35,                            // for \cx
      syntax_colon = 36,                        // for [:...:]
      syntax_equal = 37,                        // for [=...=]
   
      // perl ops:
      syntax_e = 38,                            // for \e
      syntax_l = 39,                            // for \l
      syntax_L = 40,                            // for \L
      syntax_u = 41,                            // for \u
      syntax_U = 42,                            // for \U
      syntax_s = 43,                            // for \s
      syntax_S = 44,                            // for \S
      syntax_d = 45,                            // for \d
      syntax_D = 46,                            // for \D
      syntax_E = 47,                            // for \Q\E
      syntax_Q = 48,                            // for \Q\E
      syntax_X = 49,                            // for \X
      syntax_C = 50,                            // for \C
      syntax_Z = 51,                            // for \Z
      syntax_G = 52,                            // for \G
      syntax_bang = 53,                         // reserved for future use '!'
      syntax_and = 54,                          // reserve for future use '&'
   };

   enum{
      char_class_none = 0,
      char_class_alpha,
      char_class_cntrl,
      char_class_digit,
      char_class_lower,
      char_class_punct,
      char_class_space,
      char_class_upper,
      char_class_xdigit,
      char_class_blank,
      char_class_unicode,
      char_class_alnum,
      char_class_graph,
      char_class_print,
      char_class_word
   };

   static size_t length(const char_type* p);
   unsigned int syntax_type(size_type c)const;
   char_type translate(char_type c, bool icase)const;
   void transform(string_type& out, const string_type& in)const;
   void transform_primary(string_type& out, const string_type& in)const;
   bool is_separator(char_type c)const;
   bool is_combining(char_type)const;
   bool is_class(char_type c, uint32_t f)const;
   int toi(char_type c)const;
   int toi(const char_type*& first, const char_type* last, int radix)const;
   uint32_t lookup_classname(const char_type* first, const char_type* last)const;
   bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const;
   locale_type imbue(locale_type l);
   locale_type getloc()const;
   std::string error_string(unsigned id)const;

   mytraits();
   ~mytraits();
};

The member types required by a traits class are defined as follows:

	Member name	Description
	char_type	The character type encapsulated by this traits class, must be a POD type, and be convertible to uchar_type.
	uchar_type	The unsigned type corresponding to char_type, must be convertible to size_type.
	size_type	An unsigned integral type, with at least as much precision as uchar_type.
	string_type	A type that offers the same facilities as std::basic_string<char_type. This is used for collating elements, and sort strings, if char_type has no locale dependent collation (it is not a "character"), then it could be something simpler than std::basic_string.
	locale_type	A type that encapsulates the locale used by the traits class, probably std::locale but could be a platform specific type, or a dummy type if per-instance locales are not supported by the traits class.
	uint32_t	An unsigned integral type with at least 32-bits of precision, used as a bitmask type for character classification.
	sentry	A class or struct type which is constructible from an instance of the traits class, and is convertible to void*. An instance of type sentry will be constructed before compiling each regular expression, it provides an opportunity to carry out prefix/suffix operations on the traits class. For example a traits class that encapsulates the global locale, can use this as an opportunity to synchronize with the global locale (by updating any cached data).

The following member constants are used to represent the locale independent syntax of a regular expression; the member function syntax_type returns one of these values, and is used to convert a locale dependent regular expression, into a locale-independent sequence of tokens.

	Member constant	English language representation
	syntax_char	All non-special characters.
	syntax_open_bracket	(
	syntax_close_bracket	)
	syntax_dollar	$
	syntax_caret	^
	syntax_dot	.
	syntax_star	*
	syntax_plus	+
	syntax_question	?
	syntax_open_set	[
	syntax_close_set	]
	syntax_or	\|
	syntax_slash	\
	syntax_hash	#
	syntax_dash	-
	syntax_open_brace	{
	syntax_close_brace	}
	syntax_digit	0123456789
	syntax_b	b
	syntax_B	B
	syntax_left_word	<
	syntax_right_word
	syntax_w	w
	syntax_W	W
	syntax_start_buffer	`
	syntax_end_buffer	'
	syntax_newline	\n
	syntax_comma	,
	syntax_a	a
	syntax_f	f
	syntax_n	n
	syntax_r	r
	syntax_t	t
	syntax_v	v
	syntax_x	x
	syntax_c	c
	syntax_colon	:
	syntax_equal	=
	syntax_e	e
	syntax_l	l
	syntax_L	L
	syntax_u	u
	syntax_U	U
	syntax_s	s
	syntax_S	S
	syntax_d	d
	syntax_D	D
	syntax_E	E
	syntax_Q	Q
	syntax_X	X
	syntax_C	C
	syntax_Z	Z
	syntax_G	G
	syntax_bang	!
	syntax_and	&

The following member constants are used to represent particular character classifications:

	Member constant	Description
	char_class_none	No classification, must be zero.
	char_class_alpha	All alphabetic characters.
	char_class_cntrl	All control characters.
	char_class_digit	All decimal digits.
	char_class_lower	All lower case characters.
	char_class_punct	All punctuation characters.
	char_class_space	All white-space characters.
	char_class_upper	All upper case characters.
	char_class_xdigit	All hexadecimal digit characters.
	char_class_blank	All blank characters (space + tab).
	char_class_unicode	All extended unicode characters - those that can not be represented as a single narrow character.
	char_class_alnum	All alpha-numeric characters.
	char_class_graph	All graphic characters.
	char_class_print	All printable characters.
	char_class_word	All word characters (alphanumeric characters + the underscore).

The following member functions are required by all regular expression traits classes, those members that are declared here as const, could be declared static instead if the class does not contain instance data:

	Member function	Description
	static size_t length(const char_type* p);	Returns the length of the null-terminated string p.
	unsigned int syntax_type(size_type c)const;	Converts an input character into a locale independent token (one of the syntax_xxx member constants). Called when parsing the regular expression into a locale-independent parse tree. Example: in English language regular expressions we would use "[[:word:]]" to represent the character class of all word characters, and "\w" as a shortcut for this. Consequently syntax_type('w') returns syntax_w. In French language regular expressions, we would use "[[:mot:]]" in place of "[[:word:]]" and therefore "\m" in place of "\w", therefore it is syntax_type('m') that returns syntax_w.
	char_type translate(char_type c, bool icase)const;	Translates an input character into a unique identifier that represents the equivalence class that that character belongs to. If icase is true, then the returned value is insensitive to case. [An equivalence class is the set of all characters that must be treated as being equivalent to each other.]
	void transform(string_type& out, const string_type& in)const;	Transforms the string in, into a locale-dependent sort key, and stores the result in out.
	void transform_primary(string_type& out, const string_type& in)const;	Transforms the string in, into a locale-dependent primary sort key, and stores the result in out.
	bool is_separator(char_type c)const;	Returns true only if c is a line separator.
	bool is_combining(char_type c)const;	Returns true only if c is a unicode combining character.
	bool is_class(char_type c, uint32_t f)const;	Returns true only if c is a member of one of the character classes represented by the bitmap f.
	int toi(char_type c)const;	Converts the character c to a decimal integer. [Precondition: is_class(c,char_class_digit)==true]
	int toi(const char_type& first, const char_type last, int radix)const;	Converts the string [first-last) into an integral value using base radix. Stops when it finds the first non-digit character, and sets first to point to that character. [Precondition: is_class(*first,char_class_digit)==true]
	uint32_t lookup_classname(const char_type* first, const char_type* last)const;	Returns the bitmap representing the character class [first-last), or char_class_none if [first-last) is not recognized as a character class name.
	bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const;	If the sequence [first-last) is the name of a known collating element, then stores the collating element in buf, and returns true, otherwise returns false.
	locale_type imbue(locale_type l);	Imbues the class with the locale l.
	locale_type getloc()const;	Returns the traits-class locale.
	std::string error_string(unsigned id)const;	Returns the locale-dependent error-string associated with the error-number id. The parameter id is one of the REG_XXX error codes described by the POSIX standard, and defined in <boost/cregex.hpp.
	mytraits();	Constructor.
	~ mytraits();	Destructor.

There is also an example of a custom traits class supplied by Christian Engström, see iso8859_1_regex_traits.cpp and iso8859_1_regex_traits.hpp. This example inherits from c_regex_traits and provides it's own implementations of two locale specific functions. This ensures that the class gives consistent behaviour (albeit tied to one locale) on all platforms. A fuller desciption by the author is available in the readme file.