C++ Boost

Regex++, Traits Class Reference.

Copyright (c) 1998-2001

Dr John Maddock

Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose is hereby granted without fee, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation. Dr John Maddock makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.


This section describes the traits class requirements of the reg_expression template class, these requirements are somewhat complex (sorry), and subject to change as uses ask for new features, however I will try to keep them stable for a while, and ideally the requirements should lessen rather than increase.

The reg_expression traits classes encapsulate both the properties of a character type, and the properties of the locale associated with that type. The associated locale may be defined at run-time (via std::locale), or hard-coded into the traits class and determined at compile time.

The following example class illustrates the interface required by a "typical" traits class for use with class reg_expression:

class mytraits
{
   typedef implementation_defined char_type;
   typedef implementation_defined uchar_type;
   typedef implementation_defined size_type;
   typedef implementation_defined string_type;
   typedef implementation_defined locale_type;
   typedef implementation_defined uint32_t;
   struct sentry
   {
      sentry(const mytraits&);
      operator void*() { return this; }
   };

   enum char_syntax_type
   {
      syntax_char = 0,
      syntax_open_bracket = 1,                  // (
      syntax_close_bracket = 2,                 // )
      syntax_dollar = 3,                        // $
      syntax_caret = 4,                         // ^
      syntax_dot = 5,                           // .
      syntax_star = 6,                          // *
      syntax_plus = 7,                          // +
      syntax_question = 8,                      // ?
      syntax_open_set = 9,                      // [
      syntax_close_set = 10,                    // ]
      syntax_or = 11,                           // |
      syntax_slash = 12,                        //
      syntax_hash = 13,                         // #
      syntax_dash = 14,                         // -
      syntax_open_brace = 15,                   // {
      syntax_close_brace = 16,                  // }
      syntax_digit = 17,                        // 0-9
      syntax_b = 18,                            // for \b
      syntax_B = 19,                            // for \B
      syntax_left_word = 20,                    // for \<
      syntax_right_word = 21,                   // for \
      syntax_w = 22,                            // for \w
      syntax_W = 23,                            // for \W
      syntax_start_buffer = 24,                 // for \`
      syntax_end_buffer = 25,                   // for \'
      syntax_newline = 26,                      // for newline alt
      syntax_comma = 27,                        // for {x,y}

      syntax_a = 28,                            // for \a
      syntax_f = 29,                            // for \f
      syntax_n = 30,                            // for \n
      syntax_r = 31,                            // for \r
      syntax_t = 32,                            // for \t
      syntax_v = 33,                            // for \v
      syntax_x = 34,                            // for \xdd
      syntax_c = 35,                            // for \cx
      syntax_colon = 36,                        // for [:...:]
      syntax_equal = 37,                        // for [=...=]
   
      // perl ops:
      syntax_e = 38,                            // for \e
      syntax_l = 39,                            // for \l
      syntax_L = 40,                            // for \L
      syntax_u = 41,                            // for \u
      syntax_U = 42,                            // for \U
      syntax_s = 43,                            // for \s
      syntax_S = 44,                            // for \S
      syntax_d = 45,                            // for \d
      syntax_D = 46,                            // for \D
      syntax_E = 47,                            // for \Q\E
      syntax_Q = 48,                            // for \Q\E
      syntax_X = 49,                            // for \X
      syntax_C = 50,                            // for \C
      syntax_Z = 51,                            // for \Z
      syntax_G = 52,                            // for \G
      syntax_bang = 53,                         // reserved for future use '!'
      syntax_and = 54,                          // reserve for future use '&'
   };

   enum{
      char_class_none = 0,
      char_class_alpha,
      char_class_cntrl,
      char_class_digit,
      char_class_lower,
      char_class_punct,
      char_class_space,
      char_class_upper,
      char_class_xdigit,
      char_class_blank,
      char_class_unicode,
      char_class_alnum,
      char_class_graph,
      char_class_print,
      char_class_word
   };

   static size_t length(const char_type* p);
   unsigned int syntax_type(size_type c)const;
   char_type translate(char_type c, bool icase)const;
   void transform(string_type& out, const string_type& in)const;
   void transform_primary(string_type& out, const string_type& in)const;
   bool is_separator(char_type c)const;
   bool is_combining(char_type)const;
   bool is_class(char_type c, uint32_t f)const;
   int toi(char_type c)const;
   int toi(const char_type*& first, const char_type* last, int radix)const;
   uint32_t lookup_classname(const char_type* first, const char_type* last)const;
   bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const;
   locale_type imbue(locale_type l);
   locale_type getloc()const;
   std::string error_string(unsigned id)const;

   mytraits();
   ~mytraits();
};

The member types required by a traits class are defined as follows:
  

  Member name Description  
  char_type The character type encapsulated by this traits class, must be a POD type, and be convertible to uchar_type.  
  uchar_type The unsigned type corresponding to char_type, must be convertible to size_type.  
  size_type An unsigned integral type, with at least as much precision as uchar_type.  
  string_type A type that offers the same facilities as std::basic_string<char_type. This is used for collating elements, and sort strings, if char_type has no locale dependent collation (it is not a "character"), then it could be something simpler than std::basic_string.  
  locale_type A type that encapsulates the locale used by the traits class, probably std::locale but could be a platform specific type, or a dummy type if per-instance locales are not supported by the traits class.  
  uint32_t An unsigned integral type with at least 32-bits of precision, used as a bitmask type for character classification.  
  sentry A class or struct type which is constructible from an instance of the traits class, and is convertible to void*. An instance of type sentry will be constructed before compiling each regular expression, it provides an opportunity to carry out prefix/suffix operations on the traits class. 

For example a traits class that encapsulates the global locale, can use this as an opportunity to synchronize with the global locale (by updating any cached data).

 


 The following member constants are used to represent the locale independent syntax of a regular expression; the member function syntax_type returns one of these values, and is used to convert a locale dependent regular expression, into a locale-independent sequence of tokens.
 

  Member constant  English language representation   
  syntax_char  All non-special characters.   
  syntax_open_bracket   
  syntax_close_bracket   
  syntax_dollar   
  syntax_caret   
  syntax_dot   
  syntax_star   
  syntax_plus   
  syntax_question   
  syntax_open_set   
  syntax_close_set   
  syntax_or   
  syntax_slash   
  syntax_hash   
  syntax_dash   
  syntax_open_brace   
  syntax_close_brace   
  syntax_digit  0123456789   
  syntax_b   
  syntax_B   
  syntax_left_word   
  syntax_right_word     
  syntax_w   
  syntax_W   
  syntax_start_buffer   
  syntax_end_buffer   
  syntax_newline  \n   
  syntax_comma   
  syntax_a   
  syntax_f   
  syntax_n   
  syntax_r   
  syntax_t   
  syntax_v   
  syntax_x   
  syntax_c   
  syntax_colon   
  syntax_equal   
  syntax_e   
  syntax_l   
  syntax_L   
  syntax_u   
  syntax_U   
  syntax_s   
  syntax_S   
  syntax_d   
  syntax_D   
  syntax_E   
  syntax_Q   
  syntax_X   
  syntax_C   
  syntax_Z   
  syntax_G   
  syntax_bang   
  syntax_and   

The following member constants are used to represent particular character classifications:
 

  Member constant  Description  
  char_class_none  No classification, must be zero.  
  char_class_alpha  All alphabetic characters.  
  char_class_cntrl  All control characters.  
  char_class_digit  All decimal digits.  
  char_class_lower  All lower case characters.  
  char_class_punct  All punctuation characters.  
  char_class_space  All white-space characters.  
  char_class_upper  All upper case characters.  
  char_class_xdigit  All hexadecimal digit characters.  
  char_class_blank  All blank characters (space + tab).  
  char_class_unicode  All extended unicode characters - those that can not be represented as a single narrow character.  
  char_class_alnum  All alpha-numeric characters.  
  char_class_graph  All graphic characters.  
  char_class_print  All printable characters.  
  char_class_word  All word characters (alphanumeric characters + the underscore).  

The following member functions are required by all regular expression traits classes, those members that are declared here as const, could be declared static instead if the class does not contain instance data:
 

  Member function Description  
  static size_t length(const char_type* p); Returns the length of the null-terminated string p.  
  unsigned int syntax_type(size_type c)const;  Converts an input character into a locale independent token (one of the syntax_xxx member constants). Called when parsing the regular expression into a locale-independent parse tree. 

Example: in English language regular expressions we would use "[[:word:]]" to represent the character class of all word characters, and "\w" as a shortcut for this. Consequently syntax_type('w') returns syntax_w. In French language regular expressions, we would use "[[:mot:]]" in place of "[[:word:]]" and therefore "\m" in place of "\w", therefore it is syntax_type('m') that returns syntax_w.

 
  char_type translate(char_type c, bool icase)const;  Translates an input character into a unique identifier that represents the equivalence class that that character belongs to. If icase is true, then the returned value is insensitive to case. 

[An equivalence class is the set of all characters that must be treated as being equivalent to each other.]

 
  void transform(string_type& out, const string_type& in)const;  Transforms the string in, into a locale-dependent sort key, and stores the result in out.  
  void transform_primary(string_type& out, const string_type& in)const;  Transforms the string in, into a locale-dependent primary sort key, and stores the result in out.  
  bool is_separator(char_type c)const;  Returns true only if c is a line separator.  
  bool is_combining(char_type c)const;  Returns true only if c is a unicode combining character.  
  bool is_class(char_type c, uint32_t f)const;  Returns true only if c is a member of one of the character classes represented by the bitmap f.  
  int toi(char_type c)const;  Converts the character c to a decimal integer. 

[Precondition: is_class(c,char_class_digit)==true]

 
  int toi(const char_type*& first, const char_type* last, int radix)const;  Converts the string [first-last) into an integral value using base radix. Stops when it finds the first non-digit character, and sets first to point to that character. 

[Precondition: is_class(*first,char_class_digit)==true]

 
  uint32_t lookup_classname(const char_type* first, const char_type* last)const;  Returns the bitmap representing the character class [first-last), or char_class_none if [first-last) is not recognized as a character class name.  
  bool lookup_collatename(string_type& buf, const char_type* first, const char_type* last)const; If the sequence [first-last) is the name of a known collating element, then stores the collating element in buf, and returns true, otherwise returns false.  
  locale_type imbue(locale_type l);  Imbues the class with the locale l.  
  locale_type getloc()const;  Returns the traits-class locale.  
  std::string error_string(unsigned id)const;  Returns the locale-dependent error-string associated with the error-number id. The parameter id is one of the REG_XXX error codes described by the POSIX standard, and defined in <boost/cregex.hpp.  
  mytraits();  Constructor.  
  ~ mytraits();  Destructor.  

There is also an example of a custom traits class supplied by Christian Engström, see iso8859_1_regex_traits.cpp and iso8859_1_regex_traits.hpp. This example inherits from c_regex_traits and provides it's own implementations of two locale specific functions. This ensures that the class gives consistent behaviour (albeit tied to one locale) on all platforms. A fuller desciption by the author is available in the readme file.


Copyright Dr John Maddock 1998-2001 all rights reserved.