The String
class is designed to extend GNU C++ to support string processing capabilities similar to those in languages like Awk. The class provides facilities that ought to be convenient and efficient enough to be useful replacements for char*
based processing via the C string library (i.e., strcpy, strcmp,
etc.) in many applications. Many details about String representations are described in the Representation section.
A separate SubString
class supports substring extraction and modification operations. This is implemented in a way that user programs never directly construct or represent substrings, which are only used indirectly via String operations.
Another separate class, Regex
is also used indirectly via String operations in support of regular expression searching, matching, and the like. The Regex class is based entirely on the GNU Emacs regex functions. See Syntax of Regular Expressions, for a full explanation of regular expression syntax. (For implementation details, see the internal documentation in files `regex.h' and `regex.c'.)
Strings are initialized and assigned as in the following examples:
String x; String y = 0; String z = "";
String x = "Hello"; String y("Hello");
String x = 'A'; String y('A');
String u = x; String v(x);
String u = x.at(1,4); String v(x.at(1,4));
String x("abc", 2);
String x = dec(20);
char*
function.
There are no directly accessible forms for declaring SubString variables.
The declaration Regex r("[a-zA-Z_][a-zA-Z0-9_]*");
creates a compiled regular expression suitable for use in String operations described below. (In this case, one that matches any C++ identifier). The first argument may also be a String. Be careful in distinguishing the role of backslashes in quoted GNU C++ char* constants versus those in Regexes. For example, a Regex that matches either one or more tabs or all strings beginning with "ba" and ending with any number of occurrences of "na" could be declared as Regex r = "\\(\t+\\)\\|\\(ba\\(na\\)*\\)"
Note that only one backslash is needed to signify the tab, but two are needed for the parenthesization and virgule, since the GNU C++ lexical analyzer decodes and strips backslashes before they are seen by Regex.
There are three additional optional arguments to the Regex constructor that are less commonly useful:
fast (default 0)
fast
may be set to true (1) if the Regex should be "fast-compiled". This causes an additional compilation step that is generally worthwhile if the Regex will be used many times.
bufsize (default max(40, length of the string))
transtable (default none == 0)
As a convenience, several Regexes are predefined and usable in any program. Here are their declarations from `String.h'.
extern Regex RXwhite; // = "[ \n\t]+" extern Regex RXint; // = "-?[0-9]+" extern Regex RXdouble; // = "-?\\(\\([0-9]+\\.[0-9]*\\)\\| // \\([0-9]+\\)\\| // \\(\\.[0-9]+\\)\\) // \\([eE][---+]?[0-9]+\\)?" extern Regex RXalpha; // = "[A-Za-z]+" extern Regex RXlowercase; // = "[a-z]+" extern Regex RXuppercase; // = "[A-Z]+" extern Regex RXalphanum; // = "[0-9A-Za-z]+" extern Regex RXidentifier; // = "[A-Za-z_][A-Za-z0-9_]*"
Most String
class capabilities are best shown via example. The examples below use the following declarations.
String x = "Hello"; String y = "world"; String n = "123"; String z; char* s = ","; String lft, mid, rgt; Regex r = "e[a-z]*o"; Regex r2("/[a-z]*/"); char c; int i, pos, len; double f; String words[10]; words[0] = "a"; words[1] = "b"; words[2] = "c";
The usual lexicographic relational operators (==, !=, <, <=, >, >=
) are defined. A functional form compare(String, String)
is also provided, as is fcompare(String, String)
, which compares Strings without regard for upper vs. lower case.
All other matching and searching operations are based on some form of the (non-public) match
and search
functions. match
and search
differ in that match
attempts to match only at the given starting position, while search
starts at the position, and then proceeds left or right looking for a match. As seen in the following examples, the second optional startpos
argument to functions using match
and search
specifies the starting position of the search: If non-negative, it results in a left-to-right search starting at position startpos
, and if negative, a right-to-left search starting at position x.length() + startpos
. In all cases, the index returned is that of the beginning of the match, or -1 if there is no match.
Three String functions serve as front ends to search
and match
. index
performs a search, returning the index, matches
performs a match, returning nonzero (actually, the length of the match) on success, and contains
is a boolean function performing either a search or match, depending on whether an index argument is provided:
x.index("lo")
x.index("l", 2)
x.index("l", -1)
x.index("l", -3)
pos = r.search("leo", 3, len, 0)
char*
string of length 3, starting at position 0, also placing the length of the match in reference parameter len.
x.contains("He")
x.contains("el", 1)
contains
, if present, means to match the substring only at that position, and not to search elsewhere in the string.
x.contains(RXwhite);
RXwhite
is a global whitespace Regex.
x.matches("lo", 3)
x.matches(r)
int f = x.freq("l")
Substrings may be extracted via the at
, before
, through
, from
, and after
functions. These behave as either lvalues or rvalues.
z = x.at(2, 3)
x.at(2, 2) = "r"
x.at("He") = "je";
x.at("l", -1) = "i";
z = x.at(r)
z = x.before("o")
x.before("ll") = "Bri";
z = x.before(2)
z = x.after("Hel")
z = x.through("el")
z = x.from("el")
x.after("Hel") = "p";
z = x.after(3)
z = " ab c"; z = z.after(RXwhite)
x[0] = 'J';
common_prefix(x, "Help")
common_suffix(x, "to")
z = x + s + ' ' + y.at("w") + y.after("w") + ".";
x += y;
cat(x, y, z)
cat(z, y, x, x)
y.prepend(x);
z = replicate(x, 3);
z = join(words, 3, "/")
z = "this string has five words"; i = split(z, words, 10, RXwhite);
int nmatches x.gsub("l","ll")
z = x + y; z.del("loworl");
z = reverse(x)
z = upcase(x)
z = downcase(x)
z = capitalize(x)
x.reverse(), x.upcase(), x.downcase(), x.capitalize()
cout << x
cout << x.at(2, 3)
cin >> x
x.length()
s = (const char*)x
char*
char array. This coercion is useful for sending a String as an argument to any function expecting a const char*
argument (like atoi
, and File::open
). This operator must be used with care, since the conversion returns a pointer to String
internals without copying the characters: The resulting (char*)
is only valid until the next String operation, and you must not modify it. (The conversion is defined to return a const value so that GNU C++ will produce warning and/or error messages if changes are attempted.)