regex_utils library#
This library contains useful utilities to handle all regex related tasks.
Regex to Wildcard Translator#
Goal#
Performs a best-effort translation to turn a regex string to an equivalent wildcard string.
CLP currently only recognizes three meta-characters in the wildcard syntax:
?
Matches any single character*
Matches zero or more characters\
Suppresses the special meaning of meta characters (including itself)
If the regex query can actually be expressed as a wildcard query only deploying the three metacharacters above, CLP should use the wildcard version.
Includes#
The translator function returns a
Result<std::string, std::error_code>
type, which can either contain a value or an error code.
To use the translator:
#include <regex_utils/regex_translation_utils.hpp>
using clp::regex_utils::regex_to_wildcard;
// Other code
auto result{regex_to_wildcard(wildcard_str)};
if (result.has_error()) {
auto err_code{result.error()};
// Handle error
} else {
auto regex_str{result.value()};
// Do things with the translated string
}
To add custom configuration to the translator:
#include <regex_utils/RegexToWildcardTranslatorConfig.hpp>
RegexToWildcardTranslatorConfig config{true, false, /*...other booleans*/};
auto result{regex_to_wildcard(wildcard_str, config)};
// Same as above
For a detailed description on the options order and usage, see the Custom Configuration section.
Functionalities#
Wildcards
Turn
.
into?
Turn
.*
into*
Turn
.+
into?*
E.g.
abc.*def.ghi.+
will get translated toabc*def?ghi?*
Metacharacter escape sequences
An escaped regex metacharacter is treated as a literal and appended to the wildcard output.
The list of characters that require escaping to have their special meanings suppressed is
[\/^$.|?*+(){}
.Superfluous escape characters are ignored for the following characters:
],<>-_=!
.E.g.
a\[\+b\-\_c-_d
will get translated toa[+b-_c-_d
Note: generally, any non-alphanumeric character can be escaped to use it as a literal. The list this utils library supports is non-exhaustive and can be expanded when necessary.
For metacharacters shared by both syntaxes, keep the escape backslashes.
The list of characters that fall into this category is
*?\
. All wildcard metacharacters are also regex metacharacters.E.g.
a\*b\?c\\d
will get translated toa\*b\?c\\d
(no change)
Escape sequences with alphanumeric characters are disallowed.
E.g. Special utility escape sequences
\Q
,\E
,\A
etc. and back references\1
\2
etc. cannot be translated.
Character set
Reduces a character set into a single character if possible.
A trivial character set containing a single character or a single escaped metacharacter.
E.g.
[a]
intoa
,[\^]
into^
If the
case_insensitive_wildcard
config is turned on, the translator can also reduce the case-insensitive style character set patterns into a single lowercase character:E.g.
[aA]
intoa
,[Bb]
intob
,[xX][Yy][zZ]
intoxyz
Custom configuration#
The RegexToWildcardTranslatorConfig
class objects are currently immutable once instantiated. By
default, all of the options are set to false
.
The constructor takes the following option arguments in order:
case_insensitive_wildcard
: see Character set bullet point in the Functionalities section.add_prefix_suffix_wildcards
: in the absence of regex anchors, add prefix or suffix wildcards so the query becomes a substring query.E.g.
info.*system
gets translated into*info*system*
which makes the original query a substring query.