Parsing wildcard queries#
Given a wildcard query, we need to parse it like we would a message, turning it into a logtype and variable values that we can use to match encoded messages.
Motivating example#
Consider this message (timestamp omitted for brevity):
INFO Task task_12 assigned to container: [NodeAddress:172.128.0.41, \
ContainerID:container_15], operation took 0.335 seconds
At a high-level, we parse it as follows:
Tokenize the message using the delimiters from the schema file.
Compare each token against the variable patterns from the schema file. If a token matches a pattern, we:
extract it,
encode it either as a dictionary or a non-dictionary variable, and
replace the token with a placeholder in the original message.
The specific placeholder used depends on how the variable was encoded.
The output for the example is:
Dictionary variables:
["task_12", "172.128.0.41", "container_15"]
Encoded variables:
[0.335]
(in reality, this is encoded but we omit the details for brevity)Logtype:
INFO Task <dict-var> assigned to container: [NodeAddress:<dict-var>, \ ContainerID:<dict-var>], operation took <float> seconds
Where
<dict-var>
and<float>
are single-byte placeholder characters.
Now consider the query *task* took 0.3*
. To match this query against the
encoded messages, we need to parse it like a log message, and then use the
parsed values as queries on the relevant data. For instance, after parsing, we
might extract 0.3*
as an encoded variable, meaning we should look for encoded
variables that match 0.3*
. But 0.3*
could also match a dictionary variable
which requires a separate query. Overall, wildcards create ambiguity that
requires us to consider different query interpretations.
There are four query interpretations for the example (*task* took 0.3*
):
Interpretation 1:
Dictionary variable queries:
["*task*"]
Encoded variable queries:
["0.3*"]
Logtype query:
*<dict-var>* took <float>*
Interpretation 2:
Dictionary variables queries:
["*task*", "0.3*"]
Encoded variable queries:
[]
Logtype query:
*<dict-var>* took <dict-var>*
Interpretation 3:
Dictionary variable queries:
[]
Encoded variable queries:
["0.3*"]
Logtype query:
*task* took <float>*
Interpretation 4:
Dictionary variable queries:
["0.3*"]
Encoded variable queries:
[]
Logtype query:
*task* took <dict-var>*
We call each of these interpretations a subquery. A message which matches any subquery matches the original wildcard query (with one exception mentioned later). In other words, the subqueries form a logical disjunction (i.e., the subqueries are OR-ed together to comprise the original query). The rest of this doc explains how we generate these subqueries. For more background on logtypes, variables, etc., see the CLP paper.
Handling ambiguity#
To parse a query, we need to consider two sources of ambiguity:
How each interpretation of a wildcard changes the tokenization.
What variable patterns match a wildcard-containing token, and the variable placeholders each matching pattern uses.
We consider each source of ambiguity below.
Tokenization with wildcards#
Consider *task?123*
and assume we use the default variable patterns.
If the
?
matches a non-delimiter, this query could match a single dictionary variable, e.g.,task_123
.If the
?
matches a delimiter (e.g.,:
), this query could match a message with some static texttask:
and an encoded variable123
.
Thus, for every wildcard we need to consider each possibility
(delimiter/non-delimiter). For ?
, this is simple as shown in the example.
However, *
is more involved since it can match zero or more characters—in
other words, a single *
could match both delimiters and non-delimiters.
Handling *
#
Consider how we might tokenize *to*container* 0.335 *
. *to*container*
could be one or more tokens depending on how we interpret each *
. 0.335
is
a token that can be encoded as a float variable. The lone *
can match any
number of tokens.
For *to*container*
, Table 1 below lists the spans we can generate based on
how we interpret each *
. We use the term span to refer to either a
contiguous set of non-delimiters (i.e., tokens) or a contiguous set of
delimiters.
# |
|
Spans |
---|---|---|
1 |
Delimiters only |
|
2 |
Non-delimiters only |
|
3 |
Both |
|
Table 1: The spans generated by tokenizing *to*container*
depending on the
interpretation of *
s.
To understand the spans generated by the third interpretation, consider the
central *
and surrounding non-wildcards in the original query. Since the *
is interpreted as containing both non-delimiters and delimiters, then there must
be at least one delimiter between to
and container
. Table 2 below lists a
set of substrings that could match to*container
.
Substring |
Parts matched by the |
---|---|
|
Delimiters ( |
|
Non-delimiters ( |
|
Non-delimiters, a delimiter, non-delimiters, a delimiter, and a non-delimiter. |
Table 2: Some substrings that can be matched by container*to
where the
central *
is interpreted as matching a combination of non-delimiters and
delimiters.
From the table, we can see that the central *
could match the following in
sequence:
zero or more non-delimiters attached to
to
, followed byat least one delimiter or a combination of non-delimiters and delimiters, and finally
zero or more non-delimiters before
container
.
Thus, we can break the central *
into three *
corresponding to each case of
the sequence: one as a suffix of to
, a lone *
, and one as a prefix of
container
.
Comparing the first and third interpretation in Table 1, we can see that the third is a more general version of the first. As a result, we don’t need to consider the first interpretation. We can generalize this as follows:
If a
*
is interpreted to have a different type than either of the characters surrounding it, the tokenization should split the string at the*
while leaving a*
attached to the surrounding characters.
So the wildcard-containing token, *container*to*
, can be tokenized either as:
*container*
and*to*
, or*container*to*
Note that we don’t need to consider the lone *
as a potential variable since
it matches all variable patterns; similarly, we don’t need to consider what
variable placeholders it needs in the logtype since it matches all variable
placeholders. A consequence of this is that the interpretation of a
wildcard-containing token’s boundary *
wildcards (wildcards at the beginning
or end of a token) does not affect how we tokenize a wildcard-containing token.
In other words, we don’t need to consider the non-delimiters-only case for *
boundary wildcards.
Matching variable patterns to wildcard-containing tokens#
The precise mechanism for matching a variable pattern against a wildcard-containing token is an implementation detail, but it is worth considering the difference between matching a token in a log message versus matching a wildcard-containing token in a wildcard query.
In a log message, if two or more patterns match a token, we apply the pattern that appears first in the schema file. However, when two or more patterns match a wildcard-containing token, we can’t choose the first pattern unless it is a superset of the other patterns; this really means the other patterns would never apply since any token matching the first pattern would match the other patterns as well, so the other patterns would never be applied. (In the future, we will likely warn users when their patterns have this property.) So if two or more non-nested (i.e., one is not a superset of the other) patterns match, we can’t choose the first pattern since that would ignore cases where only the second pattern’s variables match the query. For instance, consider these non-nested patterns:
ip_addr: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
float: \d+\.\d+
If we encounter a wildcard-containing token like *1.2*
, we have to search for
variables matching either ip_addr
or float
. For instance, encoded messages
might contain a message with the float
, 1.23
and/or they might contain a
message with the ip_addr
, 1.2.3.4
. Since the two variables use different
placeholders in the logtype, we need to generate a separate subquery for each.
Generating Subqueries#
Based on the analysis above, we can develop an algorithm to generate all possible subqueries. One approach is to iterate through each possible interpretation of every wildcard. For a given interpretation, we would tokenize the query, and for each wildcard-containing token, we would iterate through its matching variable patterns. The approach we take is a slight variation of this.
At a high-level, the algorithm is as follows:
First, we tokenize the query, treating every (unescaped) wildcard as a non-delimiter. At this point, if we were to remove all wildcard-containing tokens, then we would have no wildcards remaining in the query. This is helpful because it allows us to leave the part of the query without wildcards intact while we iterate on every interpretation of the wildcard-containing tokens.
When constructing a wildcard-containing token, we find each wildcard and determine whether they could be interpreted as matching only non-delimiters or only delimiters, or both.
When constructing a wildcard-containing token, we also tokenize it based on the
current interpretation of wildcards. This may lead to creating a token that’s
static text, a token that’s a variable, and/or a smaller
wildcard-containing token. For example, if we were to tokenize the
wildcard-containing token ?abc?123?
, interpreting every ?
as matching a
delimiter, then we would end up with two tokens, abc
and 123
. abc
is
static text while 123
is an integer variable. Now if the central ?
was
interpreted as matching a non-delimiter, then the only token generated would be
abc?123
which can only match a dictionary variable.
As a result, we call the original wildcard-containing token a
CompositeWildcardToken
, since it can generate multiple smaller tokens based on
the interpretation of its wildcards. We call each smaller wildcard-containing
token a WildcardToken
since it is not further divisible. Finally, we call
each token that doesn’t contain a wildcard and which matches a variable pattern,
an ExactVariableToken
, in contrast with a WildcardToken
.
When constructing a WildcardToken
, we find all the variable patterns that it
can match as well as if it can match static text. Each case is an interpretation
we must consider when generating subqueries.
Once tokenization is complete, we will already have an interpretation of
wildcards and WildcardToken
s from which we can generate a subquery. So the
next step is to generate a subquery and then begin iterating.
The first layer of iteration is the interpretation of each WildcardToken
s.
Essentially, we change the interpretation of a single WildcardToken
and then
generate another subquery. We repeat this process until the chosen
WildcardToken
has no new interpretations at which point we reset its
interpretation and advance the interpretation of the next WildcardToken
.
This process continues much like a counter (e.g., 00, 01, 10, 11) where when a
bit overflows, we increment the next highest bit and then continue counting
from the bit place.
When we’ve exhausted all WildcardToken
s, the second layer of iteration is the
interpretation of each wildcard.
When every iteration is complete, we will have a complete list of subqueries.
However, some subqueries may be duplicates of each other. For instance, consider
*abc*def?
. When all wildcards are interpreted to match delimiters, one
subquery we would generate is:
Dictionary variable queries:
[]
Encoded variable queries:
[]
Logtype query:
*abc*def?
where both *abc*
and *def?
are interpreted as static text. Similarly, when
the ?
is interpreted to match non-delimiters, we could again generate the same
subquery. Therefore, we deduplicate the subqueries during generation.
One final nuance of using the subqueries as described is that if a message matches a subquery, it does not guarantee that the message matches the original wildcard query. Consider Interpretation 1 from the motivating example:
Interpretation 1:
Dictionary variable queries:
["*task*"]
Encoded variable queries:
["0.3*"]
Logtype query:
*<dict-var> took <float>*
And consider this encoded message:
Dictionary variables:
["task_12"]
Encoded variables:
[0.4, 0.3]
Logtype:
<dict-var> took <float> above <float>
We can see that this encoded message matches the subquery, but when decoded,
it is "task_12 took 0.4 above 0.3"
which does not match the original wildcard
query *task* took 0.3*
. This is because the subqueries as described don’t
consider the position of query variables in relation to the logtype query.
A bruteforce solution is simply to decode messages which match the subqueries
and then perform a wildcard match with the original query. However, more
efficient approaches do exist and can be implemented when necessary.