Parsing wildcard queries#

Given a wildcard query, we need to parse it like we would a message, turning it into a logtype and variable values that we can use to match encoded messages.

Motivating example#

Consider this message (timestamp omitted for brevity):

  INFO Task task_12 assigned to container: [NodeAddress:172.128.0.41, \
 ContainerID:container_15], operation took 0.335 seconds

At a high-level, we parse it as follows:

  1. Tokenize the message using the delimiters from the schema file.

  2. Compare each token against the variable patterns from the schema file. If a token matches a pattern, we:

    1. extract it,

    2. encode it either as a dictionary or a non-dictionary variable, and

    3. replace the token with a placeholder in the original message.

      • The specific placeholder used depends on how the variable was encoded.

The output for the example is:

  • Dictionary variables: ["task_12", "172.128.0.41", "container_15"]

  • Encoded variables: [0.335] (in reality, this is encoded but we omit the details for brevity)

  • Logtype:

     INFO Task <dict-var> assigned to container: [NodeAddress:<dict-var>, \
    ContainerID:<dict-var>], operation took <float> seconds
    
    • Where <dict-var> and <float> are single-byte placeholder characters.

Now consider the query *task* took 0.3*. To match this query against the encoded messages, we need to parse it like a log message, and then use the parsed values as queries on the relevant data. For instance, after parsing, we might extract 0.3* as an encoded variable, meaning we should look for encoded variables that match 0.3*. But 0.3* could also match a dictionary variable which requires a separate query. Overall, wildcards create ambiguity that requires us to consider different query interpretations.

There are four query interpretations for the example (*task* took 0.3*):

  1. Interpretation 1:

    • Dictionary variable queries: ["*task*"]

    • Encoded variable queries: ["0.3*"]

    • Logtype query: *<dict-var>* took <float>*

  2. Interpretation 2:

    • Dictionary variables queries: ["*task*", "0.3*"]

    • Encoded variable queries: []

    • Logtype query: *<dict-var>* took <dict-var>*

  3. Interpretation 3:

    • Dictionary variable queries: []

    • Encoded variable queries: ["0.3*"]

    • Logtype query: *task* took <float>*

  4. Interpretation 4:

    • Dictionary variable queries: ["0.3*"]

    • Encoded variable queries: []

    • Logtype query: *task* took <dict-var>*

We call each of these interpretations a subquery. A message which matches any subquery matches the original wildcard query (with one exception mentioned later). In other words, the subqueries form a logical disjunction (i.e., the subqueries are OR-ed together to comprise the original query). The rest of this doc explains how we generate these subqueries. For more background on logtypes, variables, etc., see the CLP paper.

Handling ambiguity#

To parse a query, we need to consider two sources of ambiguity:

  • How each interpretation of a wildcard changes the tokenization.

  • What variable patterns match a wildcard-containing token, and the variable placeholders each matching pattern uses.

We consider each source of ambiguity below.

Tokenization with wildcards#

Consider *task?123* and assume we use the default variable patterns.

  • If the ? matches a non-delimiter, this query could match a single dictionary variable, e.g., task_123.

  • If the ? matches a delimiter (e.g., :), this query could match a message with some static text task: and an encoded variable 123.

Thus, for every wildcard we need to consider each possibility (delimiter/non-delimiter). For ?, this is simple as shown in the example. However, * is more involved since it can match zero or more characters—in other words, a single * could match both delimiters and non-delimiters.

Handling *#

Consider how we might tokenize *to*container* 0.335 *. *to*container* could be one or more tokens depending on how we interpret each *. 0.335 is a token that can be encoded as a float variable. The lone * can match any number of tokens.

For *to*container*, Table 1 below lists the spans we can generate based on how we interpret each *. We use the term span to refer to either a contiguous set of non-delimiters (i.e., tokens) or a contiguous set of delimiters.

#

* interpretation

Spans

1

Delimiters only

*, to, *, container, *

2

Non-delimiters only

*to*container*

3

Both

*, *to*, *, *container*, *

Table 1: The spans generated by tokenizing *to*container* depending on the interpretation of *s.

To understand the spans generated by the third interpretation, consider the central * and surrounding non-wildcards in the original query. Since the * is interpreted as containing both non-delimiters and delimiters, then there must be at least one delimiter between to and container. Table 2 below lists a set of substrings that could match to*container.

Substring

Parts matched by the *

to:::container

Delimiters (:::)

tools:container

Non-delimiters (ols) followed by a delimiter (:)

tools:new:mcontainer

Non-delimiters, a delimiter, non-delimiters, a delimiter, and a non-delimiter.

Table 2: Some substrings that can be matched by container*to where the central * is interpreted as matching a combination of non-delimiters and delimiters.

From the table, we can see that the central * could match the following in sequence:

  • zero or more non-delimiters attached to to, followed by

  • at least one delimiter or a combination of non-delimiters and delimiters, and finally

  • zero or more non-delimiters before container.

Thus, we can break the central * into three * corresponding to each case of the sequence: one as a suffix of to, a lone *, and one as a prefix of container.

Comparing the first and third interpretation in Table 1, we can see that the third is a more general version of the first. As a result, we don’t need to consider the first interpretation. We can generalize this as follows:

If a * is interpreted to have a different type than either of the characters surrounding it, the tokenization should split the string at the * while leaving a * attached to the surrounding characters.

So the wildcard-containing token, *container*to*, can be tokenized either as:

  1. *container* and *to*, or

  2. *container*to*

Note that we don’t need to consider the lone * as a potential variable since it matches all variable patterns; similarly, we don’t need to consider what variable placeholders it needs in the logtype since it matches all variable placeholders. A consequence of this is that the interpretation of a wildcard-containing token’s boundary * wildcards (wildcards at the beginning or end of a token) does not affect how we tokenize a wildcard-containing token. In other words, we don’t need to consider the non-delimiters-only case for * boundary wildcards.

Matching variable patterns to wildcard-containing tokens#

The precise mechanism for matching a variable pattern against a wildcard-containing token is an implementation detail, but it is worth considering the difference between matching a token in a log message versus matching a wildcard-containing token in a wildcard query.

In a log message, if two or more patterns match a token, we apply the pattern that appears first in the schema file. However, when two or more patterns match a wildcard-containing token, we can’t choose the first pattern unless it is a superset of the other patterns; this really means the other patterns would never apply since any token matching the first pattern would match the other patterns as well, so the other patterns would never be applied. (In the future, we will likely warn users when their patterns have this property.) So if two or more non-nested (i.e., one is not a superset of the other) patterns match, we can’t choose the first pattern since that would ignore cases where only the second pattern’s variables match the query. For instance, consider these non-nested patterns:

ip_addr: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
float: \d+\.\d+

If we encounter a wildcard-containing token like *1.2*, we have to search for variables matching either ip_addr or float. For instance, encoded messages might contain a message with the float, 1.23 and/or they might contain a message with the ip_addr, 1.2.3.4. Since the two variables use different placeholders in the logtype, we need to generate a separate subquery for each.

Generating Subqueries#

Based on the analysis above, we can develop an algorithm to generate all possible subqueries. One approach is to iterate through each possible interpretation of every wildcard. For a given interpretation, we would tokenize the query, and for each wildcard-containing token, we would iterate through its matching variable patterns. The approach we take is a slight variation of this.

At a high-level, the algorithm is as follows:

First, we tokenize the query, treating every (unescaped) wildcard as a non-delimiter. At this point, if we were to remove all wildcard-containing tokens, then we would have no wildcards remaining in the query. This is helpful because it allows us to leave the part of the query without wildcards intact while we iterate on every interpretation of the wildcard-containing tokens.

When constructing a wildcard-containing token, we find each wildcard and determine whether they could be interpreted as matching only non-delimiters or only delimiters, or both.

When constructing a wildcard-containing token, we also tokenize it based on the current interpretation of wildcards. This may lead to creating a token that’s static text, a token that’s a variable, and/or a smaller wildcard-containing token. For example, if we were to tokenize the wildcard-containing token ?abc?123?, interpreting every ? as matching a delimiter, then we would end up with two tokens, abc and 123. abc is static text while 123 is an integer variable. Now if the central ? was interpreted as matching a non-delimiter, then the only token generated would be abc?123 which can only match a dictionary variable.

As a result, we call the original wildcard-containing token a CompositeWildcardToken, since it can generate multiple smaller tokens based on the interpretation of its wildcards. We call each smaller wildcard-containing token a WildcardToken since it is not further divisible. Finally, we call each token that doesn’t contain a wildcard and which matches a variable pattern, an ExactVariableToken, in contrast with a WildcardToken.

When constructing a WildcardToken, we find all the variable patterns that it can match as well as if it can match static text. Each case is an interpretation we must consider when generating subqueries.

Once tokenization is complete, we will already have an interpretation of wildcards and WildcardTokens from which we can generate a subquery. So the next step is to generate a subquery and then begin iterating.

The first layer of iteration is the interpretation of each WildcardTokens. Essentially, we change the interpretation of a single WildcardToken and then generate another subquery. We repeat this process until the chosen WildcardToken has no new interpretations at which point we reset its interpretation and advance the interpretation of the next WildcardToken. This process continues much like a counter (e.g., 00, 01, 10, 11) where when a bit overflows, we increment the next highest bit and then continue counting from the bit place.

When we’ve exhausted all WildcardTokens, the second layer of iteration is the interpretation of each wildcard.

When every iteration is complete, we will have a complete list of subqueries. However, some subqueries may be duplicates of each other. For instance, consider *abc*def?. When all wildcards are interpreted to match delimiters, one subquery we would generate is:

  • Dictionary variable queries: []

  • Encoded variable queries: []

  • Logtype query: *abc*def?

where both *abc* and *def? are interpreted as static text. Similarly, when the ? is interpreted to match non-delimiters, we could again generate the same subquery. Therefore, we deduplicate the subqueries during generation.

One final nuance of using the subqueries as described is that if a message matches a subquery, it does not guarantee that the message matches the original wildcard query. Consider Interpretation 1 from the motivating example:

  1. Interpretation 1:

    • Dictionary variable queries: ["*task*"]

    • Encoded variable queries: ["0.3*"]

    • Logtype query: *<dict-var> took <float>*

And consider this encoded message:

  • Dictionary variables: ["task_12"]

  • Encoded variables: [0.4, 0.3]

  • Logtype: <dict-var> took <float> above <float>

We can see that this encoded message matches the subquery, but when decoded, it is "task_12 took 0.4 above 0.3" which does not match the original wildcard query *task* took 0.3*. This is because the subqueries as described don’t consider the position of query variables in relation to the logtype query. A bruteforce solution is simply to decode messages which match the subqueries and then perform a wildcard match with the original query. However, more efficient approaches do exist and can be implemented when necessary.