Schema file syntax#
The schema file is used to specify the delimiters and variable patterns for compressing and searching logs. Logs are tokenized using delimiters and each token is classified as a variable or static text. Variable types are assigned a name and a pattern defined by the regex rules in the schema file. Some variable names are keywords and variables with these names are treated specially. Anything not matching a variable type is considered to be static text.
Example Schema File#
// Delimiters
delimiters: \t\r\n:,!;%
// Keywords
timestamp:\d{4}\-\d{2}\-\d{2} \d{2}:\d{2}:\d{2}(\.\d{3}){0,1}
timestamp:\[\d{8}\-\d{2}:\d{2}:\d{2}\]
int:\-{0,1}[0-9]+
float:\-{0,1}[0-9]+\.[0-9]+
// Custom variables
hex:[a-fA-F]+
hasNumber:.*\d.*
equals:.*=.*[a-zA-Z0-9].*
- delimiters: \t\r\n:,!;%indicates that- \t,- \r,- \n,- :,- ,,- !,- ;,- %, and- 'are delimiters. Note, this is not a regular expression but a collection of characters; every character specified is treated as a delimiter. In a log file, consecutive delimiters, e.g., N consecutive spaces, are stored as static text. Currently, at least one delimiter must be specified, and multiple delimiter specifications on separate lines are allowed.
- Keywords and custom variables are designated by specifying - typeName:regexPatternwhere- typeNameconsists of alphanumeric characters and is followed by a- :.- regexPatterncannot contain delimiters (except- timestamp) as this is currently incompatible with search. The supported regex for- regexPatternis specified in the section below. The same- typeNamemay show up on multiple lines allowing for it to be associated with multiple- regexPatternspecifications.- Although keywords are defined the same as variables, they are treated differently. 
- If a variable in a log matches multiple keyword/variable patterns with different names, the pattern specified earlier in the schema file is prioritized. 
 
- timestampis a keyword and its pattern may contain delimiters. Timestamps at the start of a log file or after a newline in a log file indicate the beginning of a new log message. If the log file contains no timestamp at the start of the file then a newline is used to indicate the beginning of a new log message. Timestamp patterns are not matched midline and are not stored as dictionary variables as they may contain delimiters.
- intand- floatare keywords. These are encoded specially for compression performance.
Supported Regex#
REGEX RULE   DEFINITION
ab           Match 'a' followed by 'b'
a|b          Match a OR b
[a-z]        Match any character in the brackets (e.g., any lowercase letter)
             - special characters must be escaped, even in brackets (e.g., [\.\(\\])
[^a-zA-Z]    Match any character NOT in the brackets (e.g., non-alphabet character)
a*           Match 'a' 0 or more times
a+           Match 'a' 1 or more times
a{N}         Match 'a' exactly N times
a{N,M}       Match 'a' between N and M times
(abc)        Subexpression (concatenates abc)
\d           Match any digit 0-9
\s           Match any whitespace character (' ', '\r', '\t', '\v', or '\f')
.            Match any *non-delimiter* character
Regex rules are listed in order of operation
Known issues#
Below are current known issues/limitations with schema support. Open an issue on GitHub if you require assistance.
- Wildcard search queries for archives compressed with a schema file currently consider - *as representing 0 or more non-delimiter characters.
- We currently only support ASCII characters. Future support will include UTF-8 encodings. 
- Timestamps must appear at the start of the message. 
- There is currently no way to specify text around a variable that is used for context but is not part of the variable. 
