Grok Pattern Reference
Grok is a powerful tool for extracting structured data from unstructured text. Grok syntax is composed of reusable elements called Grok patterns that enable parsing for data such as timestamps, IP addresses, hostnames, log levels, and more.The prebuilt patterns make Grok easier to use than defining new regular expressions to extract structured data, especially for long text strings.
Mezmo has some unique requirements for using Grok patterns, such as restricting the use of literals and regex between patterns, as described in the section on Unique Mezmo Grok Patterns, which also describes alternative shortcuts to use instead.
Key Concepts for Using Grok
Grok patterns follow the syntax: %{PATTERN TO MATCH:output label}
PATTERN TO MATCH
: this is the pattern to match on- Many patterns are predefined and available for review within the Logstash repository.
- Example:
%{TIMESTAMP_ISO8601}
extracts an ISO 8601 timestamp - Example:
%{IP}
extracts any IPv4 or IPv6 value
output label
: the name of the key to use for the parsed data (optional)- While the
output label
is optional, if you don't specify a label, the extracted data is dropped and will not be included in the output - Example:
%{TIMESTAMP_ISO8601:timestamp}
extracts an ISO 8601 timestamp, and then outputs the result with the output label astimestamp
:{"timestamp": "2022-09-27 18:00:00.000"}
- While the
By combining multiple Grok patterns you can parse a single line of text and extract the relevant information into structured data.
Unique Mezmo Grok Patterns
Standard Grok usage allows for characters between the pattern syntax. These can include literal characters, as well as regular expressions. Mezmo does not allow any characters between Grok patterns other than \s,;:-
.
Instead, Mezmo recommends using one or more of the the patterns described here to aid in jumping over interstitial data. When you use these patterns between recognized patterns, you will usually automatically span the appropriate characters.
Pattern | Usage |
---|---|
%{GREEDYDATA} | This pattern serves as an excellent starting point for building any Grok expression. This will match an entire line of data, or until you specify another expression to buffer it. This pattern is often used for the remainder of any line not parsed. |
%{DATA} | This pattern uses a lazy loading with any character count from zero to many, until it is succeeded by another expression, or it reaches the end of the line. This means you can use this pattern with other patterns to "bookend" a larger set of interstitial data. |
Sometimes you will be unable to span the appropriate characters using these patterns. You can also employ the following patterns to aid in more specific advancement:
Pattern | Usage |
---|---|
%{SPACE} | Use this pattern to explicitly jump over whitespace if needed. |
%{NOTSPACE} | This pattern is useful for skipping sets of characters between whitespaces. |
%{WORD} | This will match whole words within text. |
%{QS} or %{QUOTEDSTRING} | Use this pattern to extract sets of characters between quotes. |
Usage Examples
Simple Log Line
Given this log line, parse it into the individual data fields:
2022-12-23T18:33:21Z | WARN | User object with the id ’420’ was not found
This example includes a timestamp, a log level, and a log message. We will tackle parsing this line iteratively to see the results as we go.
We can start with a %{GREEDYDATA:message}
to get the whole line. Subsequent iterations add to the patterns to extract the individual values.
Iteration 1
Pattern: %{GREEDYDATA:message}
Output:
{
"message":"2022-12-23T18:33:21Z | WARN | User object with the id ’420’ was not found"
}
Iteration 2
Extract the timestamp from the beginning of the line, which is a standard ISO timestamp.
Note that we added a space between the timestamp pattern and greedy data. Spaces between patterns are treated as literals in Grok.
Pattern: %{TIMESTAMP_ISO8601:timestamp} %{GREEDYDATA:message}
Output:
This yields the timestamp
as a separate field and the remainder of the message line in the message
field
{
"message":"| WARN | User object with the id ’420’ was not found"
"timestamp":"2022-12-23T18:33:21Z"
}
Iteration 3
Now we need to get the WARN
and assign it to the log level. This means we'll need to use an interstitial pattern that is not greedy, so we'll use %{DATA}
.
Pattern: %{TIMESTAMP_ISO8601:timestamp} %{DATA} %{GREEDYDATA:message}
Output:
Unfortunately, because %{DATA}
is a lazy pattern, the spaces between the pipes and the log level in combination with the %{GREEDYDATA}
pattern means that Grok doesn't try to jump to the message. We don't get the desired result yet for the level
field. We need to help Grok manually jump over the vertical pipes.
{
"message":"| WARN | User object with the id ’420’ was not found"
"timestamp":"2022-12-23T18:33:21Z"
}
Iteration 4 (Final)
Let's use the %{WORD}
pattern combined with the %{DATA}
pattern to see if we can extract the level
.
Pattern: %{TIMESTAMP_ISO8601:timestamp} %{DATA} %{WORD:level} %{DATA} %{GREEDYDATA:message}
Output:
Now we've got the full object and the pipes are dropped because they had no associated label defined.
{
"level":"WARN"
"message":"User object with the id ’420’ was not found"
"timestamp":"2022-12-23T18:33:21Z"
}
We could have also used the pattern %{DATA}
instead of %{WORD}
in this example. However, knowing log level was always a single word, we illustrated the %{WORD}
pattern.
Hypothetical for Embedded Metrics
Let's take an example where metrics are being included within a log line and need to be explicitly extracted.
Iteration 1
Pattern: %{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:host} %{DATA:service} %{LOGLEVEL:level} %{WORD:method} %{NUMBER:response} %{GREEDYDATA:message}
Output:
At first try, we can pull the timestamp, host, service, log level, method, and response quickly while greedy data captures the remainder. Next, we need to get the URI and the metric value we want at the end.
Note that the JSON view does not show escape characters that are present in the raw data.
{
"host":"api-gateway-23",
"level":"info",
"method":"GET",
"message":"/api/transactions?offset=0&limit=999 18.580795ms",
"response":"200",
"service":"apigateway",
"timestamp":"Mar 23 14:46:29"
}
Iteration 2
Pattern: %{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:host} %{DATA:service} %{LOGLEVEL:level} %{WORD:method} %{NUMBER:response} %{URIPATH:path}%{URIPARAM:params} %{NUMBER:value}%{NOTSPACE:unit}
Output:
We need to remove the path of the URI and the parameters separately as those are both within the string. That leaves only the value with the unit of milliseconds (ms) left.
The value is not an integer, but a floating point. The pattern for NUMBER
can be used to get the numeric values with the trailing decimals. This ensures we don't grab the ms
at the end. If we want to preserve the ms
as the units, we can use the NOTSPACE
option as it will grab alphanumeric characters.
Note that we did not leave a space between the URI options or the NUMBER and NOTSPACE patterns because no space exists within the line.
{
"host":"api-gateway-23",
"level":"info",
"method":"GET",
"params":"?offset=0&limit=999",
"path":"/api/transactions",
"response":"200",
"service":"apigateway",
"timestamp":"Mar 23 14:46:29",
"unit":"ms",
"value":"18.580795"
}
Currently Mezmo's grok patterns do not support the additional type capability for numbers and other data types, such as %{NUMBER:field:int}
, where int
represents the output format in the parsed data. You will need to add an additional processor to coerce the data into the appropriate format after parsing.
Mezmo-supported Grok Patterns
This is not an exhaustive list. You can find additional pattern information in the Other Resources section if you need further reference.
Custom Patterns
These patterns are specific to Mezmo and can be very useful in common scenarios involving extracting data from logs.
Many of these patterns will work for specific circumstances, but cases like JSON_OBJECT
and JSON_ARRAY
are not isolated to single objects within a string. They do not count the number of occurrences.
For example, if you have multiple JSON objects in a string, you would return both objects and the characters that are in between them. We recommend separating the objects using the Parse Processor prior to trying to use the JSON_OBJECT
pattern to prevent the return of invalid JSON.
Pattern | Expression | Usage |
---|---|---|
ANY_CHAR | r#"."# | Matches any single valid character. Useful for advancing by just one character until the next pattern. |
CURLY_BRACKET | r#"{|}"# | Matches any single { open or } closed curly bracket. |
DOUBLE_QUOTE | r#"""# | Matches any single " double quote. |
JSON_ARRAY | r#"\[.*\]"# | Matches any embedded JSON array between square brackets. Note that this will match all arrays, not just one. If multiple arrays are present, separate them in a prior parse first. |
JSON_OBJECT | r#"{.*}"# | Matches any embedded JSON object between curly braces. Note that this will match all objects, not just one. If multiple objects are present, separate them in a prior parse first. |
OPTIONAL_SENTENCE | r#"[\p{L},":;\s\-]*"# | Matches any human readable sentence with UTF-8 characters that includes words, spaces, and punctuation, but is greedy |
PERIOD_CHAR | r#"\."# | Matches any single . period character. |
SENTENCE | r#"[\p{L},":;\s\-]+"# | Matches any human readable sentence with UTF-8 characters that includes words, spaces, and punctation. |
SINGLE_QUOTE | r#"'"# | Matches any ' single quote character. |
SINGLE_SPACE | r#"\s"# | Matches a single space. Note that this is different than the standard SPACE pattern, which is greedy and will return multiple spaces or tabs if they are present. |
TAB | r#"\t"# | Matches any tab character, which may be present in the form \t . |
UTF8_WORD | r#"\p{L}+"# | |
XML | r#"<[\w"=\s]+>.*<\/[\w\s]+>"# | Matches any embedded XML object between . Note that this will match all objects, not just one. If multiple objects are present, separate them in a prior parse first. |
Common Patterns
These patterns are frequently used within Grok expressions.
Note that certain Grok patterns are aliases or a combination of other patterns. They can be used as shortcuts however for extracting more
Pattern | Expression | Usage |
---|---|---|
USERNAME or USER | [a-zA-Z0-9._-]+ | Matches content that contains letters, digits, plus the characters . _-. |
EMAILLOCALPART | [a-zA-Z][a-zA-Z0-9_.+-=:]+ | Matches the characters before the For example, in the email address |
EMAILADDRESS | %{EMAILLOCALPART}@%{HOSTNAME} | Matches email addresses formats where the pattern is <account_name>@<domain.com> . |
HTTPDUSER | %{EMAILADDRESS}|%{USER} | Matches either an email address or username. Especially useful for cases where usernames may be a mix of either type. |
INT | (?:[+-]?(?:[0-9]+)) | Matches integer (whole) numbers. |
BASE10NUM | (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:.[0-9]+)?)|(?:.[0-9]+))) | Matches decimal numbers. |
NUMBER | (?:%{BASE10NUM}) | Matches |
BASE16NUM | (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+)) | Matches hexadecimal numbers (0-9,A-F). |
BASE16FLOAT | \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:.[0-9A-Fa-f]*)?)|(?:.[0-9A-Fa-f]+)))\b | Matches hexadecimal floating point numbers. |
POSINT | \b(?:[1-9][0-9]*)\b | Matches positive integers only |
NONNEGINT | \b(?:[0-9]+)\b | Matches non-negative integers including zero. |
WORD | \b\w+\b | Matches letters, digits, and underscores (_) separated by spaces. |
NOTSPACE | \S+ | Matches any character that is not a whitespace character (spaces, tabs, line breaks). |
SPACE | \s* | Matches one or more of any whitespace character (spaces, tabs, line breaks) |
DATA | .*? | Matches between zero and unlimited times, as few times as possible, expanding as needed |
GREEDYDATA | .* | Matches zero or multiple characters until the end of the line |
QS or QUOTEDSTRING | (?>(?<!)(?>"(?>.|[^"]+)+"|""|(?>'(?>.|[^']+)+')|''|(?>(?>\\.|[^\\ ]+)+`)| ))`` | Matches content between quotes. Note that this pattern can return double quotes in the output, so |
UUID | [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12} | Matches any standard set of characters in a universally unique identifier () |
Other Resources
- The Logstash repository for supported Grok patterns.
- Grokdebugger.com, a site for building and testing Grok patterns.
- Elastic tutorial on using Grok.