CLP Connector¶
Overview¶
The CLP Connector enables SQL-based querying of CLP archives via Presto. This document describes how to configure the CLP Connector for use with a CLP cluster, as well as essential details for understanding the CLP connector.
Configuration¶
To configure the CLP connector, create a catalog properties file etc/catalog/clp.properties
with at least the
following contents, modifying the properties as appropriate:
connector.name=clp
clp.metadata-provider-type=mysql
clp.metadata-db-url=jdbc:mysql://localhost:3306
clp.metadata-db-name=clp_db
clp.metadata-db-user=clp_user
clp.metadata-db-password=clp_password
clp.metadata-table-prefix=clp_
clp.split-filter-config=/path/to/split-filter-config.json
clp.split-filter-provider-type=mysql
clp.split-provider-type=mysql
Configuration Properties¶
The following configuration properties are available:
Property Name |
Description |
Default |
---|---|---|
|
Enables or disables support for polymorphic types in CLP, allowing the same field to have different types. This is useful for schema-less, semi-structured data where the same field may appear with different types. When enabled, type annotations are added to conflicting field names to
distinguish between types. For example, if an Supported type annotations include |
|
|
Specifies the metadata provider type. Currently, the only supported
type is a MySQL database, which is also used by the CLP package to store
metadata. Additional providers can be supported by implementing the
|
|
|
The JDBC URL used to connect to the metadata database. This property is
required if |
|
|
The name of the metadata database. This option is required if
|
|
|
The database user with access to the metadata database. This option is
required if |
|
|
The password for the metadata database user. This option is required if
|
|
|
A string prefix prepended to all metadata table names when querying the
database. Useful for namespacing or avoiding collisions. This option is
required if |
|
|
Defines how long, in seconds, metadata entries remain valid before they need to be refreshed. |
600 |
|
Specifies how frequently metadata is refreshed from the source, in seconds. Set this to a lower value for frequently changing datasets or to a higher value to reduce load. |
60 |
|
The absolute path to an optional split filter config file. See the Split Filter Config File section for details. |
|
|
Specifies the split filter provider type. Currently, the only supported
type is a MySQL database, which is also used by the CLP package to
store metadata. Additional providers can be supported by implementing
the |
|
|
Specifies the split provider type. By default, it uses the same type as
the metadata provider with the same connection parameters. Additional
types can be supported by implementing the |
|
Metadata and Split Providers¶
The CLP connector relies on metadata and split providers to retrieve information from various sources. By default, it uses a MySQL database for both metadata and split storage. We recommend using the CLP package for log ingestion, which automatically populates the database with the required information.
If you prefer to use a different source–or the same source with a custom implementation–you can provide your own
implementations of the ClpMetadataProvider
and ClpSplitProvider
interfaces, and configure the connector
accordingly.
Split Filter Config File¶
The split filter config file allows you to configure the set of columns that can be used to filter out irrelevant splits (CLP archives) when querying CLP’s metadata database. This can significantly improve performance by reducing the amount of data that needs to be scanned. For a given query, the connector will translate any supported filter predicates that involve the configured columns into a query against CLP’s metadata database.
The configuration is a JSON object where each key under the root represents a scope and each scope maps to an array of filter configs.
Scopes¶
A scope can be one of the following:
A catalog name
A fully-qualified schema name
A fully-qualified table name
Filter configs under a particular scope will apply to all child scopes. For example, filter configs at the schema level will apply to all tables within that schema.
Filter Configs¶
Each filter config indicates how a data column—i.e., a column in the Presto table—should be mapped to one or more metadata columns—i.e., columns in CLP’s metadata database.
For example, an integer data column (e.g., timestamp
), may be remapped to a pair of metadata columns that represent
the range of possible values (e.g., begin_timestamp
and end_timestamp
) of the data column within a split.
Each filter config has the following options:
columnName
: The data column’s name.customOptions
(optional): Custom options for a split filter provider. Options for the default split filter provider (ClpMySqlSplitFilterProvider
) are below.required
(optional, defaults to false): Whether the filter must be present in the generated metadata query. If a required filter is missing or cannot be added to the metadata query, the original query will be rejected.
ClpMySqlSplitFilterProvider-Specific Filter Config¶
For ClpMySqlSplitFilterProvider
, the customOptions
option of the filter config has the following sub-options:
rangeMapping
(optional): an object with the following properties:Note
This option is only valid if the column has a numeric type.
lowerBound
: The metadata column that represents the lower bound of values in a split for the data column.upperBound
: The metadata column that represents the upper bound of values in a split for the data column.
Filter Config Example¶
The code block shows an example filter config file:
{
"clp": [
{
"columnName": "level"
}
],
"clp.default": [
{
"columnName": "author"
}
],
"clp.default.table_1": [
{
"columnName": "msg.timestamp",
"customOptions": {
"rangeMapping": {
"lowerBound": "begin_timestamp",
"upperBound": "end_timestamp"
}
},
"required": true
},
{
"columnName": "file_name"
}
]
}
The first key-value pair adds the following filter configs for all schemas and tables under the
clp
catalog:The column
level
is used as-is without remapping.
The second key-value pair adds the following filter configs for all tables under the
clp.default
schema:The column
author
is used as-is without remapping.
The third key-value pair adds two filter configs for the table
clp.default.table_1
:The column
msg.timestamp
is remapped via arangeMapping
to the metadata columnsbegin_timestamp
andend_timestamp
, and is required to exist in every query.The column
file_name
is used as-is without remapping.
If you prefer to use a different format for customOptions
, you can provide your own implementation of the
ClpSplitFilterProvider
interface, and configure the connector accordingly.
Supported SQL Expressions¶
The connector supports translations from a Presto SQL query to the split filter query for the following expressions:
Comparisons between variables and constants (e.g.,
=
,!=
,<
,>
,<=
,>=
).Dereferencing fields from row-typed variables.
Logical operators:
AND
,OR
, andNOT
.
Data Types¶
The data type mappings are as follows:
CLP Type |
Presto Type |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(others) |
(unsupported) |
String Types¶
CLP uses three distinct string types: ClpString
(strings with whitespace), VarString
(strings without
whitespace), and DateString
(strings representing dates). Currently, all three are mapped to Presto’s VARCHAR
type.
Array Types¶
CLP supports two array types: UnstructuredArray
and StructuredArray
. Unstructured arrays are stored as strings
in CLP and elements can be any type. However, in Presto arrays are homogeneous, so the elements are converted to strings
when read. StructuredArray
type is not supported in Presto.
Object Types¶
CLP stores metadata using a global schema tree structure that captures all possible fields from various log structures.
Internal nodes may represent objects containing nested fields as their children. In Presto, we map these internal object
nodes to the ROW
data type, including all subfields as fields within the ROW
.
For instance, consider a table containing two distinct JSON log types:
Log Type 1:
{
"msg": {
"ts": 0,
"status": "ok"
}
}
Log Type 2:
{
"msg": {
"ts": 1,
"status": "error",
"thread_num": 4,
"backtrace": ""
}
}
In CLP’s schema tree, these two structures are combined into a unified internal node (msg
) with four child nodes:
ts
, status
, thread_num
and backtrace
. In Presto, we represent this combined structure using the
following ROW
type:
ROW(ts BIGINT, status VARCHAR, thread_num BIGINT, backtrace VARCHAR)
Each JSON log maps to this unified ROW
type, with absent fields represented as NULL
. The child nodes (ts
,
status
, thread_num
, backtrace
) become fields within the ROW
, clearly reflecting the nested and varying
structures of the original JSON logs.
SQL support¶
The connector only provides read access to data. It does not support DDL operations, such as creating or dropping
tables. Currently, we only support one default
schema.