CLP Connector¶

Overview¶

The CLP Connector enables SQL-based querying of CLP archives via Presto. This document describes how to configure the CLP Connector for use with a CLP cluster, as well as essential details for understanding the CLP connector.

Configuration¶

To configure the CLP connector, create a catalog properties file etc/catalog/clp.properties with at least the following contents, modifying the properties as appropriate:

connector.name=clp
clp.metadata-provider-type=mysql
clp.metadata-db-url=jdbc:mysql://localhost:3306
clp.metadata-db-name=clp_db
clp.metadata-db-user=clp_user
clp.metadata-db-password=clp_password
clp.metadata-table-prefix=clp_
clp.split-filter-config=/path/to/split-filter-config.json
clp.split-filter-provider-type=mysql
clp.split-provider-type=mysql

Configuration Properties¶

The following configuration properties are available:

Property Name	Description	Default
`clp.polymorphic-type-enabled`	Enables or disables support for polymorphic types in CLP, allowing the same field to have different types. This is useful for schema-less, semi-structured data where the same field may appear with different types. When enabled, type annotations are added to conflicting field names to distinguish between types. For example, if an `id` column appears with both an `int` and `string` types, the connector will create two columns named `id_bigint` and `id_varchar`. Supported type annotations include `bigint`, `varchar`, `double`, `boolean`, and `array(varchar)` (See Data Types for details). For columns with only one type, the original column name is used.	`false`
`clp.metadata-provider-type`	Specifies the metadata provider type. Currently, the only supported type is a MySQL database, which is also used by the CLP package to store metadata. Additional providers can be supported by implementing the `ClpMetadataProvider` interface.	`mysql`
`clp.metadata-db-url`	The JDBC URL used to connect to the metadata database. This property is required if `clp.metadata-provider-type` is set to `mysql`.
`clp.metadata-db-name`	The name of the metadata database. This option is required if `clp.metadata-provider-type` is set to `mysql` and the database name is not specified in the URL.
`clp.metadata-db-user`	The database user with access to the metadata database. This option is required if `clp.metadata-provider-type` is set to `mysql` and the database name is not specified in the URL.
`clp.metadata-db-password`	The password for the metadata database user. This option is required if `clp.metadata-provider-type` is set to `mysql`.
`clp.metadata-table-prefix`	A string prefix prepended to all metadata table names when querying the database. Useful for namespacing or avoiding collisions. This option is required if `clp.metadata-provider-type` is set to `mysql`.
`clp.metadata-expire-interval`	Defines how long, in seconds, metadata entries remain valid before they need to be refreshed.	600
`clp.metadata-refresh-interval`	Specifies how frequently metadata is refreshed from the source, in seconds. Set this to a lower value for frequently changing datasets or to a higher value to reduce load.	60
`clp.split-filter-config`	The absolute path to an optional split filter config file. See the Split Filter Config File section for details.
`clp.split-filter-provider-type`	Specifies the split filter provider type. Currently, the only supported type is a MySQL database, which is also used by the CLP package to store metadata. Additional providers can be supported by implementing the `ClpSplitFilterProvider` interface.	`mysql`
`clp.split-provider-type`	Specifies the split provider type. By default, it uses the same type as the metadata provider with the same connection parameters. Additional types can be supported by implementing the `ClpSplitProvider` interface.	`mysql`

Metadata and Split Providers¶

The CLP connector relies on metadata and split providers to retrieve information from various sources. By default, it uses a MySQL database for both metadata and split storage. We recommend using the CLP package for log ingestion, which automatically populates the database with the required information.

If you prefer to use a different source–or the same source with a custom implementation–you can provide your own implementations of the ClpMetadataProvider and ClpSplitProvider interfaces, and configure the connector accordingly.

Split Filter Config File¶

The split filter config file allows you to configure the set of columns that can be used to filter out irrelevant splits (CLP archives) when querying CLP’s metadata database. This can significantly improve performance by reducing the amount of data that needs to be scanned. For a given query, the connector will translate any supported filter predicates that involve the configured columns into a query against CLP’s metadata database.

The configuration is a JSON object where each key under the root represents a scope and each scope maps to an array of filter configs.

Scopes¶

A scope can be one of the following:

A catalog name
A fully-qualified schema name
A fully-qualified table name

Filter configs under a particular scope will apply to all child scopes. For example, filter configs at the schema level will apply to all tables within that schema.

Filter Configs¶

Each filter config indicates how a data column—i.e., a column in the Presto table—should be mapped to one or more metadata columns—i.e., columns in CLP’s metadata database.

For example, an integer data column (e.g., timestamp), may be remapped to a pair of metadata columns that represent the range of possible values (e.g., begin_timestamp and end_timestamp) of the data column within a split.

Each filter config has the following options:

columnName: The data column’s name.
customOptions (optional): Custom options for a split filter provider. Options for the default split filter provider (ClpMySqlSplitFilterProvider) are below.
required (optional, defaults to false): Whether the filter must be present in the generated metadata query. If a required filter is missing or cannot be added to the metadata query, the original query will be rejected.

ClpMySqlSplitFilterProvider-Specific Filter Config¶

For ClpMySqlSplitFilterProvider, the customOptions option of the filter config has the following sub-options:

rangeMapping (optional): an object with the following properties:

Note

This option is only valid if the column has a numeric type.
- lowerBound: The metadata column that represents the lower bound of values in a split for the data column.
- upperBound: The metadata column that represents the upper bound of values in a split for the data column.

Filter Config Example¶

The code block shows an example filter config file:

{
  "clp": [
    {
      "columnName": "level"
    }
  ],
  "clp.default": [
    {
      "columnName": "author"
    }
  ],
  "clp.default.table_1": [
    {
      "columnName": "msg.timestamp",
      "customOptions": {
        "rangeMapping": {
          "lowerBound": "begin_timestamp",
          "upperBound": "end_timestamp"
        }
      },
      "required": true
    },
    {
      "columnName": "file_name"
    }
  ]
}

The first key-value pair adds the following filter configs for all schemas and tables under the clp catalog:
- The column level is used as-is without remapping.
The second key-value pair adds the following filter configs for all tables under the clp.default schema:
- The column author is used as-is without remapping.
The third key-value pair adds two filter configs for the table clp.default.table_1:
- The column msg.timestamp is remapped via a rangeMapping to the metadata columns begin_timestamp and end_timestamp, and is required to exist in every query.
- The column file_name is used as-is without remapping.

If you prefer to use a different format for customOptions, you can provide your own implementation of the ClpSplitFilterProvider interface, and configure the connector accordingly.

Supported SQL Expressions¶

The connector supports translations from a Presto SQL query to the split filter query for the following expressions:

Comparisons between variables and constants (e.g., =, !=, <, >, <=, >=).
Dereferencing fields from row-typed variables.
Logical operators: AND, OR, and NOT.

Data Types¶

The data type mappings are as follows:

CLP Type	Presto Type
`Integer`	`BIGINT`
`Float`	`DOUBLE`
`ClpString`	`VARCHAR`
`VarString`	`VARCHAR`
`DateString`	`VARCHAR`
`Boolean`	`BOOLEAN`
`UnstructuredArray`	`ARRAY(VARCHAR)`
`Object`	`ROW`
(others)	(unsupported)

String Types¶

CLP uses three distinct string types: ClpString (strings with whitespace), VarString (strings without whitespace), and DateString (strings representing dates). Currently, all three are mapped to Presto’s VARCHAR type.

Array Types¶

CLP supports two array types: UnstructuredArray and StructuredArray. Unstructured arrays are stored as strings in CLP and elements can be any type. However, in Presto arrays are homogeneous, so the elements are converted to strings when read. StructuredArray type is not supported in Presto.

Object Types¶

CLP stores metadata using a global schema tree structure that captures all possible fields from various log structures. Internal nodes may represent objects containing nested fields as their children. In Presto, we map these internal object nodes to the ROW data type, including all subfields as fields within the ROW.

For instance, consider a table containing two distinct JSON log types:

Log Type 1:

{
  "msg": {
    "ts": 0,
    "status": "ok"
  }
}

Log Type 2:

{
  "msg": {
    "ts": 1,
    "status": "error",
    "thread_num": 4,
    "backtrace": ""
  }
}

In CLP’s schema tree, these two structures are combined into a unified internal node (msg) with four child nodes: ts, status, thread_num and backtrace. In Presto, we represent this combined structure using the following ROW type:

ROW(ts BIGINT, status VARCHAR, thread_num BIGINT, backtrace VARCHAR)

Each JSON log maps to this unified ROW type, with absent fields represented as NULL. The child nodes (ts, status, thread_num, backtrace) become fields within the ROW, clearly reflecting the nested and varying structures of the original JSON logs.

SQL support¶

The connector only provides read access to data. It does not support DDL operations, such as creating or dropping tables. Currently, we only support one default schema.