Sinks: where collected data is sent

Chalk allows you to configure reports (what key data we collect), sinks (where reports are sent), and custom keys (what entirely new data can we enable reports to collect).

This guide is about sinks (where reports are sent).

The default_out sink

Reports are sent by default to the default_out sink which appends entries (JSON-newline format) to ~/.local/chalk/chalk.log.

Unsubscribing from default_out

You can disable this by creating a .c4m file and unsubscribing from the "report" topic.

$ cat null.c4m
unsubscribe("report", "default_out")

And having Chalk load it.

chalk load ./null.c4m

This isn’t very useful on its own since now we’ll just not have reports written anywhere at all. But it’s useful if we want to disable the default and have reports go somewhere else.

The file sink type

We can send reports to another path by creating a new sink and subscribing "reports" to it.

$ cat myfile.c4m
sink_config my_file_out {
  enabled: true
  sink: "file"
  filename: "/tmp/chalkdata.jsonl"
}

subscribe("report", "my_file_out")

If we chalk load ./myfile.c4m and that was the only config we had loaded, Chalk reports would get sent both to the default (~/.local/chalk/chalk.log) and to /tmp/chalkdata.jsonl.

File-specific sink parameters in detail

ParameterTypeRequiredDescription
filenamestringyesThe file name for the output.
log_search_pathlist[string]noAn ordered list of directories for the file to live.
use_search_pathboolnoControls whether or not to use the log_search_path at all. Defaults to true.

The log file consists of rows of JSON objects (the jsonl format).

The log_search_path is a list of paths that the system will march down, trying to find a place where it can open the named sink, skipping directories where there isn’t write permission. If no value is provided, the default is ["/var/log/", "~/log/", "."].

If the filename parameter has a slash in it, it will always be tried first, before the search path is checked.

If nothing in the search path is openable, or if no search path was given, and the file location was not writable, the system tries to write to a temporary file as a last resort.

If use_search_path is false, the system just looks at the filename field; if it’s a relative path, it resolves it based on the current working directory. In this mode, if the log file cannot be opened, then the sink configuration will error when used.

The rotating_log sink type

If you’d like to write to files of bounded sizes you can use the rotating_log sink type. This sink behaves like a ring buffer.

Here’s an example to show how it works (though you won’t ever want the truncation_amount so small in the real world).

$ cat myrotatinglog.c4m
sink_config my_rotating_log_out {
  enabled: true
  sink: "rotating_log"
  filename: "chalkdata.jsonl"
  max: <<10kib>>
  log_search_path: ["/tmp/"]
}

subscribe("report", "my_rotating_log_out")

Rotating log-specific paramters in detail

ParameterTypeRequiredDescription
filenamestringtrueThe name to use for the log file.
maxSizetrueThe size at which truncation should occur.
log_search_pathlist[string]falseAn ordered list of directories for the file to live.
truncation_amountSizefalseThe target size to which the log file should be truncated.

When the file size reaches the max threshold (in bytes), it is truncated, removing records until it has truncated truncation_amount bytes of data. If the truncation_amount field is not provided, it is set to 25% of max.

The log file consists of rows of JSON objects (the jsonl format). When we delete, we delete full records, from oldest to newest. Since we delete full reocrds, we may delete slightly more than the truncation amount specified as a result.

The deletion process guards against catastrophic failure by copying undeleted data into a new, temporary log file, and swapping it into the destination file once finished. As a result, you should assume you need 2x the value of max available in terms of disk space.

max and truncation_amount should be Size objects (e.g., << 100mb >>).

The s3 sink type

We can also send reports to S3-compatible systems.

Let’s grab rclone, a single binary that will give us a local S3 API.

Create a directory for rclone to serve from, and a bucket within that directory.

mkdir ./data # Directory to serve from
mkdir ./data/myorg # A bucket

And run rclone with auth settings to mimic a production API.

rclone serve s3 --auth-key "mykey,mysecret" ./data

Now in another window create a config for an S3 sink.

$ cat mys3.c4m
sink_config my_s3_out {
  enabled: true
  sink: "s3"
  endpoint: "http://localhost:8080"
  uri: "s3://myorg/chalkdata/report.json"
  uid: "mykey"
  secret: "mysecret"
  region: "us-east-1"
}

subscribe("report", "my_s3_out")

Load it up in Chalk and Chalk a binary.

$ chalk load ./mys3.c4m
$ cp $(which cp) cp2
$ chalk insert ./cp2
$ ls ./data/myorg/chalkdata
1778607576222-BF1YCE5Y6PRQ9BDCD6PP1VB3P0-report.json

Amazon S3 vs S3-compatible APIs

If you are publishing to an S3-compatible API you must set the endpoint field. Set it to the base URL of the service and keep uri in the normal s3://bucket-name/object-path form.

The region field is still required by the AWS SigV4 signing algorithm; for most S3-compatible stores any non-empty value (e.g. "us-east-1") works.

AWS environment variables

While the S3 sink will not automatically read your existing AWS environment variables, you can forward them within the Chalk config with the env() builtin.

sink_config s3_sink_config {
  enabled: true
  sink:    "s3"
  region:  env("AWS_REGION")
  uri:     env("AWS_S3_BUCKET_URI")
  uid:     env("AWS_ACCESS_KEY_ID")
  secret:  env("AWS_SECRET_ACCESS_KEY")
}

S3-specific sink parameters in detail

ParameterTypeRequiredDescription
uidstringtrueA valid AWS access key ID
secretstringtrueA valid AWS secret access key
tokenstringfalseAWS session token
uristringtrueThe URI for the bucket in s3: format; see below
regionstringtrueThe AWS region (or any non-empty string for S3-compatible stores)
extrastringfalseA prefix added to the object path within the bucket
endpointstringfalseCustom S3-compatible endpoint URL (e.g. http://localhost:9000 for MinIO or http://localhost:8080 for rclone serve s3). When omitted, the standard AWS endpoint is used.

To ensure uniqueness, each run of chalk constructs a unique object name. Here are the components:

  1. An integer consisting of the machine’s local time in ms
  2. A 26-character cryptographically random ID (using a base32 character set)
  3. The value of the extra field, if provided.
  4. Anything provided in the uri field after the host.

These items are separated by dashes.

The timestamp goes before the timestamp to ensure files are listed in a sane order.

The user is responsible for making sure the last two values are valid; this will not be checked; the operation will fail if they are not.

Generally, you should not use dots in your bucket name, as this will thwart TLS protection of the connection.

The post HTTP sink type

We can also send reports via HTTP Post method.

Let’s create a simple Python server that accepts POST requests at /upload and writes the request body to disk with a unique name.

$ cat upload_server.py
import http.server as h, time
class H(h.BaseHTTPRequestHandler):
 def do_POST(self):
  if self.path != "/upload":
   self.send_response(404); self.end_headers(); return
  open(f"report-{time.time_ns()}.json","wb").write(
   self.rfile.read(int(self.headers["Content-Length"])))
  self.send_response(200); self.end_headers()
h.HTTPServer(("127.0.0.1",8000),H).serve_forever()

Now create a data directory for the server and start the server in the data directory.

rm -rf data && mkdir data
cd data && python3 ../upload_server.py

Now create a Chalk config.

$ cat myhttp.c4m
sink_config my_http_out {
  enabled: true
  sink: "post"
  uri: "http://127.0.0.1:8000/upload"
}

subscribe("report", "my_http_out")

Load the Chalk config and Chalk a binary.

$ chalk load ./myhttp.c4m
$ cp $(which cp) cp2
$ chalk insert ./cp2
$ ls data
report-1778609825252755000.json

HTTP-specific sink parameters in detail

ParameterRequiredDescription
uristringtrueThe full URI to the endpoint to which the POST should be made.
content_typestringfalseThe value to pass for the “content-type” header
headersdict[string, string]falseA dictionary of additional mime headers
disallow_httpboolfalseDo not allow HTTP connections, only HTTPS
timeoutDurationfalseConnection timeout in ms
pinned_cert_filestringfalseTLS certificate file
prefer_bundled_certsboolfalseWhether to prefer chalk bundled root CA certs
authstringfalseAuth configuration for the API

The post will always be a single JSON object, and the default content-type field will be application/json. Changing this value doesn’t change what is posted; it is only there in case a particular endpoint requires a different value.

If HTTPS is used, the connection will fail if the server doesn’t have a valid certificate. Unless you provide a specific certificate via the pinned_cert_file field, self-signed certificates will not be considered valid.

The underlying TLS library requires certificates to live on the file system. However, you can embed your certificate in your configuration in PEM format, and use config builtin functions to write it to disk, if needed, before configuring the sink.

If additional headers need to be passed (for instance, a bearer token), the headers field is converted directly to MIME. If you wish to pass the raw MIME, you can use the mime_to_dict builtin. For example, the default configuration uses the following sink configuration:

sink_config my_https_config {
  enabled: true
  sink:    "post"
  uri:     env("CHALK_POST_URL")

  if env_exists("TLS_CERT_FILE") {
    pinned_cert_file: env("TLS_CERT_FILE")
  }

  if env_exists("CHALK_POST_HEADERS") {
    headers: mime_to_dict(env("CHALK_POST_HEADERS"))
  }
}