Chalk™ User Guide

About This Guide
Source Code Availability
Basic Concepts
Additional Detail and Specs

About This Guide

Chalk is like GPS for software, allowing you to easily see where software comes from and where it is deployed. Chalk collects, stores, and reports metadata about software from build to production.

The tie is generally made by adding an identifying mark (which we call a chalk mark) into the artifact at build time, such that the mark is easy to validate and extract.

This document is meant to be a guide for users and implementers, to help them understand core concepts behind Chalk and our implementation. It should help you better understand Chalk’s behavior and some of the design decisions behind it.

This document is NOT intended to be a tutorial overview. For that, see the Getting Started Guide for an easy introduction to Chalk.

Source Code Availability

Source code is available at crashappsec/chalk. See [installation guide] how to download and install chalk locally. We will be making source code available at the time of our public launch.

Basic Concepts

Chalk operates on software metadata. Generally, it’s most important to run chalk during CI/CD so as to collect metadata on the build process and insert an identifier into any built artifacts. The process of adding metadata into an artifact is what we call chalk marking.

Chalk can also report on existing chalk marks in artifacts at any other time. Of particular interest, it can be set up to launch a program and, in parallel, report on that program and its operating environment on a regular heartbeat interval.

There are lots of different kinds of metadata that chalk can report on and inject into artifacts. Basic use would involve capturing enough build-time info to be able to identify, when looking at a production artifact, what code repository it came from and from which specific build it originated.

But there is plenty of other data that chalk can collect, including results from running analysis tools, such as security analysis or SBOM (software bill of materials) generation tools. This flexibility can help meet a wide variety of requirements (for instance, SLSA level 3 compliance if a customer is requesting it).

After chalk marks are added, the Chalk tool’s focus is on both reporting data about the existing mark (while validating its integrity), as well as reporting basic environmental information about the runtime environment.

The actual act of adding a chalk mark into software artifacts is key to making it easy to trace software through its lifecycle. We add identifiers into the mark that can be extracted easily, allowing us to tie together metadata about the artifact whenever it was collected.

Reports

Chalk reports are fully configurable. You can configure what metadata gets added to reports, and in what circumstances. Reports can be sent to the terminal, logged, posted to a web URL, or written to object storage. You can make multiple reports for a single chalk operation, and you individually control where data from those reports go. Typically, we recommend sending the bulk of the metadata collected directly to some kind of durable storage.

Generally, at least some metadata will get inserted into the artifact itself to make it easy to tie artifacts in production to their metadata. Chalk can then be used to find metadata in production environments.

Configuration

Chalk stores its own configuration inside its own binary. This configuration is used to set up behavior and preferences for each command, including how marking and reporting happens. For more information on how to write Chalk configurations, see the config overview.

Command Line Operations

Chalk insertion operations attempt to add chalk marks to artifacts, and then run any configured reporting. Currently, the chalk insert subcommand and chalk docker build subcommand are insertion operations.

Besides collecting and inserting metadata during CI/CD, there are other things the chalk command can be used to do. Most other chalk operations will extract existing chalk marks, report what is extracted, and also optionally report information from the time of extraction. The other operations may generally perform other operations, such as spawning a program in the case of chalk exec. The most important of these operations are introduced below.

There are a few chalk operations that do not report, including things like chalk help and chalk version.

In the default shipped configuration, when you invoke chalk on the command line, you will get a summary report to stdout and a full report sent to a local log file. Chalk can be configured to send reports elsewhere, such as a server endpoint or an S3 bucket.

In the default configuration, the log level for error messages defaults to info. The log level is easily changed in the configuration or with command line flags.

When you run in docker wrapping mode (described below), most console logs are suppressed by default, unless the log level of the message is error. However, more logs will be added to any reporting.

Below is a high-level overview of the most important commands. Note that, by default, these commands will treat the current working directory as the place to look for existing artifacts and will scan recursively.

Basic Marking Experience

Philosophically, chalk aims to make the actual deployment into the CI/CD pipeline as easy as possible. While chalk is incredibly flexible, the intent is to pre-configure behavior and embed that configuration into the binary. That way, you can hand a binary to someone else and say, “just run this after building your artifacts”, and it should automagically work.

However, the software universe isn’t that simple. Different types of artifacts can have different types of considerations. Currently, the chalk binary works in two modes:

Build wrapping, in which it wraps the command that builds the artifact, adding a chalk mark into the artifact at build time.
Stand-alone insertion, in which it chalks the supported artifact types after the artifact has been built.

Currently, build wrapping only works with docker, and is introduced below.

When inserting chalk marks into other software artifacts (specifically, binaries, scripts, JAR files and similar), we use stand-alone insertion, which, out of the box, can be invoked by typing:

$ chalk insert <path>

After running this command, the file system is scanned from the current working directory, marking any runnable software (except for any artifacts that live in hidden directories, such as scripts in ${CWD}/.git).

By default, chalk will try to collect basic metadata:

It will look for a .git directory to associate a git repository.
It will collect host information about the build environment such as environment variables and platform details.
It will collect metadata about CI/CD system doing the build. For example who triggered the build and what is its URL.
It will see if there’s a local CODEOWNERS or AUTHORS file, and capture it, if so.
It will generate identifiers for the artifact, including the CHALK_ID which uniquely identifies the unchalked artifact, and the METADATA_ID which uniquely identifies the artifact plus the metadata inserted into the chalk mark.

You can also configure chalk to run:

static security analysis via semgrep
create SBOMs via syft
run secret scanner via trufflehog

Chalk also supports custom metadata collection and digital signing. See attestation for more information.

You can configure chalk to use insert as the default command, in which case the binary can be deployed with no command line options whatsoever.

Build Wrapping

The experience for chalking containers is different, as it leverages build wrapping.

Currently, build wrapping is only supported for building docker containers with docker build, although we are working on other options. The deployment could take various forms, all of which will work out of the box:

Put chalk in front of the build command. For example:
```
chalk docker build -t some:thing .
```
Rename the chalk binary to docker, and put it somewhere in the PATH that will generally show up before the actual docker command.

In all these cases, chalk will search the rest of the user’s PATH for docker (unless configured to look in a specific location).

Build wrapping is conservative in that if chalk cannot, for some reason, be confident about adding a chalk mark to an artifact, it runs the build process as if it hadn’t been invoked at all.

When wrapping docker, many docker commands are not affected, and are passed through without Chalk taking action. However, for builds, Chalk will, by default:

Add labels to the produced image with repository metadata.
Rewrite the Dockerfile to add a chalk mark.
Generate a chalk report with metadata on the build operation.

Chalk also reports a bit of metadata when pushing images to help provide full traceability.

Chalk can also be configured to add build-time attestation when possible.

Because of the way Docker works, there’s currently not a simple, pre-defined algorithm for getting a repeatable hash of a container image. As a result, the CHALK_ID will be based on a random value, and there will be some validation considerations if not using attestations.

Basic Extraction Experience

The chalk command is capable of extracting full chalk marks, and can report the presence of those chalk marks with the full flexibility of insertion. That means you have full flexibility in selecting what to report on and where to send reports. Additionally, on-demand extraction can report metadata about the runtime environment.

During the extraction operation, chalk runs once, reports what it finds (including information about the host environment), and then exits. Unlike insertion, it does not accept exclusions like .git; if there are chalk marks in specified paths, it will report them, even if they perhaps never should have been added in the first place.

On-demand Extraction

By default, running chalk extract recursively searches the current directory for artifacts with chalk marks (including any scripts that live in the hidden directories such as ${CWD}/.git). Directory scanning is recursive unless specified otherwise with recursive: true in the chalk configuration or --no-recursive on the command line.

If we want to extract chalk from a single artifact (or directory), we can specify the only target to extract chalk mark from:

$ chalk extract testbinary

Running chalk extract container will search all running containers for chalk marks; likewise, chalk extract image will search all images for chalk marks (but be aware that chalk extraction from docker images may take a long time, particularly if there are many images).

Similarly, to extract chalk marks from a specific container or image, we can specify the image ID, image name, container ID, or container name. For example:

chalk extract 0ed38928691b

Generally, when using chalk for container extraction, it is best to run it in the context of the host OS, as non-privileged containers may not have enough information to ensure which image is running.

In most cases, the container ID will be available from within the container in the HOSTNAME environment variable. But there generally won’t be any easy, ironclad way to tie that to the image from within the container, short of integrity checking the entire file system from the entry point.

Identifying Chalked Containers

While containers do keep the chalk mark in a file on the root of their file system, it can often be inconvenient, or even impossible, to get access to the container.

However, there are plenty of monitoring tools (including CSPM tools in the security space) that capture enough runtime metadata about containers such that you can tie it back to the chalk mark via the image hash of the container.

While not available by the time chalk marks are added (due to the cryptographic one way function), the image hash is captured in the chalk report in the _IMAGE_ID key. Generally, you should have an information trail with your tooling from the container ID to the image ID.

For instance, if you were just using docker, and your container id were 0ed38928691b, you could generally retrieve the image ID simply with:

$ docker inspect 5eb97611e37f -f '{{ .Image }}'
sha256:4ee5e79272183b8313e43921b3e46c1809399391535c0c044dd6f2230041eede

Note that when Chalk is in Docker mode, it also wraps docker push, which generally will result in a new image registry manifest digest and list digest if applicable. Capturing that info on push provides the needed breadcrumbs.

Using Chalk to Launch Processes

If you want to better automate traceability across software’s life cycle, you can configure chalk to run software (ideally, software that you’ve previously marked via chalk insert or chalk docker build). Chalk supports a chalk exec operation where it will run your process, as well as report on that process and the host environment.

The current default behavior for chalk is then to exit quietly, without impacting your process. Chalk can also be configured to continue running in the background and emit a periodic heartbeat report that is fully customizable.

See exec guide for more information.

Metadata Keys

Metadata is at the core of Chalk, which categorizes data into four types:

Chalk-time artifact metadata, which is data specific to a software artifact, collected when inserting chalk marks. This data can be put into a chalk mark, and it can also be separately reported without putting it in the chalk mark.
Chalk-time host metadata, which is data about the environment in which chalk ran in when inserting chalk marks. This data can also be added to the individual marks inserted, if desired.
Run-time artifact metadata, which is data about software artifacts that can be collected on any invocation of chalk, such as when launching a program you’ve previously marked, or when searching for chalk marks on a system.
Run-time host metadata, which is data about the host, captured for any chalk invocation.

Some things to note about metadata:

Some metadata is inappropriate for those looking for fully reproducible builds, such as time-specific keys. The default is to include some of these items, which are useful to a different set of people who want to be able to track which build came from which environment. These concerns can be dealt with via the configuration by setting up custom reports for different consumers.
Chalk-time keys can be reported at run-time if they’re being extracted from a chalk mark, but they will always contain the values added at chalk time.
Run-time keys cannot be added to chalk marks.
If the implementation is unable to collect a piece of metadata, it is not included in any reporting, no matter the configuration.

There’s plenty of flexibility on when to collect and report on metadata. This document covers some of the basics at a high level. More detailed information on metadata keys is available through the help documentation via:

$ chalk help metadata
$ chalk help metadata <key>

It is possible to create chalk marks without inserting identifiers into artifacts, called “virtual chalking”. However, it is not recommended, and is intended primarily for testing. Doing so means that deployed software will need to be independently identified and correlated, negating a lot of the value of Chalk.

Additional Detail and Specs

Section contents

In this section, we will go into more detail on key Chalk concepts, and give pointers to deeper reference material where appropriate.

While we have produced an additional implementation of Chalk that is quite flexible, we designed it so that other people could build implementations, including partial implementations, and easily achieve interoperability across implementations.

For instance, it is easy to write a compliant chalk library that allows programs to store their implementations inside their executable, and retrieve them, while still inter-operating with other programs that collect a wider range of metadata.

We certainly intend to allow other people to implement compatible software, and if the software meets our requirements, call it a conforming Chalk implementation. To that end, as we explain parts of Chalk in this document, we will often indicate explicitly whether something is necessary to be ‘conformant’ or not.

However, until we are confident that the we’ve been thorough enough at specifying conformance to ensure interoperability, note that nobody should use the Chalk trademark to describe an implementation of anything without the project’s express approval.

Currently, the project trademark is held by Crash Override, Inc. However, once the project is sufficiently mature, we expect to assign ownership of the trademark to a non-profit.

To help with interoperability and to help people to understand capabilities of various implementations, Chalk is versioned – not just the reference software, but the information that is required for compliance. Chalk versioning will follow the Semver standard.

Until Chalk is declared 1.0.0, new versions of the spec may contain breaking changes. We will document these as they happen in CHANGELOG.

Configuring Chalk

Chalk stores its configuration inside its binary. Configurations can be extracted from a binary with the chalk dump command, and new ones loaded with the chalk load command, which loads from either a local file or an https URL.

The actual configuration format is designed to be at least as simple as the typical *NIX config file whenever possible, while still supporting advanced use cases. The configuration file format itself is mostly in line with the NGINX family of configuration files, with sections and key/value pairs. For those advanced use cases, the config file also supports some limited programmability.

Generally, the average user shouldn’t need such features, but the people who do should find the syntax to be straightforward to anyone with basic programming experience.

For people who want to dig into the actual configuration file, we provide an overview in Config Overview. We also provide documentation via the help in chalk help configs.

Note that other implementations of Chalk are free to implement their own configuration mechanisms.

Chalk Mark Basics

Chalk writes arbitrary metadata into software artifacts. The metadata written into a single artifact is called the chalk mark. The mark itself is always a (utf-8 encoded) JSON object, where the first key/value pair is always "MAGIC" : "dadfedabbadabbed".

The presence of this value is required to consider data embedded into an artifact a chalk mark.

Beyond the initial key pair, key/value pairs can appear in any order.

Key names beginning with a leading underscore _ must never appear in a chalk mark (as they denote reporting data that was collected at the time a report was generated).

Key names beginning with a $ are considered to be internal to chalk implementations that add metadata, called injectors in this document. If an implementation uses such keys, then chalk marks using that implementation must identify and test the injector by adding the key INJECTOR_NAME to the chalk mark, which must be registered, currently through the Chalk development team.

These keys should only ever be written into programs that themselves modify chalk marks.

Starting with Chalk 0.1.1, Chalk mark injectors that find an existing chalk mark in an artifact will, if replacing the chalk mark, keep $ keys they do not recognize, unless specifically configured to remove them, while also considering them part of the previous chalk mark.

Custom Keys

Users of the reference implementation of Chalk and other conforming implementations of Chalk cannot add arbitrary chalkable keys or arbitrary runtime keys, and the keys defined must conform. However, if these keys don’t meet your needs, you can add custom keys. Any key starting with an X_ is reserved for custom chalkable keys (both host and artifact), and any key starting with _X_ is reserved for custom run-time keys.

Marks Versus Reports

Marks are JSON objects inserted into a software artifact, or into a virtual-chalk.json file if virtual chalking is enabled (not recommended).

Chalk can also generate reports that are separate from the artifact. These reports can be generated at the time of chalking, but they can also be generated at any point after a mark is added, such as on extract or exec operations. Chalk reports are also structured as JSON. The format and requirements around such reports are discussed in Overview of Reporting Templates.

It’s important to understand the basic semantics of reports and how they differ from marks:

Reports can occur at any time, not just when inserting chalk marks.
When inserting chalk marks, if a report is also generated, the data in the mark and the data in the report can overlap, but does NOT need to be identical.

It will be common, at chalk insertion time, to add data about the software artifact to a report that is not added to the chalk mark itself, as discussed shortly. In such a case, metadata keys can get reported that are not in the mark. Similarly, it is fine for metadata to be put into the mark that is not in the associated report.

When a report is generated because a chalk mark is seen in the field (meaning not on an insertion operation), the report can report some, all, or none of the information it finds in chalk marks. It may also report new metadata that cannot be added to the chalk mark outside an insertion operation.

Reported metadata keys starting with a leading underscore _ are never added to a chalk mark, and represent collected metadata at the time the report was generated.

When reports contain metadata keys that do NOT start with an underscore, they contain information from the time of metadata insertion. If Chalk is not performing insertions at the time, such keys will always be directly taken from the chalk mark. No keys without the leading underscore can be reported for non-insertion operations unless they are found in a chalk mark.

We do recommend, at chalk insertion time, to be thoughtful about what metadata will be added to the chalk mark itself.

There are two key reasons for this:

Privacy. Depending on where the software is distributed and who has access to it, there may be information captured that people examining the artifact shouldn’t see. In this case, there should be enough information stored in the mark to find the report record that was generated at the time of chalking without leaking sensitive data in the mark itself.
Mark size. Although size generally won’t be much of a concern in practice, some metadata objects may be quite large, such as generated SBOMs or static analysis reports.

The first concern is, by far, the most significant. Even in cases where software never intentionally leaves an organization, there can be risks. For instance, if the chalk mark contains code ownership or other contact information, while it does make life easier for legitimate parties, it also could help an attacker who manages to get onto a node and is looking to pivot.

Required Keys in Chalk Marks

To be considered a valid chalk mark, the following keys must be present:

MAGIC. This key must be first, and must have the exact value "dadfedabbadabbed". This is a strong requirement even though JSON object items do not require ordering. This is part of how we make chalk mark extraction easy to implement.
CHALK_ID. This value is an encoding of the first 100 bits of an artifact’s unchalked SHA-256 hash, whenever such a hash can be unambiguously determined. Otherwise, it derived from 100 bits selected from a cryptographic PRNG.
CHALK_VERSION. This value is required so that extractors can unambiguously deal with future changes to Chalk.
METADATA_ID. In contrast to the CHALK_ID, this is a unique identifier for the chalked artifact. It is the first 100 bits of the chalk mark’s SHA-256 hash.

Details about what keys contain can be found in chalk help metadata. Compliant implementations of Chalk must insert compatible information.

The JSON object representing the chalk mark can contain arbitrary spaces that would otherwise be valid in JSON, but cannot contain newlines (newlines in values are encoded). This requirement is only for chalk marks stored in an artifact; there are no requirements on presentation when displaying chalk marks.

Current Chalk Mark Insertion Algorithms

How the chalk mark it is stored in the artifact varies:

Software Type	Storage Approach
ELF Executables	Added to its own `.chalk` section
Container Images	Adds `/chalk.json` file to the top layer of the image file system; soon will store an attestation with the mark, whenever supported.
JAR/WAR/EAR files	A `.chalk.json` file at the top level of the archive
ZIP files	A `.chalk.json` file at the top level of the archive
Scripting Languages	Placed in a comment, generally at the end of the file. Note that currently this requires a Unix shebang or a .py file extension to be identified as a valid script.
Byte-compiled Python	Added to the end of the .pyc file
Mach-O (MacOS) executables	Due to Apple restrictions, we automatically wrap the binary into a shell script, and mark the shell script.

The implementation for scripting languages will do one of the following:

Replace an existing mark, wherever it is.
Place a mark in the first place in the file it finds a chalk placeholder.
Write the mark at the end of the file.

In the first case, the mark does NOT need to be at the end of the file, due to the support for placeholders.

A valid placeholder consists of the JSON object { "MAGIC" : "dadfedabbadabbed" }. The presence of spaces and the number of spaces is all flexible, but no newlines are allowed.

The intent here is to allow developers to specify where they want marks to go, either so that they’re the least in-the-way, or so that they can include them as data, instead of a comment. This requires a normalization function for computing the HASH value, which is described below.

Implementations won’t insert potentially ambiguous chalk marks relative to the current version of Chalk. For instance, it may soon be possible to put ELF chalk marks in places other than the end of a file, and older versions of Chalk would continue inserting to the same place.

In such a case, the newer version must remove the older chalk mark. However, the older version might be invoked after already being marked by the new version.

The current version of Chalk will not deal with this issue, but before version 1.0, we expect to define and implement a solution. Currently, we’re considering two approaches:

File-based artifacts will need to be scanned in their entirety before marking, and if a mark is found, the spot is reused. This would make things easier on implementers, but could impact performance for some larger artifacts.
We may require marking the locations that older versions would have selected with a mark that invalidates the location, and points to the correct location.This would allow for more efficient operation, but would make some parts of the implementation more difficult, especially around calculating the CHALK_ID, which is discussed below.

To be clear, while compliance for chalk implementations requires adhering to the algorithms as defined by the reference, it does not require implementing any specific algorithm for insertion or extraction of marks, as long as it implements at least one. Nor does an implementation need to be set up to chalk all possible artifacts.

For example, we expect to release small libraries for different language environments that allow programs to chalk themselves. This would allow them to easily load and store configuration information without using external files (as Chalk itself currently does). Similarly, we intend to use this to have programs automatically add bash completion scripts on their first run, if such scripts aren’t found in an environment.

Future Insertion Algorithms

We have approaches planned for roadmap executable types, including in-browser Javascript, PE binaries, and more.

Generally, even going forward, anything in an image format will have the mark stored in the root of its file system, either as chalk.json or .chalk.json. Anything else will generally be stored in a way where the raw JSON would be directly visible in the artifact’s bytes.

We want marking strategies to be unambiguous to implement and easy to extract, wherever possible. To that end, any algorithm that doesn’t meet the requirements laid out here should be brought to the project to be considered for approval.

Multiple marks in an artifact

Sometimes it might make sense for a software artifact to have multiple Chalk marks. For instance:

A zip file deployment might itself be marked, and contain multiple executables that also have marks.
A single script-based program may consist of several files, all independently marked, particularly when an entry point cannot easily be programmatically determined.
Individually compiled ELF objects could conceivably be marked independently, and then composed into an ELF object that is also marked.

In the first two cases, no changes need to be made; sub-items can be marked unambiguously. Implementations can either:

Leave sub-marks in place.
Lift them into the top-level object in full, adding them to the EMBEDDED_CHALK key, in which case they should not be placed in the embedded objects, or should be removed from them if already there.

In the third case, allowing individual object marks to exist independently in the artifact would make it harder to support simple extraction. As of the current version, multiple marks independently existing in a single document (such as executables) that is not a well-defined image format is not allowed.

Replacing existing marks

When a Chalk mark already exists in a document, it’s up to the context of the insertion whether the existing chalk mark should be removed. In most cases, an existing chalk mark should be preserved. For instance, when chalking during deployment, any previous chalk mark from the build process should be preserved.

In such cases, there are three options:

The old chalk mark can be kept, in its entirety, in a key in the new chalk mark called OLD_CHALK_MARK.
If the user is confident that data about the chalk mark being replaced was captured, then the mark can be replaced with the single key OLD_CHALK_METADATA_ID, where the value of this key is the METADATA_ID of the mark being replaced.
Similarly, one can use the OLD_CHALK_METADATA_HASH, if full hashes are preferred to IDs.

Particularly in the latter two scenarios, note that if the old chalk mark is not reported before being replaced, and then the mark is replaced again, the link between marks will be lost. Therefore we strongly discourage using those keys without reporting.

Mark Extraction Algorithms

Extractors generally do not need to care about file structure for non-image formats. It should be sufficient for them to scan the bytes of such artifacts, looking for the existence of Chalk MAGIC key.

However, for image-based formats, the extractor needs to be aware enough of the marking requirements for that format to be able to unambiguously locate the primary mark.

Mark Reporting

As mentioned in the section Marks Versus Reports, chalk reports do not have to contain the same data as in chalk marks.

Currently, Chalk can generate reports when any of the following operations are performed:

Operation	How to invoke the operation	Description
`insert`	`chalk insert`	Adds chalk marks to non-container artifacts.
`extract`	`chalk extract`	Scan an environment, looking for existing marks in software artifacts.
`build`	`chalk docker ...` where the docker command leads to a build	Adds chalk marks while building a container.
`push`	`chalk docker ...` where the docker command leads to a push	Report when pushing a container to a registry.
`exec`	`chalk exec`	Spawn a process, and perform reporting.
`heartbeat`	`chalk exec --heartbeat`	Same as `exec`, but with heartbeat enabled.
`delete`	`chalk delete`	Delete chalk marks from existing artifacts.
`env`	`chalk env`	Perform reporting on a host environment without performing a scan (as `chalk extract` does), and without spawning another process (as `chalk exec` does).
`load`	`chalk load`	Replace a chalk executable’s configuration.
`dump`	`chalk dump`	Output the currently loaded configuration.
`setup`	`chalk setup`	Setup signing and attestation.
`docker`	`chalk docker ...`, where chalk encounters an error.	Still runs docker, but reports on cases where `insert` or `push` operations could not complete.

The operation associated with a report is available via the _OPERATION key.

For more details on the command line usage of chalk, see the help documentation at chalk help commands.

Chalk reports can contain per-artifact information, as well as information specific to the host environment.

A chalk report is output as an array of JSON objects that contains the report, so in most cases the array will only have a single object. However, when reports are sent, they’re always sent in a JSON array that may have multiple objects, in case an implementation has cached reports that have not been delivered.

For each single report, the JSON keys that are valid in the top-level report will NEVER be artifact-specific data. Only what we call ‘host data’ is included at the top level, by which we mean data specific to one run of chalk on one host.

For artifact data, there’s a host-level metadata key called _CHALKS, the value of which is an array of JSON objects, containing the metadata specific to that artifact to be reported on. Each element in the array corresponds to a single artifact’s chalk information.

The data in a report contained in _CHALKS does not have to consist of a full chalk mark; the user could choose to report on a subset of keys, or no keys from the chalk mark at all. Furthermore, the artifact data in a _CHALKS field will not consist solely of chalk-time information; it can also contain information from the time the report was run. For instance, a single artifact report could contain both the filesystem path where the software lived when it was chalked in the build environment, as well as the file system path for its current location (PATH_WHEN_CHALKED vs _OP_ARTIFACT_PATH keys).

In all cases, the chalk-time keys at the top level of the report and at the top level of the objects in the _CHALKS array will not contain a leading underscore. The keys representing report-time operations will contain a leading underscore.

There are no specific requirements about what keys must be contained in a report; the user has the final say in what data gets reported on and what does not. In fact, reports do not have to report the _CHALKS field. However, removing that field does mean there will be no artifact-specific information in the report, making it suitable for host reporting and summary stats only.

Generally, if reporting on artifacts at all, we strongly recommend configuring the reporting to include, at a bare minimum, the CHALK_ID and METADATA_ID fields.

For each of the above operations, the chalk report allows you to configure a primary report. That report can go to multiple different places, including the terminal, log files, HTTPS URLs, and S3 buckets.

You can also define additional reports that get sent at the same time, so that you can send different bits of data to different places. This is done with Chalk’s custom reporting facility.

For more information, see the following:

chalk help metadata contains documentation for what metadata keys are available in which operations, as well as the meaning of the fields. Documentation for keys will also include the conditions where the reference implementation can find them.
The Config Overview Guide covers how to configure WHERE reports get sent.

Note that compliant insertion implementations do not require compliant reporting implementations. But compliant chalk tools for other operations MUST produce fully conformant JSON.

However, there are no requirements on how that JSON gets distributed or managed, other than that compliant implementations must provide a straightforward way to make the JSON available to users if desired.

A report not in the proper format, or with key/values pairs that are not compliant, is not a Chalk report.

Mark Validation

Any time a mark is extracted, Chalk must go through a validation process.

IF the artifact isn’t a container, extractors independently compute the value of HASH for that artifact (the field isn’t currently computed for containers).

That value is then used to derive what we expect the CHALK_ID to be. If it is not a valid value, we log an error, which may print to the terminal depending on the log level, and will generally be added to any chalk operation report under the top-level key _OP_ERRORS.

If the CHALK_ID validates, or if the artifact is a container, the we must also validate the METADATA_ID.

The METADATA_ID requires independently recomputing the METADATA_HASH by normalizing and encoding the fields explicitly added into a chalk mark.

The normalization algorithm is as follows:

The following key/value pairs are removed:
- MAGIC
- METADATA_HASH
- METADATA_ID
- SIGNING
- SIGNATURE
- EMBEDDED_CHALK
The following key/value pairs are encoded first, in order (whenever present; they are skipped if they were not added to the mark):
- CHALK_ID
- CHALK_VERSION
- TIMESTAMP
- DATE
- TIME
- TZ_OFFSET
- DATETIME
The following key/value pair is encoded LAST, (whenever present):
- ERR_INFO
The remaining keys are encoded in lexicographical order.
The encoding starts with the number of keys in the normalization, as a 32-bit little endian integer.
Each key/value pair is encoded in order by encoding the key, and then the value, using the item normalization algorithm below.

Individual items are conceptually normalized as follows:

Strings are normalized by adding the byte \x01, followed by the length of the JSON-encoded string in bytes represented as a 32-bit little endian unsigned value, followed by the encoded string.
Integers are normalized by adding the byte \x02, followed by the 64-bit value of integer, when represented as a little-endian unsigned value.
Booleans are represented as two bytes each: \x03\x00 for false and x03\x01 for true.
Arrays are normalized by adding the byte \x04, followed by the number of items in the array encoded as a little endian 32-bit integer, followed by the normalized version of each item in order.
Dictionaries / Json objects must be stored ordered in Chalk values.They are normalized by adding the byte \x05, followed by the number of key/value pairs in the dictionary encoded as a 32-bit little endian integer, followed by paired encodings for each pair, in their stored ordering.
Floats are represented as \x06 followed by string representation of the float value.
null is represented as \x07.

The complete normalized string is hashed with SHA-256. The resulting 256-bit binary value is the base of both the METADATA_HASH field and the METADATA_ID field. The entire value is hex-encoded to get the METADATA_HASH, whereas the METADATA_ID is computed via the same algorithm used to calculate CHALK_ID fields.

Remember that keys beginning with an underscore are never added to chalk marks, and so are never considered in the normalization process at all.

We omit the key MAGIC because it is always constant across all invocations, and chalk marks will not even be recognized if it is modified in any way.

We omit METADATA_HASH and METADATA_ID because they are the output of the normalization process.

We omit SIGNATURE and SIGNING because they are further validation discussed below built on top of the METADATA_ID.

We currently omit EMBEDDED_CHALK, instead allowing them to be independently validated, if desired. While this does mean the EMBEDDED_CHALK key can be excised without detection at validation time, we expect that either the relevant sub-artifacts will have embedded chalk marks themselves, or the server will have record of the insertion.

Currently, this key is only used for ZIP files, and the HASH value, which must be present for ZIP files, is used, meaning the integrity of the underlying artifacts is guaranteed by this process.

This was initially done because there were some concerns about the potential amount of processing, especially since our implementation requires using the file system for ZIP files. But we have some reservations, and will consider changing this in future versions.

This validation process only proves that the chalk mark’s integrity is intact from when it was written (and, if there is a HASH field, that the core artifact as normalized is intact). It does not validate that the chalk mark was added by any particular party.

For that level of assurance, the METADATA_ID field reported at extraction time should be cross-referenced against insertion-time reports. When these IDs correlate, you can be confident that the metadata is identical across reports (and the files in question as well, as long as there is a HASH field).

In containers, where we do not have an easy, reliable hash, metadata normalization and validation works the same way. But we strongly recommend automatic digital signatures to ensure that you can detect changes to the container.

Digital signing can be used both with containers and with other artifacts. With containers, we use Sigstore with their In-Toto attestations that we apply on docker push. The mark is replicated in full inside the attestation.

For other artifacts, the signature is stored in the Chalk mark, but is (necessarily) not part of the metadata hashing, since it needs to sign that data.

Mark Deletion

If needed, it’s possible to delete chalk marks from most artifacts, with containers being the exception (they should be rebuilt instead). This can be done with the chalk delete command, which will recursively delete chalk marks from all artifacts in the current working directory, or with:

$ chalk delete [targetpath]

The delete operation will produce a report where the deleted chalk mark (if any) will be reported, along with run-time environmental information.

Configuration File Syntax Basics

Many of the things you might want to configure simply involve setting configuration variables. For instance, we could create and load the following configuration file:

color: false # Also could set NO_COLOR env variable.
log_level: "error" # Otherwise, defaults to 'warn' in non-docker cases
run_sbom_tools: true # Run syft; off by default.
run_sast_tools: true # Run semgrep; off by default.
run_secret_scanner_tools: true # Run trufflehog; off by default.
# A backup; a self-truncating log file for reporting
use_report_cache: true
default_command: "docker" # Defaults to "help".
report_cache_location: "/var/log/chalk-reports"
# When using docker, the prefix to add to auto-added labels
docker.label_prefix: "com.example."

In the configuration file, we can also set up environment variables for reporting, such as by defining new environment variables and using simple if / else logic to set a default if the environment variable is not set on the host. For example, the line docker.label_prefix: "com.example." in the sample config above can be changed to:

if env_exists("CHALK_LABEL_PREFIX") {
  docker.label_prefix: env("CHALK_LABEL_PREFIX")
} else {
  docker.label_prefix: "com.default."
}

In this case, if CHALK_LABEL_PREFIX is set on the host, then all docker images built with chalk will have that label prefix; otherwise the label prefix will be com.default..

Testing Configurations

When you load a new configuration, Chalk will test it automatically. But you can generally test your configuration more quickly by leaving it in an external file.

Out of the box, Chalk will search for chalk.c4m on startup in:

current directory
/etc/chalk
/etc/
~/.config/chalk/
~

Or you can specify the specific file to use with --config-file (also -f).

Note that running chalk help commands will show globally available flags, and chalk config shows common configuration variables and their current values.

By default, Chalk will happily evaluate the embedded configuration, and then a configuration on disk. You can also force Chalk to skip one or both with the flags: --no-use-embedded-config and --no-use-external-config.

In fact, if you want to force Chalk to ignore any external configuration file, you can set use_external_config: false in the embedded configuration.

If a configuration has a syntax error, Chalk will not run. For instance, if we had our config file only set color to false, but we forgot the : (or =, which also works for setting config attributes), we would get:

error: chalk: ./chalk.conf: 1:7: Parse error: Expected an assignment, unpack (no parens), block start, or expression
  color false
error: Could not load configuration files. exiting.

Overview of Reporting Templates

Reports and chalk marks decide which keys to add based on report templates and mark templates, respectively. These are essentially lists of keys to include or not include for a given situation.

While you can build your own templates, the easiest thing to do is to copy and paste the default templates into your new configuration file, change the defaults, and then load the config file. Defaults are available in src/configs/base_report_templates.c4m for report templates and in src/configs/base_chalk_templates.c4m for mark templates.

For more information on how templates can be configured manually in the configuration file, see the Config Overview Guide.

Reporting Templates for Docker Labels

The default configuration for what labels to output automatically are kept in the chalk_labels mark template, which, by default is:

mark_template chalk_labels {
  key.AUTHOR.use                              = true
  key.BRANCH.use                              = true
  key.COMMITTER.use                           = true
  key.COMMIT_ID.use                           = true
  key.DATE_AUTHORED.use                       = true
  key.DATE_COMMITTED.use                      = true
  key.DATE_TAGGED.use                         = true
  key.ORIGIN_URI.use                          = true
  key.TAG.use                                 = true
  key.TAGGER.use                              = true
}

You can set additional chalkable keys in this template, and/or disable reporting on any of those keys.

Chalk Installation Guide Integrating Chalk with CI/CD Platforms