---
title: "Canonicalization"
description: ""
url: https://instituteofprovenance.org/docs/canonicalization
source: Institute of Provenance
---
# Canonicalization

Canonicalization is the process of normalizing content into a deterministic byte sequence before hashing. This ensures that the same logical content always produces the same BLAKE3 hash, regardless of platform, encoding order, or insignificant formatting differences.

## Purpose

Without canonicalization, semantically identical content could produce different hashes due to:

- JSON key ordering differences across serializers
- Whitespace normalization (CRLF vs LF, trailing whitespace)
- Unicode normalization forms (NFC vs NFD)
- Encoding differences (UTF-8 BOM presence)

Canonicalization eliminates these variations so that a XION artifact signed on one platform can be verified on any other.

## Text Content

For text-based content (markdown, plain text, structured text):

1. **Strip the trust block** — Remove the embedded XION trust block before hashing (the trust block is not part of the content being signed)
2. **Normalize line endings** — Convert all line endings to LF (`\n`). Remove trailing whitespace from each line
3. **Normalize Unicode** — Apply NFC (Canonical Decomposition followed by Canonical Composition)
4. **Encode as UTF-8** — Without BOM
5. **Trim** — Remove leading and trailing whitespace from the entire document
6. **Ensure trailing newline** — The canonical form ends with exactly one LF

The resulting byte sequence is hashed with BLAKE3.

## Structured Data (JSON)

For JSON content:

1. **Parse** the JSON into an in-memory representation
2. **Sort object keys** lexicographically (Unicode code point order) at all nesting levels
3. **Normalize numbers** — No leading zeros, no trailing zeros after decimal point, no positive sign
4. **Normalize strings** — NFC Unicode normalization, no unnecessary escaping (only escape characters required by JSON: `"`, `\`, and control characters)
5. **Remove insignificant whitespace** — Produce compact JSON with no indentation or extra spaces
6. **Encode as UTF-8** without BOM

## Binary Content

For binary content (images, video, audio):

1. **Extract the content bytes** — The raw file content excluding any XION metadata that was appended or embedded in format-specific metadata fields
2. **Hash directly** — No normalization is applied to binary content. The raw bytes are hashed as-is with BLAKE3

The boundary between "content bytes" and "XION metadata" is defined per format. For example, in JPEG files, XION metadata stored in an EXIF APP1 marker is excluded from the hash; the remaining file bytes are hashed in order.

## Idempotency

Canonicalization is idempotent: applying it twice produces the same result as applying it once. This is guaranteed by the deterministic nature of each normalization step.

## Test Vectors

Implementations SHOULD validate against the published test vectors to ensure canonicalization compatibility. Test vectors cover:

- Whitespace and line ending variations
- Unicode normalization edge cases (composed vs. decomposed forms)
- JSON key ordering with nested objects
- Binary content with various metadata embedding formats

Test vectors are published alongside each specification version and are available in the reference implementation repository.

