Base62 Encoding System Whitepaper

This whitepaper defines a family of eight related encoding formats — Formats A through G and Gu — built on the base62 numeral system, designed to produce compact, human-readable, time-sortable identifiers suitable for use as filenames, record markers, and universal unique identifiers.

Author: Akri (technical review agent) Date: 2026-03-29 UUID: BGk26cHiqZ01

1. Abstract

This whitepaper defines a family of eight related encoding formats — Formats A, B, C, D, E, F, G, and Gu — built on the base62 numeral system. The system was designed to produce compact, human-readable, time-sortable identifiers suitable for use as filenames, record markers, and universal unique identifiers (UUIDs) within a note-taking and knowledge management infrastructure.

Formats A through F represent the six possible permutations of three character groups (digits, lowercase letters, uppercase letters) within a standard 62-character alphabet. Format-G modifies the standard alphabet by removing two visually ambiguous characters (l and O), producing a 60-character working set that maps cleanly onto timestamp segments requiring at most 60 distinct values. Format-Gu extends Format-G into a 12-character time-based UUID (tbUUID) with a class prefix and collision-handling order number.

The system prioritises: compactness (8–12 characters for a full timestamp identifier), URL safety (no special characters), human readability (no visually confusable characters in the primary format), and deterministic decodability (every character position has a fixed semantic role).

2. Motivation

2.1 Why Base62?

The need for compact, human-readable identifiers arises in systems where filenames, record markers, and cross-references must be typed, read, and sorted by both humans and machines. Common alternatives have trade-offs:

Base16 (hexadecimal): widely understood, but produces long strings — a Unix timestamp in hex is 8 characters and carries no structured date information.
Base64: offers high density, but includes +, /, and = characters that are unsafe in filenames and URLs without escaping.
UUID v4 (128-bit, hex): 36 characters with dashes; far too long for filenames and impossible to remember or type.
ISO 8601 timestamps: human-readable but verbose (2026-03-29T03:30:00+08:00 is 25 characters) and contain characters (:, +) that are problematic in filenames.

Base62 uses exactly the 62 alphanumeric characters (0-9, a-z, A-Z). These characters are filename-safe on all major operating systems, URL-safe without percent-encoding, shell-safe without quoting, and human-typeable without shift-key symbols.

A single base62 digit encodes values 0–61, which is sufficient to represent months (1–12), days (1–31), hours (0–23), minutes (0–59), and seconds (0–59) each in a single character.

2.2 Why Multiple Formats?

The six permutation formats (A–F) exist because the ordering of the three character groups (0-9, a-z, A-Z) determines the lexicographic sort order of encoded values. By defining all six permutations explicitly, the system provides a complete catalogue of base62 alphabet orderings. Each format is self-documenting: the format letter (A–F) tells you the alphabet ordering without looking it up.

2.3 Why Format-G?

Format-G was designed for human-facing identifiers. The characters l (lowercase L) and O (uppercase O) are removed because they are visually indistinguishable from 1 and 0 in many fonts. This reduces the alphabet from 62 to 60 characters, which is still sufficient for single-character encoding of all timestamp segments — the largest segment (minutes/seconds) requires exactly 60 values: 0–59.

2.4 Why Format-Gu?

Format-Gu wraps Format-G in a UUID structure. The "u" stands for "UUID." It adds a class prefix (1 character) identifying the record type, and an order number (2 characters) for sub-second collision handling. The result is a 12-character identifier encoding: record type, full timestamp to the second, and a collision-resolution suffix.

3. Base62 Alphabet Fundamentals

3.1 Definition

Base62 is a positional numeral system with 62 symbols drawn from the ASCII alphanumeric characters:

Digits:     0 1 2 3 4 5 6 7 8 9          (10 characters)
Lowercase:  a b c d e f g h i j k l m n o p q r s t u v w x y z   (26 characters)
Uppercase:  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z   (26 characters)
Total:      10 + 26 + 26 = 62 characters

3.2 Comparison with Other Bases

Property	Base16	Base58	Base62	Base64
Character set size	16	58	62	64
Filename-safe	Yes	Yes	Yes	No (`/`, `=`)
URL-safe	Yes	Yes	Yes	No (`+`, `/`, `=`)
Visually unambiguous	Mostly	Yes	No	No
Bits per character	4.00	5.86	5.95	6.00
Chars for 64-bit value	16	11	11	11

Base62 offers nearly the same information density as base64 (5.95 bits/char vs. 6.00) while remaining safe for filenames, URLs, and shell arguments without escaping.

4. The Three-Group Permutation Model (Formats A–F)

4.1 All Six Permutations

Format	Group order	Full alphabet (62 characters)
A	`[0-9][a-z][A-Z]`	`0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`
B	`[0-9][A-Z][a-z]`	`0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz`
C	`[a-z][0-9][A-Z]`	`abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ`
D	`[a-z][A-Z][0-9]`	`abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`
E	`[A-Z][0-9][a-z]`	`ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz`
F	`[A-Z][a-z][0-9]`	`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`

4.2 Timestamp Structure (Formats A–F)

All six formats share the same 8-character timestamp structure {F}{YYY}{M}{D}{h}{m}{s} where the year segment uses 3 characters: a century symbol (year divided by 100, encoded as a single base62 character) followed by the two-digit year modulo 100. Each single-character segment (month, day, hour, minute, second) is encoded by looking up the numeric value in the format's alphabet.

5. Format-G — Visual-Ambiguity-Free Encoding

5.1 Alphabet

Format-G uses 60 characters in the following order:

0123456789abcdefghijkmnopqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ

This is the Format-A ordering with l removed from the lowercase group and O removed from the uppercase group.

5.2 Why 60 Is Sufficient

Segment	Values needed	Available (60)	Sufficient?
Century-sym	Up to 62 (theoretical)	60	Yes (covers years 0–5999)
Month	12	60	Yes
Day	31 (+ 1 reserved)	60	Yes
Hour	24	60	Yes
Minute	60	60	Yes (exact fit)
Second	60	60	Yes (exact fit)

5.3 Timestamp Structure

G{YYY}{M}{D}{h}{m}{s}

Total: 9 characters.

Position	Segment	Description
1	Prefix	Literal `G`
2	Century-sym	`year // 100` — character from the Format-G alphabet
3–4	2-digit year	`year % 100`, zero-padded decimal
5	Month	1-character encoding (months 1–6 map to `a`–`f`; months 7–12 map to `A`–`F`)
6	Day	1-character encoding (days 1–10 map to `0`–`9`; 11–21 map to `a`–`k`; 22–31 map to `A`–`J`)
7	Hour	1-character encoding (hour 0 maps to `0`; 1–12 map to `a`–`l`; 13–23 map to `A`–`K`)
8	Minute	Format-G 60-char alphabet positional lookup
9	Second	Format-G 60-char alphabet positional lookup

5.4 Verified Example

Input timestamp: 2026-03-10 04:00:45

Segment	Value	Result
Prefix	—	`G`
Century-sym	`2026 // 100 = 20` — alphabet position 20 — `k`	`k`
2-digit year	`2026 % 100 = 26`	`26`
Month	3 (March) maps to `c`	`c`
Day	10 maps to `9`	`9`
Hour	4 maps to `d`	`d`
Minute	0 maps to `0`	`0`
Second	45 maps to `K`	`K`

Result: Gk26c9d0K (9 characters)

6. Format-Gu — Time-Based UUID (tbUUID)

6.1 Structure

{C}G{YYY}{M}{D}{h}{m}{s}{XX}

Total length: 12 characters.

Position	Length	Segment	Description
1	1	Class indicator (C)	Single uppercase letter identifying the record type
2	1	Format marker	Literal `G`
3–5	3	Year (YYY)	Century-sym + 2-digit year
6	1	Month (M)	Format-G month encoding
7	1	Day (D)	Format-G day encoding
8	1	Hour (h)	Format-G hour encoding
9	1	Minute (m)	Format-G minute/second encoding
10	1	Second (s)	Format-G minute/second encoding
11–12	2	Order number (XX)	Collision-handling suffix, default `01`

6.2 Class Indicator

The class indicator is a single character (typically an uppercase letter) that identifies the type of record the UUID refers to. This enables routing, filtering, and display logic to operate on the UUID alone without reading the referenced file. All Format-Gu UUIDs match the regex pattern [A-Z]G[0-9a-zA-Z]{10}.

6.3 Order Number

The order number (positions 11–12) handles the case where multiple records are created within the same second. The default is 01. On collision, it increments through the sequence 01–09, 0a–0z, 0A–0Z, 10–..., providing 3,844 possible values per class per second.

6.4 UUID Immutability

Once a Format-Gu UUID is assigned to a record, it is permanent. The UUID must never be changed, reassigned, or reused, even when the associated file is moved or renamed. This preserves git history integrity and cross-reference stability.

7. Comparison Table — All Eight Formats

Property	A–F	Format-G	Format-Gu
Alphabet size	62	60	60
Excluded chars	None	`l`, `O`	`l`, `O`
Total length	9 chars	9 chars	12 chars
Order number	No	No	Yes (2 chars)
Visual ambiguity	Possible	Eliminated	Eliminated
Use case	General encoding	Human-facing IDs	Record UUIDs

8. Encoding and Decoding Algorithms

8.1 Format-G Encoding (Pseudocode)

function encode_format_g(datetime):
    ALPHABET = "0123456789abcdefghijkmnopqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ"
    MONTH_MAP = {1:'a',2:'b',3:'c',4:'d',5:'e',6:'f',
                 7:'A',8:'B',9:'C',10:'D',11:'E',12:'F'}
    // DAY: 1-10 -> '0'-'9'; 11-21 -> 'a'-'k'; 22-31 -> 'A'-'J'
    // HOUR: 0->'0'; 1-12->'a'-'l'; 13-23->'A'-'K'
    // MINUTE/SECOND: ALPHABET[value]

    century_char = ALPHABET[datetime.year // 100]
    year_2digit = zero_pad(datetime.year % 100, 2)
    return "G" + century_char + year_2digit + MONTH_MAP[month]
           + DAY_MAP[day] + HOUR_MAP[hour]
           + ALPHABET[minute] + ALPHABET[second]

8.2 Format-Gu Encoding (Pseudocode)

function encode_format_gu(datetime, class_char, existing_uuids):
    prefix = class_char + encode_format_g(datetime)  // 10 chars
    order = 1
    while (prefix + format_order(order)) in existing_uuids:
        order += 1
    return prefix + format_order(order)  // 12 chars

function format_order(n):
    ORDER_CHARS = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    return ORDER_CHARS[n // 62] + ORDER_CHARS[n % 62]
    // n=1 -> "01", n=9 -> "09", n=10 -> "0a", n=36 -> "0A"