DOTSV Whitepaper
DOTSV (Database Oriented Tab Separated Vehicle) is a single-line-per-record flat-file database format designed for high processing rate, low memory footprint, git-traceable diffs, full Unicode support, and human readability.
Version: 0.1 Draft
File extensions: *.dotsv, *.dov
MIME type: text/dotsv
1. Overview
DOTSV is a single-line-per-record, flat-file database format designed for:
- High processing rate via
mmapand zero-copy parsing - Low memory footprint — data is borrowed directly from the memory-mapped buffer
- Git-traceable diffs — deterministic record ordering produces minimal, meaningful changesets
- Full Unicode support — CJK, emoji, accented characters pass through unescaped
- Human readability — plain text, editable in any text editor
A .dov file is a valid UTF-8 text file. Every tool that operates on text (grep, awk, diff, cat) works on DOTSV without modification.
2. Record Format
Each record occupies exactly one line (\n-terminated). A record consists of a fixed-width base62-Gu UUID followed by one or more tab-separated key-value pairs.
<12-char-base62-Gu-uuid> <key=value> <key=value> ...\n
Example
NGk26cHcv001 name=Alice city=Tokyo age=30
NGk26cHdn002 name=Bob city=Osaka msg=hello world
EGk26cICK001 name=Carol city=London note=a\x3Db
3. UUID Column — Base62-Gu
The first column is always a 12-character base62-Gu UUID — a time-based, class-prefixed identifier defined by the Base62 Encoding System whitepaper (Format-Gu).
3.1 Structure
{class}{G}{century}{year2d}{month}{day}{hour}{minute}{second}{order2}
1 1 1 2 1 1 1 1 1 2 = 12 chars
| Position | Length | Segment | Example | Notes |
|---|---|---|---|---|
| 1 | 1 | Class prefix | N |
Record type (uppercase A–Z) |
| 2 | 1 | Format marker | G |
Always literal G |
| 3 | 1 | Century | k |
2026 // 100 = 20 — Format-G char |
| 4–5 | 2 | Year (2d) | 26 |
Decimal year mod 100 |
| 6 | 1 | Month | c |
Format-G month table |
| 7 | 1 | Day | H |
Format-G day table |
| 8 | 1 | Hour | c |
Format-G hour table |
| 9 | 1 | Minute | v |
Format-G 60-char alphabet |
| 10 | 1 | Second | 0 |
Format-G 60-char alphabet |
| 11–12 | 2 | Order number | 01 |
Collision resolution (01–99, then base62) |
3.2 Properties
| Property | Specification |
|---|---|
| Width | Exactly 12 bytes (fixed) |
| Character set | [A-Z]G[0-9a-km-zA-NP-Z]{8}[0-9a-zA-Z]{2} |
| Byte position | line[0..12] |
| Followed by | \t at byte 12 |
Fixed width enables direct slicing (&line[..12]) without scanning.
3.3 Sort Order Semantics
Records naturally group and order as:
- By class — all
E-class records sort beforeN-class, etc. - By time within class — the embedded timestamp sorts chronologically within each class prefix.
- By order within second — the 2-char suffix breaks ties for records created in the same second.
3.4 Format-G Alphabet
0123456789abcdefghijkmnopqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ
This ensures UUIDs are unambiguous when read, typed, or printed in any font.
4. Key-Value Pairs
After the UUID and its trailing tab, the remainder of the line is a sequence of key=value pairs delimited by \t.
- Keys and values are UTF-8 strings.
- Keys MUST NOT be empty.
- Values MAY be empty (e.g.,
tag=). - Key order within a record is not significant for semantics, but implementations SHOULD maintain a consistent order for git diff stability.
5. Escaping Rules
DOTSV uses backslash-hex escaping. Only four bytes require escaping when they appear inside a key or value:
| Byte | Escaped Form | Reason |
|---|---|---|
\n |
\x0A |
Record delimiter |
\t |
\x09 |
Field delimiter |
= |
\x3D |
Key-value separator |
\ |
\\ |
Escape character itself |
Optional
| Byte | Escaped Form | Reason |
|---|---|---|
\r |
\x0D |
Windows line-ending safety |
What is NOT escaped
Everything else passes through literally: CJK characters (Tokyo, Osaka), emoji, accented Latin characters, spaces, punctuation, and all other printable Unicode. Most real-world values need zero escaping, enabling the zero-copy fast path: if no \ is present in a slice, borrow it directly from the underlying buffer.
6. File Structure
A .dov file consists of two sections separated by a single blank line:
<sorted section>
<pending section>
6.1 Sorted Section
Records sorted lexicographically by UUID (ASCII byte-order comparison of the 12-character base62-Gu string). Because the class prefix is the first character and the timestamp follows, records naturally group by class and sort chronologically within each class.
Properties:
- O(log n) lookup via binary search on
mmap - Deterministic ordering — insert position is defined by UUID, not by insertion time
- Stable git diffs — new records appear as single inserted lines in predictable locations
6.2 Pending Section
A write-ahead buffer of uncommitted operations, appended after the blank line separator. Each line is prefixed with a single-byte opcode:
+<uuid> <key=val> ...\n insert
-<uuid>\n delete
~<uuid> <key=newval> ...\n patch (changed pairs only)
| Prefix | Operation | Payload |
|---|---|---|
+ |
Insert | UUID + all KV pairs |
- |
Delete | UUID only |
~ |
Patch | UUID + changed KV pairs |
Read path: Binary search the sorted section, then linear scan the pending section for overrides. Write path: Always append to the pending section — O(1). Compaction: Merge the pending section into the sorted section when it exceeds a threshold (e.g., 100 lines). This is a single sequential read + write pass.
7. Comments and Blank Lines
- Lines starting with
#are comments and MUST be ignored by parsers. - Blank lines within the sorted section are not permitted (the first blank line marks the section boundary).
- Additional blank lines after the section separator are ignored.
8. In-Place Modification
When modifying a value in the sorted section, two strategies apply:
| Condition | Strategy |
|---|---|
| New line length is less than or equal to old line length | Overwrite in place, pad with trailing spaces before \n |
| New line length is greater than old line length | Append a ~ patch to the pending section |
The padding strategy works because the KV parser trims trailing whitespace from the last value. This avoids file rewrites for most edits, since values typically stay similar in length.
9. Parsing — Zero-Copy Fast Path
The core parse loop in Rust:
fn parse_record(line: &str) -> (&str, Vec<(&str, &str)>) {
let uuid = &line[..12];
let kvs = line[13..] // skip uuid + \t
.split('\t')
.filter_map(|pair| pair.split_once('='))
.collect();
(uuid, kvs)
}
splitandsplit_oncearememchr-accelerated in Rust's standard library.- When no
\appears in a value slice, the slice is a direct borrow from themmap— zero allocation. - When
\is detected, only that value is decoded into aCow<str>.
Binary Search on mmap
fn find_record<'a>(mmap: &'a [u8], target: &str) -> Option<&'a str> {
let text = std::str::from_utf8(mmap).ok()?;
let mut lo = 0usize;
let mut hi = text.len();
while lo < hi {
let mid = (lo + hi) / 2;
let line_start = match text[..mid].rfind('\n') {
Some(p) => p + 1,
None => 0,
};
let uuid = &text[line_start..line_start + 12];
if uuid < target {
lo = text[mid..].find('\n').map(|p| mid + p + 1).unwrap_or(hi);
} else if uuid > target {
hi = line_start;
} else {
let line_end = text[line_start..].find('\n')
.map(|p| line_start + p)
.unwrap_or(text.len());
return Some(&text[line_start..line_end]);
}
}
None
}
10. Design Rationale
| Goal | Mechanism |
|---|---|
| Fast query | Sorted UUIDs + mmap + binary search = O(log n) |
| Fast parse | Tab-split is memchr-SIMD accelerated; fixed-width 12-char UUID avoids scanning |
| Low memory | Zero-copy borrows from mmap; no HashMap, no deserialization tree |
| Fast writes | O(1) append to pending section |
| Fast deletes | O(1) append - opcode to pending section |
| Fast modify | In-place overwrite when size fits; else O(1) patch append |
| Git-traceable | Sorted records = one-line insertion diffs; deterministic key order within records |
| Human-readable | Plain UTF-8 text; minimal escaping; tab-aligned columns in most editors |
| Full Unicode | Only 4 control bytes escaped; all other Unicode is literal |
11. Concurrency — Lock File Protocol
DOTSV defines a companion lock file for coordinating concurrent access by multiple tsdb instances.
11.1 Lock File
For every .dov file, a corresponding .dov.lock file serves as both a kernel-level lock target and a queue manifest.
target.dov data
target.dov.lock flock() target + queue manifest
Why a separate file: The atomic write strategy replaces the .dov via temp-file to rename, which invalidates any flock() held on the original fd. The .lock file is stable — never renamed, never rewritten during data operations.
11.2 Lock File Contents
The lock file contains one line per queued tsdb instance:
<status> <process_id> <uuid1>,<uuid2>,... <timestamp>\n
| Field | Spec |
|---|---|
| Status | EXEC (currently running) or WAIT (queued) |
| Process ID | 16 lowercase hex chars, randomly generated at startup |
| UUID list | Comma-separated target UUIDs extracted from action file |
| Timestamp | Unix epoch seconds, refreshed periodically by EXEC |
11.3 Conflict Detection
The rule: your UUID set intersected with any queued UUID set must be empty, or conflict is detected.
Opcodes are irrelevant — any queued operation ahead of you may alter the record's state before your turn arrives.
11.4 Stale Entry Eviction
Any entry whose timestamp exceeds a configurable staleness threshold (default: 30 seconds) is automatically evicted. This prevents a crashed writer from blocking the queue indefinitely.