DOTSV — Database Oriented Tab Separated Vehicle

Version: 0.2 File extensions: *.dotsv, *.dov MIME type: text/dotsv

Revision history: - 0.1 — initial release - 0.2 — timestamp tracking; related formats atsv, rtsv, qtsv

1. Overview

DOTSV is a single-line-per-record, flat-file database format designed for:

High processing rate via mmap and zero-copy parsing
Low memory footprint — data is borrowed directly from the memory-mapped buffer
Git-traceable diffs — deterministic record ordering produces minimal, meaningful changesets
Full Unicode support — CJK, emoji, accented characters pass through unescaped
Human readability — plain text, editable in any text editor

A .dov file is a valid UTF-8 text file. Every tool that operates on text (grep, awk, diff, cat) works on DOTSV without modification.

2. Record Format

Each record occupies exactly one line (\n-terminated). A record consists of a fixed-width base62-Gu UUID followed by one or more tab-separated key-value pairs.

<12-char-base62-Gu-uuid>\t<key=value>\t<key=value>\t...\n

Example

NGk26cHcv001    name=Alice  city=東京 age=30
NGk26cHdn002    name=Bob    city=大阪 msg=hello world
EGk26cICK001    name=Carol  city=London note=a\x3Db

3. UUID Column — Base62-Gu

The first column is always a 12-character base62-Gu UUID — a time-based, class-prefixed identifier defined by the Base62 Encoding System whitepaper (Format-Gu).

3.1 Structure

{class}{G}{century}{year2d}{month}{day}{hour}{minute}{second}{order2}
  1      1    1       2      1     1    1      1       1        2     = 12 chars

Position	Length	Segment	Example	Notes
1	1	Class prefix	`N`	Record type (uppercase A–Z)
2	1	Format marker	`G`	Always literal `G`
3	1	Century	`k`	`2026 // 100 = 20` → Format-G char
4–5	2	Year (2d)	`26`	Decimal year mod 100
6	1	Month	`c`	Format-G month table
7	1	Day	`H`	Format-G day table
8	1	Hour	`c`	Format-G hour table
9	1	Minute	`v`	Format-G 60-char alphabet
10	1	Second	`0`	Format-G 60-char alphabet
11–12	2	Order number	`01`	Collision resolution (01–99, then base62)

3.2 Properties

Property	Specification
Width	Exactly 12 bytes (fixed)
Character set	`[A-Z]G[0-9a-km-zA-NP-Z]{8}[0-9a-zA-Z]{2}`
Byte position	`line[0..12]`
Followed by	`\t` at byte 12

Fixed width enables direct slicing (&line[..12]) without scanning. The tab at byte 12 serves as a sanity check, not a parsing necessity.

3.3 Sort Order Semantics

Because UUIDs are sorted lexicographically (ASCII byte order), records naturally group and order as:

By class — all E-class records sort before N-class, etc.
By time within class — the embedded timestamp sorts chronologically within each class prefix.
By order within second — the 2-char suffix breaks ties for records created in the same second.

3.4 Format-G Alphabet

The Format-G alphabet removes visually ambiguous characters l (lowercase L) and O (uppercase O), producing a 60-character set:

0123456789abcdefghijkmnopqrstuvwxyzABCDEFGHIJKLMNPQRSTUVWXYZ

4. Key-Value Pairs

After the UUID and its trailing tab, the remainder of the line is a sequence of key=value pairs delimited by \t.

Keys and values are UTF-8 strings.
Keys MUST NOT be empty.
Values MAY be empty (e.g., tag=).
Key order within a record is not significant for semantics, but implementations SHOULD maintain a consistent order for git diff stability.

5. Escaping Rules

DOTSV uses backslash-hex escaping. Only four bytes require escaping when they appear inside a key or value:

Byte	Escaped Form	Reason
`\n`	`\x0A`	Record delimiter
`\t`	`\x09`	Field delimiter
`=`	`\x3D`	Key-value separator
`\`	`\\`	Escape character itself

Optional

Byte	Escaped Form	Reason
`\r`	`\x0D`	Windows line-ending safety

What is NOT escaped

Everything else passes through literally: CJK characters, emoji, accented Latin, spaces, punctuation. Most real-world values need zero escaping, enabling the zero-copy fast path.

6. File Structure

A .dov file consists of two sections separated by a single blank line:

<sorted section>

<pending section>
# YYYYDDMMhhmmss

6.1 Sorted Section

Records sorted lexicographically by UUID. Properties:

O(log n) lookup via binary search on mmap
Deterministic ordering — insert position is defined by UUID, not insertion time
Stable git diffs — new records appear as single inserted lines in predictable locations

6.2 Pending Section

A write-ahead buffer of uncommitted operations, appended after the blank line separator:

+<uuid>\t<key=val>\t...\n       insert
-<uuid>\n                       delete
~<uuid>\t<key=newval>\t...\n    patch (changed pairs only)

Read path: Binary search sorted section, then linear scan pending for overrides. Write path: Always append — O(1). Compaction: Merge pending into sorted when it exceeds threshold (default 100 lines).

7. Comments and Blank Lines

Lines starting with # are comments and MUST be ignored by parsers.
Blank lines within the sorted section are not permitted (the first blank line marks the section boundary).
Additional blank lines after the section separator are ignored.

8. In-Place Modification

Condition	Strategy
New line ≤ old line length	Overwrite in place, pad with trailing spaces
New line > old line length	Append a `~` patch to the pending section

9. Parsing — Zero-Copy Fast Path

fn parse_record(line: &str) -> (&str, Vec<(&str, &str)>) {
    let uuid = &line[..12];
    let kvs = line[13..]
        .split('\t')
        .filter_map(|pair| pair.split_once('='))
        .collect();
    (uuid, kvs)
}

split and split_once are memchr-accelerated in Rust's standard library.
When no \ appears in a value slice, the slice is a direct borrow from the mmap — zero allocation.

10. Design Rationale

Goal	Mechanism
Fast query	Sorted UUIDs + `mmap` + binary search = O(log n)
Fast parse	Tab-split is `memchr`-SIMD accelerated; fixed-width UUID avoids scanning
Low memory	Zero-copy borrows from `mmap`; no HashMap, no deserialization tree
Fast writes	O(1) append to pending section
Git-traceable	Sorted records = one-line insertion diffs
Human-readable	Plain UTF-8 text; minimal escaping
Full Unicode	Only 4 control bytes escaped; all other Unicode is literal

11. Concurrency — Lock File Protocol

For every .dov file, a corresponding .dov.lock file serves as both a kernel-level lock target and a queue manifest.

11.1 Lock File Contents

<status>\t<process_id>\t<uuid1>,<uuid2>,...\t<timestamp>\n

Field	Spec
Status	`EXEC` (currently running) or `WAIT` (queued)
Process ID	16 lowercase hex chars, randomly generated at startup
UUID list	Comma-separated target UUIDs from the action file
Timestamp	Unix epoch seconds, used for heartbeat / stale detection

11.2 Conflict Detection

Conflict  =  your_uuids  ∩  any_queued_uuids  ≠  ∅

Opcodes are irrelevant to conflict detection. On conflict, tsdb exits immediately without joining the queue.

11.3 Crash Safety

flock() is released automatically by the kernel when a process exits, even on SIGKILL. Stale entries (timestamp older than 30 seconds) are evicted by the next WAIT process.

11.4 Locking Granularity

The flock() on .dov.lock is held only for microseconds — just long enough to read/write the manifest. Actual .dov processing happens entirely outside the lock.

12. Limitations

Not suitable for values larger than ~1 MB (entire record must fit one line).
No built-in indexing beyond UUID — secondary key lookup requires full scan or an external index (--relate).
Concurrent writers that touch the same UUIDs are rejected.
The pending section must be periodically compacted to maintain read performance.

13. Companion Files

File	Purpose
`target.dov`	The database file
`target.dov.lock`	Queue manifest and `flock()` target
`target.dov.tmp`	Transient temp file during atomic write/compact
`target.kv.rtv`	Key-value inverted index (generated by --relate)
`target.vk.rtv`	Value-key inverted index (generated by --relate)

14. File Extension and Identification

Property	Value
Full name	Database Oriented Tab Separated Vehicle
Abbreviation	DOTSV
Extensions	`.dotsv`, `.dov`
Encoding	UTF-8 (no BOM)
Line ending	`\n` (LF only)

15. Timestamp Tracking

Every successful write to a .dov file appends a timestamp comment as the final line:

# YYYYDDMMhhmmss

The timestamp is in UTC. Field layout: year (4), day (2), month (2), hour (2), minute (2), second (2). Example: # 20262903143022.

This comment line is ignored by all existing DOTSV parsers (see §7).

15.1 Scope

The timestamp is written after every operation that produces a new .dov file:

Normal action file execution (tsdb <target.dov> <action.atv>)
Compaction (tsdb --compact <target.dov>)
After --relate completes its implicit compaction

15.2 Compaction Behaviour

--compact merges the pending section into the sorted section. The resulting file retains exactly the last timestamp line — all earlier accumulated timestamps are discarded. A new timestamp is appended reflecting the compaction time.

16. Related Formats

Format	Extension	Full name	Role	Hand-authored?
`atsv`	`*.atv`	Action Tab Separated Vehicle	Input file for standard `tsdb` write operations	Yes
`rtsv`	`*.rtv`	Relation Tab Separated Vehicle	Generated inverted index over a `.dov` file	No
`qtsv`	`*.qtv`	Query Tab Separated Vehicle	Input file for `tsdb --query` mode	Yes

`atsv` (Action TSV)

Formalises the existing action file as a named format. Each line is an opcode-prefixed DOTSV record using +, -, ~, or !. The parser is byte-identical to the DOTSV pending section parser.

`rtsv` (Relation TSV)

A generated flat three-column index. Two variants per .dov:

Variant	Filename	Column order
Key-Value	`<target>.kv.rtv`	key, value, uuid-list
Value-Key	`<target>.vk.rtv`	value, key, uuid-list

Rows sorted lexicographically by col 1, then col 2. UUID array in col 3 is ,-separated. Last line is a # YYYYDDMMhhmmss timestamp matching the source .dov. Generated by tsdb --relate; never hand-authored.

`qtsv` (Query TSV)

Input format for tsdb --query. Optional first line declares # mode\tunion or # mode\tintersect (default: intersect). Criterion lines:

Form	Syntax	Lookup
Bare token	`<token>`	Searches both `kv.rtv` col 1 and `vk.rtv` col 1
Key + value	`<key>\t<value>`	Exact pair lookup in `kv.rtv`