================================================================================
FINAL_SELF_GATHERING_CODE_SPEC_5_12_26.txt
CollabORhythm / Collabtunes — Engineering Blueprint Phase
Generated: 5.12.26 | Black Claude — Blueprint Session
STATUS: ENGINEERING PREP — no final code written yet
PURPOSE: Master spec for the 2 self-gathering code systems
================================================================================

THIS DOCUMENT IS THE AUTHORITY OVER THE TWO FLOWMAPS.
If a flowmap conflicts with this spec, this spec wins.

================================================================================
SECTION 1 — WHAT ARE THE TWO SELF-GATHERING CODES?
================================================================================

SELF-GATHERING CODE 1 (SGC-1) — LIVE SITE GATHERER
  Mission: Crawl the live collabtunes.com site and produce a structured,
           machine-readable snapshot of all live pages, URLs, nav structure,
           routing, and page metadata.
  Primary output: A current-state map of the live site at run time.
  Consumer: Production Claude, generators, QA tools, deployment checklist systems.

SELF-GATHERING CODE 2 (SGC-2) — REPOSITORY GATHERER
  Mission: Scan the local project repository (folder structure + all ZIPs and TXTs)
           and produce a structured, machine-readable inventory of all project
           assets, their authority level, and their dependency relationships.
  Primary output: A current-state map of what exists in the repo at run time.
  Consumer: Production Claude, handoff packages, session context loaders.

COMBINED PURPOSE:
  When both codes run, Production Claude has a complete dual-sided picture:
  - What exists on the live site (SGC-1)
  - What exists in the project files (SGC-2)
  This eliminates the need to manually re-read 28+ TXT files at the start of
  each new session. Claude receives structured data, not raw text dumps.

================================================================================
SECTION 2 — SHARED DESIGN PRINCIPLES (Both codes follow these)
================================================================================

PRINCIPLE 1 — NEVER DESTROY, NEVER OVERWRITE
  Neither code may delete, rename, or modify any existing file.
  All output is additive — new files only, with timestamped names.
  Naming rule: [CODE_ID]_OUTPUT_[DATE]_[TIME].json (or .txt if JSON not viable)
  Example: SGC1_LIVE_SITE_SNAPSHOT_5_12_26_1430.json

PRINCIPLE 2 — CHECKPOINT BEFORE FULL RUN
  Each code writes a CHECKPOINT file at the start of every run.
  If the run fails mid-way, the checkpoint captures partial results.
  Checkpoint format: [CODE_ID]_CHECKPOINT_[DATE]_[TIME]_PARTIAL.json
  On next run: code detects checkpoint, resumes from last completed step.

PRINCIPLE 3 — IDEMPOTENT OPERATION
  Running either code twice in a row on the same data produces the same output.
  No side effects. No state mutation. Read-only everywhere except output files.

PRINCIPLE 4 — AUTHORITY-AWARE OUTPUT
  Every item in both outputs carries an authority tag:
    LOCKED       = confirmed canon — do not alter
    AUTHORITATIVE = current best source — use this
    REFERENCE    = useful context, partially superseded
    DEPRECATED   = keep for rollback, never use as source
    CONFLICT     = two or more sources disagree — flag for Tom
    UNKNOWN      = no authority designation available — flag

PRINCIPLE 5 — DEDUPE LOGIC
  Both codes must identify and deduplicate:
  - SGC-1: same URL appearing in multiple nav sources (homepage vs 128-nav vs quicklinks)
  - SGC-2: same document topic appearing in multiple files (VOL1/VOL2/VOL3 pattern)
  Deduplication does NOT delete anything. It tags the canonical source and marks
  all others as SUPERSEDED or REFERENCE.

PRINCIPLE 6 — CONFLICT SURFACING
  Neither code resolves conflicts. Both codes surface them.
  Conflict format: CONFLICT_ITEM | SOURCE_A | SOURCE_B | CONFLICT_DESCRIPTION
  Conflicts are written to a separate CONFLICT_LOG in the output.

PRINCIPLE 7 — HUMAN-READABLE + MACHINE-READABLE
  All outputs are dual-format:
    Primary:   Structured JSON (machine-readable, for Claude context injection)
    Secondary: Formatted TXT (human-readable, for Tom and manual review)
  Both written on every run.

PRINCIPLE 8 — SAFE RUN MODES
  Both codes support two run modes:
    DRY_RUN:   Logs what it would do, touches nothing, writes to /logs/ only
    LIVE_RUN:  Executes fully, writes output to /outputs/ directory
  Default is always DRY_RUN. Must explicitly pass --live flag to execute.

================================================================================
SECTION 3 — WHAT EACH CODE MUST NOT DO
================================================================================

SGC-1 MUST NOT:
  - Modify any page on collabtunes.com
  - Attempt to log in to Yola or any CMS
  - Follow external links (YouTube, social media, etc.)
  - Crawl placeholders/coming-soon pages for body content
    (they are confirmed empty — just catalog the URL)
  - Crawl pages rated X without explicit --include-x flag
  - Cache page content beyond the current run session
  - Report a page as LIVE if it returns anything other than HTTP 200

SGC-2 MUST NOT:
  - Extract or process content from DEFAMATION_RISK_REGISTRY
    (catalog its existence and path only — never open it)
  - Extract or process CREATOR_INTERVIEW_TRANSCRIPT content
    (catalog its existence only)
  - Modify any source file
  - Move files between folders
  - Create new folders (output goes to pre-existing /outputs/ or /logs/)
  - Infer authority level — only read authority tags from MANIFEST files
    or filenames (ACTIVE_CANON, DEPRECATED, ARCHIVE, SUPERSEDED in filename = tags)

================================================================================
SECTION 4 — OUTPUT FORMAT SPEC (Both codes)
================================================================================

JSON SCHEMA — TOP LEVEL:
{
  "run_id": "SGC[1|2]_[DATE]_[TIME]",
  "run_mode": "DRY_RUN | LIVE_RUN",
  "run_timestamp": "ISO 8601",
  "source_code_version": "v1.0",
  "total_items_found": N,
  "total_conflicts_found": N,
  "total_items_flagged": N,
  "items": [ ...per-item objects... ],
  "conflicts": [ ...conflict objects... ],
  "flags": [ ...flag objects... ],
  "run_log": [ ...step-by-step log entries... ]
}

PER-ITEM OBJECT (SGC-1 — live pages):
{
  "url": "https://collabtunes.com/song-list-1/",
  "slug": "/song-list-1/",
  "label": "Song List 1 — The Last Man Singing",
  "http_status": 200,
  "page_type": "ALBUM_AIO | NAV | SONGBOOK | QUICKGUIDE | HGIH | PLACEHOLDER | DEV | UNKNOWN",
  "rating": "PG-13 | PENDING | CONFLICT",
  "authority": "LOCKED | AUTHORITATIVE | CONFLICT | UNKNOWN",
  "nav_sources": ["homepage", "128-nav", "quicklinks"],
  "body_captured": true | false,
  "flags": ["GATE_REQUIRED", "X_RATED", "BROKEN_URL", "DEV_PAGE", "ORPHAN", "CONFLICT"],
  "notes": "free text"
}

PER-ITEM OBJECT (SGC-2 — repo assets):
{
  "filepath": "02_CANON/FINAL_CANON_AUTHORITY_REGISTRY_5_12_26.txt",
  "filename": "FINAL_CANON_AUTHORITY_REGISTRY_5_12_26.txt",
  "file_type": "TXT | ZIP | HTML | PDF | MD | OTHER",
  "category": "CANON | RATINGS | BLOCKERS | URL_MAPS | ...",
  "authority": "AUTHORITATIVE | REFERENCE | DEPRECATED | WORKING | UNKNOWN",
  "date_in_filename": "5_12_26",
  "vol_number": null | 1 | 2 | 3,
  "superseded_by": null | "FINAL_CANON_AUTHORITY_REGISTRY_5_13_26.txt",
  "supersedes": null | [...],
  "in_manifest": true | false,
  "contains_sensitive": false,  // true only for defamation/interview files
  "flags": ["SENSITIVE_INTERNAL", "DEPRECATED", "SUPERSEDED", "MISSING_MANIFEST"],
  "notes": "free text"
}

CONFLICT OBJECT:
{
  "conflict_id": "CC-[CODE]-[SEQUENCE]",
  "conflict_type": "URL_DUPLICATE | AUTHORITY_CONFLICT | MISSING_DATA | BROKEN_REF",
  "item_a": "...",
  "item_b": "...",
  "description": "...",
  "resolution_status": "OPEN | TOM_REQUIRED | KNOWN_ISSUE",
  "known_blocker_id": "CC-CH18 | BLOCK-H04 | ..." // map to existing blocker IDs
}

FLAG OBJECT:
{
  "flag_id": "FLAG-[SEQUENCE]",
  "flag_type": "GATE_REQUIRED | BROKEN | ORPHAN | DEV_PAGE | MISSING | SENSITIVE",
  "item_reference": "url or filepath",
  "severity": "CRITICAL | HIGH | MEDIUM | LOW",
  "description": "..."
}

================================================================================
SECTION 5 — DEPENDENCY MAP (What each code needs to run)
================================================================================

SGC-1 DEPENDENCIES:
  Required: Network access to collabtunes.com
  Required: MASTER_URL_AUTHORITY_REGISTRY_5_12_26.txt (seed URL list)
  Required: FINAL_NAVIGATION_AUTHORITY_MAP_5_12_26.txt (nav structure reference)
  Optional: Prior SGC-1 output (for diff/delta mode)
  Python libs: requests, beautifulsoup4, lxml, json, csv, time, hashlib

SGC-2 DEPENDENCIES:
  Required: Access to COLLABTUNES_PROJECT_ROOT folder
  Required: MASTER_NAMING_STANDARD_5_12_26.txt (to parse filenames)
  Required: REPOSITORY_STRUCTURE_PLAN_5_12_26.txt (expected folder structure)
  Optional: Prior SGC-2 output (for diff/delta mode)
  Python libs: os, zipfile, json, re, hashlib, datetime

================================================================================
SECTION 6 — DELTA MODE (Future phase — not v1)
================================================================================

Both codes will eventually support a DELTA MODE:
  Input: Prior run output file
  Output: Only items that changed since last run
  Purpose: Efficiency for large site — don't re-crawl everything every time
  Delta types: NEW_ITEM | REMOVED_ITEM | STATUS_CHANGED | AUTHORITY_CHANGED
  Implementation: Deferred to v2 — not in scope for heavy coding phase

================================================================================
SECTION 7 — ROLLBACK SAFETY FOR THESE CODES THEMSELVES
================================================================================

Both codes are Python scripts stored in:
  COLLABTUNES_PROJECT_ROOT/15_GATHERING_TOOLS/
    SGC1_LIVE_SITE_GATHERER_v1_[DATE].py
    SGC2_REPOSITORY_GATHERER_v1_[DATE].py

Version rules:
  Never overwrite a prior version — new date = new file
  If a bug is found in v1, create v2, do not modify v1
  All outputs tagged with the code version that produced them

================================================================================
END FINAL_SELF_GATHERING_CODE_SPEC_5_12_26.txt
================================================================================
