================================================================================
SELF_GATHERING_CODE_1_FLOWMAP_5_12_26.txt
CollabORhythm / Collabtunes — Engineering Blueprint Phase
Generated: 5.12.26 | Black Claude — Blueprint Session
SUBJECT: SGC-1 — Live Site / Nav / Routing Gatherer
STATUS: ENGINEERING BLUEPRINT — not final code
================================================================================

PURPOSE:
Crawl collabtunes.com and produce a structured, machine-readable snapshot
of all live pages, their URLs, nav presence, HTTP status, rating flags,
and routing classification — at run time, from the real live site.

================================================================================
PHASE 0 — PRE-FLIGHT CHECKS
================================================================================

STEP 0.1 — ARGUMENT PARSE
  Read command-line args:
    --mode [DRY_RUN | LIVE_RUN]     (default: DRY_RUN)
    --include-x                      (flag: crawl X-rated pages)
    --max-pages [N]                  (default: 200, safety cap)
    --output-dir [path]              (default: ./outputs/)
    --seed-file [path]               (default: MASTER_URL_AUTHORITY_REGISTRY)
    --resume [checkpoint_file]       (optional: resume from partial checkpoint)

STEP 0.2 — VALIDATE OUTPUT DIRECTORY
  If /outputs/ does not exist → CREATE it (only directory creation allowed)
  If /logs/ does not exist → CREATE it
  Confirm write access to both

STEP 0.3 — LOAD SEED URL LIST
  Source: MASTER_URL_AUTHORITY_REGISTRY_5_12_26.txt
  Parse all URLs with status LIVE ✅ and CONFLICT ⚠️
  Skip: PENDING ⏳ (placeholders — no body to crawl)
  Skip: DEV 🔧 (dev pages — catalog URL only, do not crawl body)
  Skip: BROKEN ❌ (catalog as BROKEN, do not crawl)
  Output: seed_urls[] — list of dicts {url, expected_status, label, rating, nav_flag}

STEP 0.4 — LOAD NAV REFERENCE
  Source: FINAL_NAVIGATION_AUTHORITY_MAP_5_12_26.txt
  Parse: 14-section structure, all nav pages, chapter drift map
  Purpose: Cross-reference crawl results against known nav structure

STEP 0.5 — WRITE RUN CHECKPOINT (initial)
  File: SGC1_CHECKPOINT_[DATE]_[TIME]_START.json
  Contents: {run_id, timestamp, mode, seed_url_count, step: "PRE_FLIGHT_COMPLETE"}

IF --resume FLAG:
  Load checkpoint file → skip to last completed step → resume from there

================================================================================
PHASE 1 — SEED URL NORMALIZATION
================================================================================

STEP 1.1 — NORMALIZE ALL SEED URLS
  Rule: Strip trailing slashes for comparison, add back for canonical form
  Rule: Force https:// prefix on all collabtunes.com URLs
  Rule: Lowercase all slugs for comparison (preserve original case in output)
  Rule: Build canonical URL = https://collabtunes.com[/slug/]

STEP 1.2 — BUILD CRAWL QUEUE
  Initialize: crawl_queue = deque(seed_urls)
  Initialize: crawled_set = set() (tracks already-crawled URLs to prevent loops)
  Initialize: results[] = [] (accumulates per-item result objects)
  Initialize: conflicts[] = [] (accumulates conflict objects)
  Initialize: flags[] = [] (accumulates flag objects)

STEP 1.3 — PRE-CLASSIFY URLS FROM SEED FILE
  For each seed URL, assign initial page_type from label and URL pattern:
    /song-list-[N]/                → ALBUM_AIO
    /set-list-[N]/                 → ALBUM_AIO
    /set-list-[N]-[title]/         → ALBUM_AIO (SL22-24 format)
    /[N]-of-35-[slug]/             → SONGBOOK
    /how-i-got-here[...]/          → HGIH
    /1---34-[slug]/ or /1-to-34-[slug]/  → QUICKGUIDE
    / (root)                       → NAV
    /128-section-[...]/            → NAV
    /switchboard-[...]/            → NAV
    /fast-scroll-[...]/            → NAV
    /read-my-stuff-[...]/          → READMYSTUFF
    /[future|coming-soon|comment-box]/  → PLACEHOLDER
    /html-test[N]/ or /practice-head/  → DEV
    default                        → UNKNOWN

================================================================================
PHASE 2 — HTTP STATUS GATHERING
================================================================================

STEP 2.1 — HTTP HEAD REQUEST (all URLs)
  For each URL in crawl_queue:
    Send HTTP HEAD request (not GET — faster, no body load)
    Capture: http_status, response_time, redirect_chain (if any)
    Rate limit: 1 request per 1.5 seconds (polite crawl — Yola is live production)
    Timeout: 10 seconds per request
    Retry: 1 retry on timeout, then mark as TIMEOUT_ERROR

  Status mapping:
    200 → LIVE
    301/302 → REDIRECT (log destination, mark original as REDIRECT)
    404 → NOT_FOUND
    500 → SERVER_ERROR
    TIMEOUT → TIMEOUT_ERROR

STEP 2.2 — CONFLICT: STATUS VS EXPECTED
  For each URL where http_status ≠ expected_status from seed file:
    Write CONFLICT object:
      conflict_type: STATUS_MISMATCH
      expected: seed file status
      actual: HTTP status received
    Flag severity:
      Expected LIVE, got 404 → CRITICAL
      Expected BROKEN, got 200 → HIGH (seed file is wrong — update it)
      Expected PENDING, got 200 → MEDIUM (placeholder may now have content)

STEP 2.3 — CHECKPOINT AFTER PHASE 2
  Write: SGC1_CHECKPOINT_[DATE]_[TIME]_PHASE2_COMPLETE.json
  Contents: all results so far (status gathered, not yet body-crawled)

================================================================================
PHASE 3 — NAV SOURCE CROSS-REFERENCE
================================================================================

PURPOSE: For each live URL, determine which nav sources include it.

STEP 3.1 — LOAD NAV SOURCE DATA
  Source files required (already in seed load from Phase 0):
    Homepage nav links (from LIVE_CAPTURE or parsed from seed file nav_flag)
    128-nav links
    Switchboard Quicklinks links
  Parse each source into: nav_source_urls[source_name] = set(urls)

STEP 3.2 — FOR EACH LIVE URL — CROSS-REFERENCE NAV SOURCES
  For each url in results[]:
    nav_sources = []
    IF url in nav_source_urls["homepage"]: append "homepage"
    IF url in nav_source_urls["128-nav"]:  append "128-nav"
    IF url in nav_source_urls["quicklinks"]: append "quicklinks"
    result["nav_sources"] = nav_sources

  FLAG LOGIC:
    len(nav_sources) == 0 → flag ORPHAN (live but not in any nav)
    len(nav_sources) == 3 → FULLY_INDEXED (in all 3 nav sources)
    len(nav_sources) == 1 or 2 → PARTIALLY_INDEXED
    In nav but HTTP_STATUS ≠ 200 → flag BROKEN_NAV_LINK (nav points to non-live page)

STEP 3.3 — DETECT DUPLICATE URL SLOTS
  For each nav_source, scan for duplicate slugs (same URL appearing twice):
    Log as CONFLICT: conflict_type = NAV_DUPLICATE
    Link to known blocker IDs from UNRESOLVED_BLOCKERS file if matched
      Example: /20-35-the-lady-weaver/ appears twice → link to BLOCK-H04

STEP 3.4 — DETECT CROSS-SOURCE URL MISMATCHES
  Cases where same page has two different slugs across sources:
    Example: YouTube lyric video — /lyric-videos-...-on-one-video-on-youtube/
             vs /lyric-videos-...-on-one-youtube-video/
    Detection: Label matching — if same label appears with two different URLs
               across nav sources → CONFLICT of type URL_SLUG_MISMATCH
    Link to known blocker: BLOCK-M01

================================================================================
PHASE 4 — PAGE BODY GATHERING (Selective)
================================================================================

RULE: Full GET crawl only for pages where body_captured = false in seed data.
      Pages already captured in COLLABTUNES_LIVE_CAPTURE_FINAL_2026-05-12.zip
      are marked body_captured = true — do NOT re-crawl them.

STEP 4.1 — IDENTIFY PAGES NEEDING BODY CAPTURE
  From results[], filter where:
    http_status == 200
    AND page_type NOT IN [PLACEHOLDER, DEV]
    AND body_captured == false
    AND (not X_RATED OR --include-x flag set)

STEP 4.2 — GET REQUEST + BODY EXTRACTION
  For each qualifying URL:
    Send HTTP GET
    Parse response with BeautifulSoup
    Extract:
      page_title: <title> tag text
      h1_text: first <h1> text (or null)
      meta_description: <meta name="description"> content
      internal_links: all <a href> where href starts with collabtunes.com or /
      word_count: approximate (len(body_text.split()))
      has_rating_badge: boolean (search for known rating badge CSS class/text patterns)
      has_js_gate: boolean (search for collabtunes_selected_rating localStorage key in script tags)
      visible_text_excerpt: first 500 chars of stripped body text

  Rate limit: 2 seconds between GET requests (heavier than HEAD)

STEP 4.3 — JS-RENDERED PAGE DETECTION
  Some pages (homepage, Fast Scroll) render body via JavaScript.
  Detection: IF word_count < 50 AND <noscript> tag present → flag JS_RENDERED
  Action: Mark body_captured = false, flag JS_RENDERED
  Do NOT attempt JS execution — note in output that Playwright/Selenium needed for full capture

STEP 4.4 — ADD DISCOVERED LINKS TO CRAWL QUEUE
  For each internal link discovered in body:
    IF link URL not in crawled_set AND not in crawl_queue:
      IF link is to collabtunes.com domain:
        Add to crawl_queue (discovery mode)
        Mark as DISCOVERED (not in seed file)
  Cap: Stop discovery after max-pages limit reached

STEP 4.5 — CHECKPOINT AFTER PHASE 4
  Write: SGC1_CHECKPOINT_[DATE]_[TIME]_PHASE4_COMPLETE.json

================================================================================
PHASE 5 — ROUTING CLASSIFICATION
================================================================================

STEP 5.1 — RATING ASSIGNMENT
  For each result item:
    Source priority order (first match wins):
      1. LOCKED rating from FINAL_CANON_AUTHORITY_REGISTRY
      2. Rating from MASTER_CONTENT_RATINGS_INDEX VOL3
      3. Rating inferred from URL (X in slug → flag SELF_DECLARED_X)
      4. Rating from page body (has_rating_badge match)
      5. Default: PENDING

STEP 5.2 — GATE REQUIREMENT ASSIGNMENT
  Rules (from FINAL_NAVIGATION_AUTHORITY_MAP routing logic):
    Rating G/PG   → gate = NONE
    Rating PG-13  → gate = PG13_REQUIRED
    Rating R      → gate = R_REQUIRED
    Rating NC-17  → gate = NC17_REQUIRED
    Rating X      → gate = X_REQUIRED
    Rating PENDING → gate = UNKNOWN_GATE (flag for review)

STEP 5.3 — SAFE ROUTE CLASSIFICATION
  For each item, assign safe_route status:
    SAFE_ALWAYS:   G/PG pages — show in all routes
    SAFE_PG13:     PG-13 pages — show only if PG-13+ selected
    SAFE_R:        R pages — show only if R+ selected
    SAFE_NC17:     NC-17 pages
    SAFE_X:        X pages
    UNSAFE_LIVE:   Pages with missing gate that SHOULD be gated (flag GATE_MISSING)

STEP 5.4 — FLAG: UNGATED HIGH-CONTENT PAGES
  Cross-reference with UNRESOLVED_BLOCKERS known live blockers:
    BLOCK-L01: HOW I GOT HERE X pages → flag GATE_MISSING + CRITICAL
    BLOCK-L02: Quick Guide NC-17/X → flag GATE_MISSING + CRITICAL
    BLOCK-L03: Full Texts of Lyrics → flag GATE_MISSING + HIGH

================================================================================
PHASE 6 — DEDUPE LOGIC
================================================================================

STEP 6.1 — CANONICAL URL SELECTION FOR DUPLICATES
  For each conflict of type URL_SLUG_MISMATCH or NAV_DUPLICATE:
    Check MASTER_URL_AUTHORITY_REGISTRY for designated canonical URL
    IF canonical designated: mark other URL as NON_CANONICAL
    IF not designated: mark both as CONFLICT, write to conflict log

STEP 6.2 — CHAPTER DRIFT RECONCILIATION
  For each Songbook chapter URL (/[N]-of-35-[slug]/):
    Compare N (from URL slug) against nav label number (from 128-nav parse)
    If URL-N ≠ nav-label-N: flag CHAPTER_DRIFT
    Affected: chapters where nav_label = URL_N + 1 (known pattern)
    Output: chapter_drift_map[] in results — exact list of all drifted chapters

STEP 6.3 — FINAL DEDUPLICATION PASS
  Remove exact duplicate result objects (same URL appearing twice in results)
  Preserve first occurrence, mark subsequent as DUPLICATE_ENTRY

================================================================================
PHASE 7 — EXPORT LOGIC
================================================================================

STEP 7.1 — GENERATE JSON OUTPUT
  Filename: SGC1_LIVE_SITE_SNAPSHOT_[DATE]_[TIME].json
  Structure: per FINAL_SELF_GATHERING_CODE_SPEC top-level schema
  Validation: Confirm all required fields present on every item
  Write to: /outputs/

STEP 7.2 — GENERATE TXT SUMMARY (human-readable)
  Filename: SGC1_LIVE_SITE_SNAPSHOT_[DATE]_[TIME]_SUMMARY.txt
  Contents:
    Run metadata (mode, timestamp, counts)
    LIVE PAGES section: sorted by page_type, then by URL
    CONFLICTS section: all conflict objects in readable table format
    FLAGS section: all flags sorted by severity (CRITICAL first)
    CHAPTER DRIFT MAP: all drifted chapters listed
    DISCOVERED PAGES: pages found via crawl not in seed file
    MISSING PAGES: pages in seed file not found live (404 / TIMEOUT)
  Write to: /outputs/

STEP 7.3 — GENERATE MANIFEST
  Filename: SGC1_RUN_MANIFEST_[DATE]_[TIME].txt
  Contents:
    Run ID | Date | Mode | Pages crawled | Conflicts found | Flags raised
    Output files produced (with sizes)
    Error count | Timeout count
    Recommended next action (based on flag severity counts)
  Write to: /outputs/

STEP 7.4 — WRITE FINAL CHECKPOINT (complete)
  Filename: SGC1_CHECKPOINT_[DATE]_[TIME]_COMPLETE.json
  Contents: full results + run_log + status = "COMPLETE"
  Write to: /logs/

================================================================================
PHASE 8 — ROLLBACK SAFETY VERIFICATION
================================================================================

STEP 8.1 — VERIFY NO WRITES TO collabtunes.com
  Confirm: no POST, PUT, DELETE, or PATCH requests were made
  Confirm: all requests were GET or HEAD only
  Write to run_log: "SITE_INTEGRITY_VERIFIED: read-only crawl confirmed"

STEP 8.2 — VERIFY NO SOURCE FILE MODIFICATIONS
  Check modification timestamps on seed files loaded in Phase 0
  IF any seed file mtime changed during run: ABORT and write ALERT to log

STEP 8.3 — VERIFY OUTPUT FILES WRITTEN CORRECTLY
  For each output file: read back first 100 bytes to confirm write succeeded
  IF any output file is 0 bytes: flag WRITE_ERROR in run log

================================================================================
CRAWL LOGIC SUMMARY (Quick Reference)
================================================================================

INPUT:   MASTER_URL_AUTHORITY_REGISTRY (seed) + FINAL_NAVIGATION_AUTHORITY_MAP
CRAWL:   HEAD all URLs → GET qualifying URLs (rate-limited, polite)
CROSS-REF: Nav sources (homepage + 128-nav + quicklinks)
CLASSIFY: page_type, rating, gate, safe_route, nav_sources, body_captured
DETECT:  Conflicts, orphans, broken links, ungated X-rated, chapter drift, duplicates
DEDUPE:  Canonical URL designation for duplicates
EXPORT:  JSON + TXT + Manifest + Final Checkpoint

DOES NOT:
  Modify any file on site or in repo
  Execute JavaScript
  Follow external links
  Crawl X-rated pages without --include-x flag
  Open DEFAMATION_RISK_REGISTRY
  Guess at ratings or authority

================================================================================
END SELF_GATHERING_CODE_1_FLOWMAP_5_12_26.txt
================================================================================
