================================================================================
DRY_RUN_AND_LIVE_RUN_SAFETY_MATRIX_5_12_26.txt
CollabORhythm / Collabtunes — Engineering Hardening Phase
Generated: 5.12.26 | Black Claude — Final Engineering Hardening
PURPOSE: Exact behavioral matrix for every operation in DRY_RUN vs LIVE_RUN.
         Mixed Claude implements every cell of this table — no guessing.
STATUS: ENGINEERING PREP — authoritative safety matrix
================================================================================

================================================================================
SECTION 1 — MODE DEFINITIONS
================================================================================

DRY_RUN (DEFAULT):
  Activated by: default mode OR explicit --mode DRY_RUN flag
  Network calls: NONE (SGC-1)
  File reads:    YES (seed files, nav files, repo files — same as LIVE_RUN)
  File writes:   ONE ONLY — the DRY_RUN report to /logs/
  Stdout:        Minimal — mode banner + "Dry run complete. See: {report_path}"
  Confirmation:  NOT required
  Safe for:      First run, any run where operator is unsure, automated testing

LIVE_RUN:
  Activated by: explicit --mode LIVE_RUN flag
  Network calls: YES (SGC-1 — HEAD + selective GET to collabtunes.com)
  File reads:    YES (same as DRY_RUN)
  File writes:   YES — all output files to /outputs/, all checkpoints to /logs/
  Stdout:        Confirmation gate prompt + progress indicators
  Confirmation:  REQUIRED — must type "yes" before Phase 2 begins
  Safe for:      Production use after DRY_RUN has been reviewed

================================================================================
SECTION 2 — PER-OPERATION BEHAVIOR MATRIX
================================================================================

FORMAT: OPERATION | DRY_RUN BEHAVIOR | LIVE_RUN BEHAVIOR

PHASE 0 — PRE-FLIGHT
──────────────────────────────────────────────────────────────────────────────
Parse command-line args         | Execute (both modes)           | Execute
Check /outputs/ exists          | Log "would create if absent"   | Create if absent
Check /logs/ exists             | Log "would create if absent"   | Create if absent
Load seed file (SGC-1)         | Execute — file is READ         | Execute — file is READ
Load nav reference (SGC-1)     | Execute — file is READ         | Execute — file is READ
Walk directory (SGC-2)         | Execute — files are WALKED     | Execute — files are WALKED
Partition sensitive files (SGC-2)| Execute — filenames INSPECTED | Execute
Write START checkpoint          | Skip (no writes in dry run)    | Write to /logs/
Print confirmation gate         | Skip                           | Print and wait for "yes"

PHASE 1 — URL NORMALIZATION / FILE CLASSIFICATION
──────────────────────────────────────────────────────────────────────────────
Normalize seed URLs             | Execute                        | Execute
Build crawl queue               | Execute (but never dequeued)   | Execute and dequeue
Build raw_file_list             | Execute                        | Execute
Parse filenames                 | Execute                        | Execute
Write DRY_RUN report           | Write to /logs/ (only output)  | Skip

PHASE 2 — HTTP REQUESTS (SGC-1) / FILE EXTENSION CLASSIFICATION (SGC-2)
──────────────────────────────────────────────────────────────────────────────
HEAD request to collabtunes.com | SKIP — log "would request {url}"| Execute
GET request to collabtunes.com  | SKIP — log "would GET {url}"   | Execute (selective)
time.sleep() between requests   | SKIP (no requests to throttle) | Execute
HTTP response parsing           | SKIP                           | Execute
Rate limit enforcement          | N/A (no requests made)         | Enforce
Retry on timeout                | N/A                            | Execute (1 retry)
File extension classification (SGC-2)| Execute                   | Execute
Write PHASE2 checkpoint         | Skip                           | Write to /logs/

PHASE 3 — NAV CROSSREF (SGC-1) / ZIP INSPECTION (SGC-2)
──────────────────────────────────────────────────────────────────────────────
Parse nav sources               | Execute (uses seed data)       | Execute (uses live data)
Cross-reference URL vs nav      | Execute (dry run uses seed)    | Execute (live results)
Open ZIP in read mode           | Execute                        | Execute
List ZIP members                | Execute                        | Execute
Extract manifest text from ZIP  | Execute                        | Execute
Recurse into nested ZIP         | Execute (up to max_depth)      | Execute
Write PHASE3 checkpoint (SGC-2) | Skip                           | Write to /logs/

PHASE 4 — BODY GATHERING (SGC-1) / TXT SCANNING (SGC-2)
──────────────────────────────────────────────────────────────────────────────
GET requests for body content   | SKIP — log "would GET {url}"   | Execute (selective)
BeautifulSoup parse             | SKIP (no body to parse)        | Execute
TXT file open + read            | Execute                        | Execute
Add to opened_files_log (SGC-2) | Execute                        | Execute
Detect JS-rendered pages        | SKIP (no body)                 | Execute
Discover new URLs from body     | SKIP                           | Execute
Write PHASE4 checkpoint (SGC-1) | Skip                           | Write to /logs/

PHASES 5–7 — CLASSIFICATION, CONFLICT DETECTION, DEDUPLICATION
──────────────────────────────────────────────────────────────────────────────
All classification logic        | Execute (on seed/file data)    | Execute (on live data)
All conflict detection          | Execute                        | Execute
All deduplication               | Execute                        | Execute
All flag generation             | Execute                        | Execute
Folder structure validation     | Execute                        | Execute
Dependency chain assessment     | Execute                        | Execute
Build version chains            | Execute                        | Execute

PHASE 8 — EXPORT
──────────────────────────────────────────────────────────────────────────────
Build JSON output object        | Execute (build in memory)      | Execute
Build TXT summary string        | Execute (build in memory)      | Execute
Build manifest string           | Execute (build in memory)      | Execute
Write JSON to /outputs/         | SKIP — log "would write {file}"| Execute via shared_output_writer
Write TXT to /outputs/          | SKIP — log "would write {file}"| Execute
Write manifest to /outputs/     | SKIP — log "would write {file}"| Execute
Write final checkpoint to /logs/| Skip                           | Write

PHASE 8R/9 — ROLLBACK SAFETY VERIFICATION
──────────────────────────────────────────────────────────────────────────────
Verify no writes to live site   | Log "SITE_INTEGRITY_VERIFIED (DRY — no requests made)" | Execute + log
Verify source file mtimes       | Execute (baseline vs current)  | Execute
Verify sensitive not opened     | Execute                        | Execute
Verify output files non-zero    | SKIP (no output files written) | Execute
Write integrity entries to log  | Execute (in DRY_RUN log only)  | Execute (in run_log)

================================================================================
SECTION 3 — WHAT IS IDENTICAL IN BOTH MODES
================================================================================

These operations behave exactly the same in DRY_RUN and LIVE_RUN.
Mixed Claude does not add mode checks around these:

  - Argument parsing (Phase 0)
  - Seed file reading and URL parsing (Phase 0-1)
  - Directory walking and file listing (Phase 1 — SGC-2)
  - Sensitive file detection and partitioning (before any reads)
  - Filename parsing and authority inference (Phase 2)
  - ZIP opening in read mode and member listing (Phase 3)
  - Manifest text extraction from ZIPs (Phase 3)
  - TXT header scanning (Phase 4)
  - All conflict detection logic (Phase 6)
  - All deduplication logic (Phase 7)
  - All in-memory JSON/TXT construction (Phase 8)
  - Source file mtime verification (Phase 8R/9)
  - Sensitive file verification (Phase 9)

REASON: The difference between modes is only about network calls and file writes.
        All data processing, classification, and analysis is mode-independent.
        This means DRY_RUN is a genuine simulation — it exercises all logic paths.

================================================================================
SECTION 4 — MODE GUARD IMPLEMENTATION
================================================================================

Mode guards appear in exactly TWO locations in each codebase:

LOCATION 1 — In orchestrator (sgc1_main.py or sgc2_main.py):
  After Phase 1 completes and before Phase 2 begins:

    if args.mode == "DRY_RUN":
        _write_dry_run_report(phases_planned, files_planned, logger, args)
        logger.write_log_to_file()
        print(f"Dry run complete. See: {dry_run_report_path}")
        sys.exit(0)
    
    # LIVE_RUN continues here — confirmation gate fires next
    _show_confirmation_gate(args, seed_count, logger)

LOCATION 2 — In shared_output_writer.write_json() and write_txt():
  No — these functions do NOT check mode. They just write.
  The mode guard in the orchestrator means they are never called in DRY_RUN.
  Do not add mode checks inside write_json/write_txt — that would be redundant.

NO OTHER MODE GUARDS anywhere in the codebase.
Any phase-level check (e.g. "if mode == DRY_RUN: skip this request")
is handled by the early exit in the orchestrator, not scattered conditionals.

================================================================================
SECTION 5 — CONFIRMATION GATE IMPLEMENTATION
================================================================================

Fires in sgc1_main.py and sgc2_main.py after DRY_RUN check, before Phase 2:

  def _show_confirmation_gate(args, item_count, logger):
      code_num = 1 if "sgc1" in __file__ else 2
      estimate_mins = round(item_count * 0.15)  # rough estimate: 9 sec per item
      
      print("=" * 60)
      print(f"SGC-{code_num} LIVE RUN — CONFIRMATION REQUIRED")
      print("=" * 60)
      if code_num == 1:
          print(f"Seed URLs:     {item_count} loaded")
          print(f"Network:       https://collabtunes.com")
          print(f"Max Pages:     {args.max_pages}")
          print(f"Include X:     {args.include_x}")
      else:
          print(f"Files found:   {item_count}")
          print(f"Root Path:     {args.root}")
          print(f"Max ZIP Depth: {args.max_zip_depth}")
      print(f"Output Dir:    {args.output_dir}")
      print(f"Est. Time:     ~{estimate_mins} minutes")
      print("=" * 60)
      
      response = input("Type 'yes' to proceed or anything else to cancel: ").strip().lower()
      
      if response != "yes":
          print("Run cancelled. No files written.")
          sys.exit(0)
      
      logger.log("SETUP", "LIVE_RUN confirmed by operator", severity="INFO")

RESUME BEHAVIOR: --resume skips the confirmation gate entirely.
  The original run was already confirmed. Resume inherits that confirmation.
  Add to orchestrator: if args.resume: skip _show_confirmation_gate()

================================================================================
SECTION 6 — PROGRESS INDICATOR RULES
================================================================================

DRY_RUN: No progress indicators. Single-pass execution completes in seconds.
         Only output is the dry run report written to /logs/.

LIVE_RUN: Print progress to stdout at these milestones:

  SGC-1 progress:
    After each batch of 10 HEAD requests:
      print(f"  [{n}/{total}] HEAD requests complete")
    After Phase 2:
      print(f"  Phase 2 complete: {live} LIVE, {broken} BROKEN, {timeout} TIMEOUT")
    After each 5 body GET requests:
      print(f"  [{n}/{body_total}] pages body-crawled")
    After Phase 4:
      print(f"  Phase 4 complete: {body_captured} bodies captured, {js_rendered} JS-rendered")
    On final output write:
      print(f"  Output written: {output_path}")

  SGC-2 progress:
    After Phase 1:
      print(f"  Phase 1 complete: {file_count} files, {folder_count} folders found")
    After each 20 ZIPs inspected:
      print(f"  [{n}/{zip_total}] ZIPs inspected")
    After each 20 TXT files scanned:
      print(f"  [{n}/{txt_total}] TXT files scanned")
    On final output write:
      print(f"  Output written: {output_path}")

Do not print progress to the log — log is for structured entries only.
Do not print progress in DRY_RUN mode.

================================================================================
SECTION 7 — SAFETY INVARIANTS THAT MUST HOLD IN BOTH MODES
================================================================================

These conditions must be true at the end of ANY run, regardless of mode.

INVARIANT 1: No file outside /outputs/ or /logs/ was created or modified.
  Verification: validate_output_path() was called before every write.
  Verification: Source file mtime baseline = mtime at end (SGC-2).

INVARIANT 2: No sensitive file was opened or read.
  Verification: opened_files_log contains no sensitive filepath.

INVARIANT 3: No non-GET/HEAD request was made to any URL (SGC-1).
  Verification: sgc1_http_requester internal assert fires before any request.
  Verification: run_log contains SITE_INTEGRITY_VERIFIED entry.

INVARIANT 4: All output files are non-zero size.
  Verification: shared_output_writer checks size after every write.
  Exception: DRY_RUN — no output files, invariant N/A.

INVARIANT 5: The run_id is unique to this run.
  Verification: run_id includes datetime to millisecond + mode suffix.
  No two runs share a run_id in the same 24-hour period.

If any invariant is violated → log at CRITICAL or ABORT severity as appropriate.

================================================================================
END DRY_RUN_AND_LIVE_RUN_SAFETY_MATRIX_5_12_26.txt
================================================================================
