================================================================================
SELF_GATHERING_CODE_2_FLOWMAP_5_12_26.txt
CollabORhythm / Collabtunes — Engineering Blueprint Phase
Generated: 5.12.26 | Black Claude — Blueprint Session
SUBJECT: SGC-2 — Repository / ZIP / TXT / Manifest Gatherer
STATUS: ENGINEERING BLUEPRINT — not final code
================================================================================

PURPOSE:
Scan the local COLLABTUNES_PROJECT_ROOT directory and all its contents
(including inside ZIPs) and produce a structured, machine-readable inventory
of every project asset — its path, file type, authority level, category,
version, dependencies, and relationship to other files.

================================================================================
PHASE 0 — PRE-FLIGHT CHECKS
================================================================================

STEP 0.1 — ARGUMENT PARSE
  Read command-line args:
    --mode [DRY_RUN | LIVE_RUN]      (default: DRY_RUN)
    --root [path]                     (required: path to COLLABTUNES_PROJECT_ROOT)
    --output-dir [path]               (default: ./outputs/)
    --max-zip-depth [N]               (default: 2 — max levels of nested ZIPs to open)
    --skip-sensitive                  (flag: skip opening sensitive files, default ON)
    --resume [checkpoint_file]        (optional: resume from partial checkpoint)

STEP 0.2 — VALIDATE ROOT DIRECTORY
  Confirm root path exists and is readable
  Confirm /outputs/ and /logs/ are writable (create if missing)
  DO NOT validate against expected folder structure yet — that is Phase 3

STEP 0.3 — WRITE INITIAL CHECKPOINT
  File: SGC2_CHECKPOINT_[DATE]_[TIME]_START.json
  Contents: {run_id, timestamp, mode, root_path, step: "PRE_FLIGHT_COMPLETE"}

IF --resume FLAG:
  Load checkpoint file → skip to last completed step → resume from there

================================================================================
PHASE 1 — DIRECTORY WALK
================================================================================

STEP 1.1 — RECURSIVE DIRECTORY SCAN
  Walk COLLABTUNES_PROJECT_ROOT recursively using os.walk()
  For every file encountered:
    Collect: filepath, filename, extension, size_bytes, mtime, parent_folder
    Add to: raw_file_list[]
  For every folder encountered:
    Collect: folderpath, folder_name, item_count
    Add to: folder_list[]

STEP 1.2 — SENSITIVE FILE DETECTION (BEFORE OPENING ANYTHING)
  Check filename against sensitive file list:
    SENSITIVE_PATTERNS = [
      "DEFAMATION_RISK_REGISTRY",
      "CREATOR_INTERVIEW_TRANSCRIPT",
      "MASTER_DUMPS"  // session dumps — internal only
    ]
  For any file matching a pattern:
    Add to: sensitive_files[] (catalog path + existence only)
    Set flag: SENSITIVE_INTERNAL
    DO NOT open, read, or extract content from these files
    This check runs BEFORE any file reading logic

STEP 1.3 — EXTENSION CLASSIFICATION
  Assign file_type to each file in raw_file_list:
    .txt   → TXT
    .zip   → ZIP
    .html  → HTML
    .htm   → HTML
    .pdf   → PDF
    .md    → MD
    .py    → SCRIPT
    .json  → JSON
    .odt   → DOCUMENT (read filename only, do not open)
    .csv   → CSV
    other  → OTHER

STEP 1.4 — CHECKPOINT AFTER PHASE 1
  Write: SGC2_CHECKPOINT_[DATE]_[TIME]_PHASE1_COMPLETE.json
  Contents: raw_file_list count, folder_list count, sensitive_files flagged

================================================================================
PHASE 2 — FILENAME PARSING (Authority and Category Extraction)
================================================================================

STEP 2.1 — PARSE NAMING CONVENTION
  Source: MASTER_NAMING_STANDARD_5_12_26.txt + COLLABTUNES_OUTPUT_NAMING_RULES_PERMANENT
  For each file in raw_file_list (excluding SENSITIVE):
    Apply regex against filename to extract:
      file_count:    leading integer before first underscore (for ZIPs)
      category:      any of the approved category codes found in name
      date_token:    M_DD_YY or M_DD_26 pattern
      vol_number:    VOL[N] pattern (None if absent)
      authority_tag: ACTIVE_CANON | DEPRECATED | ARCHIVE | SUPERSEDED | LOCKED (if in name)
      project_tag:   COLLABTUNES | COLLABORHYTHM (if in name)
      purpose:       everything between project_tag and date_token

STEP 2.2 — CATEGORY ASSIGNMENT
  Using the parsed category token(s), assign category[] array:
    LIVE_CAPTURE | CANON | QA | HTML | REGISTRY | DEPLOYMENT | BACKUP |
    PLACEHOLDERS | RATINGS | URL_MAPS | BLOCKERS | NAV_STABILIZATION |
    TOM_DECISIONS | MASTER_DUMPS | GENERATOR | HANDOFF | MANIFESTS |
    AUTHORITY | PREHANDOFF | ROLLBACK
  If no category code found: category = ["UNCATEGORIZED"]

STEP 2.3 — AUTHORITY LEVEL INFERENCE
  Priority order (first match wins):
    1. Explicit tag in filename (ACTIVE_CANON, DEPRECATED, ARCHIVE, SUPERSEDED)
    2. Tag in MANIFEST file for this ZIP (if manifest exists)
    3. VOL number pattern: highest VOL = AUTHORITATIVE, others = REFERENCE
    4. Date pattern: most recent date = AUTHORITATIVE if same base name
    5. Default: UNKNOWN

STEP 2.4 — NAMING COMPLIANCE CHECK
  For each file:
    compliant = True
    IF no date_token found: compliant = False, flag MISSING_DATE
    IF filename is in banned_names list: compliant = False, flag BAD_NAMING
      banned_names = ["final", "output", "new", "fixed", "temp", "test",
                       "untitled", "copy", "export", "revised"]
      (exact match, case-insensitive)
    IF file_type == ZIP AND file_count not found in filename: flag MISSING_COUNT

================================================================================
PHASE 3 — ZIP INSPECTION
================================================================================

STEP 3.1 — ZIP INVENTORY (Contents WITHOUT full extraction)
  For each file where file_type == ZIP:
    Open zip in read mode (zipfile.ZipFile)
    List all member files: zip_contents = zipfile.namelist()
    For each member:
      Collect: member_name, member_size, member_mtime, member_extension
    Add to: zip_index[filepath] = {member_count, members[]}
    DO NOT extract any members — read directory listing only
    EXCEPTION: If member matches MANIFEST or README pattern → extract text only

STEP 3.2 — MANIFEST EXTRACTION FROM ZIPS
  For each ZIP, check members for MANIFEST or READ_ME_FIRST files:
    Patterns: filename contains "MANIFEST" OR "READ_ME_FIRST" OR "README"
    If found: extract text content of that member only
    Parse manifest for:
      ZIP_NAME, PHASE, CATEGORY, AUTHORITATIVE flag, CONTAINS list
      WHAT_THIS_RESOLVES, WHAT_REMAINS_UNRESOLVED, DO_NOT_OVERWRITE
    Store in: manifest_data[zip_filepath] = parsed_manifest

STEP 3.3 — ZIP vs MANIFEST CROSS-REFERENCE
  For each ZIP with a manifest:
    Compare: manifest member list vs actual zip_contents list
    If mismatch: flag MANIFEST_MISMATCH (manifest claims contents that differ)
  For each ZIP without a manifest:
    flag MISSING_MANIFEST

STEP 3.4 — NESTED ZIP HANDLING
  If a ZIP contains another ZIP as a member:
    IF current_depth < max_zip_depth: recurse into nested ZIP
    ELSE: catalog nested ZIP name only, flag MAX_DEPTH_REACHED
    Track depth via call stack counter

STEP 3.5 — ZIP AUTHORITY INFERENCE
  If manifest_data exists for ZIP: use manifest's AUTHORITATIVE value
  Else: use filename-based authority inference from Phase 2

STEP 3.6 — CHECKPOINT AFTER PHASE 3
  Write: SGC2_CHECKPOINT_[DATE]_[TIME]_PHASE3_COMPLETE.json

================================================================================
PHASE 4 — TXT FILE INSPECTION
================================================================================

STEP 4.1 — IDENTIFY TXT FILES TO INSPECT
  From raw_file_list, filter: file_type == TXT AND NOT sensitive
  Limit: first 2000 chars of each file (header scan only — not full read)
  Full read: only if file size < 50KB

STEP 4.2 — HEADER SCAN
  For each TXT file:
    Read first 2000 characters
    Extract:
      doc_title: first non-empty line
      status_line: line containing "STATUS:" or "AUTHORITATIVE"
      generated_line: line containing "Generated:" date
      purpose_line: line containing "PURPOSE:"
      supersedes_refs: any line containing "SUPERSEDES:"
      prior_version_refs: any line containing "VOL" + lower VOL number
    Store in: txt_metadata[filepath] = extracted fields

STEP 4.3 — STATUS TAG EXTRACTION
  Look for known status markers in header:
    ACTIVE_CANON → authority = LOCKED
    AUTHORITATIVE → authority = AUTHORITATIVE
    WORKING → authority = WORKING
    DEPRECATED → authority = DEPRECATED
    REFERENCE → authority = REFERENCE
  If no status marker: use filename-based inference from Phase 2

STEP 4.4 — SUPERSEDES/SUPERSEDED-BY DETECTION
  For each TXT file, check supersedes_refs:
    IF supersedes_refs found: link to referenced files
      supersedes: [list of filenames this file supersedes]
    Reverse lookup: for each referenced file, mark it:
      superseded_by: this file's filename

================================================================================
PHASE 5 — FOLDER STRUCTURE VALIDATION
================================================================================

STEP 5.1 — LOAD EXPECTED STRUCTURE
  Source: REPOSITORY_STRUCTURE_PLAN_5_12_26.txt (parse folder list)
  Expected top-level folders:
    00_OPERATIONAL_RULES, 01_LIVE_CAPTURE, 02_CANON, 03_RATINGS,
    04_BLOCKERS, 05_URL_MAPS, 06_NAV_STABILIZATION, 07_DEPLOYMENT,
    08_PLACEHOLDERS, 09_HTML_PROTOTYPES, 10_QA, 11_REGISTRY,
    12_MANIFESTS, 13_SOURCE_ZIPS, 14_GENERATED_OUTPUT

STEP 5.2 — COMPARE ACTUAL vs EXPECTED STRUCTURE
  For each expected folder:
    IF exists in folder_list: mark PRESENT
    IF not exists: mark MISSING, flag MISSING_FOLDER
  For each actual folder NOT in expected list:
    mark UNEXPECTED_FOLDER (not an error — just note it)

STEP 5.3 — FILE PLACEMENT VALIDATION
  For each file in raw_file_list:
    Check if category[] matches the folder it's in
    Mismatch examples:
      HTML file in 02_CANON → flag MISPLACED_FILE
      Rating file in 07_DEPLOYMENT → flag MISPLACED_FILE
    These are flags only — do not move anything

================================================================================
PHASE 6 — DEPENDENCY AND RELATIONSHIP MAPPING
================================================================================

STEP 6.1 — BUILD SUPERSEDES CHAIN
  For all files with supersedes/superseded_by links from Phase 4:
    Build directed graph: older version → newer version
    Output: version_chains[] = [{base_name, versions: [v1, v2, v3], current: v3}]

STEP 6.2 — CROSS-REFERENCE: ZIP MEMBERS vs STANDALONE FILES
  For each TXT file that also appears as a ZIP member:
    Are they the same version? (compare filename + date)
    If standalone is newer: flag ZIP_MEMBER_OUTDATED
    If ZIP member is newer: flag STANDALONE_OUTDATED

STEP 6.3 — DEPENDENCY MAP EXTRACTION
  Cross-reference CROSS_SYSTEM_DEPENDENCY_MAP_5_12_26.txt (already read, known chains):
    Map each dependency chain to the files involved:
      CHAIN A (Rating Gate): find all involved files in repo
      CHAIN B (AIO Generator): find all involved files in repo
      CHAIN C (Nav Integrity): find all involved files
      CHAIN D (Crosslink Propagation): find all involved files
      CHAIN E (Defamation Clearance): catalog only (no content read)
    For each chain: output dependency_chain_status:
      COMPLETE: all required files present
      MISSING: one or more required files absent (list which)
      BLOCKED: dependency chain cannot proceed (list blocker)

STEP 6.4 — GENERATOR INPUT READINESS (from GENERATOR_INPUT_READINESS_REPORT)
  For each input required by the AIO generator:
    Check if corresponding file exists in repo with AUTHORITATIVE status
    Output: generator_inputs[] = [{input_name, file_found, authority, status}]
    If mood_settings_ratings_explicit_for_all_34_albums not found: flag CRITICAL_MISSING

STEP 6.5 — CHECKPOINT AFTER PHASE 6
  Write: SGC2_CHECKPOINT_[DATE]_[TIME]_PHASE6_COMPLETE.json

================================================================================
PHASE 7 — DEDUPE AND CONFLICT SURFACING
================================================================================

STEP 7.1 — DETECT DUPLICATE CONTENT (same base name, multiple versions)
  Group files by base_name (filename without date and VOL suffix):
    If multiple files share a base_name:
      Identify the most current by date (or highest VOL)
      Tag: current → AUTHORITATIVE, others → REFERENCE or DEPRECATED
      IF no clear most-current (same date, different content): flag AMBIGUOUS_VERSION

STEP 7.2 — DETECT NAMING VIOLATIONS
  Collect all files flagged in Phase 2 (MISSING_DATE, BAD_NAMING, MISSING_COUNT)
  Add to: violations[] for naming compliance section of report

STEP 7.3 — DETECT ISOLATED FILES (no manifest, no ZIP membership, no chain)
  Files that are:
    Standalone TXT/HTML
    Not referenced by any manifest
    Not part of any version chain
    Not flagged as SENSITIVE
  Flag: ORPHAN_FILE — may need to be organized or manifested

================================================================================
PHASE 8 — EXPORT LOGIC
================================================================================

STEP 8.1 — GENERATE JSON OUTPUT
  Filename: SGC2_REPO_INVENTORY_[DATE]_[TIME].json
  Structure: per FINAL_SELF_GATHERING_CODE_SPEC top-level schema
  Sections:
    files[] — all non-sensitive files with full metadata
    folders[] — all folders with expected/actual status
    zips[] — all ZIPs with member lists
    version_chains[] — all identified version chains
    dependency_chains[] — all dependency chain statuses
    generator_inputs[] — generator readiness assessment
    conflicts[] — all conflict objects
    flags[] — all flag objects sorted by severity
    sensitive_files[] — path + existence only, no content
    run_log[] — step-by-step execution log
  Write to: /outputs/

STEP 8.2 — GENERATE TXT SUMMARY (human-readable)
  Filename: SGC2_REPO_INVENTORY_[DATE]_[TIME]_SUMMARY.txt
  Sections:
    RUN METADATA
    REPOSITORY OVERVIEW (counts by type, by category, by authority)
    AUTHORITATIVE FILES (primary sources — most current per topic)
    DEPRECATED/ARCHIVED FILES (keep but don't use)
    VERSION CHAINS (visual: VOL1 → VOL2 → VOL3 = CURRENT)
    FOLDER STRUCTURE STATUS (PRESENT / MISSING per expected folder)
    MISPLACED FILES (in wrong folder)
    ORPHAN FILES (no manifest, no chain)
    NAMING VIOLATIONS
    GENERATOR INPUT READINESS SUMMARY
    DEPENDENCY CHAIN STATUS SUMMARY
    CONFLICTS
    FLAGS (CRITICAL first)
    SENSITIVE FILES (names only — no content)
  Write to: /outputs/

STEP 8.3 — GENERATE MANIFEST
  Filename: SGC2_RUN_MANIFEST_[DATE]_[TIME].txt
  Contents:
    Run ID | Date | Mode | Files scanned | ZIPs opened | Conflicts | Flags
    Output files produced (with sizes)
    CRITICAL flags count | HIGH flags count
    Top 5 items requiring immediate attention
  Write to: /outputs/

STEP 8.4 — WRITE FINAL CHECKPOINT (complete)
  Filename: SGC2_CHECKPOINT_[DATE]_[TIME]_COMPLETE.json
  Contents: full results + run_log + status = "COMPLETE"
  Write to: /logs/

================================================================================
PHASE 9 — ROLLBACK SAFETY VERIFICATION
================================================================================

STEP 9.1 — VERIFY NO FILE MODIFICATIONS
  Check modification timestamps on all files that were OPENED for reading
  IF any source file mtime changed during run: ABORT and write ALERT

STEP 9.2 — VERIFY NO FILE MOVEMENTS OR DELETIONS
  Confirm file count in root directory same as at start of run
  IF count differs: flag REPO_INTEGRITY_ERROR

STEP 9.3 — VERIFY OUTPUT FILES WRITTEN CORRECTLY
  For each output file: read back first 100 bytes to confirm write success
  IF any output file is 0 bytes: flag WRITE_ERROR

================================================================================
SCAN LOGIC SUMMARY (Quick Reference)
================================================================================

INPUT:   COLLABTUNES_PROJECT_ROOT filesystem
SCAN:    os.walk() all files → classify by type and naming convention
ZIP:     Open ZIPs read-only → list members → extract manifests only
TXT:     Header scan (first 2000 chars) → extract status/authority/supersedes
SKIP:    SENSITIVE files (catalog path only, never open)
FOLDER:  Compare actual vs expected structure from REPOSITORY_STRUCTURE_PLAN
RELATE:  Build version chains, dependency maps, generator input readiness
DEDUPE:  Group by base_name → tag current vs superseded
EXPORT:  JSON + TXT + Manifest + Final Checkpoint

DOES NOT:
  Modify any file in repo
  Delete or move any file
  Open DEFAMATION_RISK_REGISTRY
  Open CREATOR_INTERVIEW_TRANSCRIPT
  Extract full ZIP contents (manifest text only)
  Read HTML files content (catalog metadata only)
  Exceed max_zip_depth on nested ZIPs

================================================================================
END SELF_GATHERING_CODE_2_FLOWMAP_5_12_26.txt
================================================================================
