Merge branch 'refactor/release-parser-v2'

2026-05-20 01:08:20 +02:00
parent 9f10f4e0ad 629387591f
commit fcd80763e2
25 changed files with 1516 additions and 468 deletions
@@ -15,8 +15,60 @@ callers).
 ## [Unreleased]
 ---
 ## [2026-05-20] — Release parser v2 (EASY + SHITTY)
 ### Added
 - **Release parser v2 — EASY path live** (`alfred/domain/release/parser/`):
  new annotate-based pipeline (tokenize → annotate → assemble) drives
  releases from known groups. Exposes `Token` (frozen VO with `index` +
  `role` + `extra`), `TokenRole` enum (structural/technical/meta families),
  and `GroupSchema` / `SchemaChunk` value objects.
  - `pipeline.tokenize`: string-ops separator split (no regex), strips
    a `[site.tag]` prefix/suffix first.
  - `pipeline.annotate`: detects the trailing group right-to-left
    (priority to `codec-GROUP` shape, fallback to any non-source dashed
    token), looks up its `GroupSchema`, then walks tokens and schema
    chunks in lockstep — optional chunks that don't match are skipped,
    mandatory mismatches abort EASY and return `None` so the caller can
    fall back to SHITTY.
  - `pipeline.assemble`: folds annotated tokens into a
    `ParsedRelease`-compatible dict.
  - `parse_release` (in `release.services`) tries the v2 EASY path first
    and falls through to the legacy SHITTY heuristic on `None`. Legacy
    SHITTY/PATH OF PAIN behavior is unchanged.
  - Knowledge: `alfred/knowledge/release/release_groups/{kontrast,elite,
    rarbg}.yaml` declare the canonical chunk order per group, loaded via
    new `ReleaseKnowledge.group_schema(name)` port method.
  - Tests in `tests/domain/release/test_parser_v2_{scaffolding,easy}.py`
    cover token VOs, site-tag stripping, group detection, schema-driven
    annotation (movie, TV episode, season pack with optional source),
    and field assembly.
 - **Release parser v2 — enricher pass** completes the EASY pipeline.
  The structural schema walk now tolerates non-positional tokens
  between chunks (instead of aborting on leftover tokens), and a second
  pass tags them with audio / video-meta / edition / language roles.
  Multi-token sequences from `audio.yaml`, `video.yaml`, `editions.yaml`
  (e.g. `DTS.HD.MA`, `DV.HDR10`, `TrueHD.Atmos`, `DIRECTORS.CUT`) are
  matched before single tokens. Channel layouts like `5.1` and `7.1`
  (split into two tokens by the `.` separator) are detected as
  consecutive pairs. Sequence members carry an `extra["sequence_member"]`
  marker so `assemble` extracts the canonical value only from the
  primary token. KONTRAST releases with audio / HDR / edition / language
  metadata now produce a fully populated `ParsedRelease`.
 - **Streaming distributor as a separate dimension** from encoding source.
  New `alfred/knowledge/release/distributors.yaml` (NF, AMZN, DSNP, HMAX,
  ATVP, HULU, PCOK, PMTP, CR) feeds a new `ReleaseKnowledge.distributors`
  port field, a `TokenRole.DISTRIBUTOR` annotation, and a
  `ParsedRelease.distributor` field. `WEB-DL` stays the source; the
  platform that produced the release is now recorded distinctly. The
  five entries (NF, AMZN, DSNP, HMAX, ATVP) were correspondingly removed
  from `sources.yaml`.
 - **Real-world release fixtures** under `tests/fixtures/releases/{easy,shitty,path_of_pain}/`,
  each documenting an expected `ParsedRelease` plus the future `routing`
  (library / torrents / seed_hardlinks) for the upcoming `organize_media`
@@ -54,6 +106,22 @@ callers).
 ### Changed
 - **Release parser v2 — SHITTY simplified to dict-driven tagging**.
  The legacy ~480-line heuristic block in `release/services.py` is gone;
  `pipeline._annotate_shitty` does a single pass that looks each token
  up in the kb buckets (resolutions / sources / codecs / distributors /
  year / `SxxExx`) with first-match-wins semantics, and the leftmost
  contiguous UNKNOWN run becomes the title. `annotate()` no longer
  returns `None` — SHITTY is the always-on fallback when no group schema
  matches. `services.py` shrunk from ~525 to ~85 lines. Four fixtures
  (`deutschland_franchise_box`, `sleaford_yt_slug`,
  `super_mario_bilingual`, `predator_space_separators` — the last one
  moved from `shitty/` → `path_of_pain/`) are now marked
  `pytest.mark.xfail(strict=False)` documenting PoP-grade pathologies
  that SHITTY intentionally won't handle. `ReleaseFixture` grows an
  `xfail_reason` field; the parametrized suite wires the xfail mark
  automatically.
 - **`parse_release` tokenizer is now data-driven**: it splits on any character
  listed in `separators.yaml` (regex character class) instead of `name.split(".")`.
  This makes YTS-style releases (`The Father (2020) [1080p] [WEBRip] [5.1] [YTS.MX]`),
@@ -0,0 +1,31 @@
 """Release parser v2 — annotate-based pipeline.
 This package is the future home of ``parse_release``. It restructures the
 parsing logic around a **tokenize → annotate → assemble** pipeline:
 1. **tokenize**: split the release name into atomic tokens.
 2. **annotate**: walk tokens left-to-right, assigning each one a
   :class:`TokenRole` (TITLE, YEAR, SEASON, RESOLUTION, …) using the
   injected :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge`.
 3. **assemble**: fold the annotated tokens into a :class:`ParsedRelease`.
 The pipeline has three internal paths driven by the detected release group:
 - **EASY**: known group (KONTRAST, RARBG, …) with a schema-driven layout
  declared in ``knowledge/release/release_groups/<group>.yaml``.
 - **SHITTY**: unknown group, best-effort matching against the global
  knowledge sets, with a 0-100 confidence score.
 - **PATH OF PAIN**: score below threshold OR critical chunks missing —
  signaled to the caller, who decides whether to involve the LLM/user.
 Today the package exposes scaffolding only (token VOs and a thin pipeline
 stub). The legacy ``parse_release`` in ``release.services`` keeps serving
 production until each piece of the v2 pipeline is wired in.
 """
 from __future__ import annotations
 from .schema import GroupSchema, SchemaChunk
 from .tokens import Token, TokenRole
 __all__ = ["GroupSchema", "SchemaChunk", "Token", "TokenRole"]
@@ -0,0 +1,732 @@
 """Annotate-based pipeline.
 Three stages:
 1. :func:`tokenize` — release name → ``list[Token]`` (all UNKNOWN), plus
   a separately-returned site tag (e.g. ``[YTS.MX]``) that is never
   tokenized.
 2. :func:`annotate` — promote each token's :class:`TokenRole` using the
   injected knowledge base. Two sub-passes:
     a. **Structural** (schema-driven, EASY only). Detects the group at
        the right end, looks up its :class:`GroupSchema`, then matches
        the schema's chunk sequence against the token stream. Between
        two structural chunks, any number of unmatched tokens may
        remain — they are left UNKNOWN for the enricher pass to handle.
     b. **Enrichers** (non-positional). Walks UNKNOWN tokens and tags
        audio / video-meta / edition / language roles. Multi-token
        sequences (``DTS.HD.MA``, ``DV.HDR10``, ``DIRECTORS.CUT``) are
        matched first, single tokens after.
 3. :func:`assemble` — fold annotated tokens into a
   :class:`~alfred.domain.release.value_objects.ParsedRelease`-compatible
   dict.
 The pipeline is **pure**: no I/O, no TMDB, no probe. All knowledge
 arrives through ``kb: ReleaseKnowledge``.
 """
 from __future__ import annotations
 from ..ports.knowledge import ReleaseKnowledge
 from .schema import GroupSchema
 from .tokens import Token, TokenRole
 # ---------------------------------------------------------------------------
 # Stage 1 — tokenize
 # ---------------------------------------------------------------------------
 def strip_site_tag(name: str) -> tuple[str, str | None]:
    """Split off a ``[site.tag]`` prefix or suffix.
    Returns ``(clean_name, tag)``. If no tag is found, returns
    ``(name.strip(), None)``.
    """
    s = name.strip()
    if s.startswith("["):
        close = s.find("]")
        if close != -1:
            tag = s[1:close].strip()
            remainder = s[close + 1 :].strip()
            if tag and remainder:
                return remainder, tag
    if s.endswith("]"):
        open_bracket = s.rfind("[")
        if open_bracket != -1:
            tag = s[open_bracket + 1 : -1].strip()
            remainder = s[:open_bracket].strip()
            if tag and remainder:
                return remainder, tag
    return s, None
 def tokenize(name: str, kb: ReleaseKnowledge) -> tuple[list[Token], str | None]:
    """Split ``name`` into tokens after stripping any site tag.
    String-ops style: replace every configured separator with a single
    NUL byte then split. NUL cannot legally appear in a release name, so
    it's a safe sentinel.
    """
    clean, site_tag = strip_site_tag(name)
    DELIM = "\x00"
    buf = clean
    for sep in kb.separators:
        if sep != DELIM:
            buf = buf.replace(sep, DELIM)
    pieces = [p for p in buf.split(DELIM) if p]
    tokens = [Token(text=p, index=i) for i, p in enumerate(pieces)]
    return tokens, site_tag
 # ---------------------------------------------------------------------------
 # Helpers shared across passes
 # ---------------------------------------------------------------------------
 def _parse_season_episode(text: str) -> tuple[int, int | None, int | None] | None:
    """Parse a single token as ``SxxExx`` / ``SxxExxExx`` / ``Sxx`` / ``NxNN``.
    Returns ``(season, episode, episode_end)`` or ``None`` if the token
    is not a season/episode marker.
    """
    upper = text.upper()
    # SxxExx form
    if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
        season = int(upper[1:3])
        rest = upper[3:]
        if not rest:
            return season, None, None
        episodes: list[int] = []
        while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
            episodes.append(int(rest[1:3]))
            rest = rest[3:]
        if not episodes:
            return None
        return season, episodes[0], episodes[1] if len(episodes) >= 2 else None
    # NxNN form
    if "X" in upper:
        parts = upper.split("X")
        if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
            season = int(parts[0])
            episode = int(parts[1])
            episode_end = int(parts[2]) if len(parts) >= 3 else None
            return season, episode, episode_end
    return None
 def _is_year(text: str) -> bool:
    """Return True if ``text`` is a 4-digit year in [1900, 2099]."""
    return len(text) == 4 and text.isdigit() and 1900 <= int(text) <= 2099
 def _split_codec_group(text: str, kb: ReleaseKnowledge) -> tuple[str, str] | None:
    """Split a ``codec-GROUP`` token into ``(codec, group)`` if it fits.
    Returns ``None`` if the token doesn't match the ``codec-GROUP``
    shape. Handles the empty-group case (``x265-``) as ``(codec, "")``.
    """
    if "-" not in text:
        return None
    head, _, tail = text.rpartition("-")
    if head.lower() in kb.codecs:
        return head, tail
    return None
 def _match_role(text: str, role: TokenRole, kb: ReleaseKnowledge) -> TokenRole | None:
    """Return ``role`` if ``text`` matches it under ``kb``, else ``None``."""
    lower = text.lower()
    if role is TokenRole.YEAR:
        return TokenRole.YEAR if _is_year(text) else None
    if role is TokenRole.SEASON_EPISODE:
        return (
            TokenRole.SEASON_EPISODE
            if _parse_season_episode(text) is not None
            else None
        )
    if role is TokenRole.RESOLUTION:
        return TokenRole.RESOLUTION if lower in kb.resolutions else None
    if role is TokenRole.SOURCE:
        return TokenRole.SOURCE if lower in kb.sources else None
    if role is TokenRole.CODEC:
        return TokenRole.CODEC if lower in kb.codecs else None
    return None
 # ---------------------------------------------------------------------------
 # Stage 2a — group detection
 # ---------------------------------------------------------------------------
 def _detect_group(tokens: list[Token], kb: ReleaseKnowledge) -> tuple[str, int | None]:
    """Identify the release group by walking tokens right-to-left.
    Returns ``(group_name, token_index_carrying_group)``. ``index`` is
    ``None`` when the group is absent (no trailing ``-`` in the stream).
    """
    # Priority 1: codec-GROUP shape (clearest signal).
    for tok in reversed(tokens):
        split = _split_codec_group(tok.text, kb)
        if split is not None:
            _, group = split
            return (group or "UNKNOWN"), tok.index
    # Priority 2: rightmost dash, excluding dashed sources (Web-DL, etc.).
    for tok in reversed(tokens):
        if "-" not in tok.text:
            continue
        head, _, tail = tok.text.rpartition("-")
        if (
            head.lower() in kb.sources
            or tok.text.lower().replace("-", "") in kb.sources
        ):
            continue
        if tail:
            return tail, tok.index
    return "UNKNOWN", None
 # ---------------------------------------------------------------------------
 # Stage 2b — structural annotation (schema-driven)
 # ---------------------------------------------------------------------------
 def _annotate_structural(
    tokens: list[Token],
    kb: ReleaseKnowledge,
    schema: GroupSchema,
    group_token_index: int,
 ) -> list[Token] | None:
    """Annotate structural tokens following a known group schema.
    Walks the schema's chunks against the body (tokens up to the group
    token). For each chunk, scans forward in the body for a matching
    token — tokens passed over without match are left UNKNOWN (the
    enricher pass will handle them).
    Returns ``None`` if any mandatory chunk fails to find a match.
    """
    result = list(tokens)
    # The codec-GROUP token carries CODEC + GROUP. Split it now so the
    # schema walk knows the codec is "pre-consumed" at the end.
    group_token = result[group_token_index]
    cg_split = _split_codec_group(group_token.text, kb)
    codec_pre_consumed = False
    if cg_split is not None:
        codec, group = cg_split
        result[group_token_index] = group_token.with_role(
            TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
        )
        codec_pre_consumed = True
    else:
        head, _, tail = group_token.text.rpartition("-")
        result[group_token_index] = group_token.with_role(
            TokenRole.GROUP, group=tail or "UNKNOWN", prefix=head
        )
    body_end = group_token_index  # exclusive
    tok_idx = 0
    chunk_idx = 0
    # 1) TITLE — leftmost contiguous tokens up to the first structural
    #    boundary. Title is special because it can be multi-token.
    while (
        chunk_idx < len(schema.chunks)
        and schema.chunks[chunk_idx].role is TokenRole.TITLE
    ):
        title_end = _find_title_end(result, body_end, kb)
        for i in range(tok_idx, title_end):
            result[i] = result[i].with_role(TokenRole.TITLE)
        tok_idx = title_end
        chunk_idx += 1
    # 2) Remaining structural chunks. For each, scan forward in the body
    #    for a matching token; tokens passed over remain UNKNOWN.
    for chunk in schema.chunks[chunk_idx:]:
        if chunk.role is TokenRole.GROUP:
            continue
        if chunk.role is TokenRole.CODEC and codec_pre_consumed:
            continue
        match_idx = _find_chunk(result, tok_idx, body_end, chunk.role, kb)
        if match_idx is None:
            if chunk.optional:
                continue
            return None
        result[match_idx] = result[match_idx].with_role(chunk.role)
        tok_idx = match_idx + 1
    return result
 def _find_title_end(
    tokens: list[Token], body_end: int, kb: ReleaseKnowledge
 ) -> int:
    """Return the exclusive index where the title ends.
    The title is the leftmost run of tokens whose text does not match
    any structural role (year, season/episode, resolution, source,
    codec). Enricher tokens (audio, HDR, language) are *not* boundaries
    because they can appear in the middle of the structural sequence;
    however, in canonical scene names they don't appear inside the title
    itself, so this heuristic holds in practice.
    """
    for i in range(body_end):
        text = tokens[i].text
        if _parse_season_episode(text) is not None:
            return i
        if _is_year(text):
            return i
        lower = text.lower()
        if lower in kb.resolutions:
            return i
        if lower in kb.sources:
            return i
        if lower in kb.codecs:
            return i
        # codec-GROUP token (e.g. "x265-KONTRAST") or dashed source (Web-DL).
        if "-" in text:
            head, _, _ = text.rpartition("-")
            if (
                head.lower() in kb.codecs
                or head.lower() in kb.sources
                or text.lower().replace("-", "") in kb.sources
            ):
                return i
    return body_end
 def _find_chunk(
    tokens: list[Token],
    start: int,
    end: int,
    role: TokenRole,
    kb: ReleaseKnowledge,
 ) -> int | None:
    """Return the first index in ``[start, end)`` whose token matches ``role``.
    Returns ``None`` if no token in the range matches. Tokens already
    annotated (non-UNKNOWN) are skipped — they belong to another chunk.
    """
    for i in range(start, end):
        if tokens[i].role is not TokenRole.UNKNOWN:
            continue
        if _match_role(tokens[i].text, role, kb) is not None:
            return i
    return None
 # ---------------------------------------------------------------------------
 # Stage 2b' — SHITTY annotation (schema-less heuristic)
 # ---------------------------------------------------------------------------
 def _annotate_shitty(
    tokens: list[Token],
    kb: ReleaseKnowledge,
    group_index: int | None,
 ) -> list[Token]:
    """Schema-less, dictionary-driven annotation.
    SHITTY's job is narrow: for releases that *look* like scene names
    but don't have a registered group schema, tag every token whose text
    falls into a known YAML bucket (resolutions, codecs, sources, …).
    Anything we can't classify stays UNKNOWN. The leftmost run of
    UNKNOWN tokens becomes the title. Done.
    Anything that requires more reasoning (parenthesized tech blocks,
    bare-dashed title fragments, year-disguised slug suffixes, …) is
    PATH OF PAIN territory and stays out of here on purpose.
    """
    result = list(tokens)
    # 1) Group token — split codec-GROUP or tag GROUP. Same logic as EASY.
    if group_index is not None:
        gt = result[group_index]
        cg_split = _split_codec_group(gt.text, kb)
        if cg_split is not None:
            codec, group = cg_split
            result[group_index] = gt.with_role(
                TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
            )
        else:
            _, _, tail = gt.text.rpartition("-")
            result[group_index] = gt.with_role(
                TokenRole.GROUP, group=tail or "UNKNOWN"
            )
    # 2) Enrichers (audio / video-meta / edition / language).
    result = _annotate_enrichers(result, kb)
    # 3) Single pass: tag each UNKNOWN token by looking it up in the kb
    #    buckets. First match wins per token, first occurrence wins per
    #    role (we don't overwrite an already-tagged role).
    matchers: list[tuple[TokenRole, callable]] = [
        (TokenRole.SEASON_EPISODE, lambda t: _parse_season_episode(t) is not None),
        (TokenRole.YEAR, _is_year),
        (TokenRole.RESOLUTION, lambda t: t.lower() in kb.resolutions),
        (TokenRole.DISTRIBUTOR, lambda t: t.upper() in kb.distributors),
        (TokenRole.SOURCE, lambda t: t.lower() in kb.sources),
        (TokenRole.CODEC, lambda t: t.lower() in kb.codecs),
    ]
    seen: set[TokenRole] = set()
    for i, tok in enumerate(result):
        if tok.role is not TokenRole.UNKNOWN:
            continue
        for role, matches in matchers:
            if role in seen:
                continue
            if matches(tok.text):
                result[i] = tok.with_role(role)
                seen.add(role)
                break
    # 4) Title = leftmost contiguous UNKNOWN tokens.
    for i, tok in enumerate(result):
        if tok.role is not TokenRole.UNKNOWN:
            break
        result[i] = tok.with_role(TokenRole.TITLE)
    return result
 # ---------------------------------------------------------------------------
 # Stage 2c — enricher pass (non-positional roles)
 # ---------------------------------------------------------------------------
 def _annotate_enrichers(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
    """Tag the remaining UNKNOWN tokens with non-positional roles.
    Multi-token sequences are matched first (so ``DTS.HD.MA`` wins over
    a single-token ``DTS``). For each sequence match, the first token
    receives the role + ``extra["sequence"]`` (the canonical joined
    value), and the trailing members are marked with the same role +
    ``extra["sequence_member"]=True`` so :func:`assemble` extracts the
    value only from the primary.
    """
    result = list(tokens)
    # Multi-token sequences first.
    _apply_sequences(
        result, kb.audio.get("sequences", []), "codec", TokenRole.AUDIO_CODEC
    )
    _apply_sequences(
        result, kb.video_meta.get("sequences", []), "hdr", TokenRole.HDR
    )
    _apply_sequences(
        result, kb.editions.get("sequences", []), "edition", TokenRole.EDITION
    )
    # Single tokens.
    known_audio_codecs = {c.upper() for c in kb.audio.get("codecs", [])}
    known_audio_channels = set(kb.audio.get("channels", []))
    known_hdr = {h.upper() for h in kb.video_meta.get("hdr", [])} | kb.hdr_extra
    known_bit_depth = {d.lower() for d in kb.video_meta.get("bit_depth", [])}
    known_editions = {t.upper() for t in kb.editions.get("tokens", [])}
    # Channel layouts like "5.1" are tokenized as two tokens ("5", "1")
    # because "." is a separator. Detect consecutive pairs whose joined
    # value (without any trailing "-GROUP") is in the channel set.
    _detect_channel_pairs(result, known_audio_channels)
    for i, tok in enumerate(result):
        if tok.role is not TokenRole.UNKNOWN:
            continue
        text = tok.text
        upper = text.upper()
        lower = text.lower()
        if upper in known_audio_codecs:
            result[i] = tok.with_role(TokenRole.AUDIO_CODEC)
            continue
        if text in known_audio_channels:
            result[i] = tok.with_role(TokenRole.AUDIO_CHANNELS)
            continue
        if upper in known_hdr:
            result[i] = tok.with_role(TokenRole.HDR)
            continue
        if lower in known_bit_depth:
            result[i] = tok.with_role(TokenRole.BIT_DEPTH)
            continue
        if upper in known_editions:
            result[i] = tok.with_role(TokenRole.EDITION)
            continue
        if upper in kb.language_tokens:
            result[i] = tok.with_role(TokenRole.LANGUAGE)
            continue
        if upper in kb.distributors:
            result[i] = tok.with_role(TokenRole.DISTRIBUTOR)
            continue
    return result
 def _apply_sequences(
    tokens: list[Token],
    sequences: list[dict],
    value_key: str,
    role: TokenRole,
 ) -> None:
    """Mark the first occurrence of each sequence in place.
    Mutates ``tokens`` (replacing entries with new role-tagged Token
    instances). Sequences in the YAML must be ordered most-specific
    first; the first match wins per starting position.
    """
    if not sequences:
        return
    upper_texts = [t.text.upper() for t in tokens]
    consumed: set[int] = set()
    for seq in sequences:
        seq_upper = [s.upper() for s in seq["tokens"]]
        n = len(seq_upper)
        for start in range(len(tokens) - n + 1):
            if any(idx in consumed for idx in range(start, start + n)):
                continue
            if any(
                tokens[start + k].role is not TokenRole.UNKNOWN for k in range(n)
            ):
                continue
            if upper_texts[start : start + n] == seq_upper:
                tokens[start] = tokens[start].with_role(
                    role, sequence=seq[value_key]
                )
                for k in range(1, n):
                    tokens[start + k] = tokens[start + k].with_role(
                        role, sequence_member="True"
                    )
                consumed.update(range(start, start + n))
 def _detect_channel_pairs(
    tokens: list[Token], known_channels: set[str]
 ) -> None:
    """Spot two consecutive numeric tokens that form a channel layout.
    Example: ``["5", "1-KTH"]`` → joined ``"5.1"`` (after stripping the
    ``-GROUP`` suffix on the second). The second token may be the trailing
    codec-GROUP token, in which case it's already tagged CODEC and we
    skip — we'd corrupt its role.
    """
    for i in range(len(tokens) - 1):
        first = tokens[i]
        second = tokens[i + 1]
        if first.role is not TokenRole.UNKNOWN:
            continue
        # Strip a "-GROUP" suffix on the second token before joining.
        second_text = second.text.split("-")[0]
        candidate = f"{first.text}.{second_text}"
        if candidate not in known_channels:
            continue
        # Only tag the first token (carries the channel value). The
        # second token may legitimately remain UNKNOWN (or be the
        # codec-GROUP token, already tagged CODEC).
        tokens[i] = first.with_role(
            TokenRole.AUDIO_CHANNELS, sequence=candidate
        )
        if second.role is TokenRole.UNKNOWN:
            tokens[i + 1] = second.with_role(
                TokenRole.AUDIO_CHANNELS, sequence_member="True"
            )
 # ---------------------------------------------------------------------------
 # Stage 2 entry point
 # ---------------------------------------------------------------------------
 def annotate(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
    """Annotate token roles.
    Dispatch:
    * If a group is detected AND has a known schema, run the EASY
      structural walk. If the schema walk aborts on a mandatory chunk
      mismatch, fall through to SHITTY (the heuristic still does better
      than giving up).
    * Otherwise run SHITTY — schema-less, best-effort, never aborts.
    The enricher pass runs in both cases. The pipeline always returns a
    populated token list; downstream callers don't need to distinguish
    EASY vs SHITTY at this layer (the parse_path is decided in the
    service based on whether a schema matched).
    """
    group_name, group_index = _detect_group(tokens, kb)
    schema = kb.group_schema(group_name) if group_index is not None else None
    if schema is not None and group_index is not None:
        structural = _annotate_structural(tokens, kb, schema, group_index)
        if structural is not None:
            return _annotate_enrichers(structural, kb)
    # SHITTY fallback — heuristic positional pass. ``_annotate_shitty``
    # runs its own enricher pass internally (it has to, so the title
    # scan can skip enricher-tagged tokens).
    return _annotate_shitty(tokens, kb, group_index)
 def has_known_schema(tokens: list[Token], kb: ReleaseKnowledge) -> bool:
    """Return True if ``tokens`` would take the EASY path in :func:`annotate`."""
    group_name, group_index = _detect_group(tokens, kb)
    if group_index is None:
        return False
    return kb.group_schema(group_name) is not None
 # ---------------------------------------------------------------------------
 # Stage 3 — assemble
 # ---------------------------------------------------------------------------
 def assemble(
    annotated: list[Token],
    site_tag: str | None,
    raw_name: str,
    kb: ReleaseKnowledge,
 ) -> dict:
    """Fold annotated tokens into a ``ParsedRelease``-compatible dict.
    Returns a dict (not a ``ParsedRelease`` instance) so the caller can
    layer in additional fields (``parse_path``, ``raw``, …) before
    instantiation.
    """
    title_parts = [t.text for t in annotated if t.role is TokenRole.TITLE]
    title = ".".join(title_parts) if title_parts else (
        annotated[0].text if annotated else raw_name
    )
    year: int | None = None
    season: int | None = None
    episode: int | None = None
    episode_end: int | None = None
    quality: str | None = None
    source: str | None = None
    codec: str | None = None
    group = "UNKNOWN"
    audio_codec: str | None = None
    audio_channels: str | None = None
    bit_depth: str | None = None
    hdr_format: str | None = None
    edition: str | None = None
    distributor: str | None = None
    languages: list[str] = []
    for tok in annotated:
        # Skip non-primary members of a multi-token sequence.
        if tok.extra.get("sequence_member") == "True":
            continue
        role = tok.role
        if role is TokenRole.YEAR:
            year = int(tok.text)
        elif role is TokenRole.SEASON_EPISODE:
            parsed = _parse_season_episode(tok.text)
            if parsed is not None:
                season, episode, episode_end = parsed
        elif role is TokenRole.RESOLUTION:
            quality = tok.text
        elif role is TokenRole.SOURCE:
            source = tok.text
        elif role is TokenRole.CODEC:
            codec = tok.extra.get("codec", tok.text)
            if "group" in tok.extra:
                group = tok.extra["group"] or "UNKNOWN"
        elif role is TokenRole.GROUP:
            group = tok.extra.get("group", tok.text) or "UNKNOWN"
        elif role is TokenRole.AUDIO_CODEC:
            if audio_codec is None:
                audio_codec = tok.extra.get("sequence", tok.text)
        elif role is TokenRole.AUDIO_CHANNELS:
            if audio_channels is None:
                audio_channels = tok.extra.get("sequence", tok.text)
        elif role is TokenRole.BIT_DEPTH:
            if bit_depth is None:
                bit_depth = tok.text.lower()
        elif role is TokenRole.HDR:
            if hdr_format is None:
                hdr_format = tok.extra.get("sequence", tok.text.upper())
        elif role is TokenRole.EDITION:
            if edition is None:
                edition = tok.extra.get("sequence", tok.text.upper())
        elif role is TokenRole.LANGUAGE:
            languages.append(tok.text.upper())
        elif role is TokenRole.DISTRIBUTOR:
            if distributor is None:
                distributor = tok.text.upper()
    tech_parts = [p for p in (quality, source, codec) if p]
    tech_string = ".".join(tech_parts)
    # Media type heuristic. Doc/concert/integrale tokens win over the
    # generic tech-based fallback. We look across all tokens (not just
    # annotated ones) because these markers may be tagged UNKNOWN by the
    # structural pass — only the assemble step cares about them.
    upper_tokens = {tok.text.upper() for tok in annotated}
    doc_tokens = {t.upper() for t in kb.media_type_tokens.get("doc", [])}
    concert_tokens = {t.upper() for t in kb.media_type_tokens.get("concert", [])}
    integrale_tokens = {t.upper() for t in kb.media_type_tokens.get("integrale", [])}
    if upper_tokens & doc_tokens:
        media_type = "documentary"
    elif upper_tokens & concert_tokens:
        media_type = "concert"
    elif (
        edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
        or upper_tokens & integrale_tokens
    ) and season is None:
        media_type = "tv_complete"
    elif season is not None:
        media_type = "tv_show"
    elif any((quality, source, codec, year)):
        media_type = "movie"
    else:
        media_type = "unknown"
    return {
        "title": title,
        "title_sanitized": kb.sanitize_for_fs(title),
        "year": year,
        "season": season,
        "episode": episode,
        "episode_end": episode_end,
        "quality": quality,
        "source": source,
        "codec": codec,
        "group": group,
        "tech_string": tech_string,
        "media_type": media_type,
        "site_tag": site_tag,
        "languages": languages,
        "audio_codec": audio_codec,
        "audio_channels": audio_channels,
        "bit_depth": bit_depth,
        "hdr_format": hdr_format,
        "edition": edition,
        "distributor": distributor,
    }
@@ -0,0 +1,47 @@
 """Group schema value objects.
 A :class:`GroupSchema` describes the canonical chunk layout of releases
 from a known group (KONTRAST, RARBG, ELiTE, …). It is the EASY-road
 contract: when a release ends in ``-<GROUP>`` and we know the group,
 the annotator walks the schema instead of running the heuristic SHITTY
 matchers.
 Schemas are loaded from ``knowledge/release/release_groups/<group>.yaml``
 by an infrastructure adapter and surfaced via the
 :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge` port.
 """
 from __future__ import annotations
 from dataclasses import dataclass
 from .tokens import TokenRole
@dataclass(frozen=True)
 class SchemaChunk:
    """One entry in a group's chunk order.
    ``role`` is the :class:`TokenRole` the chunk maps to. ``optional``
    is True for chunks that may be absent (e.g. ``year`` on TV releases,
    ``source`` on bare ELiTE TV releases).
    """
    role: TokenRole
    optional: bool = False
@dataclass(frozen=True)
 class GroupSchema:
    """Schema for a known release group.
    ``chunks`` is the left-to-right canonical order. The annotator walks
    tokens and chunks in lockstep: an optional chunk that doesn't match
    the current token is skipped (the chunk index advances, the token
    index stays), a mandatory chunk that doesn't match aborts the EASY
    path and falls back to SHITTY.
    """
    name: str
    separator: str
    chunks: tuple[SchemaChunk, ...]
@@ -0,0 +1,90 @@
 """Token value objects for the annotate-based parser.
 A :class:`Token` carries both the original substring and its position in
 the original release name's token stream. A :class:`TokenRole` is the
 semantic tag assigned by the annotator.
 Why VOs instead of bare ``str``: the annotate step needs to flag tokens
 without consuming them (a token may carry residual info — e.g. a
 ``codec-GROUP`` token contributes both a CODEC and a GROUP role). Tracking
 the index also lets later stages reason about *order* (year must come
 after title, group must be rightmost, etc.) without re-scanning the list.
 """
 from __future__ import annotations
 from dataclasses import dataclass, field
 from enum import Enum
 class TokenRole(str, Enum):
    """Semantic role a token can take after annotation.
    A token starts as ``UNKNOWN`` and may be promoted by the annotator.
    ``str``-backed for cheap comparisons and YAML/JSON interop.
    Roles split into three families:
    - **structural**: TITLE / YEAR / SEASON_EPISODE / GROUP — drive folder
      and filename naming.
    - **technical**: RESOLUTION / SOURCE / CODEC / AUDIO_CODEC /
      AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION / LANGUAGE — feed
      ``tech_string`` and metadata fields.
    - **meta**: SITE_TAG (stripped pre-tokenize), SEPARATOR (kept for the
      assemble step if a release uses spaces that need preservation in the
      title), UNKNOWN (residual, contributes to the SHITTY score penalty).
    """
    UNKNOWN = "unknown"
    # Structural
    TITLE = "title"
    YEAR = "year"
    SEASON_EPISODE = "season_episode"
    GROUP = "group"
    # Technical
    RESOLUTION = "resolution"
    SOURCE = "source"
    CODEC = "codec"
    AUDIO_CODEC = "audio_codec"
    AUDIO_CHANNELS = "audio_channels"
    BIT_DEPTH = "bit_depth"
    HDR = "hdr"
    EDITION = "edition"
    LANGUAGE = "language"
    DISTRIBUTOR = "distributor"
    # Meta
    SITE_TAG = "site_tag"
@dataclass(frozen=True)
 class Token:
    """An atomic token from a release name.
    ``text`` is the substring exactly as it appeared after tokenization
    (case preserved — uppercase comparisons happen at match time).
    ``index`` is the 0-based position in the tokenized stream, used by
    downstream stages to enforce ordering invariants.
    ``role`` defaults to :attr:`TokenRole.UNKNOWN`. The annotator returns
    new :class:`Token` instances with the role set rather than mutating
    (the dataclass is frozen). ``extra`` carries role-specific payload
    when the token text alone isn't enough (e.g. a ``codec-GROUP`` token
    annotated as CODEC may record the group name in ``extra["group"]``).
    """
    text: str
    index: int
    role: TokenRole = TokenRole.UNKNOWN
    extra: dict[str, str] = field(default_factory=dict)
    def with_role(self, role: TokenRole, **extra: str) -> Token:
        """Return a copy of this token with ``role`` (and optional ``extra``)."""
        merged = {**self.extra, **extra} if extra else self.extra
        return Token(text=self.text, index=self.index, role=role, extra=merged)
    @property
    def is_annotated(self) -> bool:
        return self.role is not TokenRole.UNKNOWN
@@ -10,7 +10,10 @@ object that satisfies this shape (e.g. a simple dataclass).
 from __future__ import annotations
-from typing import Protocol
+from typing import TYPE_CHECKING, Protocol
 if TYPE_CHECKING:
    from ..parser.schema import GroupSchema
 class ReleaseKnowledge(Protocol):
@@ -21,6 +24,7 @@ class ReleaseKnowledge(Protocol):
    resolutions: set[str]
    sources: set[str]
    codecs: set[str]
    distributors: set[str]
    language_tokens: set[str]
    forbidden_chars: set[str]
    hdr_extra: set[str]
@@ -50,3 +54,14 @@ class ReleaseKnowledge(Protocol):
    def sanitize_for_fs(self, text: str) -> str:
        """Strip filesystem-forbidden characters from ``text``."""
        ...
    # --- Release group schemas (EASY path) ---
    def group_schema(self, name: str) -> GroupSchema | None:
        """Return the parsing schema for the named release group, or
        ``None`` if the group is unknown (caller falls back to SHITTY).
        Lookup is case-insensitive: ``"KONTRAST"``, ``"kontrast"`` and
        ``"Kontrast"`` all resolve to the same schema.
        """
        ...
@@ -1,36 +1,43 @@
-"""Release domain — parsing service."""
+"""Release domain — parsing service.
 Thin orchestrator over the annotate-based pipeline in
 :mod:`alfred.domain.release.parser.pipeline`. Responsibilities:
 * Strip a leading/trailing ``[site.tag]`` and decide ``parse_path``.
 * Reject malformed names (forbidden characters) → ``parse_path=AI`` so
  the LLM can clean them up.
 * Otherwise call the v2 pipeline (tokenize → annotate → assemble) and
  wrap the result in :class:`ParsedRelease`.
 All structural and enricher logic now lives in the pipeline. This file
 no longer carries field extractors — the heuristic SHITTY path is part
 of :func:`~alfred.domain.release.parser.pipeline.annotate`.
 """
 from __future__ import annotations
-import re
+from .parser import pipeline as _v2
 from .ports import ReleaseKnowledge
 from .value_objects import MediaTypeToken, ParsedRelease, ParsePath
 def _tokenize(name: str, kb: ReleaseKnowledge) -> list[str]:
    """Split a release name on the configured separators, dropping empty tokens."""
    pattern = "[" + re.escape("".join(kb.separators)) + "]+"
    return [t for t in re.split(pattern, name) if t]
 def parse_release(name: str, kb: ReleaseKnowledge) -> ParsedRelease:
-    """
+    """Parse a release name and return a :class:`ParsedRelease`.
    Parse a release name and return a ParsedRelease.
    Flow:
-      1. Strip a leading/trailing [site.tag] if present (sets parse_path="sanitized").
+
-      2. Check the remainder for truly forbidden chars (anything not in the
+    1. Strip a leading/trailing ``[site.tag]`` if present (sets
-         configured separators list). If any remain → media_type="unknown",
+       ``parse_path="sanitized"``).
-         parse_path="ai", and the LLM handles it.
+    2. If the remainder still contains truly forbidden chars (anything
-      3. Tokenize using the configured separators (".", " ", "[", "]", "(", ")", "_", ...)
+       not in the configured separators), short-circuit to
-         and run token-level matchers (season/episode, tech, languages, audio,
+       ``media_type="unknown"`` / ``parse_path="ai"`` — the LLM handles
-         video, edition, title, year).
+       these.
    3. Otherwise run the v2 pipeline: tokenize → annotate (EASY when a
       group schema is known, SHITTY otherwise) → assemble.
    """
    parse_path = ParsePath.DIRECT.value
-    # Always try to extract a bracket-enclosed site tag first.
+    clean, site_tag = _v2.strip_site_tag(name)
    clean, site_tag = _strip_site_tag(name)
    if site_tag is not None:
        parse_path = ParsePath.SANITIZED.value
@@ -54,453 +61,26 @@ def parse_release(name: str, kb: ReleaseKnowledge) -> ParsedRelease:
            parse_path=ParsePath.AI.value,
        )
-    name = clean
+    tokens, v2_tag = _v2.tokenize(name, kb)
-    tokens = _tokenize(name, kb)
+    annotated = _v2.annotate(tokens, kb)
-
+    fields = _v2.assemble(annotated, v2_tag, name, kb)
    season, episode, episode_end = _extract_season_episode(tokens)
    quality, source, codec, group, tech_tokens = _extract_tech(tokens, kb)
    languages, lang_tokens = _extract_languages(tokens, kb)
    audio_codec, audio_channels, audio_tokens = _extract_audio(tokens, kb)
    bit_depth, hdr_format, video_tokens = _extract_video_meta(tokens, kb)
    edition, edition_tokens = _extract_edition(tokens, kb)
    title = _extract_title(
        tokens,
        tech_tokens | lang_tokens | audio_tokens | video_tokens | edition_tokens,
        kb,
    )
    year = _extract_year(tokens, title)
    media_type = _infer_media_type(
        season, quality, source, codec, year, edition, tokens, kb
    )
    tech_parts = [p for p in [quality, source, codec] if p]
    tech_string = ".".join(tech_parts)
    return ParsedRelease(
        raw=name,
-        normalised=name,
+        normalised=clean,
        title=title,
        title_sanitized=kb.sanitize_for_fs(title),
        year=year,
        season=season,
        episode=episode,
        episode_end=episode_end,
        quality=quality,
        source=source,
        codec=codec,
        group=group,
        tech_string=tech_string,
        media_type=media_type,
        site_tag=site_tag,
        parse_path=parse_path,
-        languages=languages,
+        **fields,
        audio_codec=audio_codec,
        audio_channels=audio_channels,
        bit_depth=bit_depth,
        hdr_format=hdr_format,
        edition=edition,
    )
 def _infer_media_type(
    season: int | None,
    quality: str | None,
    source: str | None,
    codec: str | None,
    year: int | None,
    edition: str | None,
    tokens: list[str],
    kb: ReleaseKnowledge,
 ) -> str:
    """
    Infer media_type from token-level evidence only (no filesystem access).
    - documentary  : DOC token present
    - concert      : CONCERT token present
    - tv_complete  : INTEGRALE/COMPLETE token, no season
    - tv_show      : season token found
    - movie        : no season, at least one tech marker
    - unknown      : no conclusive evidence
    """
    upper_tokens = {t.upper() for t in tokens}
    doc_tokens = {t.upper() for t in kb.media_type_tokens.get("doc", [])}
    concert_tokens = {t.upper() for t in kb.media_type_tokens.get("concert", [])}
    integrale_tokens = {t.upper() for t in kb.media_type_tokens.get("integrale", [])}
    if upper_tokens & doc_tokens:
        return MediaTypeToken.DOCUMENTARY.value
    if upper_tokens & concert_tokens:
        return MediaTypeToken.CONCERT.value
    if (
        edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
        or upper_tokens & integrale_tokens
    ) and season is None:
        return MediaTypeToken.TV_COMPLETE.value
    if season is not None:
        return MediaTypeToken.TV_SHOW.value
    if any([quality, source, codec, year]):
        return MediaTypeToken.MOVIE.value
    return MediaTypeToken.UNKNOWN.value
 def _is_well_formed(name: str, kb: ReleaseKnowledge) -> bool:
-    """Return True if name contains no forbidden characters per scene naming rules.
+    """Return True if ``name`` contains no forbidden characters per scene
    naming rules.
-    Characters listed as token separators (spaces, brackets, parens, …) are NOT
+    Characters listed as token separators (spaces, brackets, parens, …)
-    considered malforming — the tokenizer handles them. Only truly broken chars
+    are NOT considered malforming — the tokenizer handles them. Only
-    like '@', '#', '!', '%' make a name malformed.
+    truly broken chars like ``@``, ``#``, ``!``, ``%`` make a name
    malformed.
    """
    tokenizable = set(kb.separators)
    return not any(c in name for c in kb.forbidden_chars if c not in tokenizable)
 def _strip_site_tag(name: str) -> tuple[str, str | None]:
    """
    Strip a site watermark tag from the release name and return (clean_name, tag).
    Handles two positions:
    - Prefix:  "[ OxTorrent.vc ] The.Title.S01..."
    - Suffix:  "The.Title.S01...-NTb[TGx]"
    Anything between [...] is treated as a site tag.
    Returns (original_name, None) if no tag found.
    """
    s = name.strip()
    if s.startswith("["):
        close = s.find("]")
        if close != -1:
            tag = s[1:close].strip()
            remainder = s[close + 1 :].strip()
            if tag and remainder:
                return remainder, tag
    if s.endswith("]"):
        open_bracket = s.rfind("[")
        if open_bracket != -1:
            tag = s[open_bracket + 1 : -1].strip()
            remainder = s[:open_bracket].strip()
            if tag and remainder:
                return remainder, tag
    return s, None
 def _parse_season_episode(tok: str) -> tuple[int, int | None, int | None] | None:
    """
    Parse a single token as a season/episode marker.
    Handles:
      - SxxExx / SxxExxExx / Sxx        (canonical scene form)
      - NxNN / NxNNxNN                  (alt form: 1x05, 12x07x08)
    Returns (season, episode, episode_end) or None if not a season token.
    """
    upper = tok.upper()
    # SxxExx form
    if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
        season = int(upper[1:3])
        rest = upper[3:]
        if not rest:
            return season, None, None
        episodes: list[int] = []
        while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
            episodes.append(int(rest[1:3]))
            rest = rest[3:]
        if not episodes:
            return None  # malformed token like "S03XYZ"
        return season, episodes[0], episodes[1] if len(episodes) >= 2 else None
    # NxNN form — split on "X" (uppercased), all parts must be digits
    if "X" in upper:
        parts = upper.split("X")
        if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
            season = int(parts[0])
            episode = int(parts[1])
            episode_end = int(parts[2]) if len(parts) >= 3 else None
            return season, episode, episode_end
    return None
 def _extract_season_episode(
    tokens: list[str],
 ) -> tuple[int | None, int | None, int | None]:
    for tok in tokens:
        parsed = _parse_season_episode(tok)
        if parsed is not None:
            return parsed
    return None, None, None
 def _extract_tech(
    tokens: list[str],
    kb: ReleaseKnowledge,
 ) -> tuple[str | None, str | None, str | None, str, set[str]]:
    """
    Extract quality, source, codec, group from tokens.
    Returns (quality, source, codec, group, tech_token_set).
    Group extraction strategy (in priority order):
    1. Token where prefix is a known codec: x265-GROUP
    2. Rightmost token with a dash that isn't a known source
    """
    quality: str | None = None
    source: str | None = None
    codec: str | None = None
    group = "UNKNOWN"
    tech_tokens: set[str] = set()
    for tok in tokens:
        tl = tok.lower()
        if tl in kb.resolutions:
            quality = tok
            tech_tokens.add(tok)
            continue
        if tl in kb.sources:
            source = tok
            tech_tokens.add(tok)
            continue
        if "-" in tok:
            parts = tok.rsplit("-", 1)
            # codec-GROUP (highest priority for group)
            if parts[0].lower() in kb.codecs:
                codec = parts[0]
                group = parts[1] if parts[1] else "UNKNOWN"
                tech_tokens.add(tok)
                continue
            # source with dash: Web-DL, WEB-DL, etc.
            if parts[0].lower() in kb.sources or tok.lower().replace("-", "") in kb.sources:
                source = tok
                tech_tokens.add(tok)
                continue
        if tl in kb.codecs:
            codec = tok
            tech_tokens.add(tok)
    # Fallback: rightmost token with a dash that isn't a known source
    if group == "UNKNOWN":
        for tok in reversed(tokens):
            if "-" in tok:
                parts = tok.rsplit("-", 1)
                tl = tok.lower()
                if tl in kb.sources or tok.lower().replace("-", "") in kb.sources:
                    continue
                if parts[1]:
                    group = parts[1]
                    break
    return quality, source, codec, group, tech_tokens
 def _is_year_token(tok: str) -> bool:
    """Return True if tok is a 4-digit year between 1900 and 2099."""
    return len(tok) == 4 and tok.isdigit() and 1900 <= int(tok) <= 2099
 def _extract_title(
    tokens: list[str], tech_tokens: set[str], kb: ReleaseKnowledge
 ) -> str:
    """Extract the title portion: everything before the first season/year/tech token."""
    title_parts = []
    known_tech = kb.resolutions | kb.sources | kb.codecs
    for tok in tokens:
        if _parse_season_episode(tok) is not None:
            break
        if _is_year_token(tok):
            break
        if tok in tech_tokens or tok.lower() in known_tech:
            break
        if "-" in tok and any(p.lower() in kb.codecs | kb.sources for p in tok.split("-")):
            break
        title_parts.append(tok)
    return ".".join(title_parts) if title_parts else tokens[0]
 def _extract_year(tokens: list[str], title: str) -> int | None:
    """Extract a 4-digit year from tokens (only after the title)."""
    title_len = len(title.split("."))
    for tok in tokens[title_len:]:
        if _is_year_token(tok):
            return int(tok)
    return None
 # ---------------------------------------------------------------------------
 # Sequence matcher
 # ---------------------------------------------------------------------------
 def _match_sequences(
    tokens: list[str],
    sequences: list[dict],
    key: str,
 ) -> tuple[str | None, set[str]]:
    """
    Try to match multi-token sequences against consecutive tokens.
    Returns (matched_value, set_of_matched_tokens) or (None, empty_set).
    Sequences must be ordered most-specific first in the YAML.
    """
    upper_tokens = [t.upper() for t in tokens]
    for seq in sequences:
        seq_upper = [s.upper() for s in seq["tokens"]]
        n = len(seq_upper)
        for i in range(len(upper_tokens) - n + 1):
            if upper_tokens[i : i + n] == seq_upper:
                matched = set(tokens[i : i + n])
                return seq[key], matched
    return None, set()
 # ---------------------------------------------------------------------------
 # Language extraction
 # ---------------------------------------------------------------------------
 def _extract_languages(
    tokens: list[str], kb: ReleaseKnowledge
 ) -> tuple[list[str], set[str]]:
    """Extract language tokens. Returns (languages, matched_token_set)."""
    languages = []
    lang_tokens: set[str] = set()
    for tok in tokens:
        if tok.upper() in kb.language_tokens:
            languages.append(tok.upper())
            lang_tokens.add(tok)
    return languages, lang_tokens
 # ---------------------------------------------------------------------------
 # Audio extraction
 # ---------------------------------------------------------------------------
 def _extract_audio(
    tokens: list[str], kb: ReleaseKnowledge,
 ) -> tuple[str | None, str | None, set[str]]:
    """
    Extract audio codec and channel layout.
    Returns (audio_codec, audio_channels, matched_token_set).
    Sequences are tried first (DTS.HD.MA, TrueHD.Atmos, …), then single tokens.
    """
    audio_codec: str | None = None
    audio_channels: str | None = None
    audio_tokens: set[str] = set()
    known_codecs = {c.upper() for c in kb.audio.get("codecs", [])}
    known_channels = set(kb.audio.get("channels", []))
    # Try multi-token sequences first
    matched_codec, matched_set = _match_sequences(
        tokens, kb.audio.get("sequences", []), "codec"
    )
    if matched_codec:
        audio_codec = matched_codec
        audio_tokens |= matched_set
    # Channel layouts like "5.1" or "7.1" are split into two tokens by normalize —
    # detect them as consecutive pairs "X" + "Y" where "X.Y" is a known channel.
    # The second token may have a "-GROUP" suffix (e.g. "1-KTH" → strip it).
    for i in range(len(tokens) - 1):
        second = tokens[i + 1].split("-")[0]
        candidate = f"{tokens[i]}.{second}"
        if candidate in known_channels and audio_channels is None:
            audio_channels = candidate
            audio_tokens.add(tokens[i])
            audio_tokens.add(tokens[i + 1])
    for tok in tokens:
        if tok in audio_tokens:
            continue
        if tok.upper() in known_codecs and audio_codec is None:
            audio_codec = tok
            audio_tokens.add(tok)
        elif tok in known_channels and audio_channels is None:
            audio_channels = tok
            audio_tokens.add(tok)
    return audio_codec, audio_channels, audio_tokens
 # ---------------------------------------------------------------------------
 # Video metadata extraction (bit depth, HDR)
 # ---------------------------------------------------------------------------
 def _extract_video_meta(
    tokens: list[str], kb: ReleaseKnowledge,
 ) -> tuple[str | None, str | None, set[str]]:
    """
    Extract bit depth and HDR format.
    Returns (bit_depth, hdr_format, matched_token_set).
    """
    bit_depth: str | None = None
    hdr_format: str | None = None
    video_tokens: set[str] = set()
    known_hdr = {h.upper() for h in kb.video_meta.get("hdr", [])} | kb.hdr_extra
    known_depth = {d.lower() for d in kb.video_meta.get("bit_depth", [])}
    # Try HDR sequences first
    matched_hdr, matched_set = _match_sequences(
        tokens, kb.video_meta.get("sequences", []), "hdr"
    )
    if matched_hdr:
        hdr_format = matched_hdr
        video_tokens |= matched_set
    for tok in tokens:
        if tok in video_tokens:
            continue
        if tok.upper() in known_hdr and hdr_format is None:
            hdr_format = tok.upper()
            video_tokens.add(tok)
        elif tok.lower() in known_depth and bit_depth is None:
            bit_depth = tok.lower()
            video_tokens.add(tok)
    return bit_depth, hdr_format, video_tokens
 # ---------------------------------------------------------------------------
 # Edition extraction
 # ---------------------------------------------------------------------------
 def _extract_edition(
    tokens: list[str], kb: ReleaseKnowledge
 ) -> tuple[str | None, set[str]]:
    """
    Extract release edition (UNRATED, EXTENDED, DIRECTORS.CUT, …).
    Returns (edition, matched_token_set).
    """
    known_tokens = {t.upper() for t in kb.editions.get("tokens", [])}
    # Try multi-token sequences first
    matched_edition, matched_set = _match_sequences(
        tokens, kb.editions.get("sequences", []), "edition"
    )
    if matched_edition:
        return matched_edition, matched_set
    for tok in tokens:
        if tok.upper() in known_tokens:
            return tok.upper(), {tok}
    return None, set()
@@ -105,6 +105,7 @@ class ParsedRelease:
    bit_depth: str | None = None  # "10bit", "8bit", …
    hdr_format: str | None = None  # "DV", "HDR10", "DV.HDR10", …
    edition: str | None = None  # "UNRATED", "EXTENDED", "DIRECTORS.CUT", …
    distributor: str | None = None  # "NF", "AMZN", "DSNP", … (streaming origin)
    def __post_init__(self) -> None:
        if not self.raw:
@@ -16,9 +16,11 @@ import alfred as _alfred_pkg
 _BUILTIN_ROOT = Path(_alfred_pkg.__file__).parent / "knowledge" / "release"
 _SITES_ROOT = _BUILTIN_ROOT / "sites"
 _GROUPS_ROOT = _BUILTIN_ROOT / "release_groups"
 _LEARNED_ROOT = (
    Path(_alfred_pkg.__file__).parent.parent / "data" / "knowledge" / "release"
 )
 _LEARNED_GROUPS_ROOT = _LEARNED_ROOT / "release_groups"
 def _merge(base: dict, overlay: dict) -> dict:
@@ -62,6 +64,15 @@ def load_sources() -> set[str]:
    return set(_load("sources.yaml").get("sources", []))
 def load_distributors() -> set[str]:
    """Streaming distributor tokens (NF, AMZN, DSNP, …).
    Distinct from ``load_sources()`` — distributors are uppercase scene
    tags identifying the platform, not the capture origin.
    """
    return {t.upper() for t in _load("distributors.yaml").get("distributors", [])}
 def load_codecs() -> set[str]:
    return set(_load("codecs.yaml").get("codecs", []))
@@ -128,6 +139,27 @@ def load_media_type_tokens() -> dict:
    return _load_sites().get("media_type_tokens", {})
 def load_group_schemas() -> dict:
    """Load every release-group schema YAML keyed by uppercase group name.
    Builtin schemas in ``alfred/knowledge/release/release_groups/`` are
    merged with user-learned schemas in
    ``data/knowledge/release/release_groups/`` (the learned ones win on
    name collision).
    """
    result: dict = {}
    for root in (_GROUPS_ROOT, _LEARNED_GROUPS_ROOT):
        if not root.is_dir():
            continue
        for path in sorted(root.glob("*.yaml")):
            data = _read(path)
            name = data.get("name")
            if not name:
                continue
            result[name.upper()] = data
    return result
 def load_separators() -> list[str]:
    """Single-char token separators used by the release name tokenizer.
@@ -14,11 +14,16 @@ filesystem-level concerns.
 from __future__ import annotations
 from alfred.domain.release.parser.schema import GroupSchema, SchemaChunk
 from alfred.domain.release.parser.tokens import TokenRole
 from .release import (
    load_audio,
    load_codecs,
    load_distributors,
    load_editions,
    load_forbidden_chars,
    load_group_schemas,
    load_hdr_extra,
    load_language_tokens,
    load_media_type_tokens,
@@ -35,6 +40,26 @@ from .release import (
 )
 def _build_group_schema(data: dict) -> GroupSchema:
    """Translate a raw YAML schema dict into a frozen :class:`GroupSchema`.
    Unknown roles raise ``ValueError`` early so a typo in a YAML file
    surfaces at construction time, not on first parse.
    """
    chunks = tuple(
        SchemaChunk(
            role=TokenRole(entry["role"]),
            optional=bool(entry.get("optional", False)),
        )
        for entry in data.get("chunk_order", [])
    )
    return GroupSchema(
        name=data["name"],
        separator=data.get("separator", "."),
        chunks=chunks,
    )
 class YamlReleaseKnowledge:
    """Single object holding every parsed-release knowledge constant.
@@ -48,6 +73,7 @@ class YamlReleaseKnowledge:
        self.resolutions: set[str] = load_resolutions()
        self.sources: set[str] = load_sources() | load_sources_extra()
        self.codecs: set[str] = load_codecs()
        self.distributors: set[str] = load_distributors()
        self.language_tokens: set[str] = load_language_tokens()
        self.forbidden_chars: set[str] = load_forbidden_chars()
        self.hdr_extra: set[str] = load_hdr_extra()
@@ -78,6 +104,15 @@ class YamlReleaseKnowledge:
            "", "", "".join(load_win_forbidden_chars())
        )
        # Group schemas, keyed by uppercase group name for fast lookup.
        self._group_schemas: dict[str, GroupSchema] = {
            key: _build_group_schema(data)
            for key, data in load_group_schemas().items()
        }
    def sanitize_for_fs(self, text: str) -> str:
        """Strip Windows-forbidden characters from ``text``."""
        return text.translate(self._win_forbidden_table)
    def group_schema(self, name: str) -> GroupSchema | None:
        return self._group_schemas.get(name.upper())
@@ -0,0 +1,17 @@
 # Known streaming distributor tokens (case-insensitive match).
 #
 # These tags identify *which platform* the release was sourced from
 # (Netflix, Amazon, Disney+, …). Distinct from ``sources.yaml`` which
 # captures the encoding origin (WEB-DL, BluRay, …). A typical release
 # carries both: ``Show.S01E01.1080p.NF.WEB-DL.x264-GROUP`` →
 # source=WEB-DL, distributor=NF.
 distributors:
  - NF      # Netflix
  - AMZN    # Amazon Prime Video
  - DSNP    # Disney+
  - HMAX    # HBO Max
  - ATVP    # Apple TV+
  - HULU    # Hulu
  - PCOK    # Peacock
  - PMTP    # Paramount+
  - CR      # Crunchyroll
@@ -0,0 +1,22 @@
 # ELiTE release naming schema.
 #
 # Examples seen in the wild:
 #   Foundation.S02.1080p.x265-ELiTE             (TV season pack, no source)
 #
 # ELiTE often omits the source token entirely on TV releases (no WEBRip /
 # BluRay), going straight from resolution to codec.
 name: ELiTE
 separator: "."
 chunk_order:
  - role: title
  - role: year
    optional: true
  - role: season_episode
    optional: true
  - role: resolution
  - role: source
    optional: true             # often absent on TV
  - role: codec
  - role: group
@@ -0,0 +1,28 @@
 # KONTRAST release naming schema.
 #
 # Examples seen in the wild:
 #   Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST            (movie)
 #   The.Long.Walk.2025.1080p.WEBRip.x265-KONTRAST             (movie)
 #   Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST             (TV episode)
 #   Slow.Horses.S05.1080p.WEBRip.x265-KONTRAST                (TV season pack)
 #
 # Schema is a left-to-right description of the canonical chunk order.
 # Each entry is a role (matching TokenRole). Optional chunks are marked
 # with `optional: true`. The parser consumes tokens greedily by role,
 # skipping over optional chunks that don't match.
 name: KONTRAST
 separator: "."
 # Canonical order of structural + technical chunks (left to right).
 # `title` is special-cased as "everything up to the first non-title role".
 chunk_order:
  - role: title
  - role: year
    optional: true             # absent on TV releases (S01E01 instead)
  - role: season_episode
    optional: true             # absent on movies
  - role: resolution           # always present (1080p, 2160p, …)
  - role: source               # always present (WEBRip, BluRay, …)
  - role: codec                # always present (x265, x264, …)
  - role: group                # everything after the final `-`
@@ -0,0 +1,20 @@
 # RARBG release naming schema.
 #
 # RARBG follows the canonical scene convention closely:
 #   Title.Year.Resolution.Source.Codec-RARBG
 # For TV:
 #   Title.S01E01.Resolution.Source.Codec-RARBG
 name: RARBG
 separator: "."
 chunk_order:
  - role: title
  - role: year
    optional: true
  - role: season_episode
    optional: true
  - role: resolution
  - role: source
  - role: codec
  - role: group
@@ -1,4 +1,9 @@
-# Known release source tokens (case-insensitive match)
+# Known release source tokens (case-insensitive match).
 #
 # "Source" here means the capture/encoding origin (disc, broadcast, web
 # stream) — NOT the streaming distributor (Netflix, Disney+, …). Those
 # live in ``distributors.yaml`` because they're a separate dimension:
 # a release is typically "WEB-DL from NF" — both should be captured.
 sources:
  - bluray
  - blu-ray
@@ -14,8 +19,3 @@ sources:
  - dvdrip
  - dvd
  - vodrip
  - amzn
  - nf
  - dsnp
  - hmax
  - atvp
@@ -0,0 +1,216 @@
 """EASY-path tests for the v2 annotate-based pipeline.
 These tests assert that the **v2 pipeline itself** produces the correct
 annotated stream and assembled fields for releases from known groups
 (KONTRAST, ELiTE, …) — without going through ``parse_release``. The
 fixtures suite (``tests/domain/test_release_fixtures.py``) already
 locks the user-visible ``ParsedRelease`` contract; here we cover the
 internal pipeline behavior so a future refactor of ``parse_release``
 can't quietly drop EASY without us noticing.
 """
 from __future__ import annotations
 from alfred.domain.release.parser import TokenRole
 from alfred.domain.release.parser.pipeline import (
    _detect_group,
    annotate,
    assemble,
    tokenize,
 )
 from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
 _KB = YamlReleaseKnowledge()
 class TestDetectGroup:
    def test_codec_group(self) -> None:
        tokens, _ = tokenize(
            "Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
        )
        name, idx = _detect_group(tokens, _KB)
        assert name == "KONTRAST"
        assert idx == 6  # x265-KONTRAST is the 7th token
    def test_unknown_when_no_dash(self) -> None:
        tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x265.KONTRAST", _KB)
        # No dash anywhere → no group detected.
        name, idx = _detect_group(tokens, _KB)
        assert idx is None
        assert name == "UNKNOWN"
    def test_skips_dashed_source(self) -> None:
        # "Web-DL" must not be mistaken for a group token.
        tokens, _ = tokenize("Movie.2020.1080p.Web-DL.x265-GRP", _KB)
        name, idx = _detect_group(tokens, _KB)
        assert name == "GRP"
 class TestAnnotateEasy:
    def test_kontrast_movie(self) -> None:
        tokens, tag = tokenize(
            "Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
        )
        annotated = annotate(tokens, _KB)
        assert annotated is not None, "KONTRAST should hit the EASY path"
        roles = [t.role for t in annotated]
        assert roles == [
            TokenRole.TITLE,  # Back
            TokenRole.TITLE,  # in
            TokenRole.TITLE,  # Action
            TokenRole.YEAR,
            TokenRole.RESOLUTION,
            TokenRole.SOURCE,
            TokenRole.CODEC,  # x265-KONTRAST → CODEC with extra.group=KONTRAST
        ]
        assert annotated[-1].extra["group"] == "KONTRAST"
        assert annotated[-1].extra["codec"] == "x265"
    def test_kontrast_tv_episode(self) -> None:
        tokens, _ = tokenize(
            "Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST", _KB
        )
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        # Year is optional and absent → skipped. Season_episode present.
        roles = [t.role for t in annotated]
        assert TokenRole.SEASON_EPISODE in roles
        assert TokenRole.YEAR not in roles
    def test_elite_no_source(self) -> None:
        # ELiTE schema marks source as optional — Foundation.S02 omits it.
        tokens, _ = tokenize("Foundation.S02.1080p.x265-ELiTE", _KB)
        annotated = annotate(tokens, _KB)
        assert annotated is not None, "ELiTE optional source must be tolerated"
        roles = [t.role for t in annotated]
        assert TokenRole.SOURCE not in roles
        assert TokenRole.RESOLUTION in roles
        assert TokenRole.CODEC in roles
    def test_unknown_group_falls_to_shitty(self) -> None:
        tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x264-RANDOM", _KB)
        # RANDOM is not in our release_groups/ — annotate() now falls
        # through to the in-pipeline SHITTY pass and returns a populated
        # token list (no None sentinel anymore).
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        roles = [t.role for t in annotated]
        # Title is "Some.Movie", then YEAR, RESOLUTION, SOURCE, CODEC
        # carrying the group in extra.
        assert TokenRole.TITLE in roles
        assert TokenRole.YEAR in roles
        assert TokenRole.RESOLUTION in roles
        assert TokenRole.SOURCE in roles
        assert TokenRole.CODEC in roles
        codec_tok = next(t for t in annotated if t.role is TokenRole.CODEC)
        assert codec_tok.extra.get("group") == "RANDOM"
 class TestAssemble:
    def test_kontrast_movie_fields(self) -> None:
        name = "Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        fields = assemble(annotated, tag, name, _KB)
        assert fields["title"] == "Back.in.Action"
        assert fields["year"] == 2025
        assert fields["season"] is None
        assert fields["quality"] == "1080p"
        assert fields["source"] == "WEBRip"
        assert fields["codec"] == "x265"
        assert fields["group"] == "KONTRAST"
        assert fields["tech_string"] == "1080p.WEBRip.x265"
        assert fields["media_type"] == "movie"
        assert fields["site_tag"] is None
    def test_kontrast_tv_fields(self) -> None:
        name = "Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        fields = assemble(annotated, tag, name, _KB)
        assert fields["title"] == "Slow.Horses"
        assert fields["year"] is None
        assert fields["season"] == 5
        assert fields["episode"] == 1
        assert fields["media_type"] == "tv_show"
        assert fields["group"] == "KONTRAST"
    def test_elite_season_pack(self) -> None:
        name = "Foundation.S02.1080p.x265-ELiTE"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        fields = assemble(annotated, tag, name, _KB)
        assert fields["title"] == "Foundation"
        assert fields["season"] == 2
        assert fields["episode"] is None  # season pack
        assert fields["source"] is None  # ELiTE omits it
        assert fields["tech_string"] == "1080p.x265"
        assert fields["group"] == "ELiTE"
 class TestEnrichers:
    """Non-positional roles populated alongside the structural walk.
    These releases would have failed the v2 EASY path before the enricher
    pass landed (leftover unknown tokens would force a fallback). They
    now succeed in v2 with rich metadata.
    """
    def test_bit_depth_and_audio(self) -> None:
        name = "Back.in.Action.2025.1080p.WEBRip.10bit.DDP.5.1.x265-KONTRAST"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        fields = assemble(annotated, tag, name, _KB)
        assert fields["title"] == "Back.in.Action"
        assert fields["bit_depth"] == "10bit"
        assert fields["audio_codec"] == "DDP"
        assert fields["audio_channels"] == "5.1"
    def test_hdr_sequence(self) -> None:
        # DV.HDR10 sequence + TrueHD.Atmos sequence + 7.1 channels +
        # DIRECTORS.CUT edition all in one release.
        name = (
            "Some.Movie.2024.DIRECTORS.CUT.2160p.BluRay.DV.HDR10."
            "TrueHD.Atmos.7.1.x265-KONTRAST"
        )
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        fields = assemble(annotated, tag, name, _KB)
        assert fields["edition"] == "DIRECTORS.CUT"
        assert fields["hdr_format"] == "DV.HDR10"
        assert fields["audio_codec"] == "TrueHD.Atmos"
        assert fields["audio_channels"] == "7.1"
    def test_multiple_languages(self) -> None:
        name = "Movie.2020.FRENCH.MULTI.1080p.WEBRip.DTS.HD.MA.5.1.x265-KONTRAST"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        fields = assemble(annotated, tag, name, _KB)
        assert fields["languages"] == ["FRENCH", "MULTI"]
        assert fields["audio_codec"] == "DTS-HD.MA"
        assert fields["audio_channels"] == "5.1"
    def test_tv_with_language(self) -> None:
        name = "Show.S01E05.FRENCH.1080p.WEBRip.x265-KONTRAST"
        tokens, tag = tokenize(name, _KB)
        annotated = annotate(tokens, _KB)
        assert annotated is not None
        fields = assemble(annotated, tag, name, _KB)
        assert fields["title"] == "Show"
        assert fields["season"] == 1
        assert fields["episode"] == 5
        assert fields["languages"] == ["FRENCH"]
        assert fields["media_type"] == "tv_show"
@@ -0,0 +1,79 @@
 """Scaffolding tests for the v2 parser package.
 These tests lock the **shape** of the new pipeline (token VOs, tokenize
 output, site-tag stripping) before the annotate step is wired in. They
 do not check parsed-release output yet — that comes once :func:`annotate`
 is implemented and the fixtures-based suite switches over.
 """
 from __future__ import annotations
 from alfred.domain.release.parser import Token, TokenRole
 from alfred.domain.release.parser.pipeline import strip_site_tag, tokenize
 from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
 _KB = YamlReleaseKnowledge()
 class TestToken:
    def test_default_role_is_unknown(self) -> None:
        t = Token(text="1080p", index=3)
        assert t.role is TokenRole.UNKNOWN
        assert not t.is_annotated
    def test_with_role_returns_new_instance(self) -> None:
        t = Token(text="1080p", index=3)
        promoted = t.with_role(TokenRole.RESOLUTION)
        assert promoted is not t
        assert promoted.role is TokenRole.RESOLUTION
        assert t.role is TokenRole.UNKNOWN  # original unchanged (frozen)
    def test_with_role_merges_extra(self) -> None:
        t = Token(text="x265-KONTRAST", index=5)
        promoted = t.with_role(TokenRole.CODEC, group="KONTRAST")
        assert promoted.role is TokenRole.CODEC
        assert promoted.extra == {"group": "KONTRAST"}
 class TestStripSiteTag:
    def test_no_tag(self) -> None:
        clean, tag = strip_site_tag("The.Movie.2020.1080p-GRP")
        assert tag is None
        assert clean == "The.Movie.2020.1080p-GRP"
    def test_suffix_tag(self) -> None:
        clean, tag = strip_site_tag("Sinners.2025.1080p-[YTS.MX]")
        assert tag == "YTS.MX"
        assert clean == "Sinners.2025.1080p-"
    def test_prefix_tag(self) -> None:
        clean, tag = strip_site_tag("[ OxTorrent.vc ] The.Title.S01E01")
        assert tag == "OxTorrent.vc"
        assert clean == "The.Title.S01E01"
 class TestTokenize:
    def test_simple_release(self) -> None:
        tokens, tag = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
        assert tag is None
        texts = [t.text for t in tokens]
        # Dash is not a separator, so x265-KONTRAST stays glued.
        assert texts == [
            "Back", "in", "Action", "2025", "1080p", "WEBRip", "x265-KONTRAST",
        ]
    def test_all_tokens_start_unknown(self) -> None:
        tokens, _ = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
        assert all(t.role is TokenRole.UNKNOWN for t in tokens)
    def test_indexes_are_contiguous(self) -> None:
        tokens, _ = tokenize("A.B.C.D", _KB)
        assert [t.index for t in tokens] == [0, 1, 2, 3]
    def test_strips_site_tag_before_tokenize(self) -> None:
        tokens, tag = tokenize(
            "Sinners.2025.1080p.WEBRip.x265.10bit.AAC5.1-[YTS.MX]", _KB
        )
        assert tag == "YTS.MX"
        # Site tag substring must not appear among tokens.
        assert not any("YTS" in t.text for t in tokens)
@@ -26,10 +26,16 @@ _KB = YamlReleaseKnowledge()
 FIXTURES = discover_fixtures()
 def _fixture_param(f: ReleaseFixture) -> pytest.param:
    marks = []
    if f.xfail_reason:
        marks.append(pytest.mark.xfail(reason=f.xfail_reason, strict=False))
    return pytest.param(f, id=f.name, marks=marks)
@pytest.mark.parametrize(
    "fixture",
-    FIXTURES,
+    [_fixture_param(f) for f in FIXTURES],
    ids=[f.name for f in FIXTURES],
 )
 def test_parse_matches_fixture(fixture: ReleaseFixture, tmp_path) -> None:
    # Materialize the tree to assert it is at least well-formed YAML +
@@ -39,6 +39,14 @@ class ReleaseFixture:
    def routing(self) -> dict:
        return self.data.get("routing", {})
    @property
    def xfail_reason(self) -> str | None:
        """If set, the fixture is expected to fail — wrapped with
        ``pytest.mark.xfail`` by the test runner. Used for known
        not-supported pathological cases (typically PATH OF PAIN bucket).
        """
        return self.data.get("xfail_reason")
    def materialize(self, root: Path) -> None:
        """Create the fixture's ``tree`` as empty files/dirs under ``root``."""
        for entry in self.tree:
@@ -1,5 +1,10 @@
 release_name: "Deutschland 83-86-89 (2015) Season 1-3 S01-S03 (1080p BluRay x265 HEVC 10bit AAC 5.1 German Kappa)"
 # Out of SHITTY scope by design: parenthesized tech blocks, group name as
 # the last bare word inside parens, year-suffix range in title, dual
 # season expression. PATH OF PAIN handles this via LLM pre-analysis.
 xfail_reason: "PoP-grade pathological franchise box-set, beyond simple-dict SHITTY"
 # Pathological franchise box-set:
 # - Title contains year-suffix range "83-86-89" (3 years glued)
 # - Season range expressed twice: "Season 1-3" AND "S01-S03"
@@ -1,5 +1,10 @@
 release_name: "Predator Badlands 2025 1080p HDRip HEVC x265 BONE"
 # Space-separated release with both codec aliases present (HEVC + x265)
 # and no dash-before-group. Simple-SHITTY first-wins picks HEVC, expected
 # was x265 (legacy last-wins). Reclassified PoP.
 xfail_reason: "Space-separated, dual codec aliases, no dashed group"
 # Space-separated release: tokenizer correctly splits and identifies year +
 # tech, but the dash-before-group convention is absent so 'BONE' is not
 # recognized as the group — falls to UNKNOWN. Anti-regression baseline.
@@ -1,5 +1,9 @@
 release_name: "SLEAFORD MODS   Live Glastonbury June 27th 2015-niNjHn8abyY.mp4"
 # YouTube-style slug with year-prefixed video-id dash suffix. Not a scene
 # release shape at all — PATH OF PAIN.
 xfail_reason: "YouTube slug with year-prefixed video-id, not a scene shape"
 # yt-dlp filename: triple space between band name and event, no canonical
 # tech markers, dashed YouTube video ID glued to the year, .mp4 extension
 # preserved in the title. Parser:
@@ -1,5 +1,10 @@
 release_name: "Super Mario Bros. le film [FR-EN] (2023).mkv"
 # Bare-dashed language pair interior to the title (``[FR-EN]``) is tagged
 # as group by ``_detect_group``, leaving the title fragment behind.
 # Out of simple-SHITTY scope.
 xfail_reason: "Interior bare-dashed language pair confuses group detection"
 # Hybrid English/French marketing title with:
 # - Trailing period after 'Bros' that is part of the title abbreviation
 #   (not a separator), but tokenizer treats it as one
@@ -1,7 +1,8 @@
 release_name: "Notre.planete.s01e01.1080p.NF.WEB-DL.DDP5.1.x264-NTb"
 # Lowercase 's01e01' and lowercased title word ('planete') correctly parsed.
-# NF (Netflix) source tag is not in the source KB — drops; WEB-DL wins.
+# NF is the Netflix streaming distributor (separate dimension from source);
 # WEB-DL is the encoding source.
 parsed:
  title: "Notre.planete"
  year: null
@@ -11,6 +12,7 @@ parsed:
  source: "WEB-DL"
  codec: "x264"
  group: "NTb"
  distributor: "NF"
  tech_string: "1080p.WEB-DL.x264"
  media_type: "tv_show"
  parse_path: "direct"