feat(release): scaffold v2 parser package (annotate-based pipeline)

New package alfred/domain/release/parser/ lays the foundation for the release parser refactor (specs in memory). Exposes: - Token: frozen VO carrying text + stream index + TokenRole + extra dict. with_role() returns a new instance (no mutation). - TokenRole: str-backed enum split into structural (TITLE/YEAR/SEASON_EP/ GROUP), technical (RESOLUTION/SOURCE/CODEC/AUDIO_*/BIT_DEPTH/HDR/ EDITION/LANGUAGE), and meta (SITE_TAG/UNKNOWN) families. - pipeline.strip_site_tag(): pulls a [site.tag] prefix or suffix. - pipeline.tokenize(): release name -> list[Token] (all UNKNOWN), string-ops split on kb.separators (no regex, per CLAUDE.md). - pipeline.annotate(): documented stub. Walk order recorded in docstring (group right-to-left, then season/episode, year, tech, title). Legacy parse_release in release.services remains the live implementation until the annotate step lands. Scaffolding tests verify Token API, site-tag stripping (prefix/suffix), and tokenize output shape. Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:12:33 +02:00
parent 9f10f4e0ad
commit a2c917618f
6 changed files with 323 additions and 0 deletions
@@ -17,6 +17,16 @@ callers).

 ### Added

+- **Release parser v2 scaffolding** (`alfred/domain/release/parser/`):
+  new package laying the foundation for an annotate-based pipeline
+  (tokenize → annotate → assemble). Exposes `Token` (frozen VO with
+  `index` + `role` + `extra`), `TokenRole` enum (structural / technical /
+  meta families), and a `pipeline.py` module with working `strip_site_tag`
+  + `tokenize` and a documented `annotate` stub. Legacy `parse_release`
+  in `release.services` remains the live implementation until the
+  annotate step is wired in. Scaffolding tests in
+  `tests/domain/release/test_parser_v2_scaffolding.py`.
+
 - **Real-world release fixtures** under `tests/fixtures/releases/{easy,shitty,path_of_pain}/`,
  each documenting an expected `ParsedRelease` plus the future `routing`
  (library / torrents / seed_hardlinks) for the upcoming `organize_media`
@@ -0,0 +1,30 @@
+"""Release parser v2 — annotate-based pipeline.
+
+This package is the future home of ``parse_release``. It restructures the
+parsing logic around a **tokenize → annotate → assemble** pipeline:
+
+1. **tokenize**: split the release name into atomic tokens.
+2. **annotate**: walk tokens left-to-right, assigning each one a
+   :class:`TokenRole` (TITLE, YEAR, SEASON, RESOLUTION, …) using the
+   injected :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge`.
+3. **assemble**: fold the annotated tokens into a :class:`ParsedRelease`.
+
+The pipeline has three internal paths driven by the detected release group:
+
+- **EASY**: known group (KONTRAST, RARBG, …) with a schema-driven layout
+  declared in ``knowledge/release/release_groups/<group>.yaml``.
+- **SHITTY**: unknown group, best-effort matching against the global
+  knowledge sets, with a 0-100 confidence score.
+- **PATH OF PAIN**: score below threshold OR critical chunks missing —
+  signaled to the caller, who decides whether to involve the LLM/user.
+
+Today the package exposes scaffolding only (token VOs and a thin pipeline
+stub). The legacy ``parse_release`` in ``release.services`` keeps serving
+production until each piece of the v2 pipeline is wired in.
+"""
+
+from __future__ import annotations
+
+from .tokens import Token, TokenRole
+
+__all__ = ["Token", "TokenRole"]
@@ -0,0 +1,115 @@
+"""Annotate-based pipeline skeleton.
+
+The pipeline is **declared here** in three named stages, but actual logic
+is wired in incrementally — current state is intentional scaffolding.
+
+Stages:
+
+1. :func:`tokenize` — release name → ``list[Token]`` (all UNKNOWN). Also
+   pulls out a leading/trailing site tag (e.g. ``[YTS.MX]``) which is
+   returned separately and never tokenized.
+2. :func:`annotate` — walk the tokens, promote roles using
+   :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge`. The
+   walk is **right-to-left for the group** (scene convention puts it
+   last) and **left-to-right for the title** (which is always leftmost).
+3. :func:`assemble` — fold the annotated stream into a domain VO. Output
+   type still TBD: the migration target is the existing
+   :class:`~alfred.domain.release.value_objects.ParsedRelease`, but the
+   pipeline may grow an intermediate :class:`AnnotatedRelease` first to
+   keep the score / leftover-tokens information that ``ParsedRelease``
+   doesn't carry today.
+
+Road dispatch (EASY / SHITTY / PATH OF PAIN) happens **inside**
+:func:`annotate` — once the group is identified (or not), the annotator
+picks the right strategy. EASY consults a per-group schema; SHITTY runs
+the generic matcher loop; PATH OF PAIN is a return-state, not a
+separate path — the caller (``application/release/inspect.py``) decides
+what to do with a low-confidence result.
+"""
+
+from __future__ import annotations
+
+from ..ports.knowledge import ReleaseKnowledge
+from .tokens import Token
+
+
+def strip_site_tag(name: str) -> tuple[str, str | None]:
+    """Split off a ``[site.tag]`` prefix or suffix.
+
+    The bracketed substring is removed from ``name`` and returned as the
+    second element. If no tag is found, returns ``(name.strip(), None)``.
+    """
+    s = name.strip()
+
+    if s.startswith("["):
+        close = s.find("]")
+        if close != -1:
+            tag = s[1:close].strip()
+            remainder = s[close + 1 :].strip()
+            if tag and remainder:
+                return remainder, tag
+
+    if s.endswith("]"):
+        open_bracket = s.rfind("[")
+        if open_bracket != -1:
+            tag = s[open_bracket + 1 : -1].strip()
+            remainder = s[:open_bracket].strip()
+            if tag and remainder:
+                return remainder, tag
+
+    return s, None
+
+
+def tokenize(name: str, kb: ReleaseKnowledge) -> tuple[list[Token], str | None]:
+    """Split ``name`` into tokens after stripping any site tag.
+
+    Returns ``(tokens, site_tag)``. All tokens start with role
+    :attr:`~.tokens.TokenRole.UNKNOWN` — promotion happens in
+    :func:`annotate`.
+
+    The tokenizer is a pure character-class split on ``kb.separators``.
+    String-ops style: no regex (keeps the rule from CLAUDE.md), at the
+    cost of one pass per separator. The release names we parse are short
+    (<200 chars), so the constant factor is irrelevant.
+    """
+    clean, site_tag = strip_site_tag(name)
+
+    # Replace every separator with a single delimiter, then split. Using
+    # \x00 because it cannot legally appear in a release name.
+    DELIM = "\x00"
+    buf = clean
+    for sep in kb.separators:
+        if sep != DELIM:
+            buf = buf.replace(sep, DELIM)
+
+    pieces = [p for p in buf.split(DELIM) if p]
+    tokens = [Token(text=p, index=i) for i, p in enumerate(pieces)]
+    return tokens, site_tag
+
+
+def annotate(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
+    """Promote each token's role using ``kb``.
+
+    **Not implemented yet.** Returns the input unchanged so the package
+    is importable and the pipeline shape is visible. Will be filled in
+    by subsequent commits, one role family at a time.
+
+    The intended walk order, once implemented:
+
+    1. **Group (right-to-left)** — find the trailing ``-GROUP`` token,
+       which also reveals the codec when shaped as ``codec-GROUP``. If
+       the group matches a schema in ``knowledge/release/release_groups/``
+       → EASY path; otherwise SHITTY.
+    2. **Season/episode** — single-token scan, ``S01E05`` / ``1x05``.
+    3. **Year** — first 4-digit token in [1900, 2099] *after* index 0.
+    4. **Tech tokens** — resolutions, sources, codecs, audio, video meta,
+       editions, languages. Multi-token sequences (``DTS.HD.MA``,
+       ``Directors.Cut``) handled first to avoid greedy single-token
+       claims swallowing a sequence prefix.
+    5. **Title** — leftmost contiguous UNKNOWN tokens up to the first
+       structural/technical role boundary.
+    """
+    # TODO(parser-v2): implement annotation. See module docstring for the
+    # walk order. Until then, the legacy parse_release in
+    # release.services is the live implementation.
+    return tokens
@@ -0,0 +1,89 @@
+"""Token value objects for the annotate-based parser.
+
+A :class:`Token` carries both the original substring and its position in
+the original release name's token stream. A :class:`TokenRole` is the
+semantic tag assigned by the annotator.
+
+Why VOs instead of bare ``str``: the annotate step needs to flag tokens
+without consuming them (a token may carry residual info — e.g. a
+``codec-GROUP`` token contributes both a CODEC and a GROUP role). Tracking
+the index also lets later stages reason about *order* (year must come
+after title, group must be rightmost, etc.) without re-scanning the list.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from enum import Enum
+
+
+class TokenRole(str, Enum):
+    """Semantic role a token can take after annotation.
+
+    A token starts as ``UNKNOWN`` and may be promoted by the annotator.
+    ``str``-backed for cheap comparisons and YAML/JSON interop.
+
+    Roles split into three families:
+
+    - **structural**: TITLE / YEAR / SEASON_EPISODE / GROUP — drive folder
+      and filename naming.
+    - **technical**: RESOLUTION / SOURCE / CODEC / AUDIO_CODEC /
+      AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION / LANGUAGE — feed
+      ``tech_string`` and metadata fields.
+    - **meta**: SITE_TAG (stripped pre-tokenize), SEPARATOR (kept for the
+      assemble step if a release uses spaces that need preservation in the
+      title), UNKNOWN (residual, contributes to the SHITTY score penalty).
+    """
+
+    UNKNOWN = "unknown"
+
+    # Structural
+    TITLE = "title"
+    YEAR = "year"
+    SEASON_EPISODE = "season_episode"
+    GROUP = "group"
+
+    # Technical
+    RESOLUTION = "resolution"
+    SOURCE = "source"
+    CODEC = "codec"
+    AUDIO_CODEC = "audio_codec"
+    AUDIO_CHANNELS = "audio_channels"
+    BIT_DEPTH = "bit_depth"
+    HDR = "hdr"
+    EDITION = "edition"
+    LANGUAGE = "language"
+
+    # Meta
+    SITE_TAG = "site_tag"
+
+
+@dataclass(frozen=True)
+class Token:
+    """An atomic token from a release name.
+
+    ``text`` is the substring exactly as it appeared after tokenization
+    (case preserved — uppercase comparisons happen at match time).
+    ``index`` is the 0-based position in the tokenized stream, used by
+    downstream stages to enforce ordering invariants.
+
+    ``role`` defaults to :attr:`TokenRole.UNKNOWN`. The annotator returns
+    new :class:`Token` instances with the role set rather than mutating
+    (the dataclass is frozen). ``extra`` carries role-specific payload
+    when the token text alone isn't enough (e.g. a ``codec-GROUP`` token
+    annotated as CODEC may record the group name in ``extra["group"]``).
+    """
+
+    text: str
+    index: int
+    role: TokenRole = TokenRole.UNKNOWN
+    extra: dict[str, str] = field(default_factory=dict)
+
+    def with_role(self, role: TokenRole, **extra: str) -> Token:
+        """Return a copy of this token with ``role`` (and optional ``extra``)."""
+        merged = {**self.extra, **extra} if extra else self.extra
+        return Token(text=self.text, index=self.index, role=role, extra=merged)
+
+    @property
+    def is_annotated(self) -> bool:
+        return self.role is not TokenRole.UNKNOWN
@@ -0,0 +1,79 @@
+"""Scaffolding tests for the v2 parser package.
+
+These tests lock the **shape** of the new pipeline (token VOs, tokenize
+output, site-tag stripping) before the annotate step is wired in. They
+do not check parsed-release output yet — that comes once :func:`annotate`
+is implemented and the fixtures-based suite switches over.
+"""
+
+from __future__ import annotations
+
+from alfred.domain.release.parser import Token, TokenRole
+from alfred.domain.release.parser.pipeline import strip_site_tag, tokenize
+from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
+
+_KB = YamlReleaseKnowledge()
+
+
+class TestToken:
+    def test_default_role_is_unknown(self) -> None:
+        t = Token(text="1080p", index=3)
+        assert t.role is TokenRole.UNKNOWN
+        assert not t.is_annotated
+
+    def test_with_role_returns_new_instance(self) -> None:
+        t = Token(text="1080p", index=3)
+        promoted = t.with_role(TokenRole.RESOLUTION)
+        assert promoted is not t
+        assert promoted.role is TokenRole.RESOLUTION
+        assert t.role is TokenRole.UNKNOWN  # original unchanged (frozen)
+
+    def test_with_role_merges_extra(self) -> None:
+        t = Token(text="x265-KONTRAST", index=5)
+        promoted = t.with_role(TokenRole.CODEC, group="KONTRAST")
+        assert promoted.role is TokenRole.CODEC
+        assert promoted.extra == {"group": "KONTRAST"}
+
+
+class TestStripSiteTag:
+    def test_no_tag(self) -> None:
+        clean, tag = strip_site_tag("The.Movie.2020.1080p-GRP")
+        assert tag is None
+        assert clean == "The.Movie.2020.1080p-GRP"
+
+    def test_suffix_tag(self) -> None:
+        clean, tag = strip_site_tag("Sinners.2025.1080p-[YTS.MX]")
+        assert tag == "YTS.MX"
+        assert clean == "Sinners.2025.1080p-"
+
+    def test_prefix_tag(self) -> None:
+        clean, tag = strip_site_tag("[ OxTorrent.vc ] The.Title.S01E01")
+        assert tag == "OxTorrent.vc"
+        assert clean == "The.Title.S01E01"
+
+
+class TestTokenize:
+    def test_simple_release(self) -> None:
+        tokens, tag = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
+        assert tag is None
+        texts = [t.text for t in tokens]
+        # Dash is not a separator, so x265-KONTRAST stays glued.
+        assert texts == [
+            "Back", "in", "Action", "2025", "1080p", "WEBRip", "x265-KONTRAST",
+        ]
+
+    def test_all_tokens_start_unknown(self) -> None:
+        tokens, _ = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
+        assert all(t.role is TokenRole.UNKNOWN for t in tokens)
+
+    def test_indexes_are_contiguous(self) -> None:
+        tokens, _ = tokenize("A.B.C.D", _KB)
+        assert [t.index for t in tokens] == [0, 1, 2, 3]
+
+    def test_strips_site_tag_before_tokenize(self) -> None:
+        tokens, tag = tokenize(
+            "Sinners.2025.1080p.WEBRip.x265.10bit.AAC5.1-[YTS.MX]", _KB
+        )
+        assert tag == "YTS.MX"
+        # Site tag substring must not appear among tokens.
+        assert not any("YTS" in t.text for t in tokens)