Le champ s'appelait normalised mais ne faisait pas la normalisation
suggérée par son nom (dots instead of spaces). En pratique il contient
raw - site_tag - apostrophes, qui sert uniquement à season_folder_name()
via _strip_episode_from_normalized. Renommé en 'clean' qui décrit ce
qu'il contient réellement, docstring corrigée.
Wire the scoring foundations into the parser entry point. parse_release
now returns a tuple — the structural ParsedRelease and a diagnostic
ParseReport carrying confidence (0-100), road
(EASY / SHITTY / PATH_OF_PAIN), the residual UNKNOWN tokens, and the
list of critical fields that couldn't be filled.
EASY is decided structurally (a group schema matched), independently
of the score. SHITTY vs PATH_OF_PAIN is decided by score against the
60 cutoff from scoring.yaml. Malformed names (forbidden chars) emit a
zero-confidence PoP report and short-circuit to parse_path=AI as
before.
ParsePath stays as-is (DIRECT / SANITIZED / AI) — it records *how* we
tokenized, not how confident we are. The two dimensions are now
properly separated.
Call sites propagated:
- alfred/application/filesystem/resolve_destination.py (4 occurrences)
- alfred/agent/tools/filesystem.py
- tests/domain/test_release.py
- tests/domain/test_release_fixtures.py
- tests/application/test_detect_media_type.py
New tests/domain/release/test_parser_v2_scoring.py (22 cases) locks
ParseReport validation, compute_score arithmetic, decide_road
thresholding, the collector helpers, and the end-to-end tuple contract.
Replace the ~480-line legacy heuristic block in services.py with a
small dict-driven pass in pipeline._annotate_shitty: each token is
looked up against the kb buckets (resolutions / sources / codecs /
distributors / year / sxxexx) with first-match-wins semantics, the
leftmost contiguous UNKNOWN run becomes the title, done.
SHITTY's scope is intentionally narrow — releases that *look* like
scene names but don't have a registered group schema. Anything more
exotic (parenthesized tech, bare-dashed title fragments, YT slugs,
franchise boxes) is PATH OF PAIN territory and stays out of here.
- annotate() no longer returns None; SHITTY is the always-on fallback
- services.py shrunk from ~525 to ~85 lines (legacy extractors gone)
- 4 fixtures get xfail markers documenting PoP-grade pathologies
(deutschland franchise box, sleaford YT slug, super_mario bilingual,
predator space-separators — the last one moved from shitty/ → pop/)
- ReleaseFixture grows xfail_reason; the parametrized suite wires the
pytest.mark.xfail(strict=False) automatically
The EASY pipeline now extracts the full ParsedRelease surface from
known-group releases, not just the structural backbone. Behavior is
unchanged for releases that don't carry these tokens.
Pipeline (parser/pipeline.py):
- Structural walk (renamed _annotate_structural): no longer requires
body to be fully consumed. Tokens passed over between schema chunks
remain UNKNOWN so the enricher pass can claim them.
- _find_chunk(): scans forward in the body for the next token matching
a given role, skipping already-annotated tokens. Lets optional and
mandatory chunks both tolerate intercalated enricher tokens.
- _annotate_enrichers(): new non-positional pass. Walks UNKNOWN tokens
and tags AUDIO_CODEC / AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION /
LANGUAGE. Multi-token sequences from kb.audio / kb.video_meta /
kb.editions are matched first (longest-first ordering preserved from
the YAML), single tokens after.
- _apply_sequences(): mutates the token list, tagging the first token
of a matched sequence with extra['sequence']=<canonical value> and
trailing members with extra['sequence_member']='True' so assemble
skips them.
- _detect_channel_pairs(): handles the '5.1' / '7.1' case where the
'.' separator splits the layout into two tokens. Strips a trailing
'-GROUP' suffix on the second before joining.
Assemble:
- New fields populated: languages (list), audio_codec, audio_channels,
bit_depth, hdr_format, edition. Each role-handler skips
sequence_member tokens.
- media_type heuristic extended: edition in {COMPLETE, INTEGRALE,
COLLECTION} + no season → tv_complete (mirrors legacy).
Tests:
- 4 new TestEnrichers cases covering bit_depth+audio_codec+channels,
HDR sequence + edition sequence + TrueHD.Atmos + 7.1, multi-language
with DTS-HD.MA sequence, TV episode with single language.
- All 14 v2 tests + 30 fixture tests still green. Suite: 1011 passed,
8 skipped.
Refs: project_release_parser_v2_specs (memory)
The annotate-based v2 pipeline now handles releases ending in -KONTRAST,
-ELiTE, or -RARBG. Unknown groups still fall through to the legacy
SHITTY heuristic in services.py — nothing changes for them.
Pipeline (alfred/domain/release/parser/pipeline.py):
- tokenize(): string-ops separator split, strips [site.tag] first.
- annotate(): right-to-left group detection (priority to codec-GROUP
shape, fallback to any non-source dashed token), GroupSchema lookup
via the kb port, then lockstep walk of tokens against schema chunks.
Optional chunks skip on mismatch, mandatory mismatches return None so
the caller falls back gracefully. CODEC pre-consumed by a codec-GROUP
trailing token correctly skips the CODEC chunk in the body walk.
- assemble(): folds annotated tokens into a ParsedRelease-compatible
dict (title joined by '.', group from the codec-GROUP token's extras).
Schema (alfred/domain/release/parser/schema.py):
- GroupSchema + SchemaChunk frozen value objects.
- TokenRole.GROUP added.
Port + adapter:
- ReleaseKnowledge.group_schema(name) lookup added (case-insensitive).
- YamlReleaseKnowledge loads alfred/knowledge/release/release_groups/
*.yaml at construction time; learned overrides in
data/knowledge/release/release_groups/ also picked up.
Knowledge:
- release_groups/kontrast.yaml, elite.yaml, rarbg.yaml declare the
canonical chunk_order. ELiTE marks source as optional (Foundation.S02
has no WEBRip token).
Services:
- parse_release tries the v2 path first; on None falls through to the
legacy implementation untouched.
Tests:
- tests/domain/release/test_parser_v2_easy.py (10 cases) cover group
detection (codec-GROUP, dashed-source skip, no-dash → unknown),
schema-driven annotation (movie, TV episode, season pack with
optional source, unknown group returns None), and field assembly.
- Existing tests/domain/test_release_fixtures.py (30 cases) stay green:
5 EASY fixtures now produced by v2, 25 SHITTY/PATH OF PAIN fixtures
still produced by the legacy path. Verified via spy on v2.assemble.
Suite: 1007 passed, 8 skipped.
Refs: project_release_parser_v2_specs (memory)
New package alfred/domain/release/parser/ lays the foundation for the
release parser refactor (specs in memory). Exposes:
- Token: frozen VO carrying text + stream index + TokenRole + extra dict.
with_role() returns a new instance (no mutation).
- TokenRole: str-backed enum split into structural (TITLE/YEAR/SEASON_EP/
GROUP), technical (RESOLUTION/SOURCE/CODEC/AUDIO_*/BIT_DEPTH/HDR/
EDITION/LANGUAGE), and meta (SITE_TAG/UNKNOWN) families.
- pipeline.strip_site_tag(): pulls a [site.tag] prefix or suffix.
- pipeline.tokenize(): release name -> list[Token] (all UNKNOWN),
string-ops split on kb.separators (no regex, per CLAUDE.md).
- pipeline.annotate(): documented stub. Walk order recorded in docstring
(group right-to-left, then season/episode, year, tech, title).
Legacy parse_release in release.services remains the live implementation
until the annotate step lands. Scaffolding tests verify Token API,
site-tag stripping (prefix/suffix), and tokenize output shape.
Refs: project_release_parser_v2_specs (memory)