8 Commits

Author SHA1 Message Date
francwa 621bb96995 fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly
Apostrophes are in the forbidden-chars list, which made any release
with a title like "Don't" or "L'avare" short-circuit to the AI
fallback (parse_path=ai, everything UNKNOWN). They are now stripped
up front from the name before the well-formed check and tokenize,
so the parse completes normally. The raw name is preserved on the
VO; only the title field loses its apostrophe.

parse_path becomes 'sanitized' when an apostrophe was stripped, to
surface that the parser cleaned something up.

Fixtures updated:
- shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse
  (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip,
  codec=x265, group=Amen).
- path_of_pain/the_prodigy_full_chaos/ — went from total failure to
  partial success (title, year, source, codec extracted). Remaining
  gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked
  separately in tech debt.
2026-05-20 23:29:10 +02:00
francwa 448ef3b79c fix(release/parser): recognize Sxx-yy season range as tv_complete
`Der.Tatortreiniger.S01-06.GERMAN...` previously parsed as a movie
with 'S01-06' glued to the title. The parser now matches the
season-range form in _parse_season_episode (returning season=first,
episode=None), and the assemble step detects the range token to
promote media_type to 'tv_complete'.

The first season is exposed as `season` so `is_season_pack`
fires (season is not None and episode is None) — useful for routing
to a series root folder.

Fixture shitty/tatortreiniger_flat_multiseason/ updated:
- title: Der.Tatortreiniger.S01-06 → Der.Tatortreiniger
- season: null → 1
- media_type: movie → tv_complete
- is_season_pack: false → true
2026-05-20 23:26:40 +02:00
francwa b1c7f35ffb fix(release/parser): drop pure-punctuation TITLE tokens at assembly
Releases using ' - ' as a separator (Vinyl - 1x01 - FHD) tokenize to
['Vinyl', '-', '1x01', '-', 'FHD'] — the standalone '-' tokens were
ending up in title_parts and leaked into the joined title
('Vinyl.-'). We can't add '-' to the separator list (it would break
codec-GROUP), so we filter at assembly: a TITLE token with no
alphanumeric characters carries no title content.

Side win: same logic eliminates the UTF-8 wide-pipe '|' from the
khruangbin_yt_wide_pipe fixture title.

Fixtures updated:
- shitty/vinyl_1x01_format/expected.yaml (title: Vinyl.- → Vinyl)
- path_of_pain/khruangbin_yt_wide_pipe/expected.yaml (| dropped)
2026-05-20 23:24:40 +02:00
francwa 5bbdc9081f fix(release/parser): collapse chained multi-episode markers to full range
S14E09E10E11 previously parsed to episode=9, episode_end=10 — E11
was silently dropped. The parser now takes episodes[-1] as
episode_end so the full chain is captured (episode=9, episode_end=11).
Intermediate values stay implied.

Fixture shitty/archer_multi_episode/ updated from anti-regression of
the bug to anti-regression of the fix.
2026-05-20 23:23:08 +02:00
francwa 3737f66851 refactor(release): simplify SHITTY to dict-driven token tagging
Replace the ~480-line legacy heuristic block in services.py with a
small dict-driven pass in pipeline._annotate_shitty: each token is
looked up against the kb buckets (resolutions / sources / codecs /
distributors / year / sxxexx) with first-match-wins semantics, the
leftmost contiguous UNKNOWN run becomes the title, done.

SHITTY's scope is intentionally narrow — releases that *look* like
scene names but don't have a registered group schema. Anything more
exotic (parenthesized tech, bare-dashed title fragments, YT slugs,
franchise boxes) is PATH OF PAIN territory and stays out of here.

- annotate() no longer returns None; SHITTY is the always-on fallback
- services.py shrunk from ~525 to ~85 lines (legacy extractors gone)
- 4 fixtures get xfail markers documenting PoP-grade pathologies
  (deutschland franchise box, sleaford YT slug, super_mario bilingual,
  predator space-separators — the last one moved from shitty/ → pop/)
- ReleaseFixture grows xfail_reason; the parametrized suite wires the
  pytest.mark.xfail(strict=False) automatically
2026-05-20 01:03:25 +02:00
francwa fd3bd1ad8c feat(release): distinguish streaming distributors from sources
Introduce a separate dimension for streaming-platform tags (NF, AMZN,
DSNP, HMAX, ATVP, …) so they stop polluting the encoding-source field.
WEB-DL is the source; the platform that released it is the distributor.

- new distributors.yaml knowledge file
- ReleaseKnowledge port exposes distributors set
- TokenRole.DISTRIBUTOR + ParsedRelease.distributor field
- removed NF/AMZN/DSNP/HMAX/ATVP from sources.yaml
- notre_planete fixture now records distributor: NF
2026-05-20 01:03:11 +02:00
francwa c1831e3f46 test(fixtures): drop derry_duplicate_naming (was a copy-paste artifact)
The release name mixed two distinct releases — not a real-world case
worth anti-regression. SHITTY bucket now holds 14 fixtures (down from 15).
2026-05-18 15:51:11 +02:00
francwa aa182458b8 test(fixtures): seed SHITTY release bucket with 15 anti-regression cases
Add 15 expected.yaml fixtures under tests/fixtures/releases/shitty/
covering the awkward but real-world release names from the downloads
folder. Each fixture locks in the current parse_release behavior so
future parser changes are intentional, not silent drift.

Cases captured:
- Angel INTEGRALE 3-level hierarchy (tv_complete media_type)
- Buffy custom French title with dots preserved
- Archer S14E09E10E11 multi-episode (E11 lost — tech debt)
- Notre Planète lowercase s01e01
- Vinyl ' - 1x01 - FHD' (stray dash artifact — tech debt)
- Deutschland.83 (year-suffix as part of title)
- Tatortreiniger S01-06 range (falls to movie — tech debt)
- Derry Girls duplicated title
- Jurassic Park bare folder (media_type=unknown)
- La Nuit au Musée bilingual MULTI
- Chérie j'ai agrandi (ASCII-stripped apostrophe, parses fine)
- Honey Don't (unescaped apostrophe — full AI-path degeneration)
- Hook MULTi.SUBS movie with Subs/ folder
- Predator Badlands space separators (group=UNKNOWN — tech debt)
- Westworld S04 Subs.Only (no video file)

Each fixture also captures the future 3-flow routing (library /
torrents / seed_hardlinks) ahead of the organize_media refactor.

Suite: 1011 passed, 8 skipped.
2026-05-18 15:48:41 +02:00