fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly

Apostrophes are in the forbidden-chars list, which made any release
with a title like "Don't" or "L'avare" short-circuit to the AI
fallback (parse_path=ai, everything UNKNOWN). They are now stripped
up front from the name before the well-formed check and tokenize,
so the parse completes normally. The raw name is preserved on the
VO; only the title field loses its apostrophe.

parse_path becomes 'sanitized' when an apostrophe was stripped, to
surface that the parser cleaned something up.

Fixtures updated:
- shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse
  (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip,
  codec=x265, group=Amen).
- path_of_pain/the_prodigy_full_chaos/ — went from total failure to
  partial success (title, year, source, codec extracted). Remaining
  gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked
  separately in tech debt.
This commit is contained in:
2026-05-20 23:29:10 +02:00
parent 448ef3b79c
commit 621bb96995
4 changed files with 51 additions and 33 deletions
@@ -1,28 +1,26 @@
release_name: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
# Apocalypse case combining every horror:
# - Unescaped apostrophe ("World's") → forces parse_path="ai" fallback
# - Spaces AND dashes used as separators inconsistently
# - "Blu-ray" with a dash (vs. canonical BluRay)
# - "1080i" interlaced flag (not 1080p)
# - "DTS-HD MA 5.1" multi-word audio codec
# - " - GROUP.mkv" trailing format (space-dash-space before group)
# Apocalypse case combining every horror — partially tamed by the
# apostrophe fix. Remaining gaps (still PoP-worthy):
# - "1080i" interlaced flag (not in quality KB)
# - "Blu-ray" with a dash (vs. canonical BluRay) — recognized as source
# but with the dash form
# - "DTS-HD MA 5.1" multi-word audio codec — the trailing "HD" leaks
# into the group
# - Trailing .mkv extension survives in title
# Result: total degeneration — UNKNOWN across the board, title=raw input.
# Once the apostrophe + multi-word-audio + 1080i are handled this fixture
# should be revisited. For now: anti-regression of the failure shape.
# - " - GROUP" trailing format (space-dash-space before group)
parsed:
title: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
year: null
title: "The.Prodigy.Worlds.on.Fire"
year: 2011
season: null
episode: null
quality: null
source: null
codec: null
group: "UNKNOWN"
tech_string: ""
media_type: "unknown"
parse_path: "ai"
source: "Blu-ray"
codec: "AVC"
group: "HD"
tech_string: "Blu-ray.AVC"
media_type: "movie"
parse_path: "sanitized"
is_season_pack: false
tree:
+14 -13
View File
@@ -1,21 +1,22 @@
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
# Tech debt: the unescaped apostrophe in "Don't" pushes the whole release
# through the AI fallback path (parse_path="ai") and the parse degenerates to
# UNKNOWN across the board. Anti-regression here — once the tokenizer learns
# to handle apostrophes, this fixture should be revisited.
# Apostrophes inside titles ("Don't", "L'avare") used to push the release
# through the AI fallback (parse_path="ai", everything UNKNOWN). They are
# now pre-stripped before well-formed check and tokenize, so the parse
# completes normally — only the title text loses its apostrophe
# ("Honey.Dont").
parsed:
title: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
year: null
title: "Honey.Dont"
year: 2025
season: null
episode: null
quality: null
source: null
codec: null
group: "UNKNOWN"
tech_string: ""
media_type: "unknown"
parse_path: "ai"
quality: "2160p"
source: "WEBRip"
codec: "x265"
group: "Amen"
tech_string: "2160p.WEBRip.x265"
media_type: "movie"
parse_path: "sanitized"
is_season_pack: false
tree: