fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly
Apostrophes are in the forbidden-chars list, which made any release with a title like "Don't" or "L'avare" short-circuit to the AI fallback (parse_path=ai, everything UNKNOWN). They are now stripped up front from the name before the well-formed check and tokenize, so the parse completes normally. The raw name is preserved on the VO; only the title field loses its apostrophe. parse_path becomes 'sanitized' when an apostrophe was stripped, to surface that the parser cleaned something up. Fixtures updated: - shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip, codec=x265, group=Amen). - path_of_pain/the_prodigy_full_chaos/ — went from total failure to partial success (title, year, source, codec extracted). Remaining gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked separately in tech debt.
This commit is contained in:
+14
-13
@@ -1,21 +1,22 @@
|
||||
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
||||
|
||||
# Tech debt: the unescaped apostrophe in "Don't" pushes the whole release
|
||||
# through the AI fallback path (parse_path="ai") and the parse degenerates to
|
||||
# UNKNOWN across the board. Anti-regression here — once the tokenizer learns
|
||||
# to handle apostrophes, this fixture should be revisited.
|
||||
# Apostrophes inside titles ("Don't", "L'avare") used to push the release
|
||||
# through the AI fallback (parse_path="ai", everything UNKNOWN). They are
|
||||
# now pre-stripped before well-formed check and tokenize, so the parse
|
||||
# completes normally — only the title text loses its apostrophe
|
||||
# ("Honey.Dont").
|
||||
parsed:
|
||||
title: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
||||
year: null
|
||||
title: "Honey.Dont"
|
||||
year: 2025
|
||||
season: null
|
||||
episode: null
|
||||
quality: null
|
||||
source: null
|
||||
codec: null
|
||||
group: "UNKNOWN"
|
||||
tech_string: ""
|
||||
media_type: "unknown"
|
||||
parse_path: "ai"
|
||||
quality: "2160p"
|
||||
source: "WEBRip"
|
||||
codec: "x265"
|
||||
group: "Amen"
|
||||
tech_string: "2160p.WEBRip.x265"
|
||||
media_type: "movie"
|
||||
parse_path: "sanitized"
|
||||
is_season_pack: false
|
||||
|
||||
tree:
|
||||
|
||||
Reference in New Issue
Block a user