fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly
Apostrophes are in the forbidden-chars list, which made any release with a title like "Don't" or "L'avare" short-circuit to the AI fallback (parse_path=ai, everything UNKNOWN). They are now stripped up front from the name before the well-formed check and tokenize, so the parse completes normally. The raw name is preserved on the VO; only the title field loses its apostrophe. parse_path becomes 'sanitized' when an apostrophe was stripped, to surface that the parser cleaned something up. Fixtures updated: - shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip, codec=x265, group=Amen). - path_of_pain/the_prodigy_full_chaos/ — went from total failure to partial success (title, year, source, codec extracted). Remaining gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked separately in tech debt.
This commit is contained in:
@@ -23,6 +23,16 @@ callers).
|
|||||||
with intermediate values implied. Fixture
|
with intermediate values implied. Fixture
|
||||||
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
|
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
|
||||||
to anti-regression-of-fix.
|
to anti-regression-of-fix.
|
||||||
|
- **Apostrophes in titles no longer push the release through the AI
|
||||||
|
fallback.** `Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265-Amen`
|
||||||
|
previously parsed with `parse_path="ai"` and everything UNKNOWN
|
||||||
|
because `'` is in the forbidden-chars list. Apostrophes are now
|
||||||
|
pre-stripped before the well-formed check, so the parse completes
|
||||||
|
normally (`title=Honey.Dont, year=2025, quality=2160p, ...`); only
|
||||||
|
the title text loses its apostrophe. `parse_path` becomes
|
||||||
|
`sanitized` to surface the cleanup. Side win: PoP fixture
|
||||||
|
`the_prodigy_full_chaos/` also moves from total failure to a
|
||||||
|
partially-correct parse (year, source, codec extracted).
|
||||||
- **Season-range markers (`Sxx-yy`) are now recognized as
|
- **Season-range markers (`Sxx-yy`) are now recognized as
|
||||||
`tv_complete`.** `Der.Tatortreiniger.S01-06.GERMAN...` previously
|
`tv_complete`.** `Der.Tatortreiniger.S01-06.GERMAN...` previously
|
||||||
parsed as `media_type=movie` with `S01-06` glued onto the title.
|
parsed as `media_type=movie` with `S01-06` glued onto the title.
|
||||||
|
|||||||
@@ -46,7 +46,16 @@ def parse_release(
|
|||||||
"""
|
"""
|
||||||
parse_path = ParsePath.DIRECT.value
|
parse_path = ParsePath.DIRECT.value
|
||||||
|
|
||||||
clean, site_tag = _v2.strip_site_tag(name)
|
# Apostrophes inside titles ("Don't", "L'avare") are common and should
|
||||||
|
# not push the release through the AI fallback. Strip them up front so
|
||||||
|
# both strip_site_tag and tokenize see "Dont" / "Lavare", which is good
|
||||||
|
# enough for token-level matching. The raw name is preserved on the VO.
|
||||||
|
working_name = name
|
||||||
|
if "'" in working_name:
|
||||||
|
working_name = working_name.replace("'", "")
|
||||||
|
parse_path = ParsePath.SANITIZED.value
|
||||||
|
|
||||||
|
clean, site_tag = _v2.strip_site_tag(working_name)
|
||||||
if site_tag is not None:
|
if site_tag is not None:
|
||||||
parse_path = ParsePath.SANITIZED.value
|
parse_path = ParsePath.SANITIZED.value
|
||||||
|
|
||||||
@@ -77,7 +86,7 @@ def parse_release(
|
|||||||
)
|
)
|
||||||
return parsed, report
|
return parsed, report
|
||||||
|
|
||||||
tokens, v2_tag = _v2.tokenize(name, kb)
|
tokens, v2_tag = _v2.tokenize(working_name, kb)
|
||||||
annotated = _v2.annotate(tokens, kb)
|
annotated = _v2.annotate(tokens, kb)
|
||||||
fields = _v2.assemble(annotated, v2_tag, name, kb)
|
fields = _v2.assemble(annotated, v2_tag, name, kb)
|
||||||
|
|
||||||
|
|||||||
+16
-18
@@ -1,28 +1,26 @@
|
|||||||
release_name: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
|
release_name: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
|
||||||
|
|
||||||
# Apocalypse case combining every horror:
|
# Apocalypse case combining every horror — partially tamed by the
|
||||||
# - Unescaped apostrophe ("World's") → forces parse_path="ai" fallback
|
# apostrophe fix. Remaining gaps (still PoP-worthy):
|
||||||
# - Spaces AND dashes used as separators inconsistently
|
# - "1080i" interlaced flag (not in quality KB)
|
||||||
# - "Blu-ray" with a dash (vs. canonical BluRay)
|
# - "Blu-ray" with a dash (vs. canonical BluRay) — recognized as source
|
||||||
# - "1080i" interlaced flag (not 1080p)
|
# but with the dash form
|
||||||
# - "DTS-HD MA 5.1" multi-word audio codec
|
# - "DTS-HD MA 5.1" multi-word audio codec — the trailing "HD" leaks
|
||||||
# - " - GROUP.mkv" trailing format (space-dash-space before group)
|
# into the group
|
||||||
# - Trailing .mkv extension survives in title
|
# - Trailing .mkv extension survives in title
|
||||||
# Result: total degeneration — UNKNOWN across the board, title=raw input.
|
# - " - GROUP" trailing format (space-dash-space before group)
|
||||||
# Once the apostrophe + multi-word-audio + 1080i are handled this fixture
|
|
||||||
# should be revisited. For now: anti-regression of the failure shape.
|
|
||||||
parsed:
|
parsed:
|
||||||
title: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
|
title: "The.Prodigy.Worlds.on.Fire"
|
||||||
year: null
|
year: 2011
|
||||||
season: null
|
season: null
|
||||||
episode: null
|
episode: null
|
||||||
quality: null
|
quality: null
|
||||||
source: null
|
source: "Blu-ray"
|
||||||
codec: null
|
codec: "AVC"
|
||||||
group: "UNKNOWN"
|
group: "HD"
|
||||||
tech_string: ""
|
tech_string: "Blu-ray.AVC"
|
||||||
media_type: "unknown"
|
media_type: "movie"
|
||||||
parse_path: "ai"
|
parse_path: "sanitized"
|
||||||
is_season_pack: false
|
is_season_pack: false
|
||||||
|
|
||||||
tree:
|
tree:
|
||||||
|
|||||||
+14
-13
@@ -1,21 +1,22 @@
|
|||||||
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
||||||
|
|
||||||
# Tech debt: the unescaped apostrophe in "Don't" pushes the whole release
|
# Apostrophes inside titles ("Don't", "L'avare") used to push the release
|
||||||
# through the AI fallback path (parse_path="ai") and the parse degenerates to
|
# through the AI fallback (parse_path="ai", everything UNKNOWN). They are
|
||||||
# UNKNOWN across the board. Anti-regression here — once the tokenizer learns
|
# now pre-stripped before well-formed check and tokenize, so the parse
|
||||||
# to handle apostrophes, this fixture should be revisited.
|
# completes normally — only the title text loses its apostrophe
|
||||||
|
# ("Honey.Dont").
|
||||||
parsed:
|
parsed:
|
||||||
title: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
title: "Honey.Dont"
|
||||||
year: null
|
year: 2025
|
||||||
season: null
|
season: null
|
||||||
episode: null
|
episode: null
|
||||||
quality: null
|
quality: "2160p"
|
||||||
source: null
|
source: "WEBRip"
|
||||||
codec: null
|
codec: "x265"
|
||||||
group: "UNKNOWN"
|
group: "Amen"
|
||||||
tech_string: ""
|
tech_string: "2160p.WEBRip.x265"
|
||||||
media_type: "unknown"
|
media_type: "movie"
|
||||||
parse_path: "ai"
|
parse_path: "sanitized"
|
||||||
is_season_pack: false
|
is_season_pack: false
|
||||||
|
|
||||||
tree:
|
tree:
|
||||||
|
|||||||
Reference in New Issue
Block a user