fix(release/parser): drop pure-punctuation TITLE tokens at assembly
Releases using ' - ' as a separator (Vinyl - 1x01 - FHD) tokenize to
['Vinyl', '-', '1x01', '-', 'FHD'] — the standalone '-' tokens were
ending up in title_parts and leaked into the joined title
('Vinyl.-'). We can't add '-' to the separator list (it would break
codec-GROUP), so we filter at assembly: a TITLE token with no
alphanumeric characters carries no title content.
Side win: same logic eliminates the UTF-8 wide-pipe '|' from the
khruangbin_yt_wide_pipe fixture title.
Fixtures updated:
- shitty/vinyl_1x01_format/expected.yaml (title: Vinyl.- → Vinyl)
- path_of_pain/khruangbin_yt_wide_pipe/expected.yaml (| dropped)
This commit is contained in:
@@ -23,6 +23,12 @@ callers).
|
|||||||
with intermediate values implied. Fixture
|
with intermediate values implied. Fixture
|
||||||
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
|
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
|
||||||
to anti-regression-of-fix.
|
to anti-regression-of-fix.
|
||||||
|
- **Pure-punctuation TITLE tokens are dropped at assembly.** Releases
|
||||||
|
with surrounding ` - ` separators (`Vinyl - 1x01 - FHD`) previously
|
||||||
|
produced `title="Vinyl.-"`. Such tokens (a stray dash, a wide pipe
|
||||||
|
`|`, …) carry no title content and are now filtered out. Side
|
||||||
|
effect: PoP fixture `khruangbin_yt_wide_pipe/` also benefits — the
|
||||||
|
YouTube wide-pipe no longer leaks into the title.
|
||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
|||||||
@@ -618,7 +618,14 @@ def assemble(
|
|||||||
layer in additional fields (``parse_path``, ``raw``, …) before
|
layer in additional fields (``parse_path``, ``raw``, …) before
|
||||||
instantiation.
|
instantiation.
|
||||||
"""
|
"""
|
||||||
title_parts = [t.text for t in annotated if t.role is TokenRole.TITLE]
|
# Pure-punctuation tokens (e.g. a stray "-" left by ` - ` separators in
|
||||||
|
# human-friendly release names) carry no title content and would leak
|
||||||
|
# into the joined title as ``"Show.-.Episode"``. Drop them here.
|
||||||
|
title_parts = [
|
||||||
|
t.text
|
||||||
|
for t in annotated
|
||||||
|
if t.role is TokenRole.TITLE and any(c.isalnum() for c in t.text)
|
||||||
|
]
|
||||||
title = ".".join(title_parts) if title_parts else (
|
title = ".".join(title_parts) if title_parts else (
|
||||||
annotated[0].text if annotated else raw_name
|
annotated[0].text if annotated else raw_name
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -1,13 +1,15 @@
|
|||||||
release_name: "Khruangbin | Austin City Limits Music Festival 2024 | Full Set [V_-7WWPPeBs].webm"
|
release_name: "Khruangbin | Austin City Limits Music Festival 2024 | Full Set [V_-7WWPPeBs].webm"
|
||||||
|
|
||||||
# yt-dlp slug: UTF-8 wide pipe '|' (U+FF5C, not the ASCII '|'), trailing
|
# yt-dlp slug: UTF-8 wide pipe '|' (U+FF5C, not the ASCII '|'), trailing
|
||||||
# YouTube video ID in brackets, .webm extension. Parser extracts the year
|
# YouTube video ID in brackets, .webm extension. The wide pipe survives
|
||||||
# (2024) correctly but mistakes the YouTube ID '7WWPPeBs' for a release
|
# the tokenizer (not a separator) but is now dropped at title assembly
|
||||||
# group, and the wide pipe survives the tokenizer (not a separator).
|
# (pure-punctuation TITLE tokens carry no content). Year (2024) parses
|
||||||
|
# correctly; the YouTube ID '7WWPPeBs' is still mistaken for a release
|
||||||
|
# group (separate gap, see PoP backlog).
|
||||||
# This is a concert recording — closer to "live music" than "movie", but
|
# This is a concert recording — closer to "live music" than "movie", but
|
||||||
# media_type=movie is the current degenerate best guess.
|
# media_type=movie is the current degenerate best guess.
|
||||||
parsed:
|
parsed:
|
||||||
title: "Khruangbin.|.Austin.City.Limits.Music.Festival"
|
title: "Khruangbin.Austin.City.Limits.Music.Festival"
|
||||||
year: 2024
|
year: 2024
|
||||||
season: null
|
season: null
|
||||||
episode: null
|
episode: null
|
||||||
|
|||||||
@@ -1,11 +1,12 @@
|
|||||||
release_name: "Vinyl - 1x01 - FHD"
|
release_name: "Vinyl - 1x01 - FHD"
|
||||||
|
|
||||||
# Tech debt: surrounding ' - ' separators leave a stray '-' token attached
|
# Surrounding ' - ' separators in human-friendly release names left stray
|
||||||
# to the title ("Vinyl.-"). NxNN form correctly identifies S01E01; everything
|
# '-' tokens attached to the title. They are now dropped at assembly time
|
||||||
# tech-side empty (no quality token in KB — "FHD" not yet known). Anti-regression
|
# (pure-punctuation TITLE tokens carry no content). NxNN form correctly
|
||||||
# the current degenerate title so a future fix is intentional.
|
# identifies S01E01; tech-side stays empty (no quality token in KB — "FHD"
|
||||||
|
# not yet known).
|
||||||
parsed:
|
parsed:
|
||||||
title: "Vinyl.-"
|
title: "Vinyl"
|
||||||
year: null
|
year: null
|
||||||
season: 1
|
season: 1
|
||||||
episode: 1
|
episode: 1
|
||||||
|
|||||||
Reference in New Issue
Block a user