3dc73a5214
CJK release names sometimes use the fullwidth vertical bar as a token separator, as do occasional decorative YouTube-style uploads. Adding the codepoint to separators.yaml lets the tokenizer split on it instead of leaving the wide pipe glued onto an adjacent token. The tokenizer in alfred/domain/release/parser/pipeline.py iterates the separator list as plain strings (no regex), so a multi-byte UTF-8 separator works without any code change.
25 lines
1.1 KiB
YAML
25 lines
1.1 KiB
YAML
# Token separators encountered in release names.
|
||
#
|
||
# Used by parse_release() to tokenize a release name into atomic tokens before
|
||
# applying token-level matchers (resolutions, codecs, languages, season/episode
|
||
# markers, etc.).
|
||
#
|
||
# Why a YAML and not hardcoded:
|
||
# - Different scene/p2p/site conventions evolve over time (brackets from YTS,
|
||
# parens from some retro packs, underscores from older releases).
|
||
# - Lets us extend without code change when a new convention shows up.
|
||
#
|
||
# Caveats:
|
||
# - "." is always present because it's the canonical scene separator. Removing
|
||
# it would break ~everything.
|
||
# - Order does not matter — they are merged into a regex character class.
|
||
separators:
|
||
- "." # canonical scene form: Show.S01E01.1080p
|
||
- " " # human-friendly form: The Father (2020) 1080p
|
||
- "[" # bracket-prefixed/embedded: [1080p] [WEBRip] [YTS.MX]
|
||
- "]"
|
||
- "(" # parenthesis-embedded (year, edition): (2020) (Director's Cut)
|
||
- ")"
|
||
- "_" # underscore-as-space (old usenet, some Asian releases)
|
||
- "|" # fullwidth vertical bar U+FF5C (CJK release names, occasional decorative use)
|