Files
alfred/alfred/knowledge/release/separators.yaml
T
francwa 3dc73a5214 feat(release): add fullwidth vertical bar | (U+FF5C) to separators
CJK release names sometimes use the fullwidth vertical bar as a token
separator, as do occasional decorative YouTube-style uploads. Adding
the codepoint to separators.yaml lets the tokenizer split on it
instead of leaving the wide pipe glued onto an adjacent token.

The tokenizer in alfred/domain/release/parser/pipeline.py iterates
the separator list as plain strings (no regex), so a multi-byte
UTF-8 separator works without any code change.
2026-05-21 08:05:56 +02:00

25 lines
1.1 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Token separators encountered in release names.
#
# Used by parse_release() to tokenize a release name into atomic tokens before
# applying token-level matchers (resolutions, codecs, languages, season/episode
# markers, etc.).
#
# Why a YAML and not hardcoded:
# - Different scene/p2p/site conventions evolve over time (brackets from YTS,
# parens from some retro packs, underscores from older releases).
# - Lets us extend without code change when a new convention shows up.
#
# Caveats:
# - "." is always present because it's the canonical scene separator. Removing
# it would break ~everything.
# - Order does not matter — they are merged into a regex character class.
separators:
- "." # canonical scene form: Show.S01E01.1080p
- " " # human-friendly form: The Father (2020) 1080p
- "[" # bracket-prefixed/embedded: [1080p] [WEBRip] [YTS.MX]
- "]"
- "(" # parenthesis-embedded (year, edition): (2020) (Director's Cut)
- ")"
- "_" # underscore-as-space (old usenet, some Asian releases)
- "" # fullwidth vertical bar U+FF5C (CJK release names, occasional decorative use)