51 Commits

Author SHA1 Message Date
francwa 02e478a157 refactor(domain): freeze Movie and Episode, switch track collections to tuple
Movie and Episode become @dataclass(frozen=True, eq=False), with
audio_tracks/subtitle_tracks held as tuple[...] instead of list[...].
Identity-based equality is preserved via the existing __eq__/__hash__.
__post_init__ coercion (imdb_id, title, season_number, episode_number)
uses object.__setattr__ to stay compatible with frozen.

The MediaWithTracks mixin contract is updated to tuple accordingly.

Callers projecting enrichment results (probe output, file metadata) now
rebuild via dataclasses.replace(...) — same pattern recently adopted for
ParsedRelease.

Season and TVShow stay mutable for now: freezing the aggregate root
would cascade a full reconstruction on every add_episode, deferred.
2026-05-21 13:40:22 +02:00
francwa 3dc73a5214 feat(release): add fullwidth vertical bar | (U+FF5C) to separators
CJK release names sometimes use the fullwidth vertical bar as a token
separator, as do occasional decorative YouTube-style uploads. Adding
the codepoint to separators.yaml lets the tokenizer split on it
instead of leaving the wide pipe glued onto an adjacent token.

The tokenizer in alfred/domain/release/parser/pipeline.py iterates
the separator list as plain strings (no regex), so a multi-byte
UTF-8 separator works without any code change.
2026-05-21 08:05:56 +02:00
francwa 88f156b7a4 refactor(subtitles): rename SubtitleCandidate → SubtitleScanResult
The old name conflated 'might become a placed subtitle' with 'what a
scan pass produced'. The class is the output of a scan/identify pass —
language/format may still be None while classification is in progress,
confidence reflects classifier certainty, raw_tokens holds filename
fragments under analysis. SubtitleScanResult says that directly.

Pure rename + refreshed docstring; no behavior change. Touches the
domain entity, the matcher/identifier/utils services, the
manage_subtitles use case, the placer, the metadata store, the
shared-media cross-ref comment, and 7 test modules.
2026-05-21 08:05:46 +02:00
francwa 5107cb32c0 feat(release): InspectedResult.recommended_action centralizes exclusion decision
Add a derived 'recommended_action' property on InspectedResult that
collapses the orchestrator's go / wait / skip decision into one value:

- 'skip'      → no main_video, or media_type == 'other'
- 'ask_user'  → media_type == 'unknown', or road == 'path_of_pain'
- 'process'   → confident parse with a main video on disk

The ordering is part of the contract (skip > ask_user > process) —
documented in the property docstring.

Until now every consumer (workflows, the agent, the orchestrator
sketch) had to re-derive this from the road / media_type / main_video
triple, with subtle drift between sites. One place, one rule.

Exposed through the analyze_release tool so the LLM can route on it.
Spec YAML updated to describe the new field.

Suite: 1083 passed (+6 new tests in tests/application/test_inspect.py
covering the four branches and the precedence rules).
2026-05-21 07:54:17 +02:00
francwa b7979c0f8b refactor(release): freeze ParsedRelease + enrich_from_probe returns new instance
ParsedRelease is now @dataclass(frozen=True). The enrichment passes that
used to patch fields in place now produce new instances:

- enrich_from_probe(parsed, info, kb) returns a new ParsedRelease via
  dataclasses.replace (no allocation when no field changed).
- inspect_release rebinds 'parsed' after detect_media_type (wrapped in
  MediaTypeToken — the strict isinstance check now also runs on
  replace) and after enrich_from_probe.

languages becomes a tuple[str, ...] so the VO is properly immutable.
Parser pipeline packs languages as a tuple in the assemble dict.

Callers updated: inspect_release, testing/recognize_folders_in_downloads.py.
Tests updated: 22 enrich_from_probe call sites rebound, language
assertions switched to tuple literals, test_release_fixtures normalizes
result['languages'] back to list for YAML-fixture comparison.

Suite: 1077 passed.
2026-05-21 07:51:49 +02:00
francwa 9f1ce94690 refactor(application): inject kb/prober into resolve_destination use cases
Remove the module-level _KB / _PROBER singletons from
alfred/application/filesystem/resolve_destination.py. The four
resolve_{season,episode,movie,series}_destination use cases now take
kb: ReleaseKnowledge and prober: MediaProber as required arguments,
matching the shape of inspect_release.

The singletons now live at the agent-tools frontier
(alfred/agent/tools/filesystem.py), where the LLM-facing wrappers
instantiate YamlReleaseKnowledge / FfprobeMediaProber once and thread
them through. The wrappers' Python signatures are unchanged — the
inspect-based JSON-schema generator in agent/registry.py still sees the
same LLM-passable params.

analyze_release drops the dirty 'from ... import _KB' indirection.

Tests inject their own stubs by keyword (prober=_StubProber(...)) via
thin convenience wrappers, replacing the prior
monkeypatch.setattr(rd, '_PROBER', ...) pattern.

testing/debug_release.py: instantiate YamlReleaseKnowledge() /
FfprobeMediaProber() inline at the two call sites.

Suite: 1077 passed.
2026-05-21 07:46:13 +02:00
francwa 5e0ed11672 refactor(release): rename ParsePath enum to TokenizationRoute
ParsePath collided with pathlib.Path in mental models, and was one
letter from the parse_path attribute that stores its value — confusion
on confusion. Road (EASY/SHITTY/PATH_OF_PAIN) is the parser-confidence
axis; TokenizationRoute (DIRECT/SANITIZED/AI) is the tokenization-method
axis. They're orthogonal and the new name makes that obvious.

Field name parse_path stays — it's the right name for the attribute
that *holds* the route. String values ("direct", "sanitized", "ai")
stay too, so YAML fixtures and the analyze_release tool spec are
unchanged. Only the type symbol changes:

- value_objects.py: class rename + docstring spelling out orthogonality
  with Road.
- services.py: 3 call sites.
- scoring.py: docstring cross-reference updated.
- tests/domain/release/test_parser_v2_scoring.py: import + 3 call sites.
2026-05-21 07:39:42 +02:00
francwa 0246f85ef8 refactor(release): move codec mappings from code to YAML knowledge
The three module-level dicts in enrich_from_probe (ffprobe codec name
to scene token, channel count to layout) were exactly the kind of
domain lookup table CLAUDE.md says belongs in YAML, not in Python.
Move them to alfred/knowledge/release/probe_mappings.yaml, load
through a new ReleaseKnowledge.probe_mappings port field, and add a
kb parameter to enrich_from_probe so the consumer reads the maps via
the same injection pattern as everything else.

- New knowledge file: alfred/knowledge/release/probe_mappings.yaml
- New loader: load_probe_mappings() in infrastructure/knowledge/release.py
  (normalizes channel-count keys back to int).
- Port: ReleaseKnowledge gains probe_mappings: dict.
- Adapter: YamlReleaseKnowledge populates it at __init__.
- Consumer: enrich_from_probe(parsed, info, kb) reads the three sub-maps
  from kb.probe_mappings; unknown codecs still fall back to uppercase
  raw value, same behaviour as before.
- Call sites updated: inspect_release passes kb through; the testing
  script gets its kb wiring (it was already broken since the
  ReleaseKnowledge refactor); all 22 enrich_from_probe call sites in
  tests/application/test_enrich_from_probe.py pass _KB.
2026-05-21 07:37:42 +02:00
francwa e62dc90bd1 refactor(release): make tech_string a derived property
ParsedRelease.tech_string was a stored str field re-computed in two
places (assemble() at parse time, enrich_from_probe() after the probe).
The second site was a reactive fix (e79ca46) for filename builders that
saw a stale value. Turn it into an @property so it stays in sync with
quality/source/codec by construction.

- Drop the field from the dataclass + the key from assemble()'s dict.
- Drop tech_string="" from parse_release's malformed-name fallback.
- Drop the manual recomputation at the end of enrich_from_probe.
- Inject the property into asdict() result in the fixtures runner
  (same treatment as is_season_pack).
- Update tests that passed tech_string= to the constructor; rewrite the
  TestTechString case that mutated p.tech_string manually.
2026-05-21 07:33:53 +02:00
francwa 688c37bbec docs(changelog): recap session 2026-05-20 tech-debt cleanup
Consolidate the five domain-purity refactors of the session under
[Unreleased]: RuleScopeLevel enum, FilePath VO post_init, Language
strict + from_raw, ParsedRelease.normalised → clean, ParsedRelease
enum strictness. Removes the duplicate min_movie_size_bytes entry
(now sits under its proper Removed section).
2026-05-20 23:57:06 +02:00
francwa 757e4045ee refactor(release): ParsedRelease.media_type & parse_path are strict enums
The fields were already typed as MediaTypeToken / ParsePath, but a
tolerant __post_init__ coerced raw strings into their enum form. With
MediaTypeToken(str, Enum) (and ParsePath idem), the coercion served no
purpose — callers that pass '.value' got back the enum anyway, and
callers that pass an unknown string got a ValidationError just like
they would now.

Strict mode: constructor rejects non-enum values directly. The two
in-tree builders (parse_release() and the parser pipeline) already
produce enum values; all .value sites have been removed. Drops the
unused _VALID_MEDIA_TYPES / _VALID_PARSE_PATHS lookup tables.
2026-05-20 23:52:30 +02:00
francwa c3767aacb6 refactor(release): rename ParsedRelease.normalised → clean
Le champ s'appelait normalised mais ne faisait pas la normalisation
suggérée par son nom (dots instead of spaces). En pratique il contient
raw - site_tag - apostrophes, qui sert uniquement à season_folder_name()
via _strip_episode_from_normalized. Renommé en 'clean' qui décrit ce
qu'il contient réellement, docstring corrigée.
2026-05-20 23:50:05 +02:00
francwa 5bcf22b408 refactor(shared): Language VO is strict; from_raw() factory for un-normalized input
object.__setattr__ inside __post_init__ on a frozen dataclass is a
code smell — it bypasses the immutability guarantee to mutate fields
mid-construction. Split the responsibilities:

* Direct constructor is strict — rejects un-normalized input (uppercase
  iso, whitespace in aliases, etc.) so once a Language exists in the
  system, its fields are guaranteed canonical.
* Language.from_raw() factory handles arbitrary YAML/user input — it
  lowercases the iso, dedups/normalizes aliases, then constructs.

Only caller that built from raw data (LanguageRegistry loading YAML)
moves to from_raw(). Test fixtures already pass normalized data so
they keep using the direct constructor.
2026-05-20 23:48:30 +02:00
francwa cfa9f54d9f refactor(shared): FilePath VO uses __post_init__ instead of custom __init__
Custom __init__ on a @dataclass(frozen=True) is a code smell — it
bypasses the generated dataclass __init__ and re-implements the
str/Path coercion + frozen-aware setattr by hand. Replaced with a
single __post_init__ that performs the same normalization. Same
public API (FilePath(str) and FilePath(Path) both work), same
behavior, no callers touched.
2026-05-20 23:47:03 +02:00
francwa f0aaf50c97 refactor(subtitles): RuleScope.level → RuleScopeLevel enum
Six niveaux possibles (global, release_group, movie, show, season,
episode) étaient passés en str libre, le commentaire docstring servant
de seule documentation. Introduit RuleScopeLevel(str, Enum) — toujours
sérialisable en YAML, mais le set fixe est désormais imposé par le
typage. to_dict() sort explicitement .value pour rester safe côté
écrivains YAML.
2026-05-20 23:46:22 +02:00
francwa a09262b33f chore(settings): remove unused min_movie_size_bytes
Le champ + son validator étaient orphelins depuis la suppression
de MovieService.validate_movie_file. L'exclusion par extension
(application/release/supported_media.py) + le PoP couvrent désormais
la règle 'vrai film vs sample'. Si on a un jour besoin d'un seuil de
taille, il ira dans data/knowledge/, pas dans settings.
2026-05-20 23:41:41 +02:00
francwa 9c7cd66d2b Merge branch 'refactor/flatten-shared-media' 2026-05-20 23:35:52 +02:00
francwa 83dbed887b refactor(domain): flatten shared/media package into single module
Six small files (audio, video, subtitle, info, matching, tracks_mixin
+ __init__) collapsed into one ~250 LoC media.py module. Python treats
media.py and media/__init__.py interchangeably, so the 12 import sites
that read 'from alfred.domain.shared.media import ...' continue to work
without changes.

Reasoning: the whole bounded context fits on one screen; splitting into
sub-modules added more navigation friction than it saved. Tests stay
green (1077 passed).
2026-05-20 23:35:49 +02:00
francwa 0c9489e16b Merge branch 'feat/parser-phase-d' 2026-05-20 23:30:36 +02:00
francwa 621bb96995 fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly
Apostrophes are in the forbidden-chars list, which made any release
with a title like "Don't" or "L'avare" short-circuit to the AI
fallback (parse_path=ai, everything UNKNOWN). They are now stripped
up front from the name before the well-formed check and tokenize,
so the parse completes normally. The raw name is preserved on the
VO; only the title field loses its apostrophe.

parse_path becomes 'sanitized' when an apostrophe was stripped, to
surface that the parser cleaned something up.

Fixtures updated:
- shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse
  (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip,
  codec=x265, group=Amen).
- path_of_pain/the_prodigy_full_chaos/ — went from total failure to
  partial success (title, year, source, codec extracted). Remaining
  gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked
  separately in tech debt.
2026-05-20 23:29:10 +02:00
francwa 448ef3b79c fix(release/parser): recognize Sxx-yy season range as tv_complete
`Der.Tatortreiniger.S01-06.GERMAN...` previously parsed as a movie
with 'S01-06' glued to the title. The parser now matches the
season-range form in _parse_season_episode (returning season=first,
episode=None), and the assemble step detects the range token to
promote media_type to 'tv_complete'.

The first season is exposed as `season` so `is_season_pack`
fires (season is not None and episode is None) — useful for routing
to a series root folder.

Fixture shitty/tatortreiniger_flat_multiseason/ updated:
- title: Der.Tatortreiniger.S01-06 → Der.Tatortreiniger
- season: null → 1
- media_type: movie → tv_complete
- is_season_pack: false → true
2026-05-20 23:26:40 +02:00
francwa b1c7f35ffb fix(release/parser): drop pure-punctuation TITLE tokens at assembly
Releases using ' - ' as a separator (Vinyl - 1x01 - FHD) tokenize to
['Vinyl', '-', '1x01', '-', 'FHD'] — the standalone '-' tokens were
ending up in title_parts and leaked into the joined title
('Vinyl.-'). We can't add '-' to the separator list (it would break
codec-GROUP), so we filter at assembly: a TITLE token with no
alphanumeric characters carries no title content.

Side win: same logic eliminates the UTF-8 wide-pipe '|' from the
khruangbin_yt_wide_pipe fixture title.

Fixtures updated:
- shitty/vinyl_1x01_format/expected.yaml (title: Vinyl.- → Vinyl)
- path_of_pain/khruangbin_yt_wide_pipe/expected.yaml (| dropped)
2026-05-20 23:24:40 +02:00
francwa 5bbdc9081f fix(release/parser): collapse chained multi-episode markers to full range
S14E09E10E11 previously parsed to episode=9, episode_end=10 — E11
was silently dropped. The parser now takes episodes[-1] as
episode_end so the full chain is captured (episode=9, episode_end=11).
Intermediate values stay implied.

Fixture shitty/archer_multi_episode/ updated from anti-regression of
the bug to anti-regression of the fix.
2026-05-20 23:23:08 +02:00
francwa 5d7b214af2 Merge branch 'refactor/language-port' 2026-05-20 23:20:18 +02:00
francwa 18267d0165 refactor(language): LanguageRepository port + SubtitleKnowledgeBase wired to it
Mirror the MediaProber / FilesystemScanner pattern for language lookup:

- New Protocol `LanguageRepository` in alfred.domain.shared.ports
  covering from_iso, from_any, all, __contains__, __len__ — the
  surface previously coupled to the concrete LanguageRegistry.
- SubtitleKnowledgeBase types its `language_registry` parameter
  against the Protocol; the concrete LanguageRegistry stays in
  infrastructure as the YAML-backed adapter and remains the default
  when no repository is injected.
- New unit tests in tests/infrastructure/test_language_registry.py
  cover the adapter surface (from_iso, from_any, membership,
  case-insensitivity, non-string inputs).

Behaviour is unchanged for existing callers. The split opens the
door to in-memory fakes in future tests without loading the full
ISO 639 YAML.
2026-05-20 23:18:25 +02:00
francwa 19fe8a519a Merge branch 'feat/release-inspect-orchestrator'
Inspection pipeline groundwork:
- MediaProber.probe() port extension (full media inspection on the port)
- inspect_release orchestrator + InspectedResult frozen VO
- enrich_from_probe now refreshes tech_string
- resolve_*_destination use cases consume inspect_release
- detect_media_type & enrich_from_probe moved to application/release
2026-05-20 09:31:22 +02:00
francwa a0d1846ff2 refactor(release): move detect_media_type & enrich_from_probe to application/release
Both helpers are inspection-pipeline pieces, not filesystem use cases —
they belong next to inspect_release, not next to move_media /
resolve_destination / list_folder.

The move also kills the lazy import that was hiding inside
_resolve_parsed: alfred.application.filesystem.resolve_destination
no longer triggers a cycle through alfred.application.filesystem
__init__ when loading inspect_release. Top-level import restored.

Call sites updated: inspect.py, test_detect_media_type.py,
test_enrich_from_probe.py, testing/recognize_folders_in_downloads.py.
Module docstrings + test-file docstrings updated to match the new
location.
2026-05-20 09:29:58 +02:00
francwa 0fb59a4581 feat(filesystem): wire inspect_release into resolve_destination
The four resolve_*_destination use cases now route through a private
_resolve_parsed helper that picks the right entry point:

  - source path provided AND it exists -> inspect_release(name, path)
    runs the full pipeline (parse + media-type refinement + probe
    + enrich), so missing tech tokens (quality, codec, ...) get
    filled by ffprobe and the refreshed tech_string lands in the
    destination folder / file names.

  - source path missing or absent       -> parse_release(name) only,
    same behavior as before. Back-compat: tests using fake /dl/*.mkv
    paths still pass unchanged.

resolve_episode_destination / resolve_movie_destination reuse their
existing source_file parameter as the inspection target. The two
folder-move use cases (season / series) gain a new OPTIONAL
source_path parameter — threaded through the agent tool wrappers
and documented in the YAML specs.

The lazy import inside _resolve_parsed avoids a circular import:
inspect_release imports detect_media_type / enrich_from_probe from
the same application.filesystem package whose __init__ re-exports
resolve_destination.

Three new tests in TestProbeEnrichmentWiring with a stub MediaProber
prove the wiring: movie picks up probe quality, season picks it up
via source_path, and a missing path correctly skips probe (back-compat
guard).
2026-05-20 09:26:30 +02:00
francwa e79ca462b8 fix(release): refresh tech_string after enrich_from_probe
enrich_from_probe fills None fields on ParsedRelease (quality, source,
codec, audio_*, languages) but left tech_string at its parser-time
value — so the filename builders (movie_folder_name, episode_filename,
…) saw stale tech tokens even after a successful probe.

Re-derive tech_string the same way the parser does — quality.source.codec
joined by dots, skipping None — at the end of enrich_from_probe. Token-
level values still win because enrich only fills None fields.

Four new tests in TestTechString cover: enrichment rebuilds it,
existing source survives, no-info input leaves it untouched, fully
empty parsed produces ''.
2026-05-20 09:26:09 +02:00
francwa 03aa844d7d feat(release): inspect_release orchestrator + InspectedResult VO
New application-layer entry point that composes the four inspection
layers in one call:

  1. parse_release(name, kb)              -> (ParsedRelease, ParseReport)
  2. detect_media_type(parsed, path, kb)  -> patch parsed.media_type
  3. find_main_video(path, kb)            -> Path | None (top-level scan)
  4. prober.probe(video) + enrich         -> when video exists and
                                             media_type not in
                                             {unknown, other}

Returns a frozen InspectedResult(parsed, report, source_path,
main_video, media_info, probe_used). kb and prober are injected — no
module-level singletons in inspect.py.

analyze_release tool now delegates to inspect_release; its output
gains two fields, confidence (0-100) and road (easy/shitty/path_of_pain),
surfaced from ParseReport so the LLM can route by confidence. Spec
updated to document them.

12 new tests covering happy paths, probe gating (no video, media_type
'other', probe failure), mutation contract (detect refining
parsed.media_type, enrich filling None fields), resilience
(nonexistent path), and frozen contract. Suite: 1058 passing.
2026-05-20 09:15:29 +02:00
francwa c303efea48 refactor(probe): consolidate full probe() into MediaProber port
Add probe(video) -> MediaInfo | None to the MediaProber Protocol and
implement it on FfprobeMediaProber. The standalone
alfred/infrastructure/filesystem/ffprobe.py module is removed; all
callers (analyze_release / probe_media tools, testing scripts) now go
through the adapter.

Tests for the probe path moved to tests/infrastructure/test_ffprobe_prober.py
(patching subprocess.run at the adapter module level).

Unblocks the upcoming inspect_release orchestrator, which needs the
port — not a free function — to compose parse + main-video selection
+ probe in one shot.
2026-05-20 09:11:24 +02:00
francwa 5db350a1df Merge branch 'feat/release-parser-scoring' 2026-05-20 08:47:38 +02:00
francwa 12dc796ea2 docs(changelog): freeze confidence scoring + exclusion work block 2026-05-20 08:47:29 +02:00
francwa 9ddd85929e feat(release): pre-pipeline exclusion helpers
Add the application-layer helpers that decide which files are worth
parsing, sitting one notch above parse_release.

- is_supported_video(path, kb): extension-only check against
  kb.video_extensions. Lowercased suffix lookup. Directories and
  broken symlinks return False.
- find_main_video(folder, kb): top-level scan only (no recursion into
  subdirectories — releases that wrap their video in Sample/ are
  PATH_OF_PAIN territory). Lexicographically-first eligible file wins
  when several qualify (deterministic, no size-based ranking). A bare
  file as folder argument is supported for single-file releases.

No size threshold and no filename heuristics ('sample' / 'trailer'):
the parser's job is to extract structure, not to second-guess
non-standard release shapes. PoP catches the rest.

17 tests under tests/application/test_supported_media.py.
2026-05-20 01:34:32 +02:00
francwa ed7680b58f docs(changelog): log parse-confidence scoring + ParseReport tuple 2026-05-20 01:21:47 +02:00
francwa b4c9efd13b feat(release): parse_release returns (ParsedRelease, ParseReport)
Wire the scoring foundations into the parser entry point. parse_release
now returns a tuple — the structural ParsedRelease and a diagnostic
ParseReport carrying confidence (0-100), road
(EASY / SHITTY / PATH_OF_PAIN), the residual UNKNOWN tokens, and the
list of critical fields that couldn't be filled.

EASY is decided structurally (a group schema matched), independently
of the score. SHITTY vs PATH_OF_PAIN is decided by score against the
60 cutoff from scoring.yaml. Malformed names (forbidden chars) emit a
zero-confidence PoP report and short-circuit to parse_path=AI as
before.

ParsePath stays as-is (DIRECT / SANITIZED / AI) — it records *how* we
tokenized, not how confident we are. The two dimensions are now
properly separated.

Call sites propagated:
- alfred/application/filesystem/resolve_destination.py (4 occurrences)
- alfred/agent/tools/filesystem.py
- tests/domain/test_release.py
- tests/domain/test_release_fixtures.py
- tests/application/test_detect_media_type.py

New tests/domain/release/test_parser_v2_scoring.py (22 cases) locks
ParseReport validation, compute_score arithmetic, decide_road
thresholding, the collector helpers, and the end-to-end tuple contract.
2026-05-20 01:21:30 +02:00
francwa 98c688f29b feat(release): foundations for parse-confidence scoring
Add the building blocks for Phase A scoring without yet wiring them
into parse_release. Nothing changes at runtime — parse_release still
returns a single ParsedRelease — but the pieces needed to upgrade it
in a follow-up commit are now in place.

- alfred/knowledge/release/scoring.yaml: weights / penalties /
  thresholds. Title and media_type are heavy (30 / 20), structural
  fields medium (year 15, season 10), tech fields light (5 each).
  Unknown-token penalty 5 capped at -30. SHITTY/PoP cutoff at 60.
- load_scoring() loader with safe defaults baked in: a missing or
  partial YAML only de-tunes, never breaks.
- ReleaseKnowledge port grows a 'scoring: dict' field. YamlReleaseKnowledge
  populates it from load_scoring().
- New parser/scoring.py module with Road enum (EASY / SHITTY /
  PATH_OF_PAIN, distinct from ParsePath which records the tokenization
  route), and pure functions: compute_score, decide_road,
  collect_unknown_tokens, collect_missing_critical.
- ParseReport frozen VO in value_objects.py — exported alongside
  ParsedRelease.
2026-05-20 01:21:17 +02:00
francwa fcd80763e2 Merge branch 'refactor/release-parser-v2' 2026-05-20 01:08:20 +02:00
francwa 629387591f docs(changelog): freeze release parser v2 work block (2026-05-20) 2026-05-20 01:08:17 +02:00
francwa 230a7ab88a docs(changelog): log SHITTY simplification + distributor split 2026-05-20 01:03:52 +02:00
francwa 3737f66851 refactor(release): simplify SHITTY to dict-driven token tagging
Replace the ~480-line legacy heuristic block in services.py with a
small dict-driven pass in pipeline._annotate_shitty: each token is
looked up against the kb buckets (resolutions / sources / codecs /
distributors / year / sxxexx) with first-match-wins semantics, the
leftmost contiguous UNKNOWN run becomes the title, done.

SHITTY's scope is intentionally narrow — releases that *look* like
scene names but don't have a registered group schema. Anything more
exotic (parenthesized tech, bare-dashed title fragments, YT slugs,
franchise boxes) is PATH OF PAIN territory and stays out of here.

- annotate() no longer returns None; SHITTY is the always-on fallback
- services.py shrunk from ~525 to ~85 lines (legacy extractors gone)
- 4 fixtures get xfail markers documenting PoP-grade pathologies
  (deutschland franchise box, sleaford YT slug, super_mario bilingual,
  predator space-separators — the last one moved from shitty/ → pop/)
- ReleaseFixture grows xfail_reason; the parametrized suite wires the
  pytest.mark.xfail(strict=False) automatically
2026-05-20 01:03:25 +02:00
francwa fd3bd1ad8c feat(release): distinguish streaming distributors from sources
Introduce a separate dimension for streaming-platform tags (NF, AMZN,
DSNP, HMAX, ATVP, …) so they stop polluting the encoding-source field.
WEB-DL is the source; the platform that released it is the distributor.

- new distributors.yaml knowledge file
- ReleaseKnowledge port exposes distributors set
- TokenRole.DISTRIBUTOR + ParsedRelease.distributor field
- removed NF/AMZN/DSNP/HMAX/ATVP from sources.yaml
- notre_planete fixture now records distributor: NF
2026-05-20 01:03:11 +02:00
francwa 7dc7f0c241 feat(release): v2 enricher pass for audio/video-meta/edition/language
The EASY pipeline now extracts the full ParsedRelease surface from
known-group releases, not just the structural backbone. Behavior is
unchanged for releases that don't carry these tokens.

Pipeline (parser/pipeline.py):
- Structural walk (renamed _annotate_structural): no longer requires
  body to be fully consumed. Tokens passed over between schema chunks
  remain UNKNOWN so the enricher pass can claim them.
- _find_chunk(): scans forward in the body for the next token matching
  a given role, skipping already-annotated tokens. Lets optional and
  mandatory chunks both tolerate intercalated enricher tokens.
- _annotate_enrichers(): new non-positional pass. Walks UNKNOWN tokens
  and tags AUDIO_CODEC / AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION /
  LANGUAGE. Multi-token sequences from kb.audio / kb.video_meta /
  kb.editions are matched first (longest-first ordering preserved from
  the YAML), single tokens after.
- _apply_sequences(): mutates the token list, tagging the first token
  of a matched sequence with extra['sequence']=<canonical value> and
  trailing members with extra['sequence_member']='True' so assemble
  skips them.
- _detect_channel_pairs(): handles the '5.1' / '7.1' case where the
  '.' separator splits the layout into two tokens. Strips a trailing
  '-GROUP' suffix on the second before joining.

Assemble:
- New fields populated: languages (list), audio_codec, audio_channels,
  bit_depth, hdr_format, edition. Each role-handler skips
  sequence_member tokens.
- media_type heuristic extended: edition in {COMPLETE, INTEGRALE,
  COLLECTION} + no season → tv_complete (mirrors legacy).

Tests:
- 4 new TestEnrichers cases covering bit_depth+audio_codec+channels,
  HDR sequence + edition sequence + TrueHD.Atmos + 7.1, multi-language
  with DTS-HD.MA sequence, TV episode with single language.
- All 14 v2 tests + 30 fixture tests still green. Suite: 1011 passed,
  8 skipped.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:26:05 +02:00
francwa 075a827b0e feat(release): wire v2 EASY path for known release groups
The annotate-based v2 pipeline now handles releases ending in -KONTRAST,
-ELiTE, or -RARBG. Unknown groups still fall through to the legacy
SHITTY heuristic in services.py — nothing changes for them.

Pipeline (alfred/domain/release/parser/pipeline.py):
- tokenize(): string-ops separator split, strips [site.tag] first.
- annotate(): right-to-left group detection (priority to codec-GROUP
  shape, fallback to any non-source dashed token), GroupSchema lookup
  via the kb port, then lockstep walk of tokens against schema chunks.
  Optional chunks skip on mismatch, mandatory mismatches return None so
  the caller falls back gracefully. CODEC pre-consumed by a codec-GROUP
  trailing token correctly skips the CODEC chunk in the body walk.
- assemble(): folds annotated tokens into a ParsedRelease-compatible
  dict (title joined by '.', group from the codec-GROUP token's extras).

Schema (alfred/domain/release/parser/schema.py):
- GroupSchema + SchemaChunk frozen value objects.
- TokenRole.GROUP added.

Port + adapter:
- ReleaseKnowledge.group_schema(name) lookup added (case-insensitive).
- YamlReleaseKnowledge loads alfred/knowledge/release/release_groups/
  *.yaml at construction time; learned overrides in
  data/knowledge/release/release_groups/ also picked up.

Knowledge:
- release_groups/kontrast.yaml, elite.yaml, rarbg.yaml declare the
  canonical chunk_order. ELiTE marks source as optional (Foundation.S02
  has no WEBRip token).

Services:
- parse_release tries the v2 path first; on None falls through to the
  legacy implementation untouched.

Tests:
- tests/domain/release/test_parser_v2_easy.py (10 cases) cover group
  detection (codec-GROUP, dashed-source skip, no-dash → unknown),
  schema-driven annotation (movie, TV episode, season pack with
  optional source, unknown group returns None), and field assembly.
- Existing tests/domain/test_release_fixtures.py (30 cases) stay green:
  5 EASY fixtures now produced by v2, 25 SHITTY/PATH OF PAIN fixtures
  still produced by the legacy path. Verified via spy on v2.assemble.

Suite: 1007 passed, 8 skipped.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:21:11 +02:00
francwa a2c917618f feat(release): scaffold v2 parser package (annotate-based pipeline)
New package alfred/domain/release/parser/ lays the foundation for the
release parser refactor (specs in memory). Exposes:

- Token: frozen VO carrying text + stream index + TokenRole + extra dict.
  with_role() returns a new instance (no mutation).
- TokenRole: str-backed enum split into structural (TITLE/YEAR/SEASON_EP/
  GROUP), technical (RESOLUTION/SOURCE/CODEC/AUDIO_*/BIT_DEPTH/HDR/
  EDITION/LANGUAGE), and meta (SITE_TAG/UNKNOWN) families.
- pipeline.strip_site_tag(): pulls a [site.tag] prefix or suffix.
- pipeline.tokenize(): release name -> list[Token] (all UNKNOWN),
  string-ops split on kb.separators (no regex, per CLAUDE.md).
- pipeline.annotate(): documented stub. Walk order recorded in docstring
  (group right-to-left, then season/episode, year, tech, title).

Legacy parse_release in release.services remains the live implementation
until the annotate step lands. Scaffolding tests verify Token API,
site-tag stripping (prefix/suffix), and tokenize output shape.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:12:33 +02:00
francwa 9f10f4e0ad Merge branch 'refactor/domain-release-knowledge'
Final DDD purification of the release parser. Domain layer no longer
imports anything from infrastructure, no YAML at import time, and
ParsedRelease's filesystem-builders are pure (Option B).

- ReleaseKnowledge Protocol port + YamlReleaseKnowledge adapter
- parse_release(name, kb) explicit injection
- ParsedRelease.title_sanitized field; builders accept already-safe strings
- Callers (resolve_destination, detect_media_type, find_video,
  analyze_release) thread the kb through
- 987 tests pass
2026-05-19 22:05:36 +02:00
francwa cd814c7922 docs(changelog): log refactor/domain-release-knowledge work block 2026-05-19 22:05:29 +02:00
francwa 6802933acd test(release): adapt suite to explicit ReleaseKnowledge injection
- test_release.py / test_release_fixtures.py: module-level
  _KB = YamlReleaseKnowledge() + thin _parse(name) helper threading it
  into parse_release. test_show_folder_name_strips_windows_chars renamed
  to test_show_folder_name_uses_already_safe_title to reflect the
  Option B contract (caller sanitizes via kb.sanitize_for_fs).
- test_detect_media_type.py: same _KB pattern, all
  detect_media_type(parsed, path) calls now pass kb.
- test_filesystem_extras.py: find_video_file(path) calls now pass kb.
- test_enrich_from_probe.py: _bare() helper adds the new
  title_sanitized field.
- test_resolve_destination.py: drop _sanitize import + TestSanitize
  class (helper deleted), add tmdb_title_safe arg to
  _resolve_series_folder calls.

987 passed, 8 skipped.
2026-05-19 22:05:26 +02:00
francwa bf37a9d09e refactor(release): thread ReleaseKnowledge through callers
Wires the new explicit-kb signatures into every caller of the release
parser and the filesystem-extension helpers.

- application/filesystem/resolve_destination.py: module-level singleton
  _KB: ReleaseKnowledge = YamlReleaseKnowledge(); each use case now calls
  parse_release(release_name, _KB) and sanitizes TMDB strings via
  _KB.sanitize_for_fs(...) before passing them to the pure ParsedRelease
  builders. Local _sanitize helper + _WIN_FORBIDDEN regex dropped.
- application/filesystem/detect_media_type.py: signature is now
  detect_media_type(parsed, source_path, kb); uses kb.metadata_extensions,
  kb.video_extensions, kb.non_video_extensions.
- infrastructure/filesystem/find_video.py: find_video_file(path, kb) uses
  kb.video_extensions instead of an imported constant.
- agent/tools/filesystem.py::analyze_release imports the application _KB
  singleton and passes it through to parse_release / detect_media_type /
  find_video_file.
2026-05-19 22:05:19 +02:00
francwa 4a74fff9cc refactor(release): purify domain — parse_release(name, kb) + ParsedRelease Option B
Removes the last domain → infrastructure leak in the release parser.

services.py:
- parse_release(name, kb) takes the knowledge as an explicit parameter.
- Every helper (_tokenize, _is_well_formed, _extract_tech,
  _extract_languages, _extract_audio, _extract_video_meta,
  _extract_edition, _extract_title, _infer_media_type) takes kb.
- No more module-level YAML loading.

value_objects.py — Option B:
- Sanitization happens once at parse time; ParsedRelease now carries
  a title_sanitized: str field alongside title.
- Builder methods (show_folder_name, episode_filename, movie_folder_name,
  movie_filename) become pure: they accept already-sanitized
  tmdb_title_safe / tmdb_episode_title_safe arguments. Callers at the
  use-case boundary sanitize via kb.sanitize_for_fs(...) before passing in.
- All domain-knowledge constants removed (_RESOLUTIONS, _SOURCES, _CODECS,
  _AUDIO, _VIDEO_META, _EDITIONS, _HDR_EXTRA, _MEDIA_TYPE_TOKENS,
  _LANGUAGE_TOKENS, _FORBIDDEN_CHARS, _*_EXTENSIONS, _WIN_FORBIDDEN_TABLE,
  _sanitize_for_fs). The module is now pure DDD.
2026-05-19 22:05:10 +02:00
francwa c3a3cb50c9 refactor(release): introduce ReleaseKnowledge Protocol port + YamlReleaseKnowledge adapter
Adds the port/adapter pair that lets the release domain consume parsing
knowledge without importing infrastructure or loading YAML at import time.

- alfred/domain/release/ports/knowledge.py declares the read-only query
  surface: token sets (resolutions, sources, codecs, language_tokens,
  forbidden_chars, hdr_extra), structured dicts (audio, video_meta,
  editions, media_type_tokens), separators list, file-extension sets,
  and sanitize_for_fs(text).
- alfred/infrastructure/knowledge/release_kb.py loads every YAML once
  at construction and exposes them as attributes, with an immutable
  str.maketrans table backing sanitize_for_fs.

No domain code is wired to the port yet — that lands in the next commit.
2026-05-19 22:05:01 +02:00
98 changed files with 5069 additions and 1628 deletions
+421
View File
@@ -15,8 +15,372 @@ callers).
## [Unreleased]
### Fixed
- **Multi-episode chain (e.g. `S14E09E10E11`) now collapses to a full
range.** The parser previously captured `episode=9, episode_end=10`
and dropped E11+. It now returns `episode=first, episode_end=last`,
with intermediate values implied. Fixture
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
to anti-regression-of-fix.
- **Apostrophes in titles no longer push the release through the AI
fallback.** `Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265-Amen`
previously parsed with `parse_path="ai"` and everything UNKNOWN
because `'` is in the forbidden-chars list. Apostrophes are now
pre-stripped before the well-formed check, so the parse completes
normally (`title=Honey.Dont, year=2025, quality=2160p, ...`); only
the title text loses its apostrophe. `parse_path` becomes
`sanitized` to surface the cleanup. Side win: PoP fixture
`the_prodigy_full_chaos/` also moves from total failure to a
partially-correct parse (year, source, codec extracted).
- **Season-range markers (`Sxx-yy`) are now recognized as
`tv_complete`.** `Der.Tatortreiniger.S01-06.GERMAN...` previously
parsed as `media_type=movie` with `S01-06` glued onto the title.
The parser now recognizes the range, sets `season=first`,
`media_type=tv_complete`, and removes the marker from the title.
`is_season_pack` flips to `true`.
- **Pure-punctuation TITLE tokens are dropped at assembly.** Releases
with surrounding ` - ` separators (`Vinyl - 1x01 - FHD`) previously
produced `title="Vinyl.-"`. Such tokens (a stray dash, a wide pipe
``, …) carry no title content and are now filtered out. Side
effect: PoP fixture `khruangbin_yt_wide_pipe/` also benefits — the
YouTube wide-pipe no longer leaks into the title.
### Added
- **Fullwidth vertical bar `` (U+FF5C) is now a recognized release-name
token separator.** Added to `alfred/knowledge/release/separators.yaml`
so CJK release names (and the occasional decorative YouTube-style use)
tokenize cleanly instead of leaving the wide pipe glued onto an
adjacent token. The tokenizer in
`alfred/domain/release/parser/pipeline.py` already iterates the
separator list as plain strings (no regex), so a multi-byte UTF-8
separator works without any code change.
- **`InspectedResult.recommended_action` property** — derived hint that
collapses the orchestrator's go / wait / skip decision into a single
value (``"process"`` / ``"ask_user"`` / ``"skip"``). Centralizes the
exclusion logic that was previously dispersed across road /
media_type / main_video checks at each call site. Ordering is part of
the contract: ``skip`` (no main video, or media_type == ``"other"``)
wins over ``ask_user`` (media_type == ``"unknown"`` or road ==
``"path_of_pain"``) which wins over ``process``. Surfaced through the
``analyze_release`` tool so the LLM can route on it directly.
6 new tests in ``tests/application/test_inspect.py`` cover the four
branches and the precedence rules.
- **`LanguageRepository` port** in `alfred.domain.shared.ports`. Structural
Protocol covering `from_iso`, `from_any`, `all`, `__contains__`, `__len__`
— the surface previously coupled to the concrete `LanguageRegistry`.
Mirrors the `MediaProber` / `FilesystemScanner` pattern: domain code
depends on the Protocol, infrastructure provides the YAML-backed
adapter. Tests in `tests/infrastructure/test_language_registry.py`.
### Changed
- **`Movie` and `Episode` are now frozen dataclasses.** Both entities
hold their track collections as `tuple[AudioTrack, ...]` and
`tuple[SubtitleTrack, ...]` instead of mutable lists, and are
`@dataclass(frozen=True, eq=False)` (identity-based equality
preserved via `__eq__`/`__hash__`). `__post_init__` coercion uses
`object.__setattr__` for the `imdb_id` / `title` /
`season_number` / `episode_number` normalizations. To project
enrichment results (probe output, file metadata) callers now rebuild
via `dataclasses.replace(...)`. Pattern aligned with the recent
`ParsedRelease` freeze. `MediaWithTracks` mixin contract updated to
`tuple` accordingly. `Season` and `TVShow` remain mutable for now —
freezing the aggregate root would cascade a full reconstruction on
every `add_episode`, deferred.
- **`SubtitleCandidate` renamed to `SubtitleScanResult`.** The old name
conflated "this might become a placed subtitle" with "this is what a
scan pass produced". The class is the output of a scan/identify pass
— language/format may still be `None`, confidence reflects how sure
the classifier is, and `raw_tokens` holds the filename fragments
under analysis. `SubtitleScanResult` says that directly. Pure rename
with a refreshed docstring in `alfred/domain/subtitles/entities.py`;
no behavior change. Touches the domain entity + `__init__` export,
the matcher / identifier / utils services, the manage_subtitles use
case, the placer, the metadata store, the shared-media cross-ref
comment, and the seven test modules that imported the type.
- **`ParsedRelease` is now frozen; enrichment passes return new
instances.** The VO was mutable so `detect_media_type` and
`enrich_from_probe` could patch fields in place — a code smell in a
value object whose identity *is* its content. `ParsedRelease` is now
`@dataclass(frozen=True)`; `languages` is a `tuple[str, ...]`
instead of a `list[str]`. `enrich_from_probe` returns a new
`ParsedRelease` via `dataclasses.replace` (only allocates when at
least one field actually changed). `inspect_release` rebinds
`parsed` after both `detect_media_type` (wrapped in `MediaTypeToken`
to satisfy the strict isinstance check that now also runs on
replace) and `enrich_from_probe`. Parser pipeline now packs
`languages` as a tuple in the assemble dict. Callers updated:
`inspect_release`, `testing/recognize_folders_in_downloads.py`, and
the enrichment tests (22 call sites + language assertions switched
to tuple literals).
- **`resolve_destination` use cases take `kb` / `prober` as required
params; module-level singletons gone.** The four
`resolve_{season,episode,movie,series}_destination` use cases now
accept `kb: ReleaseKnowledge` and `prober: MediaProber` as required
arguments, matching the shape of `inspect_release`. The module-level
`_KB = YamlReleaseKnowledge()` and `_PROBER = FfprobeMediaProber()`
singletons that previously lived in
`alfred/application/filesystem/resolve_destination.py` are removed —
the application layer no longer reaches into infrastructure. The
singletons now live at the agent-tools frontier
(`alfred/agent/tools/filesystem.py`), where the LLM-facing wrappers
instantiate them once and thread them through. `analyze_release` no
longer needs the dirty `from ... import _KB` indirection. Tests
inject their own stubs by keyword (`prober=_StubProber(...)`) instead
of monkeypatching a module attribute.
- **`ParsePath` enum renamed to `TokenizationRoute`.** The old name
collided with `pathlib.Path` in code-reading mental models, and was
one letter from `parse_path` (the field that holds the value) — making
it harder than it needed to be to spot the type vs the attribute.
``TokenizationRoute`` says what it actually captures (DIRECT /
SANITIZED / AI = how the name reached the tokenizer), and the class
docstring now spells out the orthogonality with ``Road`` (EASY /
SHITTY / PATH_OF_PAIN, which captures parser confidence on
``ParseReport``). The ``parse_path`` field name stays unchanged —
string values too — so YAML fixtures, the ``analyze_release`` tool
spec, and any external consumer are untouched.
- **`enrich_from_probe` codec mappings moved to YAML.** The three
hard-coded module dicts (`_VIDEO_CODEC_MAP`, `_AUDIO_CODEC_MAP`,
`_CHANNEL_MAP`) translating ffprobe output to scene tokens
(`hevc → x265`, `eac3 → EAC3`, `8 → "7.1"`, …) now live in
`alfred/knowledge/release/probe_mappings.yaml` and are loaded into
`ReleaseKnowledge.probe_mappings` (new port field, populated by
`YamlReleaseKnowledge`). `enrich_from_probe` gains a third `kb`
parameter and reads the maps from there. Aligns with the CLAUDE.md
rule that lookup tables of domain knowledge belong in YAML, not in
Python — and opens the door to a future "learn new codec" pass.
Callers updated: `inspect_release`, `testing/recognize_folders_in_downloads.py`,
and all 22 sites in `tests/application/test_enrich_from_probe.py`.
- **`ParsedRelease.tech_string` is now a derived `@property`**
(`alfred/domain/release/value_objects.py`). It computes
`quality.source.codec` joined by dots on every access, so it stays in
sync with the underlying fields by construction. The stored field is
gone from the dataclass, the dict returned by `assemble()` no longer
carries the key, `parse_release`'s malformed-name fallback drops the
`tech_string=""` kwarg, and `enrich_from_probe` no longer re-derives
it after filling `quality`/`source`/`codec`. Closes the
parser/enrichment double-source-of-truth that `e79ca46` had to fix
reactively. The fixtures runner now injects `tech_string` alongside
`is_season_pack` since `asdict()` skips properties.
- **`RuleScope.level` is now an enum (`RuleScopeLevel`).** The set of
valid levels (global, release_group, movie, show, season, episode)
was documented only in a docstring comment and validated nowhere.
`RuleScopeLevel(str, Enum)` keeps wire compatibility (YAML
serialization, `.value` access) while making the closed set explicit
to type-checkers and IDEs. `to_dict()` emits `.value` strings so
YAML output is unchanged.
- **`FilePath` VO uses `__post_init__` instead of a hand-rolled
`__init__`.** Same public API (accepts `str | Path`), same behavior,
but the dataclass-generated `__init__` is no longer bypassed. One
less smell in the shared VOs.
- **`Language` VO is strict by default; `Language.from_raw()` factory
for normalization.** The previous `__post_init__` mutated `iso` and
`aliases` via `object.__setattr__` on a frozen dataclass — a code
smell hiding behind the dataclass facade. Split: the direct
constructor now rejects un-normalized input (uppercase iso,
whitespace in aliases, etc.), and `Language.from_raw()` handles
arbitrary YAML/user input. Only one caller (LanguageRegistry loading
the ISO YAML) needed migration.
- **`ParsedRelease.normalised` renamed to `clean`.** The field name
promised "dots instead of spaces" but in practice held
`raw - site_tag - apostrophes` — only used by `season_folder_name()`.
Renamed and docstring corrected.
- **`ParsedRelease.media_type` / `parse_path` are strict enums.** The
fields were already typed as `MediaTypeToken` / `ParsePath`, but a
tolerant `__post_init__` coerced raw strings. With both classes
being `(str, Enum)`, the coercion served no purpose. Strict
constructor; `.value` no longer passed at call sites; dropped the
unused `_VALID_MEDIA_TYPES` / `_VALID_PARSE_PATHS` lookup tables.
### Removed
- **`settings.min_movie_size_bytes`** — orphan Pydantic field +
validator. Its only consumer (`MovieService.validate_movie_file`)
had been removed during an earlier refactor. The "real movie vs
sample" rule now lives in extension-based exclusion
(`application/release/supported_media.py`) and PoP. If a size
threshold is ever needed, it'll go in a knowledge YAML, not in
`settings`.
### Internal
- **Flattened `alfred.domain.shared.media/` package into a single
`media.py` module.** The 6-file package (audio, video, subtitle,
info, matching, tracks_mixin + `__init__`) collapsed into one ~250
LoC module. All 12 import sites continue to resolve unchanged
(`from alfred.domain.shared.media import AudioTrack, MediaInfo, …`)
since Python treats `media.py` and `media/__init__.py`
interchangeably for import paths. Easier to scan when the whole
bounded-context fits on one screen.
- **`SubtitleKnowledgeBase` types `language_registry` against the
`LanguageRepository` port** instead of the concrete `LanguageRegistry`
class. The default constructor still instantiates the concrete adapter
when no repository is injected — behaviour is unchanged for existing
callers. Opens the door to in-memory fakes in future tests without
loading the full ISO 639 YAML.
- **Moved `detect_media_type` and `enrich_from_probe` from
`alfred.application.filesystem` to `alfred.application.release`**.
They are inspection-pipeline helpers — their natural home is next to
`inspect_release`, not next to the filesystem use cases. The move
also eliminates a circular-import workaround in
`resolve_destination.py`: `inspect_release` can now be imported at
module top instead of lazily inside `_resolve_parsed`. Public
surface is unchanged for callers that imported the helpers from
their full module paths (the only call sites — `inspect.py`, two
tests, one testing script — were updated in this commit).
### Added
- **`resolve_*_destination` use cases now consume `inspect_release`**.
`resolve_episode_destination` and `resolve_movie_destination` reuse
their existing `source_file` parameter as the inspection target;
`resolve_season_destination` and `resolve_series_destination` gain
a new **optional** `source_path` parameter (also threaded through
the tool wrappers and YAML specs). When the path exists, ffprobe
data fills tokens missing from the release name (e.g. quality) and
refreshes `tech_string`, so the destination folder / file names
end up more accurate. When the path is missing or absent (back-compat
callers), the use cases fall back to parse-only — same behavior as
before.
### Fixed
- **`enrich_from_probe` now refreshes `tech_string`** after filling
`quality` / `source` / `codec`. Previously the field stayed at its
parser-time value, so filename builders saw stale tech tokens even
after a successful probe. New `TestTechString` class in
`tests/application/test_enrich_from_probe.py` locks the behavior.
### Added
- **`inspect_release` orchestrator + `InspectedResult` VO**
(`alfred/application/release/inspect.py`). Single composition of the
four inspection layers: `parse_release` → `detect_media_type` (patches
`parsed.media_type`) → `find_main_video` (top-level scan) →
`prober.probe` + `enrich_from_probe` when a video exists and the
refined media type isn't in `{"unknown", "other"}`. Returns a frozen
`InspectedResult(parsed, report, source_path, main_video, media_info,
probe_used)` that downstream callers consume directly instead of
rebuilding the same chain. `kb` and `prober` are injected — no
module-level singletons. Never raises.
### Changed
- **`analyze_release` tool now delegates to `inspect_release`** — same
output shape, plus two new fields: `confidence` (0100) and `road`
(`"easy"` / `"shitty"` / `"path_of_pain"`) surfaced from the parser's
`ParseReport`. The tool spec (`specs/analyze_release.yaml`) documents
both fields so the LLM can route releases by confidence.
- **`MediaProber` port now covers full media probing**: added
`probe(video) -> MediaInfo | None` alongside the existing
`list_subtitle_streams`. `FfprobeMediaProber` (in
`alfred/infrastructure/probe/`) implements both methods and is now
the single adapter shelling out to `ffprobe`. The standalone
`alfred/infrastructure/filesystem/ffprobe.py` module was removed —
all callers (tools, testing scripts) instantiate
`FfprobeMediaProber` instead. Unblocks the upcoming
`inspect_release` orchestrator, which depends on the port.
### Removed
- `alfred/infrastructure/filesystem/ffprobe.py` (folded into the
`FfprobeMediaProber` adapter).
---
## [2026-05-20] — Release parser confidence scoring + exclusion
### Added
- **Pre-pipeline exclusion helpers** (`alfred/application/release/supported_media.py`):
`is_supported_video(path, kb)` (extension-only check against
`kb.video_extensions`) and `find_main_video(folder, kb)` (top-level
scan, lexicographically-first eligible file, returns `None` when no
video qualifies; accepts a bare file as folder for single-file
releases). No size threshold, no filename heuristics —
PATH_OF_PAIN handles the exotic cases. Foundation for the future
`inspect_release` orchestrator.
- **Release parser — parse-confidence scoring** (`alfred/domain/release/parser/scoring.py`,
`alfred/knowledge/release/scoring.yaml`). `parse_release` now returns
`(ParsedRelease, ParseReport)`. The new `ParseReport` frozen VO
carries a 0100 `confidence`, a `road` (`"easy"` / `"shitty"` /
`"path_of_pain"`), the residual UNKNOWN tokens, and the missing
critical fields. EASY is decided structurally (a group schema
matched); SHITTY vs PATH_OF_PAIN is decided by score against a
YAML-configurable cutoff (default 60). Weights and penalties also
live in `scoring.yaml` — title 30, media_type 20, year 15, season
10, episode 5, tech 5 each; penalty 5 per UNKNOWN token capped at
-30. `Road` is a new enum, distinct from `ParsePath` (which records
the tokenization route, not the confidence tier). `ReleaseKnowledge`
port gains a `scoring: dict` field.
### Changed
- **`parse_release` signature** is now `(name, kb) → tuple[ParsedRelease,
ParseReport]` instead of returning a bare `ParsedRelease`. Call
sites updated in `application/filesystem/resolve_destination.py` and
`agent/tools/filesystem.py`. Tests updated accordingly.
---
## [2026-05-20] — Release parser v2 (EASY + SHITTY)
### Added
- **Release parser v2 — EASY path live** (`alfred/domain/release/parser/`):
new annotate-based pipeline (tokenize → annotate → assemble) drives
releases from known groups. Exposes `Token` (frozen VO with `index` +
`role` + `extra`), `TokenRole` enum (structural/technical/meta families),
and `GroupSchema` / `SchemaChunk` value objects.
- `pipeline.tokenize`: string-ops separator split (no regex), strips
a `[site.tag]` prefix/suffix first.
- `pipeline.annotate`: detects the trailing group right-to-left
(priority to `codec-GROUP` shape, fallback to any non-source dashed
token), looks up its `GroupSchema`, then walks tokens and schema
chunks in lockstep — optional chunks that don't match are skipped,
mandatory mismatches abort EASY and return `None` so the caller can
fall back to SHITTY.
- `pipeline.assemble`: folds annotated tokens into a
`ParsedRelease`-compatible dict.
- `parse_release` (in `release.services`) tries the v2 EASY path first
and falls through to the legacy SHITTY heuristic on `None`. Legacy
SHITTY/PATH OF PAIN behavior is unchanged.
- Knowledge: `alfred/knowledge/release/release_groups/{kontrast,elite,
rarbg}.yaml` declare the canonical chunk order per group, loaded via
new `ReleaseKnowledge.group_schema(name)` port method.
- Tests in `tests/domain/release/test_parser_v2_{scaffolding,easy}.py`
cover token VOs, site-tag stripping, group detection, schema-driven
annotation (movie, TV episode, season pack with optional source),
and field assembly.
- **Release parser v2 — enricher pass** completes the EASY pipeline.
The structural schema walk now tolerates non-positional tokens
between chunks (instead of aborting on leftover tokens), and a second
pass tags them with audio / video-meta / edition / language roles.
Multi-token sequences from `audio.yaml`, `video.yaml`, `editions.yaml`
(e.g. `DTS.HD.MA`, `DV.HDR10`, `TrueHD.Atmos`, `DIRECTORS.CUT`) are
matched before single tokens. Channel layouts like `5.1` and `7.1`
(split into two tokens by the `.` separator) are detected as
consecutive pairs. Sequence members carry an `extra["sequence_member"]`
marker so `assemble` extracts the canonical value only from the
primary token. KONTRAST releases with audio / HDR / edition / language
metadata now produce a fully populated `ParsedRelease`.
- **Streaming distributor as a separate dimension** from encoding source.
New `alfred/knowledge/release/distributors.yaml` (NF, AMZN, DSNP, HMAX,
ATVP, HULU, PCOK, PMTP, CR) feeds a new `ReleaseKnowledge.distributors`
port field, a `TokenRole.DISTRIBUTOR` annotation, and a
`ParsedRelease.distributor` field. `WEB-DL` stays the source; the
platform that produced the release is now recorded distinctly. The
five entries (NF, AMZN, DSNP, HMAX, ATVP) were correspondingly removed
from `sources.yaml`.
- **Real-world release fixtures** under `tests/fixtures/releases/{easy,shitty,path_of_pain}/`,
each documenting an expected `ParsedRelease` plus the future `routing`
(library / torrents / seed_hardlinks) for the upcoming `organize_media`
@@ -54,6 +418,22 @@ callers).
### Changed
- **Release parser v2 — SHITTY simplified to dict-driven tagging**.
The legacy ~480-line heuristic block in `release/services.py` is gone;
`pipeline._annotate_shitty` does a single pass that looks each token
up in the kb buckets (resolutions / sources / codecs / distributors /
year / `SxxExx`) with first-match-wins semantics, and the leftmost
contiguous UNKNOWN run becomes the title. `annotate()` no longer
returns `None` — SHITTY is the always-on fallback when no group schema
matches. `services.py` shrunk from ~525 to ~85 lines. Four fixtures
(`deutschland_franchise_box`, `sleaford_yt_slug`,
`super_mario_bilingual`, `predator_space_separators` — the last one
moved from `shitty/` → `path_of_pain/`) are now marked
`pytest.mark.xfail(strict=False)` documenting PoP-grade pathologies
that SHITTY intentionally won't handle. `ReleaseFixture` grows an
`xfail_reason` field; the parametrized suite wires the xfail mark
automatically.
- **`parse_release` tokenizer is now data-driven**: it splits on any character
listed in `separators.yaml` (regex character class) instead of `name.split(".")`.
This makes YTS-style releases (`The Father (2020) [1080p] [WEBRip] [5.1] [YTS.MX]`),
@@ -184,6 +564,47 @@ callers).
globally — noisy on parser mappers and orchestrator use-cases where early-return
validation is essential complexity. Ignore `PLW0603` for the documented memory
singleton (`infrastructure/persistence/context.py`).
- **Release-knowledge DDD purification** (`refactor/domain-release-knowledge`):
the last domain → infrastructure leak (`domain/release/value_objects.py`
loading YAML at import-time) is gone. Achieved via:
- **`ReleaseKnowledge` Protocol port** at
`alfred/domain/release/ports/knowledge.py` declares the read-only query
surface release parsing needs (token sets for resolutions, sources, codecs,
languages, hdr extras; structured dicts for audio, video_meta, editions,
media_type_tokens; separators list; file-extension sets used by
application/infra callers; `sanitize_for_fs(text)` method).
- **`YamlReleaseKnowledge` adapter** at
`alfred/infrastructure/knowledge/release_kb.py` loads every YAML constant
once at construction. Builds an immutable `str.maketrans` translation
table for filesystem sanitization.
- **`parse_release(name, kb)`** takes the knowledge as an explicit
parameter — no more module-level YAML loading inside the domain. Every
internal helper (`_tokenize`, `_extract_tech`, `_extract_languages`,
`_extract_audio`, `_extract_video_meta`, `_extract_edition`,
`_extract_title`, `_infer_media_type`, `_is_well_formed`) takes `kb`.
- **`ParsedRelease` Option B**: sanitization happens once at parse time
and is stored on a new `title_sanitized: str` field. Builder methods
(`show_folder_name`, `season_folder_name`, `episode_filename`,
`movie_folder_name`, `movie_filename`) are now pure — they accept
already-sanitized `tmdb_title_safe` / `tmdb_episode_title_safe`
arguments. Callers at the use-case boundary sanitize TMDB strings
via `kb.sanitize_for_fs(...)` before passing them in.
- **All domain-knowledge constants removed from `value_objects.py`**:
`_RESOLUTIONS`, `_SOURCES`, `_CODECS`, `_AUDIO`, `_VIDEO_META`,
`_EDITIONS`, `_HDR_EXTRA`, `_MEDIA_TYPE_TOKENS`, `_LANGUAGE_TOKENS`,
`_FORBIDDEN_CHARS`, `_VIDEO_EXTENSIONS`, `_NON_VIDEO_EXTENSIONS`,
`_SUBTITLE_EXTENSIONS`, `_METADATA_EXTENSIONS`, `_WIN_FORBIDDEN_TABLE`,
and the `_sanitize_for_fs` helper. The domain module is now pure.
- **Application-layer KB singleton**: `resolve_destination.py` instantiates
a module-level `_KB: ReleaseKnowledge = YamlReleaseKnowledge()` and
threads it through every `parse_release(...)` call. The local
`_sanitize` helper and `_WIN_FORBIDDEN` regex were dropped in favor of
`_KB.sanitize_for_fs(...)`.
- **`detect_media_type(parsed, source_path, kb)` and
`find_video_file(path, kb)`** now take the knowledge explicitly
instead of importing `_*_EXTENSIONS` constants from the domain.
`agent/tools/filesystem.py::analyze_release` imports the application
KB singleton and passes it through.
---
+35 -23
View File
@@ -13,8 +13,6 @@ from alfred.application.filesystem import (
MoveMediaUseCase,
SetFolderPathUseCase,
)
from alfred.application.filesystem.detect_media_type import detect_media_type
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
from alfred.application.filesystem.resolve_destination import (
resolve_episode_destination as _resolve_episode_destination,
)
@@ -28,10 +26,16 @@ from alfred.application.filesystem.resolve_destination import (
resolve_series_destination as _resolve_series_destination,
)
from alfred.infrastructure.filesystem import FileManager, create_folder, move
from alfred.infrastructure.filesystem.ffprobe import probe
from alfred.infrastructure.filesystem.find_video import find_video_file
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from alfred.infrastructure.metadata import MetadataStore
from alfred.infrastructure.persistence import get_memory
from alfred.infrastructure.probe import FfprobeMediaProber
# Agent-tools frontier: this is the legitimate home for the singletons that
# back every LLM-exposed wrapper. The use cases below take ``kb`` / ``prober``
# as required params; tests inject their own stubs.
_KB = YamlReleaseKnowledge()
_PROBER = FfprobeMediaProber()
_LEARNED_ROOT = Path(_alfred_pkg.__file__).parent.parent / "data" / "knowledge"
@@ -57,10 +61,17 @@ def resolve_season_destination(
tmdb_title: str,
tmdb_year: int,
confirmed_folder: str | None = None,
source_path: str | None = None,
) -> dict[str, Any]:
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/resolve_season_destination.yaml."""
return _resolve_season_destination(
release_name, tmdb_title, tmdb_year, confirmed_folder
release_name,
tmdb_title,
tmdb_year,
_KB,
_PROBER,
confirmed_folder,
source_path,
).to_dict()
@@ -78,6 +89,8 @@ def resolve_episode_destination(
source_file,
tmdb_title,
tmdb_year,
_KB,
_PROBER,
tmdb_episode_title,
confirmed_folder,
).to_dict()
@@ -91,7 +104,7 @@ def resolve_movie_destination(
) -> dict[str, Any]:
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/resolve_movie_destination.yaml."""
return _resolve_movie_destination(
release_name, source_file, tmdb_title, tmdb_year
release_name, source_file, tmdb_title, tmdb_year, _KB, _PROBER
).to_dict()
@@ -100,10 +113,17 @@ def resolve_series_destination(
tmdb_title: str,
tmdb_year: int,
confirmed_folder: str | None = None,
source_path: str | None = None,
) -> dict[str, Any]:
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/resolve_series_destination.yaml."""
return _resolve_series_destination(
release_name, tmdb_title, tmdb_year, confirmed_folder
release_name,
tmdb_title,
tmdb_year,
_KB,
_PROBER,
confirmed_folder,
source_path,
).to_dict()
@@ -190,21 +210,10 @@ def set_path_for_folder(folder_name: str, path_value: str) -> dict[str, Any]:
def analyze_release(release_name: str, source_path: str) -> dict[str, Any]:
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/analyze_release.yaml."""
from alfred.domain.release.services import parse_release # noqa: PLC0415
path = Path(source_path)
parsed = parse_release(release_name)
parsed.media_type = detect_media_type(parsed, path)
probe_used = False
if parsed.media_type not in ("unknown", "other"):
video_file = find_video_file(path)
if video_file:
media_info = probe(video_file)
if media_info:
enrich_from_probe(parsed, media_info)
probe_used = True
from alfred.application.release import inspect_release # noqa: PLC0415
result = inspect_release(release_name, Path(source_path), _KB, _PROBER)
parsed = result.parsed
return {
"status": "ok",
"media_type": parsed.media_type,
@@ -226,7 +235,10 @@ def analyze_release(release_name: str, source_path: str) -> dict[str, Any]:
"edition": parsed.edition,
"site_tag": parsed.site_tag,
"is_season_pack": parsed.is_season_pack,
"probe_used": probe_used,
"probe_used": result.probe_used,
"confidence": result.report.confidence,
"road": result.report.road,
"recommended_action": result.recommended_action,
}
@@ -240,7 +252,7 @@ def probe_media(source_path: str) -> dict[str, Any]:
"message": f"{source_path} does not exist",
}
media_info = probe(path)
media_info = _PROBER.probe(path)
if media_info is None:
return {
"status": "error",
@@ -80,3 +80,6 @@ returns:
site_tag: Source-site tag if present.
is_season_pack: True when the folder contains a full season.
probe_used: True when ffprobe successfully enriched the result.
confidence: Parser confidence score, 0100 (higher = more reliable).
road: "Parser road: 'easy' (group schema matched), 'shitty' (heuristic but acceptable), or 'path_of_pain' (low confidence — ask the user before auto-routing)."
recommended_action: "Orchestrator hint: 'process' (go straight to resolve_*_destination), 'ask_user' (media_type unknown or road=path_of_pain — confirm with the user first), or 'skip' (no main video, or media_type=other — nothing to organize)."
@@ -61,6 +61,17 @@ parameters:
one.
example: Oz.1997.1080p.WEBRip.x265-KONTRAST
source_path:
description: |
Absolute path to the release folder on disk. Optional.
why_needed: |
When provided, the tool runs ffprobe on the main video inside the
folder and uses the probe data to fill quality/codec tokens that
may be missing from the release name. The enriched tech tokens
end up in the destination folder name, so providing source_path
gives more accurate names for releases with sparse metadata.
example: /downloads/Oz.S03.1080p.WEBRip.x265-KONTRAST
returns:
ok:
description: Paths resolved unambiguously; ready to move.
@@ -56,6 +56,16 @@ parameters:
Forces the use case to use this exact folder name and skip detection.
example: The.Wire.2002.1080p.BluRay.x265-GROUP
source_path:
description: |
Absolute path to the release folder on disk. Optional.
why_needed: |
When provided, the tool runs ffprobe on the main video inside the
folder and uses probe data to fill quality/codec tokens that may
be missing from the release name, producing a more accurate
destination folder name.
example: /downloads/The.Wire.S01-S05.1080p.BluRay.x265-GROUP
returns:
ok:
description: Path resolved; ready to move the pack.
@@ -1,82 +0,0 @@
"""enrich_from_probe — fill missing ParsedRelease fields from MediaInfo."""
from __future__ import annotations
from alfred.domain.release.value_objects import ParsedRelease
from alfred.domain.shared.media import MediaInfo
# Map ffprobe codec names to scene-style codec tokens
_VIDEO_CODEC_MAP = {
"hevc": "x265",
"h264": "x264",
"h265": "x265",
"av1": "AV1",
"vp9": "VP9",
"mpeg4": "XviD",
}
# Map ffprobe audio codec names to scene-style tokens
_AUDIO_CODEC_MAP = {
"eac3": "EAC3",
"ac3": "AC3",
"dts": "DTS",
"truehd": "TrueHD",
"aac": "AAC",
"flac": "FLAC",
"opus": "OPUS",
"mp3": "MP3",
"pcm_s16l": "PCM",
"pcm_s24l": "PCM",
}
# Map channel count to standard layout string
_CHANNEL_MAP = {
8: "7.1",
6: "5.1",
2: "2.0",
1: "1.0",
}
def enrich_from_probe(parsed: ParsedRelease, info: MediaInfo) -> None:
"""
Fill None fields in parsed using data from ffprobe MediaInfo.
Only overwrites fields that are currently None — token-level values
from the release name always take priority.
Mutates parsed in place.
"""
if parsed.quality is None and info.resolution:
parsed.quality = info.resolution
if parsed.codec is None and info.video_codec:
parsed.codec = _VIDEO_CODEC_MAP.get(
info.video_codec.lower(), info.video_codec.upper()
)
if parsed.bit_depth is None and info.video_codec:
# ffprobe exposes bit depth via pix_fmt — not in MediaInfo yet, skip for now
pass
# Audio — use the default track, fallback to first
default_track = next((t for t in info.audio_tracks if t.is_default), None)
track = default_track or (info.audio_tracks[0] if info.audio_tracks else None)
if track:
if parsed.audio_codec is None and track.codec:
parsed.audio_codec = _AUDIO_CODEC_MAP.get(
track.codec.lower(), track.codec.upper()
)
if parsed.audio_channels is None and track.channels:
parsed.audio_channels = _CHANNEL_MAP.get(
track.channels, f"{track.channels}ch"
)
# Languages — merge ffprobe languages with token-level ones
# "und" = undetermined, not useful
if info.audio_languages:
existing = set(parsed.languages)
for lang in info.audio_languages:
if lang.lower() != "und" and lang.upper() not in existing:
parsed.languages.append(lang)
@@ -4,7 +4,7 @@ import logging
from pathlib import Path
from alfred.domain.shared.value_objects import ImdbId
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.domain.subtitles.services.identifier import SubtitleIdentifier
from alfred.domain.subtitles.services.matcher import SubtitleMatcher
from alfred.domain.subtitles.services.pattern_detector import PatternDetector
@@ -278,7 +278,7 @@ class ManageSubtitlesUseCase:
def _to_unresolved_dto(
track: SubtitleCandidate, min_confidence: float = 0.7
track: SubtitleScanResult, min_confidence: float = 0.7
) -> UnresolvedTrack:
reason = "unknown_language" if track.language is None else "low_confidence"
return UnresolvedTrack(
@@ -291,10 +291,10 @@ def _to_unresolved_dto(
def _pair_placed_with_tracks(
placed: list[PlacedTrack],
tracks: list[SubtitleCandidate],
) -> list[tuple[PlacedTrack, SubtitleCandidate]]:
tracks: list[SubtitleScanResult],
) -> list[tuple[PlacedTrack, SubtitleScanResult]]:
"""
Pair each PlacedTrack with its originating SubtitleCandidate by source path.
Pair each PlacedTrack with its originating SubtitleScanResult by source path.
Falls back to positional matching if paths don't align.
"""
track_by_path = {t.file_path: t for t in tracks if t.file_path}
@@ -8,34 +8,58 @@ Four distinct use cases, one per release type:
- resolve_series_destination : complete series multi-season pack (folder move)
Each returns a dedicated DTO with only the fields that make sense for that type.
These use cases follow Option B of the snapshot-VO design: ``ParsedRelease``
arrives with ``title_sanitized`` already computed, and TMDB-supplied strings
are sanitized **at the use-case boundary** (here) before being passed into
``ParsedRelease`` builder methods. The builders themselves perform no I/O and
no sanitization.
"""
from __future__ import annotations
import logging
import re
from dataclasses import dataclass
from pathlib import Path
from alfred.application.release import inspect_release
from alfred.domain.release import parse_release
from alfred.domain.release.ports import ReleaseKnowledge
from alfred.domain.release.value_objects import ParsedRelease
from alfred.domain.shared.ports import MediaProber
from alfred.infrastructure.persistence import get_memory
logger = logging.getLogger(__name__)
_WIN_FORBIDDEN = re.compile(r'[?:*"<>|\\]')
def _resolve_parsed(
release_name: str,
source_path: str | None,
kb: ReleaseKnowledge,
prober: MediaProber,
) -> ParsedRelease:
"""Pick the right entry point depending on whether we have a path.
def _sanitize(text: str) -> str:
return _WIN_FORBIDDEN.sub("", text)
When ``source_path`` is provided and points to something that exists,
we run the full inspection pipeline so probe data can refresh tech
fields (which feed every filename builder). Otherwise we fall back
to a parse-only path — same behavior as before.
"""
if source_path:
path = Path(source_path)
if path.exists():
return inspect_release(release_name, path, kb, prober).parsed
parsed, _ = parse_release(release_name, kb)
return parsed
def _find_existing_tvshow_folders(
tv_root: Path, tmdb_title: str, tmdb_year: int
tv_root: Path, tmdb_title_safe: str, tmdb_year: int
) -> list[str]:
"""Return folder names in tv_root that match title + year prefix."""
if not tv_root.exists():
return []
clean_title = _sanitize(tmdb_title).replace(" ", ".")
clean_title = tmdb_title_safe.replace(" ", ".")
prefix = f"{clean_title}.{tmdb_year}".lower()
return sorted(
entry.name
@@ -66,6 +90,7 @@ class _Clarification:
def _resolve_series_folder(
tv_root: Path,
tmdb_title: str,
tmdb_title_safe: str,
tmdb_year: int,
computed_name: str,
confirmed_folder: str | None,
@@ -80,7 +105,7 @@ def _resolve_series_folder(
if confirmed_folder:
return confirmed_folder, not (tv_root / confirmed_folder).exists()
existing = _find_existing_tvshow_folders(tv_root, tmdb_title, tmdb_year)
existing = _find_existing_tvshow_folders(tv_root, tmdb_title_safe, tmdb_year)
if not existing:
return computed_name, True
@@ -230,13 +255,20 @@ def resolve_season_destination(
release_name: str,
tmdb_title: str,
tmdb_year: int,
kb: ReleaseKnowledge,
prober: MediaProber,
confirmed_folder: str | None = None,
source_path: str | None = None,
) -> ResolvedSeasonDestination:
"""
Compute destination paths for a season pack.
Returns series_folder + season_folder. No file paths — the whole
source folder is moved as-is into season_folder.
When ``source_path`` points to the release on disk, the parser is
augmented with ffprobe data so tech tokens missing from the release
name (quality / codec) end up in the folder names.
"""
tv_root = _get_tv_root()
if not tv_root:
@@ -246,11 +278,12 @@ def resolve_season_destination(
message="TV show library path is not configured.",
)
parsed = parse_release(release_name)
computed_name = _sanitize(parsed.show_folder_name(tmdb_title, tmdb_year))
parsed = _resolve_parsed(release_name, source_path, kb, prober)
tmdb_title_safe = kb.sanitize_for_fs(tmdb_title)
computed_name = parsed.show_folder_name(tmdb_title_safe, tmdb_year)
resolved = _resolve_series_folder(
tv_root, tmdb_title, tmdb_year, computed_name, confirmed_folder
tv_root, tmdb_title, tmdb_title_safe, tmdb_year, computed_name, confirmed_folder
)
if isinstance(resolved, _Clarification):
return ResolvedSeasonDestination(
@@ -279,6 +312,8 @@ def resolve_episode_destination(
source_file: str,
tmdb_title: str,
tmdb_year: int,
kb: ReleaseKnowledge,
prober: MediaProber,
tmdb_episode_title: str | None = None,
confirmed_folder: str | None = None,
) -> ResolvedEpisodeDestination:
@@ -286,6 +321,8 @@ def resolve_episode_destination(
Compute destination paths for a single episode file.
Returns series_folder + season_folder + library_file (full path to .mkv).
``source_file`` doubles as the inspection target — when it exists,
ffprobe enrichment refreshes tech tokens missing from the release name.
"""
tv_root = _get_tv_root()
if not tv_root:
@@ -295,12 +332,16 @@ def resolve_episode_destination(
message="TV show library path is not configured.",
)
parsed = parse_release(release_name)
parsed = _resolve_parsed(release_name, source_file, kb, prober)
ext = Path(source_file).suffix
computed_name = _sanitize(parsed.show_folder_name(tmdb_title, tmdb_year))
tmdb_title_safe = kb.sanitize_for_fs(tmdb_title)
tmdb_episode_title_safe = (
kb.sanitize_for_fs(tmdb_episode_title) if tmdb_episode_title else None
)
computed_name = parsed.show_folder_name(tmdb_title_safe, tmdb_year)
resolved = _resolve_series_folder(
tv_root, tmdb_title, tmdb_year, computed_name, confirmed_folder
tv_root, tmdb_title, tmdb_title_safe, tmdb_year, computed_name, confirmed_folder
)
if isinstance(resolved, _Clarification):
return ResolvedEpisodeDestination(
@@ -311,7 +352,7 @@ def resolve_episode_destination(
series_folder_name, is_new = resolved
season_folder_name = parsed.season_folder_name()
filename = _sanitize(parsed.episode_filename(tmdb_episode_title, ext))
filename = parsed.episode_filename(tmdb_episode_title_safe, ext)
series_path = tv_root / series_folder_name
season_path = series_path / season_folder_name
@@ -334,11 +375,15 @@ def resolve_movie_destination(
source_file: str,
tmdb_title: str,
tmdb_year: int,
kb: ReleaseKnowledge,
prober: MediaProber,
) -> ResolvedMovieDestination:
"""
Compute destination paths for a movie file.
Returns movie_folder + library_file (full path to .mkv).
``source_file`` doubles as the inspection target — when it exists,
ffprobe enrichment refreshes tech tokens missing from the release name.
"""
memory = get_memory()
movies_root = memory.ltm.library_paths.get("movie")
@@ -349,11 +394,12 @@ def resolve_movie_destination(
message="Movie library path is not configured.",
)
parsed = parse_release(release_name)
parsed = _resolve_parsed(release_name, source_file, kb, prober)
ext = Path(source_file).suffix
tmdb_title_safe = kb.sanitize_for_fs(tmdb_title)
folder_name = _sanitize(parsed.movie_folder_name(tmdb_title, tmdb_year))
filename = _sanitize(parsed.movie_filename(tmdb_title, tmdb_year, ext))
folder_name = parsed.movie_folder_name(tmdb_title_safe, tmdb_year)
filename = parsed.movie_filename(tmdb_title_safe, tmdb_year, ext)
folder_path = Path(movies_root) / folder_name
file_path = folder_path / filename
@@ -372,12 +418,18 @@ def resolve_series_destination(
release_name: str,
tmdb_title: str,
tmdb_year: int,
kb: ReleaseKnowledge,
prober: MediaProber,
confirmed_folder: str | None = None,
source_path: str | None = None,
) -> ResolvedSeriesDestination:
"""
Compute destination path for a complete multi-season series pack.
Returns only series_folder — the whole pack lands directly inside it.
When ``source_path`` points to the release on disk, ffprobe
enrichment refreshes tech tokens missing from the release name.
"""
tv_root = _get_tv_root()
if not tv_root:
@@ -387,11 +439,12 @@ def resolve_series_destination(
message="TV show library path is not configured.",
)
parsed = parse_release(release_name)
computed_name = _sanitize(parsed.show_folder_name(tmdb_title, tmdb_year))
parsed = _resolve_parsed(release_name, source_path, kb, prober)
tmdb_title_safe = kb.sanitize_for_fs(tmdb_title)
computed_name = parsed.show_folder_name(tmdb_title_safe, tmdb_year)
resolved = _resolve_series_folder(
tv_root, tmdb_title, tmdb_year, computed_name, confirmed_folder
tv_root, tmdb_title, tmdb_title_safe, tmdb_year, computed_name, confirmed_folder
)
if isinstance(resolved, _Clarification):
return ResolvedSeriesDestination(
+20
View File
@@ -0,0 +1,20 @@
"""Release application layer — orchestrators sitting between domain
parsing and infrastructure I/O.
Public surface:
- :func:`is_supported_video` / :func:`find_main_video` — pre-pipeline
filesystem helpers (extension-only filtering, top-level video pick).
- :func:`inspect_release` / :class:`InspectedResult` — full inspection
pipeline combining parse + filesystem refinement + probe enrichment.
"""
from .inspect import InspectedResult, inspect_release
from .supported_media import find_main_video, is_supported_video
__all__ = [
"InspectedResult",
"find_main_video",
"inspect_release",
"is_supported_video",
]
@@ -19,15 +19,13 @@ from __future__ import annotations
from pathlib import Path
from alfred.domain.release.value_objects import (
_METADATA_EXTENSIONS,
_NON_VIDEO_EXTENSIONS,
_VIDEO_EXTENSIONS,
ParsedRelease,
)
from alfred.domain.release.ports import ReleaseKnowledge
from alfred.domain.release.value_objects import ParsedRelease
def detect_media_type(parsed: ParsedRelease, source_path: Path) -> str:
def detect_media_type(
parsed: ParsedRelease, source_path: Path, kb: ReleaseKnowledge
) -> str:
"""
Return a refined media_type string for the given source_path.
@@ -37,10 +35,10 @@ def detect_media_type(parsed: ParsedRelease, source_path: Path) -> str:
extensions = _collect_extensions(source_path)
# Metadata extensions (.nfo, .srt, …) are always present alongside releases
# and must not influence the type decision.
conclusive = extensions - _METADATA_EXTENSIONS
conclusive = extensions - kb.metadata_extensions
has_video = bool(conclusive & _VIDEO_EXTENSIONS)
has_non_video = bool(conclusive & _NON_VIDEO_EXTENSIONS)
has_video = bool(conclusive & kb.video_extensions)
has_non_video = bool(conclusive & kb.non_video_extensions)
if has_video and has_non_video:
return "unknown"
@@ -0,0 +1,74 @@
"""enrich_from_probe — fill missing ParsedRelease fields from MediaInfo."""
from __future__ import annotations
from dataclasses import replace
from alfred.domain.release.ports import ReleaseKnowledge
from alfred.domain.release.value_objects import ParsedRelease
from alfred.domain.shared.media import MediaInfo
def enrich_from_probe(
parsed: ParsedRelease, info: MediaInfo, kb: ReleaseKnowledge
) -> ParsedRelease:
"""
Return a new ParsedRelease with None fields filled from ffprobe MediaInfo.
Only overwrites fields that are currently None — token-level values
from the release name always take priority. ``ParsedRelease`` is
frozen; this returns a new instance via :func:`dataclasses.replace`.
Translation tables (ffprobe codec name → scene token, channel count
→ layout) live in ``kb.probe_mappings`` (loaded from
``alfred/knowledge/release/probe_mappings.yaml``). When ffprobe
reports a value with no mapping entry, the fallback is the uppercase
raw value so unknown codecs still surface in a predictable form.
"""
mappings = kb.probe_mappings
video_codec_map: dict[str, str] = mappings.get("video_codec", {})
audio_codec_map: dict[str, str] = mappings.get("audio_codec", {})
channel_map: dict[int, str] = mappings.get("audio_channels", {})
updates: dict[str, object] = {}
if parsed.quality is None and info.resolution:
updates["quality"] = info.resolution
if parsed.codec is None and info.video_codec:
updates["codec"] = video_codec_map.get(
info.video_codec.lower(), info.video_codec.upper()
)
# bit_depth: ffprobe exposes it via pix_fmt — not in MediaInfo yet, skip.
# Audio — use the default track, fallback to first
default_track = next((t for t in info.audio_tracks if t.is_default), None)
track = default_track or (info.audio_tracks[0] if info.audio_tracks else None)
if track:
if parsed.audio_codec is None and track.codec:
updates["audio_codec"] = audio_codec_map.get(
track.codec.lower(), track.codec.upper()
)
if parsed.audio_channels is None and track.channels:
updates["audio_channels"] = channel_map.get(
track.channels, f"{track.channels}ch"
)
# Languages — merge ffprobe languages with token-level ones
# "und" = undetermined, not useful
if info.audio_languages:
existing_upper = {lang.upper() for lang in parsed.languages}
new_languages = list(parsed.languages)
for lang in info.audio_languages:
if lang.lower() != "und" and lang.upper() not in existing_upper:
new_languages.append(lang)
existing_upper.add(lang.upper())
if len(new_languages) != len(parsed.languages):
updates["languages"] = tuple(new_languages)
if not updates:
return parsed
return replace(parsed, **updates)
+193
View File
@@ -0,0 +1,193 @@
"""Release inspection orchestrator — the canonical "look at this thing"
entry point.
``inspect_release`` is the single composition of the four layers we
care about for a freshly-arrived release:
1. **Parse the name** — :func:`alfred.domain.release.services.parse_release`
gives a ``ParsedRelease`` plus a ``ParseReport`` (confidence + road).
2. **Pick the main video** — :func:`find_main_video` runs a top-level
scan over the source path. If nothing qualifies the result still
completes; downstream callers decide what to do with a videoless
release.
3. **Refine the media type** — :func:`detect_media_type` uses the
on-disk extension mix to override any token-level guess (e.g. a
bare ``.iso`` folder becomes ``"other"``). The refined value is
patched onto ``parsed`` in place — same convention as
``analyze_release`` had before.
4. **Probe the video** — the injected :class:`MediaProber` fills in
missing technical fields via :func:`enrich_from_probe`. Skipped
when there is no main video or when ``media_type`` ended up in
``{"unknown", "other"}`` (the probe would tell us nothing useful).
The return type is :class:`InspectedResult`, a frozen VO that bundles
everything downstream callers need (``analyze_release`` tool,
``resolve_destination``, future workflow stages) without forcing them
to redo the same four calls.
Design notes:
- **Application layer.** This module touches both domain
(``parse_release``) and infrastructure (``MediaProber`` port). That
is exactly application's job — orchestrate.
- **Knowledge base is injected.** ``inspect_release`` takes ``kb`` and
``prober`` as parameters; no module-level singletons here. Callers
(the tool wrapper, tests) decide what to plug in.
- **Mutation is contained.** We still mutate ``parsed.media_type`` and
let ``enrich_from_probe`` fill its ``None`` fields, because
``ParsedRelease`` is intentionally a mutable dataclass. The outer
``InspectedResult`` is frozen so the *bundle* is immutable from the
caller's perspective.
- **Never raises.** Filesystem / probe errors surface as ``None``
fields on the result, never as exceptions — same contract as the
underlying adapters.
"""
from __future__ import annotations
from dataclasses import dataclass, replace
from pathlib import Path
from alfred.application.release.detect_media_type import detect_media_type
from alfred.application.release.enrich_from_probe import enrich_from_probe
from alfred.application.release.supported_media import find_main_video
from alfred.domain.release.ports import ReleaseKnowledge
from alfred.domain.release.services import parse_release
from alfred.domain.release.value_objects import (
MediaTypeToken,
ParsedRelease,
ParseReport,
)
from alfred.domain.shared.media import MediaInfo
from alfred.domain.shared.ports import MediaProber
# Media types for which a probe carries no useful information.
_NON_PROBABLE_MEDIA_TYPES = frozenset({"unknown", "other"})
# Media types for which there's nothing for the organizer to do.
# ``other`` covers things like games / ISOs / archives sitting on the
# downloads folder. ``unknown`` does NOT belong here — those need a
# user decision, not a skip.
_SKIPPABLE_MEDIA_TYPES = frozenset({"other"})
# Roads that signal the parser couldn't reach a confident answer on its
# own. ``Road`` values are kept as strings on the report to avoid a
# cross-package import here.
_ASK_USER_ROADS = frozenset({"path_of_pain"})
@dataclass(frozen=True)
class InspectedResult:
"""The full picture of a release: parsed name + filesystem reality.
Bundles everything the downstream pipeline needs after a single
inspection pass:
- ``parsed`` — :class:`ParsedRelease`, with ``media_type`` already
refined by :func:`detect_media_type` and ``None`` tech fields
filled in by :func:`enrich_from_probe` when a probe ran.
- ``report`` — :class:`ParseReport` from the parser (confidence +
road, untouched by inspection).
- ``source_path`` — the path the inspector was pointed at (file or
folder), as supplied by the caller.
- ``main_video`` — the canonical video file inside ``source_path``,
or ``None`` if no eligible file was found.
- ``media_info`` — the :class:`MediaInfo` snapshot when a probe
succeeded; ``None`` when no video was probed (no main video, or
``media_type`` in ``{"unknown", "other"}``) or when ffprobe
failed.
- ``probe_used`` — ``True`` iff ``media_info`` is non-``None`` and
``enrich_from_probe`` actually ran. Explicit flag so callers
don't have to re-derive the condition.
- ``recommended_action`` — derived hint for the orchestrator (see
property docstring). Encodes the exclusion / clarification /
go-ahead decision in one place so downstream callers don't
re-implement the same checks.
"""
parsed: ParsedRelease
report: ParseReport
source_path: Path
main_video: Path | None
media_info: MediaInfo | None
probe_used: bool
@property
def recommended_action(self) -> str:
"""Return one of ``"skip"`` / ``"ask_user"`` / ``"process"``.
- ``"skip"`` — nothing to organize:
* the source has no main video file, **or**
* ``media_type`` is ``"other"`` (games / ISOs / archives).
- ``"ask_user"`` — a decision is required before any action:
* ``media_type`` is ``"unknown"`` (parser couldn't classify), **or**
* the parse landed on ``Road.PATH_OF_PAIN``
(low-confidence, malformed name, etc.).
- ``"process"`` — everything else: a confident parse with a
usable media type and a main video on disk. The orchestrator
can move straight to the planning step.
The check ordering matters: ``"skip"`` wins over ``"ask_user"``
because if there's no video to organize, no question to the
user can change that. ``"ask_user"`` then wins over
``"process"`` because a confident parse alone isn't enough if
the type or road still flag uncertainty.
"""
if self.main_video is None:
return "skip"
if self.parsed.media_type.value in _SKIPPABLE_MEDIA_TYPES:
return "skip"
if self.parsed.media_type.value == "unknown":
return "ask_user"
if self.report.road in _ASK_USER_ROADS:
return "ask_user"
return "process"
def inspect_release(
release_name: str,
source_path: Path,
kb: ReleaseKnowledge,
prober: MediaProber,
) -> InspectedResult:
"""Run the full inspection pipeline on ``release_name`` /
``source_path``.
See module docstring for the four-step flow. ``kb`` and ``prober``
are injected so the caller controls the knowledge base layering
and the probe adapter (real ffprobe in production, stubs in tests).
Never raises. A missing or unreadable ``source_path`` simply
results in ``main_video=None`` and ``media_info=None``.
"""
parsed, report = parse_release(release_name, kb)
# Step 2: refine media_type from the on-disk extension mix.
# detect_media_type tolerates non-existent paths (returns parsed.media_type
# untouched), so no need to guard here. ParsedRelease is frozen — use
# dataclasses.replace to rebind with the refined value.
refined_media_type = MediaTypeToken(detect_media_type(parsed, source_path, kb))
if refined_media_type != parsed.media_type:
parsed = replace(parsed, media_type=refined_media_type)
# Step 3: pick the canonical main video (top-level scan only).
main_video = find_main_video(source_path, kb)
# Step 4: probe + enrich, when it makes sense.
media_info: MediaInfo | None = None
probe_used = False
if main_video is not None and parsed.media_type not in _NON_PROBABLE_MEDIA_TYPES:
media_info = prober.probe(main_video)
if media_info is not None:
parsed = enrich_from_probe(parsed, media_info, kb)
probe_used = True
return InspectedResult(
parsed=parsed,
report=report,
source_path=source_path,
main_video=main_video,
media_info=media_info,
probe_used=probe_used,
)
@@ -0,0 +1,74 @@
"""Pre-pipeline exclusion — decide which files are worth parsing.
These helpers live one notch above the domain: they touch the
filesystem (``Path.iterdir``, ``Path.suffix``) but carry no parsing
logic of their own. The goal is to filter out non-video files and pick
the canonical "main video" from a release folder *before* anything
hits :func:`~alfred.domain.release.parse_release`.
Design notes (Phase A bis, 2026-05-20):
- **Extension is the sole eligibility criterion.** A file is supported
iff its suffix is in ``kb.video_extensions``. No size threshold, no
filename heuristics ("sample", "trailer", …). If a release packs a
bloated featurette or names its sample alphabetically before the
main feature, that's PATH_OF_PAIN territory — not this layer's job.
- **Top-level scan only.** ``find_main_video`` does not descend into
subdirectories. Releases that wrap the main video in ``Sample/`` or
similar are non-scene-standard and handled by the orchestrator
upstream.
- **Lexicographic tie-break.** When several candidates qualify
(legitimate for season packs), we return the first by alphabetical
order. Deterministic, no size-based ranking.
- **Direct ``Path`` I/O.** No ``FilesystemScanner`` port — this layer
is application, not domain. If isolation becomes necessary for
testing scale, we'll introduce a port then.
"""
from __future__ import annotations
from pathlib import Path
from alfred.domain.release.ports.knowledge import ReleaseKnowledge
def is_supported_video(path: Path, kb: ReleaseKnowledge) -> bool:
"""Return True when ``path`` is a video file the parser should
consider.
The check is purely extension-based: ``path.suffix.lower()`` must
belong to ``kb.video_extensions``. ``path`` must also be a regular
file — directories and broken symlinks return False.
"""
if not path.is_file():
return False
return path.suffix.lower() in kb.video_extensions
def find_main_video(folder: Path, kb: ReleaseKnowledge) -> Path | None:
"""Return the canonical main video file inside ``folder``, or
``None`` if there isn't one.
Behavior:
- Top-level scan only — subdirectories are ignored.
- Eligibility is :func:`is_supported_video`.
- When several files qualify, the lexicographically first one wins.
- When ``folder`` itself is a video file, it is returned as-is
(single-file releases are valid).
- When ``folder`` doesn't exist or isn't a directory (and isn't a
video file either), returns ``None``.
"""
if folder.is_file():
return folder if is_supported_video(folder, kb) else None
if not folder.is_dir():
return None
candidates = sorted(
child for child in folder.iterdir() if is_supported_video(child, kb)
)
return candidates[0] if candidates else None
+6 -6
View File
@@ -5,13 +5,13 @@ import os
from dataclasses import dataclass
from pathlib import Path
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.domain.subtitles.value_objects import SubtitleType
logger = logging.getLogger(__name__)
def _build_dest_name(track: SubtitleCandidate, video_stem: str) -> str:
def _build_dest_name(track: SubtitleScanResult, video_stem: str) -> str:
"""
Build the destination filename for a subtitle track.
@@ -41,7 +41,7 @@ class PlacedTrack:
@dataclass
class PlaceResult:
placed: list[PlacedTrack]
skipped: list[tuple[SubtitleCandidate, str]] # (track, reason)
skipped: list[tuple[SubtitleScanResult, str]] # (track, reason)
@property
def placed_count(self) -> int:
@@ -54,7 +54,7 @@ class PlaceResult:
class SubtitlePlacer:
"""
Hard-links matched SubtitleCandidate files next to a destination video.
Hard-links matched SubtitleScanResult files next to a destination video.
Uses the same hard-link strategy as FileManager.copy_file:
instant, no data duplication, qBittorrent keeps seeding.
@@ -64,11 +64,11 @@ class SubtitlePlacer:
def place(
self,
tracks: list[SubtitleCandidate],
tracks: list[SubtitleScanResult],
destination_video: Path,
) -> PlaceResult:
placed: list[PlacedTrack] = []
skipped: list[tuple[SubtitleCandidate, str]] = []
skipped: list[tuple[SubtitleScanResult, str]] = []
dest_dir = destination_video.parent
+9 -6
View File
@@ -8,19 +8,22 @@ from ..shared.value_objects import FilePath, FileSize, ImdbId
from .value_objects import MovieTitle, Quality, ReleaseYear
@dataclass(eq=False)
@dataclass(frozen=True, eq=False)
class Movie(MediaWithTracks):
"""
Movie aggregate root for the movies domain.
Carries file metadata (path, size) and the tracks discovered by the
ffprobe + subtitle scan pipeline. The track lists may be empty when the
ffprobe + subtitle scan pipeline. The track tuples may be empty when the
movie is known but not yet scanned, or when no file is downloaded.
Track helpers follow the same "C+" contract as ``Episode``: pass a
``Language`` for cross-format matching, or a ``str`` for case-insensitive
direct comparison.
Frozen: rebuild via ``dataclasses.replace`` to project enrichment results
(audio/subtitle tracks, file metadata) onto a new instance.
Equality is identity-based: two ``Movie`` instances are equal iff they
share the same ``imdb_id``, regardless of file/track contents. This is
the DDD aggregate invariant — the aggregate is identified by its root id.
@@ -34,15 +37,15 @@ class Movie(MediaWithTracks):
file_size: FileSize | None = None
tmdb_id: int | None = None
added_at: datetime = field(default_factory=datetime.now)
audio_tracks: list[AudioTrack] = field(default_factory=list)
subtitle_tracks: list[SubtitleTrack] = field(default_factory=list)
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
def __post_init__(self):
"""Validate movie entity."""
# Ensure ImdbId is actually an ImdbId instance
if not isinstance(self.imdb_id, ImdbId):
if isinstance(self.imdb_id, str):
self.imdb_id = ImdbId(self.imdb_id)
object.__setattr__(self, "imdb_id", ImdbId(self.imdb_id))
else:
raise ValueError(
f"imdb_id must be ImdbId or str, got {type(self.imdb_id)}"
@@ -51,7 +54,7 @@ class Movie(MediaWithTracks):
# Ensure MovieTitle is actually a MovieTitle instance
if not isinstance(self.title, MovieTitle):
if isinstance(self.title, str):
self.title = MovieTitle(self.title)
object.__setattr__(self, "title", MovieTitle(self.title))
else:
raise ValueError(
f"title must be MovieTitle or str, got {type(self.title)}"
+2 -2
View File
@@ -1,6 +1,6 @@
"""Release domain — release name parsing and naming conventions."""
from .services import parse_release
from .value_objects import ParsedRelease
from .value_objects import ParsedRelease, ParseReport
__all__ = ["ParsedRelease", "parse_release"]
__all__ = ["ParsedRelease", "ParseReport", "parse_release"]
+31
View File
@@ -0,0 +1,31 @@
"""Release parser v2 — annotate-based pipeline.
This package is the future home of ``parse_release``. It restructures the
parsing logic around a **tokenize → annotate → assemble** pipeline:
1. **tokenize**: split the release name into atomic tokens.
2. **annotate**: walk tokens left-to-right, assigning each one a
:class:`TokenRole` (TITLE, YEAR, SEASON, RESOLUTION, …) using the
injected :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge`.
3. **assemble**: fold the annotated tokens into a :class:`ParsedRelease`.
The pipeline has three internal paths driven by the detected release group:
- **EASY**: known group (KONTRAST, RARBG, …) with a schema-driven layout
declared in ``knowledge/release/release_groups/<group>.yaml``.
- **SHITTY**: unknown group, best-effort matching against the global
knowledge sets, with a 0-100 confidence score.
- **PATH OF PAIN**: score below threshold OR critical chunks missing —
signaled to the caller, who decides whether to involve the LLM/user.
Today the package exposes scaffolding only (token VOs and a thin pipeline
stub). The legacy ``parse_release`` in ``release.services`` keeps serving
production until each piece of the v2 pipeline is wired in.
"""
from __future__ import annotations
from .schema import GroupSchema, SchemaChunk
from .tokens import Token, TokenRole
__all__ = ["GroupSchema", "SchemaChunk", "Token", "TokenRole"]
+763
View File
@@ -0,0 +1,763 @@
"""Annotate-based pipeline.
Three stages:
1. :func:`tokenize` — release name → ``list[Token]`` (all UNKNOWN), plus
a separately-returned site tag (e.g. ``[YTS.MX]``) that is never
tokenized.
2. :func:`annotate` — promote each token's :class:`TokenRole` using the
injected knowledge base. Two sub-passes:
a. **Structural** (schema-driven, EASY only). Detects the group at
the right end, looks up its :class:`GroupSchema`, then matches
the schema's chunk sequence against the token stream. Between
two structural chunks, any number of unmatched tokens may
remain — they are left UNKNOWN for the enricher pass to handle.
b. **Enrichers** (non-positional). Walks UNKNOWN tokens and tags
audio / video-meta / edition / language roles. Multi-token
sequences (``DTS.HD.MA``, ``DV.HDR10``, ``DIRECTORS.CUT``) are
matched first, single tokens after.
3. :func:`assemble` — fold annotated tokens into a
:class:`~alfred.domain.release.value_objects.ParsedRelease`-compatible
dict.
The pipeline is **pure**: no I/O, no TMDB, no probe. All knowledge
arrives through ``kb: ReleaseKnowledge``.
"""
from __future__ import annotations
from ..ports.knowledge import ReleaseKnowledge
from ..value_objects import MediaTypeToken
from .schema import GroupSchema
from .tokens import Token, TokenRole
# ---------------------------------------------------------------------------
# Stage 1 — tokenize
# ---------------------------------------------------------------------------
def strip_site_tag(name: str) -> tuple[str, str | None]:
"""Split off a ``[site.tag]`` prefix or suffix.
Returns ``(clean_name, tag)``. If no tag is found, returns
``(name.strip(), None)``.
"""
s = name.strip()
if s.startswith("["):
close = s.find("]")
if close != -1:
tag = s[1:close].strip()
remainder = s[close + 1 :].strip()
if tag and remainder:
return remainder, tag
if s.endswith("]"):
open_bracket = s.rfind("[")
if open_bracket != -1:
tag = s[open_bracket + 1 : -1].strip()
remainder = s[:open_bracket].strip()
if tag and remainder:
return remainder, tag
return s, None
def tokenize(name: str, kb: ReleaseKnowledge) -> tuple[list[Token], str | None]:
"""Split ``name`` into tokens after stripping any site tag.
String-ops style: replace every configured separator with a single
NUL byte then split. NUL cannot legally appear in a release name, so
it's a safe sentinel.
"""
clean, site_tag = strip_site_tag(name)
DELIM = "\x00"
buf = clean
for sep in kb.separators:
if sep != DELIM:
buf = buf.replace(sep, DELIM)
pieces = [p for p in buf.split(DELIM) if p]
tokens = [Token(text=p, index=i) for i, p in enumerate(pieces)]
return tokens, site_tag
# ---------------------------------------------------------------------------
# Helpers shared across passes
# ---------------------------------------------------------------------------
def _parse_season_episode(text: str) -> tuple[int, int | None, int | None] | None:
"""Parse a single token as ``SxxExx`` / ``SxxExxExx`` / ``Sxx`` /
``Sxx-yy`` (season range) / ``NxNN``.
Returns ``(season, episode, episode_end)`` or ``None`` if the token
is not a season/episode marker. For ``Sxx-yy``, returns the first
season with no episode info — the caller is expected to detect the
range form and promote ``media_type`` to ``tv_complete`` separately.
"""
upper = text.upper()
# SxxExx form (and Sxx, Sxx-yy)
if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
season = int(upper[1:3])
rest = upper[3:]
if not rest:
return season, None, None
# Sxx-yy season-range form: capture the first season, treat as a
# complete-series marker (no episode info).
if (
len(rest) == 3
and rest[0] == "-"
and rest[1:3].isdigit()
):
return season, None, None
episodes: list[int] = []
while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
episodes.append(int(rest[1:3]))
rest = rest[3:]
if not episodes:
return None
# For chained multi-episode markers (E09E10E11), the range is the
# first → last episode. Intermediate values are implied.
return season, episodes[0], episodes[-1] if len(episodes) >= 2 else None
# NxNN form
if "X" in upper:
parts = upper.split("X")
if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
season = int(parts[0])
episode = int(parts[1])
episode_end = int(parts[2]) if len(parts) >= 3 else None
return season, episode, episode_end
return None
def _is_year(text: str) -> bool:
"""Return True if ``text`` is a 4-digit year in [1900, 2099]."""
return len(text) == 4 and text.isdigit() and 1900 <= int(text) <= 2099
def _split_codec_group(text: str, kb: ReleaseKnowledge) -> tuple[str, str] | None:
"""Split a ``codec-GROUP`` token into ``(codec, group)`` if it fits.
Returns ``None`` if the token doesn't match the ``codec-GROUP``
shape. Handles the empty-group case (``x265-``) as ``(codec, "")``.
"""
if "-" not in text:
return None
head, _, tail = text.rpartition("-")
if head.lower() in kb.codecs:
return head, tail
return None
def _match_role(text: str, role: TokenRole, kb: ReleaseKnowledge) -> TokenRole | None:
"""Return ``role`` if ``text`` matches it under ``kb``, else ``None``."""
lower = text.lower()
if role is TokenRole.YEAR:
return TokenRole.YEAR if _is_year(text) else None
if role is TokenRole.SEASON_EPISODE:
return (
TokenRole.SEASON_EPISODE
if _parse_season_episode(text) is not None
else None
)
if role is TokenRole.RESOLUTION:
return TokenRole.RESOLUTION if lower in kb.resolutions else None
if role is TokenRole.SOURCE:
return TokenRole.SOURCE if lower in kb.sources else None
if role is TokenRole.CODEC:
return TokenRole.CODEC if lower in kb.codecs else None
return None
# ---------------------------------------------------------------------------
# Stage 2a — group detection
# ---------------------------------------------------------------------------
def _detect_group(tokens: list[Token], kb: ReleaseKnowledge) -> tuple[str, int | None]:
"""Identify the release group by walking tokens right-to-left.
Returns ``(group_name, token_index_carrying_group)``. ``index`` is
``None`` when the group is absent (no trailing ``-`` in the stream).
"""
# Priority 1: codec-GROUP shape (clearest signal).
for tok in reversed(tokens):
split = _split_codec_group(tok.text, kb)
if split is not None:
_, group = split
return (group or "UNKNOWN"), tok.index
# Priority 2: rightmost dash, excluding dashed sources (Web-DL, etc.).
for tok in reversed(tokens):
if "-" not in tok.text:
continue
head, _, tail = tok.text.rpartition("-")
if (
head.lower() in kb.sources
or tok.text.lower().replace("-", "") in kb.sources
):
continue
if tail:
return tail, tok.index
return "UNKNOWN", None
# ---------------------------------------------------------------------------
# Stage 2b — structural annotation (schema-driven)
# ---------------------------------------------------------------------------
def _annotate_structural(
tokens: list[Token],
kb: ReleaseKnowledge,
schema: GroupSchema,
group_token_index: int,
) -> list[Token] | None:
"""Annotate structural tokens following a known group schema.
Walks the schema's chunks against the body (tokens up to the group
token). For each chunk, scans forward in the body for a matching
token — tokens passed over without match are left UNKNOWN (the
enricher pass will handle them).
Returns ``None`` if any mandatory chunk fails to find a match.
"""
result = list(tokens)
# The codec-GROUP token carries CODEC + GROUP. Split it now so the
# schema walk knows the codec is "pre-consumed" at the end.
group_token = result[group_token_index]
cg_split = _split_codec_group(group_token.text, kb)
codec_pre_consumed = False
if cg_split is not None:
codec, group = cg_split
result[group_token_index] = group_token.with_role(
TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
)
codec_pre_consumed = True
else:
head, _, tail = group_token.text.rpartition("-")
result[group_token_index] = group_token.with_role(
TokenRole.GROUP, group=tail or "UNKNOWN", prefix=head
)
body_end = group_token_index # exclusive
tok_idx = 0
chunk_idx = 0
# 1) TITLE — leftmost contiguous tokens up to the first structural
# boundary. Title is special because it can be multi-token.
while (
chunk_idx < len(schema.chunks)
and schema.chunks[chunk_idx].role is TokenRole.TITLE
):
title_end = _find_title_end(result, body_end, kb)
for i in range(tok_idx, title_end):
result[i] = result[i].with_role(TokenRole.TITLE)
tok_idx = title_end
chunk_idx += 1
# 2) Remaining structural chunks. For each, scan forward in the body
# for a matching token; tokens passed over remain UNKNOWN.
for chunk in schema.chunks[chunk_idx:]:
if chunk.role is TokenRole.GROUP:
continue
if chunk.role is TokenRole.CODEC and codec_pre_consumed:
continue
match_idx = _find_chunk(result, tok_idx, body_end, chunk.role, kb)
if match_idx is None:
if chunk.optional:
continue
return None
result[match_idx] = result[match_idx].with_role(chunk.role)
tok_idx = match_idx + 1
return result
def _find_title_end(
tokens: list[Token], body_end: int, kb: ReleaseKnowledge
) -> int:
"""Return the exclusive index where the title ends.
The title is the leftmost run of tokens whose text does not match
any structural role (year, season/episode, resolution, source,
codec). Enricher tokens (audio, HDR, language) are *not* boundaries
because they can appear in the middle of the structural sequence;
however, in canonical scene names they don't appear inside the title
itself, so this heuristic holds in practice.
"""
for i in range(body_end):
text = tokens[i].text
if _parse_season_episode(text) is not None:
return i
if _is_year(text):
return i
lower = text.lower()
if lower in kb.resolutions:
return i
if lower in kb.sources:
return i
if lower in kb.codecs:
return i
# codec-GROUP token (e.g. "x265-KONTRAST") or dashed source (Web-DL).
if "-" in text:
head, _, _ = text.rpartition("-")
if (
head.lower() in kb.codecs
or head.lower() in kb.sources
or text.lower().replace("-", "") in kb.sources
):
return i
return body_end
def _find_chunk(
tokens: list[Token],
start: int,
end: int,
role: TokenRole,
kb: ReleaseKnowledge,
) -> int | None:
"""Return the first index in ``[start, end)`` whose token matches ``role``.
Returns ``None`` if no token in the range matches. Tokens already
annotated (non-UNKNOWN) are skipped — they belong to another chunk.
"""
for i in range(start, end):
if tokens[i].role is not TokenRole.UNKNOWN:
continue
if _match_role(tokens[i].text, role, kb) is not None:
return i
return None
# ---------------------------------------------------------------------------
# Stage 2b' — SHITTY annotation (schema-less heuristic)
# ---------------------------------------------------------------------------
def _annotate_shitty(
tokens: list[Token],
kb: ReleaseKnowledge,
group_index: int | None,
) -> list[Token]:
"""Schema-less, dictionary-driven annotation.
SHITTY's job is narrow: for releases that *look* like scene names
but don't have a registered group schema, tag every token whose text
falls into a known YAML bucket (resolutions, codecs, sources, …).
Anything we can't classify stays UNKNOWN. The leftmost run of
UNKNOWN tokens becomes the title. Done.
Anything that requires more reasoning (parenthesized tech blocks,
bare-dashed title fragments, year-disguised slug suffixes, …) is
PATH OF PAIN territory and stays out of here on purpose.
"""
result = list(tokens)
# 1) Group token — split codec-GROUP or tag GROUP. Same logic as EASY.
if group_index is not None:
gt = result[group_index]
cg_split = _split_codec_group(gt.text, kb)
if cg_split is not None:
codec, group = cg_split
result[group_index] = gt.with_role(
TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
)
else:
_, _, tail = gt.text.rpartition("-")
result[group_index] = gt.with_role(
TokenRole.GROUP, group=tail or "UNKNOWN"
)
# 2) Enrichers (audio / video-meta / edition / language).
result = _annotate_enrichers(result, kb)
# 3) Single pass: tag each UNKNOWN token by looking it up in the kb
# buckets. First match wins per token, first occurrence wins per
# role (we don't overwrite an already-tagged role).
matchers: list[tuple[TokenRole, callable]] = [
(TokenRole.SEASON_EPISODE, lambda t: _parse_season_episode(t) is not None),
(TokenRole.YEAR, _is_year),
(TokenRole.RESOLUTION, lambda t: t.lower() in kb.resolutions),
(TokenRole.DISTRIBUTOR, lambda t: t.upper() in kb.distributors),
(TokenRole.SOURCE, lambda t: t.lower() in kb.sources),
(TokenRole.CODEC, lambda t: t.lower() in kb.codecs),
]
seen: set[TokenRole] = set()
for i, tok in enumerate(result):
if tok.role is not TokenRole.UNKNOWN:
continue
for role, matches in matchers:
if role in seen:
continue
if matches(tok.text):
result[i] = tok.with_role(role)
seen.add(role)
break
# 4) Title = leftmost contiguous UNKNOWN tokens.
for i, tok in enumerate(result):
if tok.role is not TokenRole.UNKNOWN:
break
result[i] = tok.with_role(TokenRole.TITLE)
return result
# ---------------------------------------------------------------------------
# Stage 2c — enricher pass (non-positional roles)
# ---------------------------------------------------------------------------
def _annotate_enrichers(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
"""Tag the remaining UNKNOWN tokens with non-positional roles.
Multi-token sequences are matched first (so ``DTS.HD.MA`` wins over
a single-token ``DTS``). For each sequence match, the first token
receives the role + ``extra["sequence"]`` (the canonical joined
value), and the trailing members are marked with the same role +
``extra["sequence_member"]=True`` so :func:`assemble` extracts the
value only from the primary.
"""
result = list(tokens)
# Multi-token sequences first.
_apply_sequences(
result, kb.audio.get("sequences", []), "codec", TokenRole.AUDIO_CODEC
)
_apply_sequences(
result, kb.video_meta.get("sequences", []), "hdr", TokenRole.HDR
)
_apply_sequences(
result, kb.editions.get("sequences", []), "edition", TokenRole.EDITION
)
# Single tokens.
known_audio_codecs = {c.upper() for c in kb.audio.get("codecs", [])}
known_audio_channels = set(kb.audio.get("channels", []))
known_hdr = {h.upper() for h in kb.video_meta.get("hdr", [])} | kb.hdr_extra
known_bit_depth = {d.lower() for d in kb.video_meta.get("bit_depth", [])}
known_editions = {t.upper() for t in kb.editions.get("tokens", [])}
# Channel layouts like "5.1" are tokenized as two tokens ("5", "1")
# because "." is a separator. Detect consecutive pairs whose joined
# value (without any trailing "-GROUP") is in the channel set.
_detect_channel_pairs(result, known_audio_channels)
for i, tok in enumerate(result):
if tok.role is not TokenRole.UNKNOWN:
continue
text = tok.text
upper = text.upper()
lower = text.lower()
if upper in known_audio_codecs:
result[i] = tok.with_role(TokenRole.AUDIO_CODEC)
continue
if text in known_audio_channels:
result[i] = tok.with_role(TokenRole.AUDIO_CHANNELS)
continue
if upper in known_hdr:
result[i] = tok.with_role(TokenRole.HDR)
continue
if lower in known_bit_depth:
result[i] = tok.with_role(TokenRole.BIT_DEPTH)
continue
if upper in known_editions:
result[i] = tok.with_role(TokenRole.EDITION)
continue
if upper in kb.language_tokens:
result[i] = tok.with_role(TokenRole.LANGUAGE)
continue
if upper in kb.distributors:
result[i] = tok.with_role(TokenRole.DISTRIBUTOR)
continue
return result
def _apply_sequences(
tokens: list[Token],
sequences: list[dict],
value_key: str,
role: TokenRole,
) -> None:
"""Mark the first occurrence of each sequence in place.
Mutates ``tokens`` (replacing entries with new role-tagged Token
instances). Sequences in the YAML must be ordered most-specific
first; the first match wins per starting position.
"""
if not sequences:
return
upper_texts = [t.text.upper() for t in tokens]
consumed: set[int] = set()
for seq in sequences:
seq_upper = [s.upper() for s in seq["tokens"]]
n = len(seq_upper)
for start in range(len(tokens) - n + 1):
if any(idx in consumed for idx in range(start, start + n)):
continue
if any(
tokens[start + k].role is not TokenRole.UNKNOWN for k in range(n)
):
continue
if upper_texts[start : start + n] == seq_upper:
tokens[start] = tokens[start].with_role(
role, sequence=seq[value_key]
)
for k in range(1, n):
tokens[start + k] = tokens[start + k].with_role(
role, sequence_member="True"
)
consumed.update(range(start, start + n))
def _detect_channel_pairs(
tokens: list[Token], known_channels: set[str]
) -> None:
"""Spot two consecutive numeric tokens that form a channel layout.
Example: ``["5", "1-KTH"]`` → joined ``"5.1"`` (after stripping the
``-GROUP`` suffix on the second). The second token may be the trailing
codec-GROUP token, in which case it's already tagged CODEC and we
skip — we'd corrupt its role.
"""
for i in range(len(tokens) - 1):
first = tokens[i]
second = tokens[i + 1]
if first.role is not TokenRole.UNKNOWN:
continue
# Strip a "-GROUP" suffix on the second token before joining.
second_text = second.text.split("-")[0]
candidate = f"{first.text}.{second_text}"
if candidate not in known_channels:
continue
# Only tag the first token (carries the channel value). The
# second token may legitimately remain UNKNOWN (or be the
# codec-GROUP token, already tagged CODEC).
tokens[i] = first.with_role(
TokenRole.AUDIO_CHANNELS, sequence=candidate
)
if second.role is TokenRole.UNKNOWN:
tokens[i + 1] = second.with_role(
TokenRole.AUDIO_CHANNELS, sequence_member="True"
)
# ---------------------------------------------------------------------------
# Stage 2 entry point
# ---------------------------------------------------------------------------
def annotate(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
"""Annotate token roles.
Dispatch:
* If a group is detected AND has a known schema, run the EASY
structural walk. If the schema walk aborts on a mandatory chunk
mismatch, fall through to SHITTY (the heuristic still does better
than giving up).
* Otherwise run SHITTY — schema-less, best-effort, never aborts.
The enricher pass runs in both cases. The pipeline always returns a
populated token list; downstream callers don't need to distinguish
EASY vs SHITTY at this layer (the parse_path is decided in the
service based on whether a schema matched).
"""
group_name, group_index = _detect_group(tokens, kb)
schema = kb.group_schema(group_name) if group_index is not None else None
if schema is not None and group_index is not None:
structural = _annotate_structural(tokens, kb, schema, group_index)
if structural is not None:
return _annotate_enrichers(structural, kb)
# SHITTY fallback — heuristic positional pass. ``_annotate_shitty``
# runs its own enricher pass internally (it has to, so the title
# scan can skip enricher-tagged tokens).
return _annotate_shitty(tokens, kb, group_index)
def has_known_schema(tokens: list[Token], kb: ReleaseKnowledge) -> bool:
"""Return True if ``tokens`` would take the EASY path in :func:`annotate`."""
group_name, group_index = _detect_group(tokens, kb)
if group_index is None:
return False
return kb.group_schema(group_name) is not None
# ---------------------------------------------------------------------------
# Stage 3 — assemble
# ---------------------------------------------------------------------------
def assemble(
annotated: list[Token],
site_tag: str | None,
raw_name: str,
kb: ReleaseKnowledge,
) -> dict:
"""Fold annotated tokens into a ``ParsedRelease``-compatible dict.
Returns a dict (not a ``ParsedRelease`` instance) so the caller can
layer in additional fields (``parse_path``, ``raw``, …) before
instantiation.
"""
# Pure-punctuation tokens (e.g. a stray "-" left by ` - ` separators in
# human-friendly release names) carry no title content and would leak
# into the joined title as ``"Show.-.Episode"``. Drop them here.
title_parts = [
t.text
for t in annotated
if t.role is TokenRole.TITLE and any(c.isalnum() for c in t.text)
]
title = ".".join(title_parts) if title_parts else (
annotated[0].text if annotated else raw_name
)
year: int | None = None
season: int | None = None
episode: int | None = None
episode_end: int | None = None
quality: str | None = None
source: str | None = None
codec: str | None = None
group = "UNKNOWN"
audio_codec: str | None = None
audio_channels: str | None = None
bit_depth: str | None = None
hdr_format: str | None = None
edition: str | None = None
distributor: str | None = None
languages: list[str] = []
is_season_range = False
for tok in annotated:
# Skip non-primary members of a multi-token sequence.
if tok.extra.get("sequence_member") == "True":
continue
role = tok.role
if role is TokenRole.YEAR:
year = int(tok.text)
elif role is TokenRole.SEASON_EPISODE:
parsed = _parse_season_episode(tok.text)
if parsed is not None:
season, episode, episode_end = parsed
# Detect Sxx-yy range form to flag it as a multi-season pack.
upper = tok.text.upper()
if (
len(upper) == 6
and upper[0] == "S"
and upper[1:3].isdigit()
and upper[3] == "-"
and upper[4:6].isdigit()
):
is_season_range = True
elif role is TokenRole.RESOLUTION:
quality = tok.text
elif role is TokenRole.SOURCE:
source = tok.text
elif role is TokenRole.CODEC:
codec = tok.extra.get("codec", tok.text)
if "group" in tok.extra:
group = tok.extra["group"] or "UNKNOWN"
elif role is TokenRole.GROUP:
group = tok.extra.get("group", tok.text) or "UNKNOWN"
elif role is TokenRole.AUDIO_CODEC:
if audio_codec is None:
audio_codec = tok.extra.get("sequence", tok.text)
elif role is TokenRole.AUDIO_CHANNELS:
if audio_channels is None:
audio_channels = tok.extra.get("sequence", tok.text)
elif role is TokenRole.BIT_DEPTH:
if bit_depth is None:
bit_depth = tok.text.lower()
elif role is TokenRole.HDR:
if hdr_format is None:
hdr_format = tok.extra.get("sequence", tok.text.upper())
elif role is TokenRole.EDITION:
if edition is None:
edition = tok.extra.get("sequence", tok.text.upper())
elif role is TokenRole.LANGUAGE:
languages.append(tok.text.upper())
elif role is TokenRole.DISTRIBUTOR:
if distributor is None:
distributor = tok.text.upper()
# Media type heuristic. Doc/concert/integrale tokens win over the
# generic tech-based fallback. We look across all tokens (not just
# annotated ones) because these markers may be tagged UNKNOWN by the
# structural pass — only the assemble step cares about them.
upper_tokens = {tok.text.upper() for tok in annotated}
doc_tokens = {t.upper() for t in kb.media_type_tokens.get("doc", [])}
concert_tokens = {t.upper() for t in kb.media_type_tokens.get("concert", [])}
integrale_tokens = {t.upper() for t in kb.media_type_tokens.get("integrale", [])}
if upper_tokens & doc_tokens:
media_type = MediaTypeToken.DOCUMENTARY
elif upper_tokens & concert_tokens:
media_type = MediaTypeToken.CONCERT
elif is_season_range:
media_type = MediaTypeToken.TV_COMPLETE
elif (
edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
or upper_tokens & integrale_tokens
) and season is None:
media_type = MediaTypeToken.TV_COMPLETE
elif season is not None:
media_type = MediaTypeToken.TV_SHOW
elif any((quality, source, codec, year)):
media_type = MediaTypeToken.MOVIE
else:
media_type = MediaTypeToken.UNKNOWN
return {
"title": title,
"title_sanitized": kb.sanitize_for_fs(title),
"year": year,
"season": season,
"episode": episode,
"episode_end": episode_end,
"quality": quality,
"source": source,
"codec": codec,
"group": group,
"media_type": media_type,
"site_tag": site_tag,
"languages": tuple(languages),
"audio_codec": audio_codec,
"audio_channels": audio_channels,
"bit_depth": bit_depth,
"hdr_format": hdr_format,
"edition": edition,
"distributor": distributor,
}
+47
View File
@@ -0,0 +1,47 @@
"""Group schema value objects.
A :class:`GroupSchema` describes the canonical chunk layout of releases
from a known group (KONTRAST, RARBG, ELiTE, …). It is the EASY-road
contract: when a release ends in ``-<GROUP>`` and we know the group,
the annotator walks the schema instead of running the heuristic SHITTY
matchers.
Schemas are loaded from ``knowledge/release/release_groups/<group>.yaml``
by an infrastructure adapter and surfaced via the
:class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge` port.
"""
from __future__ import annotations
from dataclasses import dataclass
from .tokens import TokenRole
@dataclass(frozen=True)
class SchemaChunk:
"""One entry in a group's chunk order.
``role`` is the :class:`TokenRole` the chunk maps to. ``optional``
is True for chunks that may be absent (e.g. ``year`` on TV releases,
``source`` on bare ELiTE TV releases).
"""
role: TokenRole
optional: bool = False
@dataclass(frozen=True)
class GroupSchema:
"""Schema for a known release group.
``chunks`` is the left-to-right canonical order. The annotator walks
tokens and chunks in lockstep: an optional chunk that doesn't match
the current token is skipped (the chunk index advances, the token
index stays), a mandatory chunk that doesn't match aborts the EASY
path and falls back to SHITTY.
"""
name: str
separator: str
chunks: tuple[SchemaChunk, ...]
+139
View File
@@ -0,0 +1,139 @@
"""Parse-confidence scoring.
``parse_release`` returns a :class:`ParseReport` alongside its
:class:`ParsedRelease`. The report carries:
- ``confidence``: integer 0100 derived from which structural and
technical fields got populated, minus a penalty per UNKNOWN token
left in the annotated stream.
- ``road``: which of the three roads the parse took
(:class:`Road.EASY` / :class:`Road.SHITTY` / :class:`Road.PATH_OF_PAIN`).
- ``unknown_tokens``: textual residue, useful for diagnostics.
- ``missing_critical``: structural fields the score-tally found absent
(e.g. ``("year", "media_type")``) — the caller can use this to drive
PoP recovery (questions, LLM call).
All weights, penalties and thresholds come from the injected knowledge
base (``kb.scoring``), itself loaded from
``alfred/knowledge/release/scoring.yaml``. No magic numbers here.
The scoring functions are pure — they consume the annotated token list
and the resulting :class:`ParsedRelease` and return the report. They are
called by ``services.parse_release`` after ``assemble`` has run.
"""
from __future__ import annotations
from enum import Enum
from ..ports.knowledge import ReleaseKnowledge
from ..value_objects import ParsedRelease
from .tokens import Token, TokenRole
class Road(str, Enum):
"""How the parser handled a given release name.
Distinct from :class:`~alfred.domain.release.value_objects.TokenizationRoute`,
which records the tokenization route (DIRECT / SANITIZED / AI). Road
is about confidence in the *result*, not the *method*.
"""
EASY = "easy" # group schema matched — structural annotation
SHITTY = "shitty" # no schema, dict-driven annotation, score ≥ threshold
PATH_OF_PAIN = "path_of_pain" # score below threshold, needs help
# Critical structural fields — their absence drives the
# ``missing_critical`` list in the report.
_CRITICAL_FIELDS: tuple[str, ...] = ("title", "media_type", "year")
def _is_tv_shaped(parsed: ParsedRelease) -> bool:
"""Season/episode weights only count for releases that *look* like TV."""
return parsed.season is not None
def compute_score(
parsed: ParsedRelease,
annotated: list[Token],
kb: ReleaseKnowledge,
) -> int:
"""Compute a 0100 confidence score for the parse.
Each populated field contributes its weight from
``kb.scoring["weights"]``. Season/episode only count when the parse
looks like TV. ``group == "UNKNOWN"`` is treated as absent.
Then a penalty is subtracted per residual UNKNOWN token in
``annotated``, capped at ``penalties["max_unknown_penalty"]``.
Result is clamped to ``[0, 100]``.
"""
weights = kb.scoring["weights"]
penalties = kb.scoring["penalties"]
score = 0
if parsed.title:
score += weights.get("title", 0)
if parsed.media_type and parsed.media_type.value != "unknown":
score += weights.get("media_type", 0)
if parsed.year is not None:
score += weights.get("year", 0)
if _is_tv_shaped(parsed):
if parsed.season is not None:
score += weights.get("season", 0)
if parsed.episode is not None:
score += weights.get("episode", 0)
if parsed.quality:
score += weights.get("resolution", 0)
if parsed.source:
score += weights.get("source", 0)
if parsed.codec:
score += weights.get("codec", 0)
if parsed.group and parsed.group != "UNKNOWN":
score += weights.get("group", 0)
unknown_count = sum(1 for t in annotated if t.role is TokenRole.UNKNOWN)
raw_penalty = unknown_count * penalties.get("unknown_token", 0)
capped_penalty = min(raw_penalty, penalties.get("max_unknown_penalty", 0))
score -= capped_penalty
return max(0, min(100, score))
def collect_unknown_tokens(annotated: list[Token]) -> tuple[str, ...]:
"""Return the text of every token still tagged UNKNOWN."""
return tuple(t.text for t in annotated if t.role is TokenRole.UNKNOWN)
def collect_missing_critical(parsed: ParsedRelease) -> tuple[str, ...]:
"""Return the names of critical structural fields that are absent."""
missing: list[str] = []
if not parsed.title:
missing.append("title")
if not parsed.media_type or parsed.media_type.value == "unknown":
missing.append("media_type")
if parsed.year is None:
missing.append("year")
return tuple(missing)
def decide_road(
score: int,
has_schema: bool,
kb: ReleaseKnowledge,
) -> Road:
"""Pick the road the parse took.
EASY is decided structurally: if a known group schema matched, the
annotation walked the schema, and that's enough — the score does not
veto EASY. Otherwise the score decides between SHITTY and
PATH_OF_PAIN using ``kb.scoring["thresholds"]["shitty_min"]``.
"""
if has_schema:
return Road.EASY
threshold = kb.scoring["thresholds"].get("shitty_min", 60)
if score >= threshold:
return Road.SHITTY
return Road.PATH_OF_PAIN
+90
View File
@@ -0,0 +1,90 @@
"""Token value objects for the annotate-based parser.
A :class:`Token` carries both the original substring and its position in
the original release name's token stream. A :class:`TokenRole` is the
semantic tag assigned by the annotator.
Why VOs instead of bare ``str``: the annotate step needs to flag tokens
without consuming them (a token may carry residual info — e.g. a
``codec-GROUP`` token contributes both a CODEC and a GROUP role). Tracking
the index also lets later stages reason about *order* (year must come
after title, group must be rightmost, etc.) without re-scanning the list.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
class TokenRole(str, Enum):
"""Semantic role a token can take after annotation.
A token starts as ``UNKNOWN`` and may be promoted by the annotator.
``str``-backed for cheap comparisons and YAML/JSON interop.
Roles split into three families:
- **structural**: TITLE / YEAR / SEASON_EPISODE / GROUP — drive folder
and filename naming.
- **technical**: RESOLUTION / SOURCE / CODEC / AUDIO_CODEC /
AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION / LANGUAGE — feed
``tech_string`` and metadata fields.
- **meta**: SITE_TAG (stripped pre-tokenize), SEPARATOR (kept for the
assemble step if a release uses spaces that need preservation in the
title), UNKNOWN (residual, contributes to the SHITTY score penalty).
"""
UNKNOWN = "unknown"
# Structural
TITLE = "title"
YEAR = "year"
SEASON_EPISODE = "season_episode"
GROUP = "group"
# Technical
RESOLUTION = "resolution"
SOURCE = "source"
CODEC = "codec"
AUDIO_CODEC = "audio_codec"
AUDIO_CHANNELS = "audio_channels"
BIT_DEPTH = "bit_depth"
HDR = "hdr"
EDITION = "edition"
LANGUAGE = "language"
DISTRIBUTOR = "distributor"
# Meta
SITE_TAG = "site_tag"
@dataclass(frozen=True)
class Token:
"""An atomic token from a release name.
``text`` is the substring exactly as it appeared after tokenization
(case preserved — uppercase comparisons happen at match time).
``index`` is the 0-based position in the tokenized stream, used by
downstream stages to enforce ordering invariants.
``role`` defaults to :attr:`TokenRole.UNKNOWN`. The annotator returns
new :class:`Token` instances with the role set rather than mutating
(the dataclass is frozen). ``extra`` carries role-specific payload
when the token text alone isn't enough (e.g. a ``codec-GROUP`` token
annotated as CODEC may record the group name in ``extra["group"]``).
"""
text: str
index: int
role: TokenRole = TokenRole.UNKNOWN
extra: dict[str, str] = field(default_factory=dict)
def with_role(self, role: TokenRole, **extra: str) -> Token:
"""Return a copy of this token with ``role`` (and optional ``extra``)."""
merged = {**self.extra, **extra} if extra else self.extra
return Token(text=self.text, index=self.index, role=role, extra=merged)
@property
def is_annotated(self) -> bool:
return self.role is not TokenRole.UNKNOWN
+10
View File
@@ -0,0 +1,10 @@
"""Domain ports for the release domain.
Protocol-based abstractions that decouple ``parse_release`` and
``ParsedRelease`` from any concrete knowledge-base loader. The
infrastructure layer provides the adapter that satisfies this contract.
"""
from .knowledge import ReleaseKnowledge
__all__ = ["ReleaseKnowledge"]
+91
View File
@@ -0,0 +1,91 @@
"""ReleaseKnowledge port — the read-only query surface that
``parse_release`` and ``ParsedRelease`` need from the release knowledge
base, expressed as a structural Protocol so the domain never imports any
concrete loader.
The concrete YAML-backed implementation lives in
``alfred/infrastructure/knowledge/release_kb.py``. Tests can supply any
object that satisfies this shape (e.g. a simple dataclass).
"""
from __future__ import annotations
from typing import TYPE_CHECKING, Protocol
if TYPE_CHECKING:
from ..parser.schema import GroupSchema
class ReleaseKnowledge(Protocol):
"""Read-only snapshot of release-name parsing knowledge."""
# --- Token sets used by the tokenizer / matchers ---
resolutions: set[str]
sources: set[str]
codecs: set[str]
distributors: set[str]
language_tokens: set[str]
forbidden_chars: set[str]
hdr_extra: set[str]
# --- Structured knowledge (loaded from YAML as dicts) ---
audio: dict
video_meta: dict
editions: dict
media_type_tokens: dict
# --- Tokenizer separators ---
separators: list[str]
# --- Parse scoring (Phase A) ---
#
# ``scoring`` is a dict with three keys:
# - ``weights``: dict[field_name, int] field weight contribution
# - ``penalties``: {"unknown_token": int, "max_unknown_penalty": int}
# - ``thresholds``: {"shitty_min": int} SHITTY vs PATH_OF_PAIN cutoff
#
# Concrete values come from ``alfred/knowledge/release/scoring.yaml``.
# The loader fills in safe defaults so this dict is always populated.
scoring: dict
# --- ffprobe → scene-token translation tables (consumed by
# ``application.release.enrich_from_probe``). Domain parsing itself
# doesn't touch these — exposed on the same KB to keep release
# knowledge in a single ownership point.
#
# Shape:
# - ``video_codec``: dict[str, str] ffprobe lower → scene token
# - ``audio_codec``: dict[str, str] ffprobe lower → scene token
# - ``audio_channels``: dict[int, str] channel count → layout ---
probe_mappings: dict
# --- File-extension sets (used by application/infra modules that work
# directly with filesystem paths, e.g. media-type detection, video
# lookup). Domain parsing itself doesn't touch these. ---
video_extensions: set[str]
non_video_extensions: set[str]
subtitle_extensions: set[str]
metadata_extensions: set[str]
# --- Filesystem sanitization (Option B: pre-sanitize at parse time) ---
def sanitize_for_fs(self, text: str) -> str:
"""Strip filesystem-forbidden characters from ``text``."""
...
# --- Release group schemas (EASY path) ---
def group_schema(self, name: str) -> GroupSchema | None:
"""Return the parsing schema for the named release group, or
``None`` if the group is unknown (caller falls back to SHITTY).
Lookup is case-insensitive: ``"KONTRAST"``, ``"kontrast"`` and
``"Kontrast"`` all resolve to the same schema.
"""
...
+85 -472
View File
@@ -1,58 +1,70 @@
"""Release domain — parsing service."""
"""Release domain — parsing service.
Thin orchestrator over the annotate-based pipeline in
:mod:`alfred.domain.release.parser.pipeline`. Responsibilities:
* Strip a leading/trailing ``[site.tag]`` and decide ``parse_path``.
* Reject malformed names (forbidden characters) → ``parse_path=AI`` so
the LLM can clean them up.
* Otherwise call the v2 pipeline (tokenize → annotate → assemble) and
wrap the result in :class:`ParsedRelease`.
* Score the result and decide the road (EASY / SHITTY / PATH_OF_PAIN)
via :mod:`alfred.domain.release.parser.scoring`.
The public entry point is :func:`parse_release`, which returns
``(ParsedRelease, ParseReport)``. The report carries the confidence
score, the road, and diagnostic info for downstream callers.
"""
from __future__ import annotations
import re
from alfred.infrastructure.knowledge.release import load_separators
from .value_objects import (
_AUDIO,
_CODECS,
_EDITIONS,
_FORBIDDEN_CHARS,
_HDR_EXTRA,
_LANGUAGE_TOKENS,
_MEDIA_TYPE_TOKENS,
_RESOLUTIONS,
_SOURCES,
_VIDEO_META,
MediaTypeToken,
ParsedRelease,
ParsePath,
)
from .parser import pipeline as _v2
from .parser import scoring as _scoring
from .ports import ReleaseKnowledge
from .value_objects import MediaTypeToken, ParsedRelease, ParseReport, TokenizationRoute
def _tokenize(name: str) -> list[str]:
"""Split a release name on the configured separators, dropping empty tokens."""
pattern = "[" + re.escape("".join(load_separators())) + "]+"
return [t for t in re.split(pattern, name) if t]
def parse_release(
name: str, kb: ReleaseKnowledge
) -> tuple[ParsedRelease, ParseReport]:
"""Parse a release name.
def parse_release(name: str) -> ParsedRelease:
"""
Parse a release name and return a ParsedRelease.
Returns a tuple ``(ParsedRelease, ParseReport)``. The structural VO
is unchanged from the previous single-return contract; the report
is new and carries the confidence score + road decision.
Flow:
1. Strip a leading/trailing [site.tag] if present (sets parse_path="sanitized").
2. Check the remainder for truly forbidden chars (anything not in the
configured separators list). If any remain → media_type="unknown",
parse_path="ai", and the LLM handles it.
3. Tokenize using the configured separators (".", " ", "[", "]", "(", ")", "_", ...)
and run token-level matchers (season/episode, tech, languages, audio,
video, edition, title, year).
1. Strip a leading/trailing ``[site.tag]`` if present (sets
``parse_path="sanitized"``).
2. If the remainder still contains truly forbidden chars (anything
not in the configured separators), short-circuit to
``media_type="unknown"`` / ``parse_path="ai"`` and emit a
PATH_OF_PAIN report — the LLM handles these.
3. Otherwise run the v2 pipeline: tokenize → annotate (EASY when a
group schema is known, SHITTY otherwise) → assemble → score.
"""
parse_path = ParsePath.DIRECT.value
parse_path = TokenizationRoute.DIRECT
# Always try to extract a bracket-enclosed site tag first.
clean, site_tag = _strip_site_tag(name)
# Apostrophes inside titles ("Don't", "L'avare") are common and should
# not push the release through the AI fallback. Strip them up front so
# both strip_site_tag and tokenize see "Dont" / "Lavare", which is good
# enough for token-level matching. The raw name is preserved on the VO.
working_name = name
if "'" in working_name:
working_name = working_name.replace("'", "")
parse_path = TokenizationRoute.SANITIZED
clean, site_tag = _v2.strip_site_tag(working_name)
if site_tag is not None:
parse_path = ParsePath.SANITIZED.value
parse_path = TokenizationRoute.SANITIZED
if not _is_well_formed(clean):
return ParsedRelease(
if not _is_well_formed(clean, kb):
parsed = ParsedRelease(
raw=name,
normalised=clean,
clean=clean,
title=clean,
title_sanitized=kb.sanitize_for_fs(clean),
year=None,
season=None,
episode=None,
@@ -61,448 +73,49 @@ def parse_release(name: str) -> ParsedRelease:
source=None,
codec=None,
group="UNKNOWN",
tech_string="",
media_type=MediaTypeToken.UNKNOWN.value,
media_type=MediaTypeToken.UNKNOWN,
site_tag=site_tag,
parse_path=ParsePath.AI.value,
parse_path=TokenizationRoute.AI,
)
name = clean
tokens = _tokenize(name)
season, episode, episode_end = _extract_season_episode(tokens)
quality, source, codec, group, tech_tokens = _extract_tech(tokens)
languages, lang_tokens = _extract_languages(tokens)
audio_codec, audio_channels, audio_tokens = _extract_audio(tokens)
bit_depth, hdr_format, video_tokens = _extract_video_meta(tokens)
edition, edition_tokens = _extract_edition(tokens)
title = _extract_title(
tokens,
tech_tokens | lang_tokens | audio_tokens | video_tokens | edition_tokens,
)
year = _extract_year(tokens, title)
media_type = _infer_media_type(
season, quality, source, codec, year, edition, tokens
report = ParseReport(
confidence=0,
road=_scoring.Road.PATH_OF_PAIN.value,
unknown_tokens=(clean,),
missing_critical=("title", "media_type", "year"),
)
return parsed, report
tech_parts = [p for p in [quality, source, codec] if p]
tech_string = ".".join(tech_parts)
tokens, v2_tag = _v2.tokenize(working_name, kb)
annotated = _v2.annotate(tokens, kb)
fields = _v2.assemble(annotated, v2_tag, name, kb)
return ParsedRelease(
parsed = ParsedRelease(
raw=name,
normalised=name,
title=title,
year=year,
season=season,
episode=episode,
episode_end=episode_end,
quality=quality,
source=source,
codec=codec,
group=group,
tech_string=tech_string,
media_type=media_type,
site_tag=site_tag,
clean=clean,
parse_path=parse_path,
languages=languages,
audio_codec=audio_codec,
audio_channels=audio_channels,
bit_depth=bit_depth,
hdr_format=hdr_format,
edition=edition,
**fields,
)
def _infer_media_type(
season: int | None,
quality: str | None,
source: str | None,
codec: str | None,
year: int | None,
edition: str | None,
tokens: list[str],
) -> str:
"""
Infer media_type from token-level evidence only (no filesystem access).
- documentary : DOC token present
- concert : CONCERT token present
- tv_complete : INTEGRALE/COMPLETE token, no season
- tv_show : season token found
- movie : no season, at least one tech marker
- unknown : no conclusive evidence
"""
upper_tokens = {t.upper() for t in tokens}
doc_tokens = {t.upper() for t in _MEDIA_TYPE_TOKENS.get("doc", [])}
concert_tokens = {t.upper() for t in _MEDIA_TYPE_TOKENS.get("concert", [])}
integrale_tokens = {t.upper() for t in _MEDIA_TYPE_TOKENS.get("integrale", [])}
if upper_tokens & doc_tokens:
return MediaTypeToken.DOCUMENTARY.value
if upper_tokens & concert_tokens:
return MediaTypeToken.CONCERT.value
if (
edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
or upper_tokens & integrale_tokens
) and season is None:
return MediaTypeToken.TV_COMPLETE.value
if season is not None:
return MediaTypeToken.TV_SHOW.value
if any([quality, source, codec, year]):
return MediaTypeToken.MOVIE.value
return MediaTypeToken.UNKNOWN.value
def _is_well_formed(name: str) -> bool:
"""Return True if name contains no forbidden characters per scene naming rules.
Characters listed as token separators (spaces, brackets, parens, …) are NOT
considered malforming — the tokenizer handles them. Only truly broken chars
like '@', '#', '!', '%' make a name malformed.
"""
tokenizable = set(load_separators())
return not any(c in name for c in _FORBIDDEN_CHARS if c not in tokenizable)
def _strip_site_tag(name: str) -> tuple[str, str | None]:
"""
Strip a site watermark tag from the release name and return (clean_name, tag).
Handles two positions:
- Prefix: "[ OxTorrent.vc ] The.Title.S01..."
- Suffix: "The.Title.S01...-NTb[TGx]"
Anything between [...] is treated as a site tag.
Returns (original_name, None) if no tag found.
"""
s = name.strip()
if s.startswith("["):
close = s.find("]")
if close != -1:
tag = s[1:close].strip()
remainder = s[close + 1 :].strip()
if tag and remainder:
return remainder, tag
if s.endswith("]"):
open_bracket = s.rfind("[")
if open_bracket != -1:
tag = s[open_bracket + 1 : -1].strip()
remainder = s[:open_bracket].strip()
if tag and remainder:
return remainder, tag
return s, None
def _parse_season_episode(tok: str) -> tuple[int, int | None, int | None] | None:
"""
Parse a single token as a season/episode marker.
Handles:
- SxxExx / SxxExxExx / Sxx (canonical scene form)
- NxNN / NxNNxNN (alt form: 1x05, 12x07x08)
Returns (season, episode, episode_end) or None if not a season token.
"""
upper = tok.upper()
# SxxExx form
if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
season = int(upper[1:3])
rest = upper[3:]
if not rest:
return season, None, None
episodes: list[int] = []
while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
episodes.append(int(rest[1:3]))
rest = rest[3:]
if not episodes:
return None # malformed token like "S03XYZ"
return season, episodes[0], episodes[1] if len(episodes) >= 2 else None
# NxNN form — split on "X" (uppercased), all parts must be digits
if "X" in upper:
parts = upper.split("X")
if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
season = int(parts[0])
episode = int(parts[1])
episode_end = int(parts[2]) if len(parts) >= 3 else None
return season, episode, episode_end
return None
def _extract_season_episode(
tokens: list[str],
) -> tuple[int | None, int | None, int | None]:
for tok in tokens:
parsed = _parse_season_episode(tok)
if parsed is not None:
return parsed
return None, None, None
def _extract_tech(
tokens: list[str],
) -> tuple[str | None, str | None, str | None, str, set[str]]:
"""
Extract quality, source, codec, group from tokens.
Returns (quality, source, codec, group, tech_token_set).
Group extraction strategy (in priority order):
1. Token where prefix is a known codec: x265-GROUP
2. Rightmost token with a dash that isn't a known source
"""
quality: str | None = None
source: str | None = None
codec: str | None = None
group = "UNKNOWN"
tech_tokens: set[str] = set()
for tok in tokens:
tl = tok.lower()
if tl in _RESOLUTIONS:
quality = tok
tech_tokens.add(tok)
continue
if tl in _SOURCES:
source = tok
tech_tokens.add(tok)
continue
if "-" in tok:
parts = tok.rsplit("-", 1)
# codec-GROUP (highest priority for group)
if parts[0].lower() in _CODECS:
codec = parts[0]
group = parts[1] if parts[1] else "UNKNOWN"
tech_tokens.add(tok)
continue
# source with dash: Web-DL, WEB-DL, etc.
if parts[0].lower() in _SOURCES or tok.lower().replace("-", "") in _SOURCES:
source = tok
tech_tokens.add(tok)
continue
if tl in _CODECS:
codec = tok
tech_tokens.add(tok)
# Fallback: rightmost token with a dash that isn't a known source
if group == "UNKNOWN":
for tok in reversed(tokens):
if "-" in tok:
parts = tok.rsplit("-", 1)
tl = tok.lower()
if tl in _SOURCES or tok.lower().replace("-", "") in _SOURCES:
continue
if parts[1]:
group = parts[1]
break
return quality, source, codec, group, tech_tokens
def _is_year_token(tok: str) -> bool:
"""Return True if tok is a 4-digit year between 1900 and 2099."""
return len(tok) == 4 and tok.isdigit() and 1900 <= int(tok) <= 2099
def _extract_title(tokens: list[str], tech_tokens: set[str]) -> str:
"""Extract the title portion: everything before the first season/year/tech token."""
title_parts = []
for tok in tokens:
if _parse_season_episode(tok) is not None:
break
if _is_year_token(tok):
break
if tok in tech_tokens or tok.lower() in _RESOLUTIONS | _SOURCES | _CODECS:
break
if "-" in tok and any(p.lower() in _CODECS | _SOURCES for p in tok.split("-")):
break
title_parts.append(tok)
return ".".join(title_parts) if title_parts else tokens[0]
def _extract_year(tokens: list[str], title: str) -> int | None:
"""Extract a 4-digit year from tokens (only after the title)."""
title_len = len(title.split("."))
for tok in tokens[title_len:]:
if _is_year_token(tok):
return int(tok)
return None
# ---------------------------------------------------------------------------
# Sequence matcher
# ---------------------------------------------------------------------------
def _match_sequences(
tokens: list[str],
sequences: list[dict],
key: str,
) -> tuple[str | None, set[str]]:
"""
Try to match multi-token sequences against consecutive tokens.
Returns (matched_value, set_of_matched_tokens) or (None, empty_set).
Sequences must be ordered most-specific first in the YAML.
"""
upper_tokens = [t.upper() for t in tokens]
for seq in sequences:
seq_upper = [s.upper() for s in seq["tokens"]]
n = len(seq_upper)
for i in range(len(upper_tokens) - n + 1):
if upper_tokens[i : i + n] == seq_upper:
matched = set(tokens[i : i + n])
return seq[key], matched
return None, set()
# ---------------------------------------------------------------------------
# Language extraction
# ---------------------------------------------------------------------------
def _extract_languages(tokens: list[str]) -> tuple[list[str], set[str]]:
"""Extract language tokens. Returns (languages, matched_token_set)."""
languages = []
lang_tokens: set[str] = set()
for tok in tokens:
if tok.upper() in _LANGUAGE_TOKENS:
languages.append(tok.upper())
lang_tokens.add(tok)
return languages, lang_tokens
# ---------------------------------------------------------------------------
# Audio extraction
# ---------------------------------------------------------------------------
def _extract_audio(
tokens: list[str],
) -> tuple[str | None, str | None, set[str]]:
"""
Extract audio codec and channel layout.
Returns (audio_codec, audio_channels, matched_token_set).
Sequences are tried first (DTS.HD.MA, TrueHD.Atmos, …), then single tokens.
"""
audio_codec: str | None = None
audio_channels: str | None = None
audio_tokens: set[str] = set()
known_codecs = {c.upper() for c in _AUDIO.get("codecs", [])}
known_channels = set(_AUDIO.get("channels", []))
# Try multi-token sequences first
matched_codec, matched_set = _match_sequences(
tokens, _AUDIO.get("sequences", []), "codec"
has_schema = _v2.has_known_schema(tokens, kb)
score = _scoring.compute_score(parsed, annotated, kb)
road = _scoring.decide_road(score, has_schema, kb)
report = ParseReport(
confidence=score,
road=road.value,
unknown_tokens=_scoring.collect_unknown_tokens(annotated),
missing_critical=_scoring.collect_missing_critical(parsed),
)
if matched_codec:
audio_codec = matched_codec
audio_tokens |= matched_set
# Channel layouts like "5.1" or "7.1" are split into two tokens by normalize —
# detect them as consecutive pairs "X" + "Y" where "X.Y" is a known channel.
# The second token may have a "-GROUP" suffix (e.g. "1-KTH" → strip it).
for i in range(len(tokens) - 1):
second = tokens[i + 1].split("-")[0]
candidate = f"{tokens[i]}.{second}"
if candidate in known_channels and audio_channels is None:
audio_channels = candidate
audio_tokens.add(tokens[i])
audio_tokens.add(tokens[i + 1])
for tok in tokens:
if tok in audio_tokens:
continue
if tok.upper() in known_codecs and audio_codec is None:
audio_codec = tok
audio_tokens.add(tok)
elif tok in known_channels and audio_channels is None:
audio_channels = tok
audio_tokens.add(tok)
return audio_codec, audio_channels, audio_tokens
return parsed, report
# ---------------------------------------------------------------------------
# Video metadata extraction (bit depth, HDR)
# ---------------------------------------------------------------------------
def _is_well_formed(name: str, kb: ReleaseKnowledge) -> bool:
"""Return True if ``name`` contains no forbidden characters per scene
naming rules.
def _extract_video_meta(
tokens: list[str],
) -> tuple[str | None, str | None, set[str]]:
Characters listed as token separators (spaces, brackets, parens, …)
are NOT considered malforming — the tokenizer handles them. Only
truly broken chars like ``@``, ``#``, ``!``, ``%`` make a name
malformed.
"""
Extract bit depth and HDR format.
Returns (bit_depth, hdr_format, matched_token_set).
"""
bit_depth: str | None = None
hdr_format: str | None = None
video_tokens: set[str] = set()
known_hdr = {h.upper() for h in _VIDEO_META.get("hdr", [])} | _HDR_EXTRA
known_depth = {d.lower() for d in _VIDEO_META.get("bit_depth", [])}
# Try HDR sequences first
matched_hdr, matched_set = _match_sequences(
tokens, _VIDEO_META.get("sequences", []), "hdr"
)
if matched_hdr:
hdr_format = matched_hdr
video_tokens |= matched_set
for tok in tokens:
if tok in video_tokens:
continue
if tok.upper() in known_hdr and hdr_format is None:
hdr_format = tok.upper()
video_tokens.add(tok)
elif tok.lower() in known_depth and bit_depth is None:
bit_depth = tok.lower()
video_tokens.add(tok)
return bit_depth, hdr_format, video_tokens
# ---------------------------------------------------------------------------
# Edition extraction
# ---------------------------------------------------------------------------
def _extract_edition(tokens: list[str]) -> tuple[str | None, set[str]]:
"""
Extract release edition (UNRATED, EXTENDED, DIRECTORS.CUT, …).
Returns (edition, matched_token_set).
"""
known_tokens = {t.upper() for t in _EDITIONS.get("tokens", [])}
# Try multi-token sequences first
matched_edition, matched_set = _match_sequences(
tokens, _EDITIONS.get("sequences", []), "edition"
)
if matched_edition:
return matched_edition, matched_set
for tok in tokens:
if tok.upper() in known_tokens:
return tok.upper(), {tok}
return None, set()
tokenizable = set(kb.separators)
return not any(c in name for c in kb.forbidden_chars if c not in tokenizable)
+114 -85
View File
@@ -1,55 +1,24 @@
"""Release domain — value objects and token sets."""
"""Release domain — value objects.
This module is **pure**: no I/O, no YAML loading, no knowledge-base
imports. All knowledge that the parser consumes is injected at runtime
via the ``ReleaseKnowledge`` port (see ``ports/knowledge.py``).
``ParsedRelease`` follows Option B of the snapshot-VO design: filesystem
sanitization is performed once at parse time and stored in
``title_sanitized``. The builder methods (``show_folder_name``,
``episode_filename``, etc.) are therefore pure string-formatting and do
**not** need access to any knowledge base — but they require the caller
to pass already-sanitized TMDB strings. The use case is responsible for
calling ``kb.sanitize_for_fs(tmdb_title)`` before invoking the builders.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from dataclasses import dataclass
from enum import Enum
from ..shared.exceptions import ValidationError
from alfred.infrastructure.knowledge.release import (
load_audio,
load_codecs,
load_editions,
load_forbidden_chars,
load_hdr_extra,
load_language_tokens,
load_media_type_tokens,
load_metadata_extensions,
load_non_video_extensions,
load_resolutions,
load_sources,
load_sources_extra,
load_subtitle_extensions,
load_video,
load_video_extensions,
load_win_forbidden_chars,
)
# Token sets — loaded once at import time from alfred/knowledge/release/
_RESOLUTIONS: set[str] = load_resolutions()
_SOURCES: set[str] = load_sources() | load_sources_extra()
_CODECS: set[str] = load_codecs()
_VIDEO_EXTENSIONS: set[str] = load_video_extensions()
_NON_VIDEO_EXTENSIONS: set[str] = load_non_video_extensions()
_SUBTITLE_EXTENSIONS: set[str] = load_subtitle_extensions()
# Both metadata and subtitle extensions are ignored when deciding the media
# type of a folder — neither is a conclusive signal for movie/tv/other.
_METADATA_EXTENSIONS: set[str] = load_metadata_extensions() | _SUBTITLE_EXTENSIONS
_FORBIDDEN_CHARS: set[str] = load_forbidden_chars()
_LANGUAGE_TOKENS: set[str] = load_language_tokens()
_AUDIO: dict = load_audio()
_VIDEO_META: dict = load_video()
_EDITIONS: dict = load_editions()
_HDR_EXTRA: set[str] = load_hdr_extra()
_MEDIA_TYPE_TOKENS: dict = load_media_type_tokens()
# Translation table for stripping Windows-forbidden characters
_WIN_FORBIDDEN_TABLE = str.maketrans("", "", "".join(load_win_forbidden_chars()))
def _sanitize_for_fs(text: str) -> str:
"""Remove Windows-forbidden characters from a string."""
return text.translate(_WIN_FORBIDDEN_TABLE)
class MediaTypeToken(str, Enum):
@@ -71,19 +40,27 @@ class MediaTypeToken(str, Enum):
UNKNOWN = "unknown"
class ParsePath(str, Enum):
"""How a ``ParsedRelease`` was produced. ``str``-backed for the same
reasons as :class:`MediaTypeToken`."""
class TokenizationRoute(str, Enum):
"""How a ``ParsedRelease`` was produced.
Records the **tokenization route** — i.e. whether the release name
was tokenized as-is (``DIRECT``), after a sanitization pass like
site-tag stripping or apostrophe removal (``SANITIZED``), or whether
structural parsing failed and an LLM rebuild is needed (``AI``).
This is **orthogonal** to :class:`~alfred.domain.release.parser.scoring.Road`
(EASY / SHITTY / PATH_OF_PAIN), which captures parser confidence and
is recorded on :class:`ParseReport`. Both can vary independently —
a SANITIZED name can still land on the EASY road if a group schema
matches the tokens after stripping.
``str``-backed for the same reasons as :class:`MediaTypeToken`."""
DIRECT = "direct"
SANITIZED = "sanitized"
AI = "ai"
_VALID_MEDIA_TYPES: frozenset[str] = frozenset(m.value for m in MediaTypeToken)
_VALID_PARSE_PATHS: frozenset[str] = frozenset(p.value for p in ParsePath)
def _strip_episode_from_normalized(normalized: str) -> str:
"""
Remove all episode parts (Exx) from a normalized release name, keeping Sxx.
@@ -103,13 +80,57 @@ def _strip_episode_from_normalized(normalized: str) -> str:
return ".".join(result)
@dataclass
@dataclass(frozen=True)
class ParseReport:
"""Diagnostic report attached to a :class:`ParsedRelease`.
``parse_release`` returns ``(ParsedRelease, ParseReport)``. The
report describes *how confident* the parser is in the result and
*which road* produced it. It is intentionally separate from
``ParsedRelease`` so the structural VO stays free of meta-concerns
about its own quality.
Fields:
- ``confidence``: integer 0100 (see :func:`parser.scoring.compute_score`).
- ``road``: ``"easy"`` / ``"shitty"`` / ``"path_of_pain"`` — distinct
from ``ParsedRelease.parse_path`` (which describes the
tokenization route, not the confidence tier).
- ``unknown_tokens``: tokens that finished annotation with role
UNKNOWN, in order of appearance.
- ``missing_critical``: names of critical structural fields the
parser couldn't fill (subset of ``{"title", "media_type", "year"}``).
"""
confidence: int
road: str # one of parser.scoring.Road values
unknown_tokens: tuple[str, ...] = ()
missing_critical: tuple[str, ...] = ()
def __post_init__(self) -> None:
if not (0 <= self.confidence <= 100):
raise ValidationError(
f"ParseReport.confidence out of range: {self.confidence}"
)
@dataclass(frozen=True)
class ParsedRelease:
"""Structured representation of a parsed release name."""
"""Structured representation of a parsed release name.
``title_sanitized`` carries the filesystem-safe form of ``title`` (computed
by the parser at construction time using the injected knowledge base).
Builder methods rely on it being already-sanitized — see module docstring.
Frozen: enrichment passes (``detect_media_type``, ``enrich_from_probe``)
return a **new** ``ParsedRelease`` via ``dataclasses.replace`` rather
than mutating in place. ``languages`` is a tuple for the same reason.
"""
raw: str # original release name (untouched)
normalised: str # dots instead of spaces
clean: str # raw minus site_tag and apostrophes — used by season_folder_name()
title: str # show/movie title (dots, no year/season/tech)
title_sanitized: str # title with filesystem-forbidden chars stripped
year: int | None # movie year or show start year (from TMDB)
season: int | None # season number (None for movies)
episode: int | None # first episode number (None if season-pack)
@@ -118,18 +139,18 @@ class ParsedRelease:
source: str | None # WEBRip, BluRay, …
codec: str | None # x265, HEVC, …
group: str # release group, "UNKNOWN" if missing
tech_string: str # quality.source.codec joined with dots
media_type: MediaTypeToken = MediaTypeToken.UNKNOWN
site_tag: str | None = (
None # site watermark stripped from name, e.g. "TGx", "OxTorrent.vc"
)
parse_path: ParsePath = ParsePath.DIRECT
languages: list[str] = field(default_factory=list) # ["MULTI", "VFF"], ["FRENCH"], …
parse_path: TokenizationRoute = TokenizationRoute.DIRECT
languages: tuple[str, ...] = () # ("MULTI", "VFF"), ("FRENCH",), …
audio_codec: str | None = None # "DTS-HD.MA", "DDP", "EAC3", …
audio_channels: str | None = None # "5.1", "7.1", "2.0", …
bit_depth: str | None = None # "10bit", "8bit", …
hdr_format: str | None = None # "DV", "HDR10", "DV.HDR10", …
edition: str | None = None # "UNRATED", "EXTENDED", "DIRECTORS.CUT", …
distributor: str | None = None # "NF", "AMZN", "DSNP", … (streaming origin)
def __post_init__(self) -> None:
if not self.raw:
@@ -158,36 +179,41 @@ class ParsedRelease:
f"ParsedRelease.episode_end ({self.episode_end}) < "
f"episode ({self.episode})"
)
# Coerce raw strings into their enum form (tolerant constructor).
if not isinstance(self.media_type, MediaTypeToken):
try:
self.media_type = MediaTypeToken(self.media_type)
except ValueError:
raise ValidationError(
f"ParsedRelease.media_type invalid: {self.media_type!r} "
f"(expected one of {sorted(_VALID_MEDIA_TYPES)})"
) from None
if not isinstance(self.parse_path, ParsePath):
try:
self.parse_path = ParsePath(self.parse_path)
except ValueError:
f"ParsedRelease.media_type must be a MediaTypeToken, "
f"got {type(self.media_type).__name__}: {self.media_type!r}"
)
if not isinstance(self.parse_path, TokenizationRoute):
raise ValidationError(
f"ParsedRelease.parse_path invalid: {self.parse_path!r} "
f"(expected one of {sorted(_VALID_PARSE_PATHS)})"
) from None
f"ParsedRelease.parse_path must be a TokenizationRoute, "
f"got {type(self.parse_path).__name__}: {self.parse_path!r}"
)
@property
def is_season_pack(self) -> bool:
return self.season is not None and self.episode is None
def show_folder_name(self, tmdb_title: str, tmdb_year: int) -> str:
@property
def tech_string(self) -> str:
"""``quality.source.codec`` joined by dots, skipping ``None`` parts.
Derived on every access so it stays in sync with the underlying
fields — no manual refresh needed after enrichment.
"""
return ".".join(p for p in (self.quality, self.source, self.codec) if p)
def show_folder_name(self, tmdb_title_safe: str, tmdb_year: int) -> str:
"""
Build the series root folder name.
Format: {Title}.{Year}.{Tech}-{Group}
Example: Oz.1997.1080p.WEBRip.x265-KONTRAST
``tmdb_title_safe`` must already be filesystem-safe (the caller is
expected to have run it through ``kb.sanitize_for_fs``).
"""
title_part = _sanitize_for_fs(tmdb_title).replace(" ", ".")
title_part = tmdb_title_safe.replace(" ", ".")
tech = self.tech_string or "Unknown"
return f"{title_part}.{tmdb_year}.{tech}-{self.group}"
@@ -199,44 +225,47 @@ class ParsedRelease:
For a single-episode release we still strip the episode token so the
folder can hold the whole season.
"""
return _strip_episode_from_normalized(self.normalised)
return _strip_episode_from_normalized(self.clean)
def episode_filename(self, tmdb_episode_title: str | None, ext: str) -> str:
def episode_filename(self, tmdb_episode_title_safe: str | None, ext: str) -> str:
"""
Build the episode filename.
Format: {Title}.{SxxExx}.{EpisodeTitle}.{Tech}-{Group}.{ext}
Example: Oz.S01E01.The.Routine.1080p.WEBRip.x265-KONTRAST.mkv
If tmdb_episode_title is None, omits the episode title segment.
``tmdb_episode_title_safe`` must already be filesystem-safe; pass
``None`` to omit the episode title segment.
"""
title_part = _sanitize_for_fs(self.title)
title_part = self.title_sanitized
s = f"S{self.season:02d}" if self.season is not None else ""
e = f"E{self.episode:02d}" if self.episode is not None else ""
se = s + e
ep_title = ""
if tmdb_episode_title:
ep_title = "." + _sanitize_for_fs(tmdb_episode_title).replace(" ", ".")
if tmdb_episode_title_safe:
ep_title = "." + tmdb_episode_title_safe.replace(" ", ".")
tech = self.tech_string or "Unknown"
ext_clean = ext.lstrip(".")
return f"{title_part}.{se}{ep_title}.{tech}-{self.group}.{ext_clean}"
def movie_folder_name(self, tmdb_title: str, tmdb_year: int) -> str:
def movie_folder_name(self, tmdb_title_safe: str, tmdb_year: int) -> str:
"""
Build the movie folder name.
Format: {Title}.{Year}.{Tech}-{Group}
Example: Inception.2010.1080p.BluRay.x265-GROUP
"""
return self.show_folder_name(tmdb_title, tmdb_year)
return self.show_folder_name(tmdb_title_safe, tmdb_year)
def movie_filename(self, tmdb_title: str, tmdb_year: int, ext: str) -> str:
def movie_filename(
self, tmdb_title_safe: str, tmdb_year: int, ext: str
) -> str:
"""
Build the movie filename (same as folder name + extension).
Example: Inception.2010.1080p.BluRay.x265-GROUP.mkv
"""
ext_clean = ext.lstrip(".")
return f"{self.movie_folder_name(tmdb_title, tmdb_year)}.{ext_clean}"
return f"{self.movie_folder_name(tmdb_title_safe, tmdb_year)}.{ext_clean}"
+267
View File
@@ -0,0 +1,267 @@
"""Media — file-level track types (video/audio/subtitle) and MediaInfo container.
These are the **container-view** dataclasses, populated from ffprobe output and
used across the project to describe the content of a media file.
Not to be confused with ``alfred.domain.subtitles.entities.SubtitleScanResult``
which models a subtitle being **scanned/matched** (with confidence, raw tokens,
file path, etc.). The two coexist by design — they describe the same real-world
concept seen from two different bounded contexts.
"""
from __future__ import annotations
from dataclasses import dataclass, field
from .value_objects import Language
__all__ = [
"AudioTrack",
"MediaInfo",
"MediaWithTracks",
"SubtitleTrack",
"VideoTrack",
"track_lang_matches",
]
# ─────────────────────────────────────────────────────────────────────────────
# Track types — one frozen dataclass per stream kind
# ─────────────────────────────────────────────────────────────────────────────
@dataclass(frozen=True)
class AudioTrack:
"""A single audio track as reported by ffprobe."""
index: int
codec: str | None # aac, ac3, eac3, dts, truehd, flac, …
channels: int | None # 2, 6 (5.1), 8 (7.1), …
channel_layout: str | None # stereo, 5.1, 7.1, …
language: str | None # ISO 639-2: fre, eng, und, …
is_default: bool = False
@dataclass(frozen=True)
class SubtitleTrack:
"""A single embedded subtitle track as reported by ffprobe."""
index: int
codec: str | None # subrip, ass, hdmv_pgs_subtitle, …
language: str | None # ISO 639-2: fre, eng, und, …
is_default: bool = False
is_forced: bool = False
@dataclass(frozen=True)
class VideoTrack:
"""A single video track as reported by ffprobe.
A media file typically has one video track but can have several (alt
camera angles, attached thumbnail images reported as still-image streams,
etc.), hence the list[VideoTrack] on MediaInfo.
"""
index: int
codec: str | None # h264, hevc, av1, …
width: int | None
height: int | None
is_default: bool = False
@property
def resolution(self) -> str | None:
"""
Best-effort resolution string: 2160p, 1080p, 720p, …
Width takes priority over height to handle widescreen/cinema crops
(e.g. 1920×960 scope → 1080p, not 720p). Falls back to height when
width is unavailable.
"""
match (self.width, self.height):
case (None, None):
return None
case (w, h) if w is not None:
match True:
case _ if w >= 3840:
return "2160p"
case _ if w >= 1920:
return "1080p"
case _ if w >= 1280:
return "720p"
case _ if w >= 720:
return "576p"
case _ if w >= 640:
return "480p"
case _:
return f"{h}p" if h else f"{w}w"
case (None, h):
match True:
case _ if h >= 2160:
return "2160p"
case _ if h >= 1080:
return "1080p"
case _ if h >= 720:
return "720p"
case _ if h >= 576:
return "576p"
case _ if h >= 480:
return "480p"
case _:
return f"{h}p"
# ─────────────────────────────────────────────────────────────────────────────
# MediaInfo — assembles video/audio/subtitle tracks for a media file
# ─────────────────────────────────────────────────────────────────────────────
@dataclass(frozen=True)
class MediaInfo:
"""
File-level media metadata extracted by ffprobe — immutable snapshot.
Symmetric design: every stream type is a tuple of typed track objects
(immutable on purpose — a MediaInfo is a frozen view of one ffprobe run,
not a mutable collection to append to).
Backwards-compatible flat accessors (``resolution``, ``width``, …) read
from the first video track when present.
"""
video_tracks: tuple[VideoTrack, ...] = field(default_factory=tuple)
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
# File-level (from ffprobe ``format`` block, not from any single stream)
duration_seconds: float | None = None
bitrate_kbps: int | None = None
# ──────────────────────────────────────────────────────────────────────
# Video conveniences — read the first video track
# ──────────────────────────────────────────────────────────────────────
@property
def primary_video(self) -> VideoTrack | None:
return self.video_tracks[0] if self.video_tracks else None
@property
def width(self) -> int | None:
v = self.primary_video
return v.width if v else None
@property
def height(self) -> int | None:
v = self.primary_video
return v.height if v else None
@property
def video_codec(self) -> str | None:
v = self.primary_video
return v.codec if v else None
@property
def resolution(self) -> str | None:
v = self.primary_video
return v.resolution if v else None
# ──────────────────────────────────────────────────────────────────────
# Audio conveniences
# ──────────────────────────────────────────────────────────────────────
@property
def audio_languages(self) -> list[str]:
"""Unique audio languages across all tracks (ISO 639-2)."""
seen: set[str] = set()
result: list[str] = []
for track in self.audio_tracks:
if track.language and track.language not in seen:
seen.add(track.language)
result.append(track.language)
return result
@property
def is_multi_audio(self) -> bool:
"""True if more than one audio language is present."""
return len(self.audio_languages) > 1
# ─────────────────────────────────────────────────────────────────────────────
# Language matching — shared helper + mixin
# ─────────────────────────────────────────────────────────────────────────────
def track_lang_matches(track_lang: str | None, query: str | Language) -> bool:
"""
Match a track's language string against a query (contract "C+").
* ``Language`` query → matches if the track string is any known
representation of that Language (delegates to ``Language.matches``).
Powerful, cross-format mode.
* ``str`` query → case-insensitive direct comparison against
``track_lang``. Simple, no normalization, no registry lookup.
Callers needing cross-format resolution (``"fr"`` ↔ ``"fre"`` ↔
``"french"``) should resolve their string through a ``LanguageRegistry``
once and pass the resulting ``Language``.
"""
if track_lang is None:
return False
if isinstance(query, Language):
return query.matches(track_lang)
if isinstance(query, str):
return track_lang.lower().strip() == query.lower().strip()
return False
class MediaWithTracks:
"""
Mixin providing audio/subtitle helpers for entities with track collections.
Hosts must expose two attributes:
* ``audio_tracks: tuple[AudioTrack, ...]``
* ``subtitle_tracks: tuple[SubtitleTrack, ...]``
The helpers follow the "C+" matching contract: pass a :class:`Language`
for cross-format matching, or a ``str`` for case-insensitive comparison.
"""
# These attributes are provided by the host entity (Movie, Episode, …).
# Declared here only for type-checkers and to make the contract explicit.
audio_tracks: tuple[AudioTrack, ...]
subtitle_tracks: tuple[SubtitleTrack, ...]
# ── Audio helpers ──────────────────────────────────────────────────────
def has_audio_in(self, lang: str | Language) -> bool:
"""True if at least one audio track is in the given language."""
return any(track_lang_matches(t.language, lang) for t in self.audio_tracks)
def audio_languages(self) -> list[str]:
"""Unique audio languages across all tracks, in track order."""
seen: set[str] = set()
result: list[str] = []
for t in self.audio_tracks:
if t.language and t.language not in seen:
seen.add(t.language)
result.append(t.language)
return result
# ── Subtitle helpers ───────────────────────────────────────────────────
def has_subtitles_in(self, lang: str | Language) -> bool:
"""True if at least one subtitle track is in the given language."""
return any(track_lang_matches(t.language, lang) for t in self.subtitle_tracks)
def has_forced_subs(self) -> bool:
"""True if at least one subtitle track is flagged as forced."""
return any(t.is_forced for t in self.subtitle_tracks)
def subtitle_languages(self) -> list[str]:
"""Unique subtitle languages across all tracks, in track order."""
seen: set[str] = set()
result: list[str] = []
for t in self.subtitle_tracks:
if t.language and t.language not in seen:
seen.add(t.language)
result.append(t.language)
return result
-21
View File
@@ -1,21 +0,0 @@
"""Media — file-level track types (video/audio/subtitle) and MediaInfo container.
These are the **container-view** dataclasses, populated from ffprobe output and
used across the project to describe the content of a media file.
"""
from .audio import AudioTrack
from .info import MediaInfo
from .matching import track_lang_matches
from .subtitle import SubtitleTrack
from .tracks_mixin import MediaWithTracks
from .video import VideoTrack
__all__ = [
"AudioTrack",
"MediaInfo",
"MediaWithTracks",
"SubtitleTrack",
"VideoTrack",
"track_lang_matches",
]
-17
View File
@@ -1,17 +0,0 @@
"""AudioTrack — a single audio stream as reported by ffprobe."""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class AudioTrack:
"""A single audio track as reported by ffprobe."""
index: int
codec: str | None # aac, ac3, eac3, dts, truehd, flac, …
channels: int | None # 2, 6 (5.1), 8 (7.1), …
channel_layout: str | None # stereo, 5.1, 7.1, …
language: str | None # ISO 639-2: fre, eng, und, …
is_default: bool = False
-78
View File
@@ -1,78 +0,0 @@
"""MediaInfo — assembles video, audio and subtitle tracks for a media file."""
from __future__ import annotations
from dataclasses import dataclass, field
from .audio import AudioTrack
from .subtitle import SubtitleTrack
from .video import VideoTrack
@dataclass(frozen=True)
class MediaInfo:
"""
File-level media metadata extracted by ffprobe — immutable snapshot.
Symmetric design: every stream type is a tuple of typed track objects
(immutable on purpose — a MediaInfo is a frozen view of one ffprobe run,
not a mutable collection to append to).
Backwards-compatible flat accessors (``resolution``, ``width``, …) read
from the first video track when present.
"""
video_tracks: tuple[VideoTrack, ...] = field(default_factory=tuple)
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
# File-level (from ffprobe ``format`` block, not from any single stream)
duration_seconds: float | None = None
bitrate_kbps: int | None = None
# ──────────────────────────────────────────────────────────────────────
# Video conveniences — read the first video track
# ──────────────────────────────────────────────────────────────────────
@property
def primary_video(self) -> VideoTrack | None:
return self.video_tracks[0] if self.video_tracks else None
@property
def width(self) -> int | None:
v = self.primary_video
return v.width if v else None
@property
def height(self) -> int | None:
v = self.primary_video
return v.height if v else None
@property
def video_codec(self) -> str | None:
v = self.primary_video
return v.codec if v else None
@property
def resolution(self) -> str | None:
v = self.primary_video
return v.resolution if v else None
# ──────────────────────────────────────────────────────────────────────
# Audio conveniences
# ──────────────────────────────────────────────────────────────────────
@property
def audio_languages(self) -> list[str]:
"""Unique audio languages across all tracks (ISO 639-2)."""
seen: set[str] = set()
result: list[str] = []
for track in self.audio_tracks:
if track.language and track.language not in seen:
seen.add(track.language)
result.append(track.language)
return result
@property
def is_multi_audio(self) -> bool:
"""True if more than one audio language is present."""
return len(self.audio_languages) > 1
-33
View File
@@ -1,33 +0,0 @@
"""Language-matching helper shared by media-bearing entities.
Both ``Episode`` and ``Movie`` carry ``audio_tracks`` / ``subtitle_tracks`` and
need to answer "do I have audio in language X?". The matching contract is the
same in both cases — keep it in one place.
"""
from __future__ import annotations
from ..value_objects import Language
def track_lang_matches(track_lang: str | None, query: str | Language) -> bool:
"""
Match a track's language string against a query (contract "C+").
* ``Language`` query → matches if the track string is any known
representation of that Language (delegates to ``Language.matches``).
Powerful, cross-format mode.
* ``str`` query → case-insensitive direct comparison against
``track_lang``. Simple, no normalization, no registry lookup.
Callers needing cross-format resolution (``"fr"`` ↔ ``"fre"`` ↔
``"french"``) should resolve their string through a ``LanguageRegistry``
once and pass the resulting ``Language``.
"""
if track_lang is None:
return False
if isinstance(query, Language):
return query.matches(track_lang)
if isinstance(query, str):
return track_lang.lower().strip() == query.lower().strip()
return False
-25
View File
@@ -1,25 +0,0 @@
"""SubtitleTrack — a single embedded subtitle stream as reported by ffprobe.
This is the **container-view** representation (ffprobe output) used uniformly
across the project to describe a subtitle stream embedded in a media file.
Not to be confused with ``alfred.domain.subtitles.entities.SubtitleCandidate``
which models a subtitle being **scanned/matched** (with confidence, raw tokens,
file path, etc.). The two coexist by design — they describe the same real-world
concept seen from two different bounded contexts.
"""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class SubtitleTrack:
"""A single embedded subtitle track as reported by ffprobe."""
index: int
codec: str | None # subrip, ass, hdmv_pgs_subtitle, …
language: str | None # ISO 639-2: fre, eng, und, …
is_default: bool = False
is_forced: bool = False
@@ -1,77 +0,0 @@
"""Mixin shared by entities that carry audio + subtitle tracks.
Both ``Movie`` and ``Episode`` carry a ``list[AudioTrack]`` plus a
``list[SubtitleTrack]`` and answer the same 5 queries about them (language
presence, unique languages, forced flag). Keep that behavior in one place so a
fix in one is a fix in both.
The mixin is plain Python (no dataclass machinery) so it composes cleanly with
``@dataclass`` entities — it only reads ``self.audio_tracks`` and
``self.subtitle_tracks`` which the host class provides as fields.
"""
from __future__ import annotations
from typing import TYPE_CHECKING
from ..value_objects import Language
from .matching import track_lang_matches
if TYPE_CHECKING:
from .audio import AudioTrack
from .subtitle import SubtitleTrack
class MediaWithTracks:
"""
Mixin providing audio/subtitle helpers for entities with track collections.
Hosts must expose two attributes:
* ``audio_tracks: list[AudioTrack]``
* ``subtitle_tracks: list[SubtitleTrack]``
The helpers follow the "C+" matching contract: pass a :class:`Language`
for cross-format matching, or a ``str`` for case-insensitive comparison.
"""
# These attributes are provided by the host entity (Movie, Episode, …).
# Declared here only for type-checkers and to make the contract explicit.
audio_tracks: list["AudioTrack"]
subtitle_tracks: list["SubtitleTrack"]
# ── Audio helpers ──────────────────────────────────────────────────────
def has_audio_in(self, lang: str | Language) -> bool:
"""True if at least one audio track is in the given language."""
return any(track_lang_matches(t.language, lang) for t in self.audio_tracks)
def audio_languages(self) -> list[str]:
"""Unique audio languages across all tracks, in track order."""
seen: set[str] = set()
result: list[str] = []
for t in self.audio_tracks:
if t.language and t.language not in seen:
seen.add(t.language)
result.append(t.language)
return result
# ── Subtitle helpers ───────────────────────────────────────────────────
def has_subtitles_in(self, lang: str | Language) -> bool:
"""True if at least one subtitle track is in the given language."""
return any(track_lang_matches(t.language, lang) for t in self.subtitle_tracks)
def has_forced_subs(self) -> bool:
"""True if at least one subtitle track is flagged as forced."""
return any(t.is_forced for t in self.subtitle_tracks)
def subtitle_languages(self) -> list[str]:
"""Unique subtitle languages across all tracks, in track order."""
seen: set[str] = set()
result: list[str] = []
for t in self.subtitle_tracks:
if t.language and t.language not in seen:
seen.add(t.language)
result.append(t.language)
return result
-62
View File
@@ -1,62 +0,0 @@
"""VideoTrack — a single video stream as reported by ffprobe."""
from __future__ import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class VideoTrack:
"""A single video track as reported by ffprobe.
A media file typically has one video track but can have several (alt
camera angles, attached thumbnail images reported as still-image streams,
etc.), hence the list[VideoTrack] on MediaInfo.
"""
index: int
codec: str | None # h264, hevc, av1, …
width: int | None
height: int | None
is_default: bool = False
@property
def resolution(self) -> str | None:
"""
Best-effort resolution string: 2160p, 1080p, 720p, …
Width takes priority over height to handle widescreen/cinema crops
(e.g. 1920×960 scope → 1080p, not 720p). Falls back to height when
width is unavailable.
"""
match (self.width, self.height):
case (None, None):
return None
case (w, h) if w is not None:
match True:
case _ if w >= 3840:
return "2160p"
case _ if w >= 1920:
return "1080p"
case _ if w >= 1280:
return "720p"
case _ if w >= 720:
return "576p"
case _ if w >= 640:
return "480p"
case _:
return f"{h}p" if h else f"{w}w"
case (None, h):
match True:
case _ if h >= 2160:
return "2160p"
case _ if h >= 1080:
return "1080p"
case _ if h >= 720:
return "720p"
case _ if h >= 576:
return "576p"
case _ if h >= 480:
return "480p"
case _:
return f"{h}p"
+2
View File
@@ -7,11 +7,13 @@ Protocol without going through real I/O.
"""
from .filesystem_scanner import FileEntry, FilesystemScanner
from .language_repository import LanguageRepository
from .media_prober import MediaProber, SubtitleStreamInfo
__all__ = [
"FileEntry",
"FilesystemScanner",
"LanguageRepository",
"MediaProber",
"SubtitleStreamInfo",
]
@@ -0,0 +1,36 @@
"""LanguageRepository port — abstracts canonical language lookup.
The adapter (typically loading from ISO 639 YAML knowledge) maps a wide
range of raw forms (codes, English/native names, aliases) onto the
canonical :class:`Language` value object. Domain code accepts the port
via constructor injection; tests can pass a small in-memory fake.
"""
from __future__ import annotations
from typing import Protocol
from alfred.domain.shared.value_objects import Language
class LanguageRepository(Protocol):
"""Canonical language lookup."""
def from_iso(self, code: str) -> Language | None:
"""Look up by canonical ISO 639-2/B code (case-insensitive)."""
...
def from_any(self, raw: str) -> Language | None:
"""Look up by any known representation: ISO code, name, alias.
Case-insensitive. Returns ``None`` when the raw form is unknown.
"""
...
def all(self) -> list[Language]:
"""Return all known languages, in a stable order."""
...
def __contains__(self, raw: str) -> bool: ...
def __len__(self) -> int: ...
+14 -1
View File
@@ -9,7 +9,10 @@ from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from typing import Protocol
from typing import TYPE_CHECKING, Protocol
if TYPE_CHECKING:
from alfred.domain.shared.media import MediaInfo
@dataclass(frozen=True)
@@ -37,3 +40,13 @@ class MediaProber(Protocol):
no subtitle streams. Adapters must not raise.
"""
...
def probe(self, video: Path) -> MediaInfo | None:
"""Return the full :class:`MediaInfo` for ``video``, or ``None``.
Covers all stream families (video, audio, subtitle) plus
file-level duration / bitrate. ``None`` signals that ffprobe is
unavailable or the file can't be read — adapters must not
raise.
"""
...
+47 -23
View File
@@ -1,5 +1,7 @@
"""Shared value objects used across multiple domains."""
from __future__ import annotations
import re
from dataclasses import dataclass
from pathlib import Path
@@ -43,29 +45,21 @@ class ImdbId:
@dataclass(frozen=True)
class FilePath:
"""
Value object representing a file path with validation.
Value object representing a file path.
Ensures the path is valid and optionally checks existence.
Accepts either ``str`` or :class:`pathlib.Path` at construction;
the value is normalized to ``Path`` in ``__post_init__``.
"""
value: Path
def __init__(self, path: str | Path):
"""
Initialize FilePath.
Args:
path: String or Path object representing the file path
"""
if isinstance(path, str):
path_obj = Path(path)
elif isinstance(path, Path):
path_obj = path
else:
raise ValidationError(f"Path must be str or Path, got {type(path)}")
# Use object.__setattr__ because dataclass is frozen
object.__setattr__(self, "value", path_obj)
def __post_init__(self) -> None:
if isinstance(self.value, Path):
return
if isinstance(self.value, str):
object.__setattr__(self, "value", Path(self.value))
return
raise ValidationError(f"Path must be str or Path, got {type(self.value)}")
def __str__(self) -> str:
return str(self.value)
@@ -150,19 +144,49 @@ class Language:
raise ValidationError(
f"Language.iso must be a 3-letter ISO 639-2/B code, got {self.iso!r}"
)
# Normalize iso to lowercase
object.__setattr__(self, "iso", self.iso.lower())
# Normalize aliases to a tuple of lowercase strings (dedup, preserve order)
if self.iso != self.iso.lower():
raise ValidationError(
f"Language.iso must be lowercase, got {self.iso!r}"
f"use Language.from_raw() to construct from arbitrary input"
)
for alias in self.aliases:
if not isinstance(alias, str) or alias != alias.lower().strip() or not alias:
raise ValidationError(
f"Language.aliases must be lowercase non-empty strings, "
f"got {alias!r} — use Language.from_raw() to normalize"
)
@classmethod
def from_raw(
cls,
iso: str,
english_name: str,
native_name: str,
aliases: tuple[str, ...] | list[str] = (),
) -> Language:
"""
Construct a Language from arbitrary (possibly un-normalized) input.
Use this factory when loading from external sources (YAML, user input,
third-party APIs) — it lowercases the iso code and normalizes/dedups
the alias tuple. The direct constructor is strict and rejects
un-normalized input.
"""
seen: set[str] = set()
normalized: list[str] = []
for alias in self.aliases:
for alias in aliases:
if not isinstance(alias, str):
continue
a = alias.lower().strip()
if a and a not in seen:
seen.add(a)
normalized.append(a)
object.__setattr__(self, "aliases", tuple(normalized))
return cls(
iso=iso.lower(),
english_name=english_name,
native_name=native_name,
aliases=tuple(normalized),
)
def matches(self, raw: str) -> bool:
"""
+4 -2
View File
@@ -1,11 +1,12 @@
"""Subtitles domain — subtitle identification, classification and placement."""
from .aggregates import SubtitleRuleSet
from .entities import MediaSubtitleMetadata, SubtitleCandidate
from .entities import MediaSubtitleMetadata, SubtitleScanResult
from .exceptions import SubtitleNotFound
from .services import PatternDetector, SubtitleIdentifier, SubtitleMatcher
from .value_objects import (
RuleScope,
RuleScopeLevel,
ScanStrategy,
SubtitleFormat,
SubtitleLanguage,
@@ -16,7 +17,7 @@ from .value_objects import (
)
__all__ = [
"SubtitleCandidate",
"SubtitleScanResult",
"MediaSubtitleMetadata",
"SubtitleRuleSet",
"SubtitleIdentifier",
@@ -30,5 +31,6 @@ __all__ = [
"TypeDetectionMethod",
"SubtitleMatchingRules",
"RuleScope",
"RuleScopeLevel",
"SubtitleNotFound",
]
+6 -3
View File
@@ -4,7 +4,7 @@ from dataclasses import dataclass, field
from typing import Any
from ..shared.value_objects import ImdbId
from .value_objects import RuleScope, SubtitleMatchingRules
from .value_objects import RuleScope, RuleScopeLevel, SubtitleMatchingRules
@dataclass
@@ -86,10 +86,13 @@ class SubtitleRuleSet:
if self._min_confidence is not None:
delta["min_confidence"] = self._min_confidence
return {
"scope": {"level": self.scope.level, "identifier": self.scope.identifier},
"scope": {
"level": self.scope.level.value,
"identifier": self.scope.identifier,
},
"override": delta,
}
@classmethod
def global_default(cls) -> SubtitleRuleSet:
return cls(scope=RuleScope(level="global"))
return cls(scope=RuleScope(level=RuleScopeLevel.GLOBAL))
+14 -12
View File
@@ -12,16 +12,18 @@ from .value_objects import (
@dataclass
class SubtitleCandidate:
class SubtitleScanResult:
"""
A subtitle being scanned and matched — either an external file or an embedded stream.
A subtitle observed during a scan — either an external file or an embedded stream.
Unlike ``alfred.domain.shared.media.SubtitleTrack`` (the pure container-view
populated from ffprobe), a SubtitleCandidate carries the **flow state** of the
subtitle matching pipeline: language/format are typed value objects that may
be ``None`` while classification is in progress, ``confidence`` reflects how
certain we are, and ``raw_tokens`` holds the filename fragments still under
analysis. State evolves: unknown → resolved after user clarification.
populated from ffprobe), a ``SubtitleScanResult`` carries the **flow state**
of the subtitle matching pipeline: language/format are typed value objects
that may be ``None`` while classification is in progress, ``confidence``
reflects how certain we are, and ``raw_tokens`` holds the filename fragments
still under analysis. State evolves: unknown → resolved after user
clarification. The name reflects this — it's the **output of a scan pass**,
not a value object.
"""
# Classification (may be None if not yet resolved)
@@ -72,7 +74,7 @@ class SubtitleCandidate:
if self.is_embedded
else str(self.file_path.name if self.file_path else "?")
)
return f"SubtitleCandidate({lang}, {self.subtitle_type.value}, {fmt}, src={src}, conf={self.confidence:.2f})"
return f"SubtitleScanResult({lang}, {self.subtitle_type.value}, {fmt}, src={src}, conf={self.confidence:.2f})"
@dataclass
@@ -84,14 +86,14 @@ class MediaSubtitleMetadata:
media_id: ImdbId | None
media_type: str # "movie" | "tv_show"
embedded_tracks: list[SubtitleCandidate] = field(default_factory=list)
external_tracks: list[SubtitleCandidate] = field(default_factory=list)
embedded_tracks: list[SubtitleScanResult] = field(default_factory=list)
external_tracks: list[SubtitleScanResult] = field(default_factory=list)
release_group: str | None = None
detected_pattern_id: str | None = None # pattern id from knowledge base
pattern_confirmed: bool = False
@property
def all_tracks(self) -> list[SubtitleCandidate]:
def all_tracks(self) -> list[SubtitleScanResult]:
return self.embedded_tracks + self.external_tracks
@property
@@ -99,5 +101,5 @@ class MediaSubtitleMetadata:
return len(self.embedded_tracks) + len(self.external_tracks)
@property
def unresolved_tracks(self) -> list[SubtitleCandidate]:
def unresolved_tracks(self) -> list[SubtitleScanResult]:
return [t for t in self.external_tracks if t.language is None]
+11 -11
View File
@@ -7,7 +7,7 @@ from pathlib import Path
from ...shared.ports import FilesystemScanner, MediaProber
from ..ports import SubtitleKnowledge
from ...shared.value_objects import ImdbId
from ..entities import MediaSubtitleMetadata, SubtitleCandidate
from ..entities import MediaSubtitleMetadata, SubtitleScanResult
from ..value_objects import ScanStrategy, SubtitlePattern, SubtitleType
logger = logging.getLogger(__name__)
@@ -94,7 +94,7 @@ class SubtitleIdentifier:
# Embedded tracks — via MediaProber
# ------------------------------------------------------------------
def _scan_embedded(self, video_path: Path) -> list[SubtitleCandidate]:
def _scan_embedded(self, video_path: Path) -> list[SubtitleScanResult]:
streams = self.prober.list_subtitle_streams(video_path)
tracks = []
@@ -111,7 +111,7 @@ class SubtitleIdentifier:
stype = SubtitleType.STANDARD
tracks.append(
SubtitleCandidate(
SubtitleScanResult(
language=lang,
format=None,
subtitle_type=stype,
@@ -131,7 +131,7 @@ class SubtitleIdentifier:
def _scan_external(
self, video_path: Path, pattern: SubtitlePattern
) -> list[SubtitleCandidate]:
) -> list[SubtitleScanResult]:
strategy = pattern.scan_strategy
episode_stem: str | None = None
@@ -200,7 +200,7 @@ class SubtitleIdentifier:
entries: list,
pattern: SubtitlePattern,
episode_stem: str | None = None,
) -> list[SubtitleCandidate]:
) -> list[SubtitleScanResult]:
tracks = [
self._classify_single(entry, episode_stem=episode_stem) for entry in entries
]
@@ -214,7 +214,7 @@ class SubtitleIdentifier:
def _classify_single(
self, entry, episode_stem: str | None = None
) -> SubtitleCandidate:
) -> SubtitleScanResult:
fmt = self.kb.format_for_extension(entry.suffix)
tokens = (
_tokenize_suffix(entry.stem, episode_stem)
@@ -253,7 +253,7 @@ class SubtitleIdentifier:
if entry.suffix.lower() == ".srt":
entry_count = _count_entries(self.scanner.read_text(entry.path))
return SubtitleCandidate(
return SubtitleScanResult(
language=language,
format=fmt,
subtitle_type=subtitle_type,
@@ -266,8 +266,8 @@ class SubtitleIdentifier:
)
def _disambiguate_by_size(
self, tracks: list[SubtitleCandidate]
) -> list[SubtitleCandidate]:
self, tracks: list[SubtitleScanResult]
) -> list[SubtitleScanResult]:
"""
When multiple tracks share the same language and type is UNKNOWN/STANDARD,
the one with the most entries (lines) is SDH, the smallest is FORCED if
@@ -277,7 +277,7 @@ class SubtitleIdentifier:
"""
# Group by language code
lang_groups: dict[str, list[SubtitleCandidate]] = {}
lang_groups: dict[str, list[SubtitleScanResult]] = {}
for track in tracks:
key = track.language.code if track.language else "__unknown__"
lang_groups.setdefault(key, []).append(track)
@@ -306,6 +306,6 @@ class SubtitleIdentifier:
return result
def _set_type(self, track: SubtitleCandidate, stype: SubtitleType) -> None:
def _set_type(self, track: SubtitleScanResult, stype: SubtitleType) -> None:
"""Mutate track type in-place."""
track.subtitle_type = stype
+12 -12
View File
@@ -2,7 +2,7 @@
import logging
from ..entities import SubtitleCandidate
from ..entities import SubtitleScanResult
from ..value_objects import SubtitleMatchingRules
logger = logging.getLogger(__name__)
@@ -10,7 +10,7 @@ logger = logging.getLogger(__name__)
class SubtitleMatcher:
"""
Filters a list of SubtitleCandidate against effective SubtitleMatchingRules.
Filters a list of SubtitleScanResult against effective SubtitleMatchingRules.
Returns matched tracks (pass all filters, confidence >= min_confidence)
and unresolved tracks (need user clarification).
@@ -21,14 +21,14 @@ class SubtitleMatcher:
def match(
self,
tracks: list[SubtitleCandidate],
tracks: list[SubtitleScanResult],
rules: SubtitleMatchingRules,
) -> tuple[list[SubtitleCandidate], list[SubtitleCandidate]]:
) -> tuple[list[SubtitleScanResult], list[SubtitleScanResult]]:
"""
Returns (matched, unresolved).
"""
matched: list[SubtitleCandidate] = []
unresolved: list[SubtitleCandidate] = []
matched: list[SubtitleScanResult] = []
unresolved: list[SubtitleScanResult] = []
for track in tracks:
if track.is_embedded:
@@ -51,7 +51,7 @@ class SubtitleMatcher:
return matched, unresolved
def _passes_filters(
self, track: SubtitleCandidate, rules: SubtitleMatchingRules
self, track: SubtitleScanResult, rules: SubtitleMatchingRules
) -> bool:
# Language filter
if rules.preferred_languages:
@@ -76,14 +76,14 @@ class SubtitleMatcher:
def _resolve_conflicts(
self,
tracks: list[SubtitleCandidate],
tracks: list[SubtitleScanResult],
rules: SubtitleMatchingRules,
) -> list[SubtitleCandidate]:
) -> list[SubtitleScanResult]:
"""
When multiple tracks have same language + type, keep only the best one
according to format_priority. If no format_priority applies, keep the first.
"""
seen: dict[tuple, SubtitleCandidate] = {}
seen: dict[tuple, SubtitleScanResult] = {}
for track in tracks:
lang = track.language.code if track.language else None
@@ -106,8 +106,8 @@ class SubtitleMatcher:
def _prefer(
self,
candidate: SubtitleCandidate,
existing: SubtitleCandidate,
candidate: SubtitleScanResult,
existing: SubtitleScanResult,
format_priority: list[str],
) -> bool:
"""Return True if candidate is preferable to existing."""
+3 -3
View File
@@ -1,9 +1,9 @@
"""Subtitle service utilities."""
from ..entities import SubtitleCandidate
from ..entities import SubtitleScanResult
def available_subtitles(tracks: list[SubtitleCandidate]) -> list[SubtitleCandidate]:
def available_subtitles(tracks: list[SubtitleScanResult]) -> list[SubtitleScanResult]:
"""
Return the distinct subtitle tracks available, deduped by (language, type).
@@ -11,7 +11,7 @@ def available_subtitles(tracks: list[SubtitleCandidate]) -> list[SubtitleCandida
preferences — e.g. eng, eng.sdh, fra all show up as separate entries.
"""
seen: set[tuple] = set()
result: list[SubtitleCandidate] = []
result: list[SubtitleScanResult] = []
for track in tracks:
lang = track.language.code if track.language else None
key = (lang, track.subtitle_type)
+12 -1
View File
@@ -83,9 +83,20 @@ class SubtitleMatchingRules:
min_confidence: float = 0.7
class RuleScopeLevel(str, Enum):
"""At which level a subtitle rule set applies."""
GLOBAL = "global"
RELEASE_GROUP = "release_group"
MOVIE = "movie"
SHOW = "show"
SEASON = "season"
EPISODE = "episode"
@dataclass(frozen=True)
class RuleScope:
"""At which level a rule set applies."""
level: str # "global" | "release_group" | "movie" | "show" | "season" | "episode"
level: RuleScopeLevel
identifier: str | None = None # imdb_id, group name, "S01", "S01E03"…
+13 -6
View File
@@ -47,16 +47,19 @@ from .value_objects import (
# ════════════════════════════════════════════════════════════════════════════
@dataclass(eq=False)
@dataclass(frozen=True, eq=False)
class Episode(MediaWithTracks):
"""
A single episode of a TV show — leaf of the TVShow aggregate.
Carries the file metadata (path, size) and the discovered tracks
(audio + subtitle). Track lists are populated by the ffprobe + subtitle
(audio + subtitle). Track tuples are populated by the ffprobe + subtitle
scan pipeline; they may be empty when the episode is known but not yet
scanned, or when no file is downloaded yet.
Frozen: rebuild via ``dataclasses.replace`` to project enrichment results
onto a new instance.
Equality is identity-based within the aggregate: two ``Episode`` instances
are equal iff they share the same ``(season_number, episode_number)``,
regardless of title/file/track contents. The root TVShow guarantees
@@ -68,17 +71,21 @@ class Episode(MediaWithTracks):
title: str
file_path: FilePath | None = None
file_size: FileSize | None = None
audio_tracks: list[AudioTrack] = field(default_factory=list)
subtitle_tracks: list[SubtitleTrack] = field(default_factory=list)
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
def __post_init__(self) -> None:
# Coerce numbers if raw ints were passed
if not isinstance(self.season_number, SeasonNumber):
if isinstance(self.season_number, int):
self.season_number = SeasonNumber(self.season_number)
object.__setattr__(
self, "season_number", SeasonNumber(self.season_number)
)
if not isinstance(self.episode_number, EpisodeNumber):
if isinstance(self.episode_number, int):
self.episode_number = EpisodeNumber(self.episode_number)
object.__setattr__(
self, "episode_number", EpisodeNumber(self.episode_number)
)
def __eq__(self, other: object) -> bool:
if not isinstance(other, Episode):
-121
View File
@@ -1,121 +0,0 @@
"""ffprobe — infrastructure adapter for extracting MediaInfo from a video file."""
from __future__ import annotations
import json
import logging
import subprocess
from pathlib import Path
from alfred.domain.shared.media import AudioTrack, MediaInfo, SubtitleTrack, VideoTrack
logger = logging.getLogger(__name__)
_FFPROBE_CMD = [
"ffprobe",
"-v",
"quiet",
"-print_format",
"json",
"-show_streams",
"-show_format",
]
def probe(path: Path) -> MediaInfo | None:
"""
Run ffprobe on path and return a MediaInfo.
Returns None if ffprobe is not available or the file cannot be probed.
"""
try:
result = subprocess.run(
[*_FFPROBE_CMD, str(path)],
capture_output=True,
text=True,
timeout=30,
check=False,
)
except subprocess.TimeoutExpired:
logger.warning("ffprobe timed out on %s", path)
return None
if result.returncode != 0:
logger.warning("ffprobe failed on %s: %s", path, result.stderr.strip())
return None
try:
data = json.loads(result.stdout)
except json.JSONDecodeError:
logger.warning("ffprobe returned invalid JSON for %s", path)
return None
return _parse(data)
def _parse(data: dict) -> MediaInfo:
streams = data.get("streams", [])
fmt = data.get("format", {})
# File-level duration/bitrate (ffprobe ``format`` block — independent of streams)
duration_seconds: float | None = None
bitrate_kbps: int | None = None
if "duration" in fmt:
try:
duration_seconds = float(fmt["duration"])
except ValueError:
pass
if "bit_rate" in fmt:
try:
bitrate_kbps = int(fmt["bit_rate"]) // 1000
except ValueError:
pass
video_tracks: list[VideoTrack] = []
audio_tracks: list[AudioTrack] = []
subtitle_tracks: list[SubtitleTrack] = []
for stream in streams:
codec_type = stream.get("codec_type")
if codec_type == "video":
video_tracks.append(
VideoTrack(
index=stream.get("index", len(video_tracks)),
codec=stream.get("codec_name"),
width=stream.get("width"),
height=stream.get("height"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
)
)
elif codec_type == "audio":
audio_tracks.append(
AudioTrack(
index=stream.get("index", len(audio_tracks)),
codec=stream.get("codec_name"),
channels=stream.get("channels"),
channel_layout=stream.get("channel_layout"),
language=stream.get("tags", {}).get("language"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
)
)
elif codec_type == "subtitle":
subtitle_tracks.append(
SubtitleTrack(
index=stream.get("index", len(subtitle_tracks)),
codec=stream.get("codec_name"),
language=stream.get("tags", {}).get("language"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
is_forced=stream.get("disposition", {}).get("forced", 0) == 1,
)
)
return MediaInfo(
video_tracks=tuple(video_tracks),
audio_tracks=tuple(audio_tracks),
subtitle_tracks=tuple(subtitle_tracks),
duration_seconds=duration_seconds,
bitrate_kbps=bitrate_kbps,
)
@@ -4,10 +4,10 @@ from __future__ import annotations
from pathlib import Path
from alfred.domain.release.value_objects import _VIDEO_EXTENSIONS
from alfred.domain.release.ports import ReleaseKnowledge
def find_video_file(path: Path) -> Path | None:
def find_video_file(path: Path, kb: ReleaseKnowledge) -> Path | None:
"""
Return the first video file found at path.
@@ -15,11 +15,12 @@ def find_video_file(path: Path) -> Path | None:
- If path is a folder — scan recursively, return the first video found
(sorted by name for determinism, picks S01E01 before S01E02 etc.).
"""
video_exts = kb.video_extensions
if path.is_file():
return path if path.suffix.lower() in _VIDEO_EXTENSIONS else None
return path if path.suffix.lower() in video_exts else None
for candidate in sorted(path.rglob("*")):
if candidate.is_file() and candidate.suffix.lower() in _VIDEO_EXTENSIONS:
if candidate.is_file() and candidate.suffix.lower() in video_exts:
return candidate
return None
@@ -87,7 +87,7 @@ class LanguageRegistry:
merged = _merge_language_entries(builtin, learned)
for iso, entry in merged.items():
language = Language(
language = Language.from_raw(
iso=iso,
english_name=entry.get("english_name", iso),
native_name=entry.get("native_name", iso),
@@ -16,9 +16,11 @@ import alfred as _alfred_pkg
_BUILTIN_ROOT = Path(_alfred_pkg.__file__).parent / "knowledge" / "release"
_SITES_ROOT = _BUILTIN_ROOT / "sites"
_GROUPS_ROOT = _BUILTIN_ROOT / "release_groups"
_LEARNED_ROOT = (
Path(_alfred_pkg.__file__).parent.parent / "data" / "knowledge" / "release"
)
_LEARNED_GROUPS_ROOT = _LEARNED_ROOT / "release_groups"
def _merge(base: dict, overlay: dict) -> dict:
@@ -62,6 +64,15 @@ def load_sources() -> set[str]:
return set(_load("sources.yaml").get("sources", []))
def load_distributors() -> set[str]:
"""Streaming distributor tokens (NF, AMZN, DSNP, …).
Distinct from ``load_sources()`` — distributors are uppercase scene
tags identifying the platform, not the capture origin.
"""
return {t.upper() for t in _load("distributors.yaml").get("distributors", [])}
def load_codecs() -> set[str]:
return set(_load("codecs.yaml").get("codecs", []))
@@ -128,6 +139,88 @@ def load_media_type_tokens() -> dict:
return _load_sites().get("media_type_tokens", {})
def load_group_schemas() -> dict:
"""Load every release-group schema YAML keyed by uppercase group name.
Builtin schemas in ``alfred/knowledge/release/release_groups/`` are
merged with user-learned schemas in
``data/knowledge/release/release_groups/`` (the learned ones win on
name collision).
"""
result: dict = {}
for root in (_GROUPS_ROOT, _LEARNED_GROUPS_ROOT):
if not root.is_dir():
continue
for path in sorted(root.glob("*.yaml")):
data = _read(path)
name = data.get("name")
if not name:
continue
result[name.upper()] = data
return result
def load_scoring() -> dict:
"""Load the parse-scoring config.
Returns a dict with three top-level keys: ``weights``, ``penalties``,
``thresholds``. Defaults are baked in so a missing or partial YAML
never breaks the parser — only de-tunes it.
"""
raw = _load("scoring.yaml")
weights = {
"title": 30,
"media_type": 20,
"year": 15,
"season": 10,
"episode": 5,
"resolution": 5,
"source": 5,
"codec": 5,
"group": 5,
}
weights.update(raw.get("weights", {}) or {})
penalties = {"unknown_token": 5, "max_unknown_penalty": 30}
penalties.update(raw.get("penalties", {}) or {})
thresholds = {"shitty_min": 60}
thresholds.update(raw.get("thresholds", {}) or {})
return {
"weights": weights,
"penalties": penalties,
"thresholds": thresholds,
}
def load_probe_mappings() -> dict:
"""Load ffprobe→scene-token translation tables.
Returns a dict with three keys:
- ``video_codec``: ``{ffprobe_codec_lower: scene_token}``
- ``audio_codec``: ``{ffprobe_codec_lower: scene_token}``
- ``audio_channels``: ``{channel_count_int: layout_str}``
Channel-count keys are normalized to ``int`` here so the consumer can
look up ``track.channels`` directly. Missing sections fall back to
empty dicts — the enrichment code degrades to its uppercase-fallback
path when a mapping is absent.
"""
raw = _load("probe_mappings.yaml")
video_codec = {k.lower(): v for k, v in (raw.get("video_codec") or {}).items()}
audio_codec = {k.lower(): v for k, v in (raw.get("audio_codec") or {}).items()}
audio_channels: dict[int, str] = {}
for k, v in (raw.get("audio_channels") or {}).items():
try:
audio_channels[int(k)] = v
except (TypeError, ValueError):
continue
return {
"video_codec": video_codec,
"audio_codec": audio_codec,
"audio_channels": audio_channels,
}
def load_separators() -> list[str]:
"""Single-char token separators used by the release name tokenizer.
@@ -0,0 +1,127 @@
"""YamlReleaseKnowledge — concrete adapter for the ``ReleaseKnowledge``
domain port.
Loads every release-knowledge YAML once at construction time and exposes
the parsed snapshots as plain attributes. The application layer builds a
single instance at boot and passes it down to ``parse_release`` and to
``ParsedRelease`` builder methods.
A few extras (``video_extensions``, ``non_video_extensions``,
``subtitle_extensions``, ``metadata_extensions``) are not part of the
domain port — they are consumed by application/infra modules that handle
filesystem-level concerns.
"""
from __future__ import annotations
from alfred.domain.release.parser.schema import GroupSchema, SchemaChunk
from alfred.domain.release.parser.tokens import TokenRole
from .release import (
load_audio,
load_codecs,
load_distributors,
load_editions,
load_forbidden_chars,
load_group_schemas,
load_hdr_extra,
load_language_tokens,
load_media_type_tokens,
load_metadata_extensions,
load_non_video_extensions,
load_probe_mappings,
load_resolutions,
load_scoring,
load_separators,
load_sources,
load_sources_extra,
load_subtitle_extensions,
load_video,
load_video_extensions,
load_win_forbidden_chars,
)
def _build_group_schema(data: dict) -> GroupSchema:
"""Translate a raw YAML schema dict into a frozen :class:`GroupSchema`.
Unknown roles raise ``ValueError`` early so a typo in a YAML file
surfaces at construction time, not on first parse.
"""
chunks = tuple(
SchemaChunk(
role=TokenRole(entry["role"]),
optional=bool(entry.get("optional", False)),
)
for entry in data.get("chunk_order", [])
)
return GroupSchema(
name=data["name"],
separator=data.get("separator", "."),
chunks=chunks,
)
class YamlReleaseKnowledge:
"""Single object holding every parsed-release knowledge constant.
Built once at application boot. Read-only at runtime — call sites
treat it as a snapshot. To pick up newly learned tokens without a
restart, build a fresh instance and swap it in at the call sites.
"""
def __init__(self) -> None:
# Domain-port surface
self.resolutions: set[str] = load_resolutions()
self.sources: set[str] = load_sources() | load_sources_extra()
self.codecs: set[str] = load_codecs()
self.distributors: set[str] = load_distributors()
self.language_tokens: set[str] = load_language_tokens()
self.forbidden_chars: set[str] = load_forbidden_chars()
self.hdr_extra: set[str] = load_hdr_extra()
self.audio: dict = load_audio()
self.video_meta: dict = load_video()
self.editions: dict = load_editions()
self.media_type_tokens: dict = load_media_type_tokens()
self.separators: list[str] = load_separators()
# Parse-scoring config (weights / penalties / thresholds).
self.scoring: dict = load_scoring()
# ffprobe → scene-token mapping tables (consumed by
# ``application.release.enrich_from_probe``).
self.probe_mappings: dict = load_probe_mappings()
# File-extension sets (used by application/infra modules, not by
# the parser itself — kept here so there is a single ownership
# point for release knowledge).
self.video_extensions: set[str] = load_video_extensions()
self.non_video_extensions: set[str] = load_non_video_extensions()
self.subtitle_extensions: set[str] = load_subtitle_extensions()
# Metadata + subtitle extensions are both ignored when deciding
# the media type of a folder (neither is a conclusive signal for
# movie/tv/other), so we expose the union under the historical
# name.
self.metadata_extensions: set[str] = (
load_metadata_extensions() | self.subtitle_extensions
)
# Translation table for stripping Windows-forbidden chars.
self._win_forbidden_table = str.maketrans(
"", "", "".join(load_win_forbidden_chars())
)
# Group schemas, keyed by uppercase group name for fast lookup.
self._group_schemas: dict[str, GroupSchema] = {
key: _build_group_schema(data)
for key, data in load_group_schemas().items()
}
def sanitize_for_fs(self, text: str) -> str:
"""Strip Windows-forbidden characters from ``text``."""
return text.translate(self._win_forbidden_table)
def group_schema(self, name: str) -> GroupSchema | None:
return self._group_schemas.get(name.upper())
@@ -2,7 +2,7 @@
import logging
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
from alfred.domain.shared.ports import LanguageRepository
from alfred.domain.subtitles.value_objects import (
ScanStrategy,
SubtitleFormat,
@@ -12,6 +12,8 @@ from alfred.domain.subtitles.value_objects import (
SubtitleType,
TypeDetectionMethod,
)
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
from .loader import KnowledgeLoader
logger = logging.getLogger(__name__)
@@ -28,10 +30,12 @@ class SubtitleKnowledgeBase:
def __init__(
self,
loader: KnowledgeLoader | None = None,
language_registry: LanguageRegistry | None = None,
language_registry: LanguageRepository | None = None,
):
self._loader = loader or KnowledgeLoader()
self._language_registry = language_registry or LanguageRegistry()
self._language_registry: LanguageRepository = (
language_registry or LanguageRegistry()
)
self._build()
def _build(self) -> None: # noqa: PLR0912 — straight-line YAML projection
@@ -7,12 +7,23 @@ import logging
import subprocess
from pathlib import Path
from alfred.domain.shared.media import AudioTrack, MediaInfo, SubtitleTrack, VideoTrack
from alfred.domain.shared.ports import SubtitleStreamInfo
logger = logging.getLogger(__name__)
_FFPROBE_TIMEOUT_SECONDS = 30
_FFPROBE_FULL_CMD = [
"ffprobe",
"-v",
"quiet",
"-print_format",
"json",
"-show_streams",
"-show_format",
]
class FfprobeMediaProber:
"""Inspect media files by shelling out to ``ffprobe``.
@@ -63,3 +74,101 @@ class FfprobeMediaProber:
)
)
return streams
def probe(self, video: Path) -> MediaInfo | None:
"""Run ffprobe on ``video`` and return a :class:`MediaInfo`.
Returns ``None`` when ffprobe is not available, times out, or
the file cannot be parsed. Never raises.
"""
try:
result = subprocess.run(
[*_FFPROBE_FULL_CMD, str(video)],
capture_output=True,
text=True,
timeout=_FFPROBE_TIMEOUT_SECONDS,
check=False,
)
except (subprocess.TimeoutExpired, FileNotFoundError) as e:
logger.warning("ffprobe failed on %s: %s", video, e)
return None
if result.returncode != 0:
logger.warning("ffprobe failed on %s: %s", video, result.stderr.strip())
return None
try:
data = json.loads(result.stdout)
except json.JSONDecodeError:
logger.warning("ffprobe returned invalid JSON for %s", video)
return None
return _parse_media_info(data)
def _parse_media_info(data: dict) -> MediaInfo:
"""Translate raw ffprobe JSON into a :class:`MediaInfo` snapshot."""
streams = data.get("streams", [])
fmt = data.get("format", {})
duration_seconds: float | None = None
bitrate_kbps: int | None = None
if "duration" in fmt:
try:
duration_seconds = float(fmt["duration"])
except ValueError:
pass
if "bit_rate" in fmt:
try:
bitrate_kbps = int(fmt["bit_rate"]) // 1000
except ValueError:
pass
video_tracks: list[VideoTrack] = []
audio_tracks: list[AudioTrack] = []
subtitle_tracks: list[SubtitleTrack] = []
for stream in streams:
codec_type = stream.get("codec_type")
if codec_type == "video":
video_tracks.append(
VideoTrack(
index=stream.get("index", len(video_tracks)),
codec=stream.get("codec_name"),
width=stream.get("width"),
height=stream.get("height"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
)
)
elif codec_type == "audio":
audio_tracks.append(
AudioTrack(
index=stream.get("index", len(audio_tracks)),
codec=stream.get("codec_name"),
channels=stream.get("channels"),
channel_layout=stream.get("channel_layout"),
language=stream.get("tags", {}).get("language"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
)
)
elif codec_type == "subtitle":
subtitle_tracks.append(
SubtitleTrack(
index=stream.get("index", len(subtitle_tracks)),
codec=stream.get("codec_name"),
language=stream.get("tags", {}).get("language"),
is_default=stream.get("disposition", {}).get("default", 0) == 1,
is_forced=stream.get("disposition", {}).get("forced", 0) == 1,
)
)
return MediaInfo(
video_tracks=tuple(video_tracks),
audio_tracks=tuple(audio_tracks),
subtitle_tracks=tuple(subtitle_tracks),
duration_seconds=duration_seconds,
bitrate_kbps=bitrate_kbps,
)
@@ -13,7 +13,7 @@ from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.application.subtitles.placer import PlacedTrack
from alfred.infrastructure.metadata.store import MetadataStore
@@ -25,7 +25,7 @@ class SubtitleMetadataStore:
Subtitle-pipeline view of the per-release `.alfred/metadata.yaml`.
Backed by a generic MetadataStore; this class only knows how to build
a subtitle_history entry from PlacedTrack/SubtitleCandidate pairs.
a subtitle_history entry from PlacedTrack/SubtitleScanResult pairs.
"""
def __init__(self, library_root: Path):
@@ -45,7 +45,7 @@ class SubtitleMetadataStore:
def append_history(
self,
placed_pairs: list[tuple[PlacedTrack, SubtitleCandidate]],
placed_pairs: list[tuple[PlacedTrack, SubtitleScanResult]],
season: int | None = None,
episode: int | None = None,
release_group: str | None = None,
@@ -7,7 +7,7 @@ from typing import TYPE_CHECKING
import yaml
from alfred.domain.subtitles.aggregates import SubtitleRuleSet
from alfred.domain.subtitles.value_objects import RuleScope
from alfred.domain.subtitles.value_objects import RuleScope, RuleScopeLevel
if TYPE_CHECKING:
from alfred.infrastructure.persistence.memory.ltm.components.subtitle_preferences import (
@@ -72,7 +72,9 @@ class RuleSetRepository:
rg_data = _load_yaml(rg_path).get("override", {})
if rg_data:
rg_ruleset = SubtitleRuleSet(
scope=RuleScope(level="release_group", identifier=release_group),
scope=RuleScope(
level=RuleScopeLevel.RELEASE_GROUP, identifier=release_group
),
parent=current,
)
rg_ruleset.override(**_filter_override(rg_data))
@@ -85,7 +87,7 @@ class RuleSetRepository:
local_data = _load_yaml(self._alfred_dir / "rules.yaml").get("override", {})
if local_data:
local_ruleset = SubtitleRuleSet(
scope=RuleScope(level="show"),
scope=RuleScope(level=RuleScopeLevel.SHOW),
parent=current,
)
local_ruleset.override(**_filter_override(local_data))
@@ -0,0 +1,17 @@
# Known streaming distributor tokens (case-insensitive match).
#
# These tags identify *which platform* the release was sourced from
# (Netflix, Amazon, Disney+, …). Distinct from ``sources.yaml`` which
# captures the encoding origin (WEB-DL, BluRay, …). A typical release
# carries both: ``Show.S01E01.1080p.NF.WEB-DL.x264-GROUP`` →
# source=WEB-DL, distributor=NF.
distributors:
- NF # Netflix
- AMZN # Amazon Prime Video
- DSNP # Disney+
- HMAX # HBO Max
- ATVP # Apple TV+
- HULU # Hulu
- PCOK # Peacock
- PMTP # Paramount+
- CR # Crunchyroll
@@ -0,0 +1,45 @@
# Translation table — ffprobe output → scene-style release tokens.
#
# Consumed by ``alfred.application.release.enrich_from_probe`` when filling
# missing ParsedRelease fields from a probed MediaInfo. Token-level values
# from the release name always win; these mappings only fire when the
# corresponding ParsedRelease field is None.
#
# Lookup is case-insensitive on the key side (ffprobe sometimes emits
# uppercase, sometimes lowercase). When no key matches, the fallback is
# ``ffprobe_value.upper()`` so unknown codecs still surface in a
# predictable form (and signal the gap to a future "learn" pass).
#
# Each section is a flat dict — values are the canonical scene tokens
# Alfred uses everywhere (filename builders, ParsedRelease fields).
# ffprobe video codec name → scene codec token
video_codec:
hevc: x265
h264: x264
h265: x265
av1: AV1
vp9: VP9
mpeg4: XviD
# ffprobe audio codec name → scene audio token
audio_codec:
eac3: EAC3
ac3: AC3
dts: DTS
truehd: TrueHD
aac: AAC
flac: FLAC
opus: OPUS
mp3: MP3
pcm_s16l: PCM
pcm_s24l: PCM
# Channel count (integer) → standard layout string.
# Keys are strings here because YAML mappings prefer string keys; the
# loader normalizes them back to int.
audio_channels:
"8": "7.1"
"6": "5.1"
"2": "2.0"
"1": "1.0"
@@ -0,0 +1,22 @@
# ELiTE release naming schema.
#
# Examples seen in the wild:
# Foundation.S02.1080p.x265-ELiTE (TV season pack, no source)
#
# ELiTE often omits the source token entirely on TV releases (no WEBRip /
# BluRay), going straight from resolution to codec.
name: ELiTE
separator: "."
chunk_order:
- role: title
- role: year
optional: true
- role: season_episode
optional: true
- role: resolution
- role: source
optional: true # often absent on TV
- role: codec
- role: group
@@ -0,0 +1,28 @@
# KONTRAST release naming schema.
#
# Examples seen in the wild:
# Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST (movie)
# The.Long.Walk.2025.1080p.WEBRip.x265-KONTRAST (movie)
# Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST (TV episode)
# Slow.Horses.S05.1080p.WEBRip.x265-KONTRAST (TV season pack)
#
# Schema is a left-to-right description of the canonical chunk order.
# Each entry is a role (matching TokenRole). Optional chunks are marked
# with `optional: true`. The parser consumes tokens greedily by role,
# skipping over optional chunks that don't match.
name: KONTRAST
separator: "."
# Canonical order of structural + technical chunks (left to right).
# `title` is special-cased as "everything up to the first non-title role".
chunk_order:
- role: title
- role: year
optional: true # absent on TV releases (S01E01 instead)
- role: season_episode
optional: true # absent on movies
- role: resolution # always present (1080p, 2160p, …)
- role: source # always present (WEBRip, BluRay, …)
- role: codec # always present (x265, x264, …)
- role: group # everything after the final `-`
@@ -0,0 +1,20 @@
# RARBG release naming schema.
#
# RARBG follows the canonical scene convention closely:
# Title.Year.Resolution.Source.Codec-RARBG
# For TV:
# Title.S01E01.Resolution.Source.Codec-RARBG
name: RARBG
separator: "."
chunk_order:
- role: title
- role: year
optional: true
- role: season_episode
optional: true
- role: resolution
- role: source
- role: codec
- role: group
+42
View File
@@ -0,0 +1,42 @@
# Release parse scoring.
#
# `parse_release` returns a `ParseReport` alongside the `ParsedRelease`.
# The report carries a 0-100 confidence score computed from the annotated
# tokens, plus the road decision (EASY / SHITTY / PATH_OF_PAIN).
#
# Why YAML: the weights and the SHITTY/PoP cutoff are tuning knobs we
# expect to iterate on as fixtures grow. Keeping them in code would
# mean a commit per tweak; here the user can adjust without touching
# Python.
#
# Weights are awarded when the corresponding ParsedRelease field is
# populated (non-None, non-"UNKNOWN" for group). Season and episode
# only contribute when the parse looks like TV (season is not None).
weights:
title: 30 # structural pivot — without it nothing else matters
media_type: 20 # movie / tv_show / tv_complete / …
year: 15
season: 10 # only counted for TV-shaped releases
episode: 5
resolution: 5
source: 5
codec: 5
group: 5 # "UNKNOWN" yields 0
# Penalty applied per UNKNOWN token left in the annotated stream.
# Capped at `max_unknown_penalty` to keep a long-tail of garbage from
# pushing every release into PoP.
penalties:
unknown_token: 5
max_unknown_penalty: 30
# Decision thresholds.
#
# EASY is decided structurally (a known group schema matched) — it does
# not look at the score. SHITTY vs PATH_OF_PAIN is decided here:
#
# score >= shitty_min → SHITTY (best-effort parse usable)
# score < shitty_min → PATH_OF_PAIN (needs user / LLM help)
thresholds:
shitty_min: 60
+1
View File
@@ -21,3 +21,4 @@ separators:
- "(" # parenthesis-embedded (year, edition): (2020) (Director's Cut)
- ")"
- "_" # underscore-as-space (old usenet, some Asian releases)
- "" # fullwidth vertical bar U+FF5C (CJK release names, occasional decorative use)
+6 -6
View File
@@ -1,4 +1,9 @@
# Known release source tokens (case-insensitive match)
# Known release source tokens (case-insensitive match).
#
# "Source" here means the capture/encoding origin (disc, broadcast, web
# stream) — NOT the streaming distributor (Netflix, Disney+, …). Those
# live in ``distributors.yaml`` because they're a separate dimension:
# a release is typically "WEB-DL from NF" — both should be captured.
sources:
- bluray
- blu-ray
@@ -14,8 +19,3 @@ sources:
- dvdrip
- dvd
- vodrip
- amzn
- nf
- dsnp
- hmax
- atvp
-15
View File
@@ -37,12 +37,6 @@ class Settings(BaseSettings):
llm_temperature: float = 0.2
data_storage_dir: str = "data"
# --- MEDIA ---
# Minimum file size to consider a video file as a real movie (in bytes).
# 100 MB is generous enough to skip sample clips / trailers without rejecting
# legitimate low-bitrate releases (e.g. older anime, certain web rips).
min_movie_size_bytes: int = 100 * 1024 * 1024
# --- BUILD ---
alfred_version: str | None = None
@@ -90,15 +84,6 @@ class Settings(BaseSettings):
)
return v
@field_validator("min_movie_size_bytes")
@classmethod
def validate_min_movie_size(cls, v: int) -> int:
if v < 0:
raise ConfigurationError(
f"min_movie_size_bytes must be non-negative, got {v}"
)
return v
@field_validator("request_timeout")
@classmethod
def validate_timeout(cls, v: int) -> int:
+20 -4
View File
@@ -88,13 +88,13 @@ def analyze(release_name: str, source_path: str | None = None) -> None:
if not path.exists():
print(" (chemin inexistant, probe skipped)")
else:
from alfred.infrastructure.filesystem.ffprobe import probe
from alfred.infrastructure.filesystem.find_video import find_video_file
from alfred.infrastructure.probe import FfprobeMediaProber
video = find_video_file(path) if path.is_dir() else path
if video:
print(f" video file: {video.name}")
info = probe(video)
info = FfprobeMediaProber().probe(video)
if info:
print(f" codec: {info.video_codec}")
print(f" resolution: {info.resolution}")
@@ -124,8 +124,16 @@ def dry_run(release_name: str) -> None:
from alfred.application.filesystem.resolve_destination import (
resolve_season_destination,
)
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from alfred.infrastructure.probe import FfprobeMediaProber
result = resolve_season_destination(release_name, tmdb_title, tmdb_year)
result = resolve_season_destination(
release_name,
tmdb_title,
tmdb_year,
YamlReleaseKnowledge(),
FfprobeMediaProber(),
)
d = result.to_dict()
print()
print(json.dumps(d, indent=2, ensure_ascii=False))
@@ -203,8 +211,16 @@ def do_move(release_name: str, source_folder: str | None = None) -> None:
from alfred.application.filesystem.resolve_destination import (
resolve_season_destination,
)
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from alfred.infrastructure.probe import FfprobeMediaProber
result = resolve_season_destination(release_name, tmdb_title, tmdb_year)
result = resolve_season_destination(
release_name,
tmdb_title,
tmdb_year,
YamlReleaseKnowledge(),
FfprobeMediaProber(),
)
d = result.to_dict()
if d["status"] == "needs_clarification":
+2 -2
View File
@@ -98,9 +98,9 @@ def main() -> None:
print(c(f"Error: {path} does not exist", RED), file=sys.stderr)
sys.exit(1)
from alfred.infrastructure.filesystem.ffprobe import probe
from alfred.infrastructure.probe import FfprobeMediaProber
info = probe(path)
info = FfprobeMediaProber().probe(path)
if info is None:
print(c("Error: ffprobe failed to probe the file", RED), file=sys.stderr)
sys.exit(1)
+14 -7
View File
@@ -100,11 +100,18 @@ def main() -> None:
print(c(f"Error: {downloads} does not exist", RED), file=sys.stderr)
sys.exit(1)
from alfred.application.filesystem.detect_media_type import detect_media_type
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
from dataclasses import replace
from alfred.application.release.detect_media_type import detect_media_type
from alfred.application.release.enrich_from_probe import enrich_from_probe
from alfred.domain.release.services import parse_release
from alfred.infrastructure.filesystem.ffprobe import probe
from alfred.domain.release.value_objects import MediaTypeToken
from alfred.infrastructure.filesystem.find_video import find_video_file
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from alfred.infrastructure.probe import FfprobeMediaProber
_kb = YamlReleaseKnowledge()
_prober = FfprobeMediaProber()
entries = sorted(downloads.iterdir(), key=lambda p: p.name.lower())
total = len(entries)
@@ -121,14 +128,14 @@ def main() -> None:
name = entry.name
try:
p = parse_release(name)
p.media_type = detect_media_type(p, entry)
p, _report = parse_release(name, _kb)
p = replace(p, media_type=MediaTypeToken(detect_media_type(p, entry, _kb)))
if p.media_type not in ("unknown", "other"):
video_file = find_video_file(entry)
if video_file:
media_info = probe(video_file)
media_info = _prober.probe(video_file)
if media_info:
enrich_from_probe(p, media_info)
p = enrich_from_probe(p, media_info, _kb)
warnings = _assess(p)
except Exception as e:
warnings = [f"parse error: {e}"]
+27 -21
View File
@@ -1,4 +1,4 @@
"""Tests for ``alfred.application.filesystem.detect_media_type``.
"""Tests for ``alfred.application.release.detect_media_type``.
The function refines a ``ParsedRelease.media_type`` using filesystem evidence.
@@ -18,18 +18,24 @@ from pathlib import Path
import pytest
from alfred.application.filesystem.detect_media_type import detect_media_type
from alfred.application.release.detect_media_type import detect_media_type
from alfred.domain.release.services import parse_release
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
def _parsed(media_type: str = "movie"):
"""Build a ParsedRelease with the requested media_type via the real parser."""
if media_type == "tv_show":
return parse_release("Show.S01E01.1080p-GRP")
parsed, _ = parse_release("Show.S01E01.1080p-GRP", _KB)
return parsed
if media_type == "movie":
return parse_release("Movie.2020.1080p-GRP")
parsed, _ = parse_release("Movie.2020.1080p-GRP", _KB)
return parsed
# "unknown" / other — feed a name the parser can't classify
return parse_release("randomthing")
parsed, _ = parse_release("randomthing", _KB)
return parsed
# --------------------------------------------------------------------------- #
@@ -41,30 +47,30 @@ class TestFile:
def test_video_file_preserves_parsed_type(self, tmp_path: Path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
assert detect_media_type(_parsed("movie"), f) == "movie"
assert detect_media_type(_parsed("movie"), f, _KB) == "movie"
def test_video_file_preserves_tv_type(self, tmp_path: Path):
f = tmp_path / "ep.mp4"
f.write_bytes(b"")
assert detect_media_type(_parsed("tv_show"), f) == "tv_show"
assert detect_media_type(_parsed("tv_show"), f, _KB) == "tv_show"
def test_non_video_file_returns_other(self, tmp_path: Path):
f = tmp_path / "x.iso"
f.write_bytes(b"")
assert detect_media_type(_parsed("movie"), f) == "other"
assert detect_media_type(_parsed("movie"), f, _KB) == "other"
@pytest.mark.parametrize("ext", [".rar", ".zip", ".7z", ".exe", ".dmg"])
def test_various_non_video_extensions(self, tmp_path: Path, ext):
f = tmp_path / f"x{ext}"
f.write_bytes(b"")
assert detect_media_type(_parsed("movie"), f) == "other"
assert detect_media_type(_parsed("movie"), f, _KB) == "other"
def test_metadata_only_file_keeps_parsed_type(self, tmp_path: Path):
# Metadata extension is stripped from conclusive set — no video, no
# non-video → falls through to parsed.media_type.
f = tmp_path / "x.nfo"
f.write_bytes(b"")
assert detect_media_type(_parsed("movie"), f) == "movie"
assert detect_media_type(_parsed("movie"), f, _KB) == "movie"
# --------------------------------------------------------------------------- #
@@ -75,27 +81,27 @@ class TestFile:
class TestFolder:
def test_folder_with_video_keeps_parsed_type(self, tmp_path: Path):
(tmp_path / "main.mkv").write_bytes(b"")
assert detect_media_type(_parsed("movie"), tmp_path) == "movie"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "movie"
def test_folder_only_non_video_returns_other(self, tmp_path: Path):
(tmp_path / "disc.iso").write_bytes(b"")
(tmp_path / "part.rar").write_bytes(b"")
assert detect_media_type(_parsed("movie"), tmp_path) == "other"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "other"
def test_folder_mixed_returns_unknown(self, tmp_path: Path):
(tmp_path / "main.mkv").write_bytes(b"")
(tmp_path / "extras.iso").write_bytes(b"")
assert detect_media_type(_parsed("movie"), tmp_path) == "unknown"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "unknown"
def test_empty_folder_keeps_parsed_type(self, tmp_path: Path):
assert detect_media_type(_parsed("tv_show"), tmp_path) == "tv_show"
assert detect_media_type(_parsed("tv_show"), tmp_path, _KB) == "tv_show"
def test_folder_only_metadata_keeps_parsed_type(self, tmp_path: Path):
(tmp_path / "info.nfo").write_bytes(b"")
(tmp_path / "cover.jpg").write_bytes(b"")
(tmp_path / "subs.srt").write_bytes(b"")
# All metadata → conclusive set empty → falls through.
assert detect_media_type(_parsed("movie"), tmp_path) == "movie"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "movie"
# --------------------------------------------------------------------------- #
@@ -109,18 +115,18 @@ class TestMetadataIgnored:
(tmp_path / "info.nfo").write_bytes(b"")
(tmp_path / "cover.jpg").write_bytes(b"")
(tmp_path / "subs.srt").write_bytes(b"")
assert detect_media_type(_parsed("movie"), tmp_path) == "movie"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "movie"
def test_non_video_plus_metadata_still_other(self, tmp_path: Path):
(tmp_path / "disc.iso").write_bytes(b"")
(tmp_path / "info.nfo").write_bytes(b"")
assert detect_media_type(_parsed("movie"), tmp_path) == "other"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "other"
def test_case_insensitive_extensions(self, tmp_path: Path):
# Suffix is lowercased before classification.
f = tmp_path / "X.MKV"
f.write_bytes(b"")
assert detect_media_type(_parsed("movie"), f) == "movie"
assert detect_media_type(_parsed("movie"), f, _KB) == "movie"
# --------------------------------------------------------------------------- #
@@ -132,11 +138,11 @@ class TestMissing:
def test_nonexistent_path_keeps_parsed_type(self, tmp_path: Path):
missing = tmp_path / "does_not_exist.mkv"
# Doesn't exist → empty extension set → falls through.
assert detect_media_type(_parsed("movie"), missing) == "movie"
assert detect_media_type(_parsed("movie"), missing, _KB) == "movie"
def test_nonexistent_folder_keeps_parsed_type(self, tmp_path: Path):
missing = tmp_path / "ghost"
assert detect_media_type(_parsed("tv_show"), missing) == "tv_show"
assert detect_media_type(_parsed("tv_show"), missing, _KB) == "tv_show"
def test_subfolder_not_recursed(self, tmp_path: Path):
# _collect_extensions scans only the first level — files inside
@@ -145,4 +151,4 @@ class TestMissing:
sub.mkdir()
(sub / "deep.mkv").write_bytes(b"")
# Top level has no files at all → empty → falls through to parsed type.
assert detect_media_type(_parsed("movie"), tmp_path) == "movie"
assert detect_media_type(_parsed("movie"), tmp_path, _KB) == "movie"
+73 -31
View File
@@ -1,8 +1,8 @@
"""Tests for ``alfred.application.filesystem.enrich_from_probe``.
"""Tests for ``alfred.application.release.enrich_from_probe``.
The function mutates a ``ParsedRelease`` in place using ffprobe ``MediaInfo``.
Token-level values from the release name always win only ``None`` fields
are filled.
The function returns a new ``ParsedRelease`` with ``None`` fields filled
from ffprobe ``MediaInfo``. Token-level values from the release name
always win only ``None`` fields are filled.
Coverage:
@@ -18,9 +18,12 @@ Uses real ``ParsedRelease`` / ``MediaInfo`` instances — no mocking needed.
from __future__ import annotations
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
from alfred.application.release.enrich_from_probe import enrich_from_probe
from alfred.domain.release.value_objects import ParsedRelease
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
def _info_with_video(*, width=None, height=None, codec=None, **rest) -> MediaInfo:
@@ -35,8 +38,9 @@ def _bare(**overrides) -> ParsedRelease:
"""Build a minimal ParsedRelease with all enrichable fields = None."""
defaults = dict(
raw="X",
normalised="X",
clean="X",
title="X",
title_sanitized="X",
year=None,
season=None,
episode=None,
@@ -45,7 +49,6 @@ def _bare(**overrides) -> ParsedRelease:
source=None,
codec=None,
group="UNKNOWN",
tech_string="",
)
defaults.update(overrides)
return ParsedRelease(**defaults)
@@ -59,17 +62,17 @@ def _bare(**overrides) -> ParsedRelease:
class TestQuality:
def test_fills_when_none(self):
p = _bare()
enrich_from_probe(p, _info_with_video(width=1920, height=1080))
p = enrich_from_probe(p, _info_with_video(width=1920, height=1080), _KB)
assert p.quality == "1080p"
def test_does_not_overwrite_existing(self):
p = _bare(quality="2160p")
enrich_from_probe(p, _info_with_video(width=1920, height=1080))
p = enrich_from_probe(p, _info_with_video(width=1920, height=1080), _KB)
assert p.quality == "2160p"
def test_no_dims_leaves_none(self):
p = _bare()
enrich_from_probe(p, MediaInfo())
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.quality is None
@@ -81,27 +84,27 @@ class TestQuality:
class TestVideoCodec:
def test_hevc_to_x265(self):
p = _bare()
enrich_from_probe(p, _info_with_video(codec="hevc"))
p = enrich_from_probe(p, _info_with_video(codec="hevc"), _KB)
assert p.codec == "x265"
def test_h264_to_x264(self):
p = _bare()
enrich_from_probe(p, _info_with_video(codec="h264"))
p = enrich_from_probe(p, _info_with_video(codec="h264"), _KB)
assert p.codec == "x264"
def test_unknown_codec_uppercased(self):
p = _bare()
enrich_from_probe(p, _info_with_video(codec="weird"))
p = enrich_from_probe(p, _info_with_video(codec="weird"), _KB)
assert p.codec == "WEIRD"
def test_does_not_overwrite_existing(self):
p = _bare(codec="HEVC")
enrich_from_probe(p, _info_with_video(codec="h264"))
p = enrich_from_probe(p, _info_with_video(codec="h264"), _KB)
assert p.codec == "HEVC"
def test_no_codec_leaves_none(self):
p = _bare()
enrich_from_probe(p, MediaInfo())
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.codec is None
@@ -119,7 +122,7 @@ class TestAudio:
]
)
p = _bare()
enrich_from_probe(p, info)
p = enrich_from_probe(p, info, _KB)
assert p.audio_codec == "EAC3"
assert p.audio_channels == "5.1"
@@ -131,32 +134,32 @@ class TestAudio:
]
)
p = _bare()
enrich_from_probe(p, info)
p = enrich_from_probe(p, info, _KB)
assert p.audio_codec == "AC3"
assert p.audio_channels == "5.1"
def test_channel_count_unknown_falls_back(self):
info = MediaInfo(audio_tracks=[AudioTrack(0, "aac", 4, "quad", "eng")])
p = _bare()
enrich_from_probe(p, info)
p = enrich_from_probe(p, info, _KB)
assert p.audio_channels == "4ch"
def test_unknown_audio_codec_uppercased(self):
info = MediaInfo(audio_tracks=[AudioTrack(0, "newcodec", 2, "stereo", "eng")])
p = _bare()
enrich_from_probe(p, info)
p = enrich_from_probe(p, info, _KB)
assert p.audio_codec == "NEWCODEC"
def test_no_audio_tracks(self):
p = _bare()
enrich_from_probe(p, MediaInfo())
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.audio_codec is None
assert p.audio_channels is None
def test_does_not_overwrite_existing_audio_fields(self):
info = MediaInfo(audio_tracks=[AudioTrack(0, "ac3", 6, "5.1", "eng")])
p = _bare(audio_codec="DTS-HD.MA", audio_channels="7.1")
enrich_from_probe(p, info)
p = enrich_from_probe(p, info, _KB)
assert p.audio_codec == "DTS-HD.MA"
assert p.audio_channels == "7.1"
@@ -175,8 +178,8 @@ class TestLanguages:
]
)
p = _bare()
enrich_from_probe(p, info)
assert p.languages == ["eng", "fre"]
p = enrich_from_probe(p, info, _KB)
assert p.languages == ("eng", "fre")
def test_skips_und(self):
info = MediaInfo(
@@ -186,8 +189,8 @@ class TestLanguages:
]
)
p = _bare()
enrich_from_probe(p, info)
assert p.languages == ["eng"]
p = enrich_from_probe(p, info, _KB)
assert p.languages == ("eng",)
def test_dedup_against_existing_case_insensitive(self):
# existing token-level languages are typically upper-case ("FRENCH", "ENG")
@@ -199,13 +202,52 @@ class TestLanguages:
AudioTrack(1, "aac", 2, "stereo", "fre"),
]
)
p = _bare()
p.languages = ["ENG"]
enrich_from_probe(p, info)
p = _bare(languages=("ENG",))
p = enrich_from_probe(p, info, _KB)
# "eng" → upper "ENG" already present → skipped. "fre" → "FRE" new → kept.
assert p.languages == ["ENG", "fre"]
assert p.languages == ("ENG", "fre")
def test_no_audio_tracks_leaves_languages_empty(self):
p = _bare()
enrich_from_probe(p, MediaInfo())
assert p.languages == []
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.languages == ()
# --------------------------------------------------------------------------- #
# tech_string #
# --------------------------------------------------------------------------- #
class TestTechString:
"""tech_string is a derived property on ParsedRelease: it always
reflects the current quality/source/codec. Enrichment never writes
it directly it stays in sync by construction."""
def test_rebuilt_from_filled_quality_and_codec(self):
p = _bare()
p = enrich_from_probe(
p, _info_with_video(width=1920, height=1080, codec="hevc"), _KB
)
assert p.quality == "1080p"
assert p.codec == "x265"
assert p.tech_string == "1080p.x265"
def test_keeps_existing_source_when_enriching(self):
# Token-level source must stay; probe fills only None fields.
p = _bare(source="BluRay")
p = enrich_from_probe(
p, _info_with_video(width=1920, height=1080, codec="hevc"), _KB
)
assert p.tech_string == "1080p.BluRay.x265"
def test_unchanged_when_no_enrichable_video_info(self):
# No video info → nothing to fill → derived tech_string stays as it was.
p = _bare(quality="2160p", source="WEB-DL", codec="x265")
assert p.tech_string == "2160p.WEB-DL.x265"
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.tech_string == "2160p.WEB-DL.x265"
def test_empty_when_nothing_known(self):
p = _bare()
p = enrich_from_probe(p, MediaInfo(), _KB)
assert p.tech_string == ""
+356
View File
@@ -0,0 +1,356 @@
"""Tests for the ``inspect_release`` orchestrator (Phase C).
Covers the four composition steps as a black box: a real
``YamlReleaseKnowledge``, real on-disk filesystem under ``tmp_path``,
and a stubbed ``MediaProber`` so we don't depend on a system ``ffprobe``.
"""
from __future__ import annotations
from pathlib import Path
from alfred.application.release import InspectedResult, inspect_release
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
_MOVIE_NAME = "Inception.2010.1080p.BluRay.x264-GROUP"
_TV_NAME = "Dexter.S01E01.1080p.WEB-DL.x264-GROUP"
# --------------------------------------------------------------------------- #
# Test doubles #
# --------------------------------------------------------------------------- #
class _StubProber:
"""Minimal MediaProber stub. Records the path it was asked to probe."""
def __init__(self, info: MediaInfo | None) -> None:
self._info = info
self.calls: list[Path] = []
def list_subtitle_streams(self, video: Path): # pragma: no cover - unused here
return []
def probe(self, video: Path) -> MediaInfo | None:
self.calls.append(video)
return self._info
class _RaisingProber:
"""A prober that would explode if called — used to assert no probe."""
def list_subtitle_streams(self, video: Path): # pragma: no cover
raise AssertionError("list_subtitle_streams must not be called")
def probe(self, video: Path): # pragma: no cover
raise AssertionError("probe must not be called")
def _media_info_1080p_h264() -> MediaInfo:
return MediaInfo(
video_tracks=(VideoTrack(index=0, codec="h264", width=1920, height=1080),),
audio_tracks=(
AudioTrack(
index=1,
codec="ac3",
channels=6,
channel_layout="5.1",
language="eng",
is_default=True,
),
),
subtitle_tracks=(),
duration_seconds=7200.0,
bitrate_kbps=8000,
)
# --------------------------------------------------------------------------- #
# Happy paths #
# --------------------------------------------------------------------------- #
class TestInspectMovieFolder:
def test_returns_inspected_result_with_all_fields(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
video = folder / "movie.mkv"
video.write_bytes(b"")
prober = _StubProber(_media_info_1080p_h264())
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert isinstance(result, InspectedResult)
assert result.source_path == folder
assert result.main_video == video
assert result.media_info is not None
assert result.probe_used is True
assert prober.calls == [video]
def test_parsed_carries_token_level_fields(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
prober = _StubProber(_media_info_1080p_h264())
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert result.parsed.title.lower().startswith("inception")
assert result.parsed.year == 2010
assert result.parsed.group == "GROUP"
assert result.parsed.media_type == "movie"
def test_report_has_confidence_and_road(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
prober = _StubProber(None)
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert 0 <= result.report.confidence <= 100
assert result.report.road in ("easy", "shitty", "path_of_pain")
class TestInspectSingleFile:
def test_file_is_its_own_main_video(self, tmp_path: Path) -> None:
f = tmp_path / f"{_MOVIE_NAME}.mkv"
f.write_bytes(b"")
prober = _StubProber(_media_info_1080p_h264())
result = inspect_release(_MOVIE_NAME, f, _KB, prober)
assert result.main_video == f
assert result.probe_used is True
# --------------------------------------------------------------------------- #
# Probe-gating logic #
# --------------------------------------------------------------------------- #
class TestProbeGating:
def test_no_video_means_no_probe(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
# Only a non-video file present.
(folder / "readme.txt").write_text("hi")
prober = _RaisingProber()
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert result.main_video is None
assert result.media_info is None
assert result.probe_used is False
def test_media_type_other_means_no_probe(self, tmp_path: Path) -> None:
# An ISO-only folder gets detect_media_type → "other".
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "disc.iso").write_bytes(b"")
prober = _RaisingProber()
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert result.parsed.media_type == "other"
assert result.media_info is None
assert result.probe_used is False
def test_probe_failure_keeps_probe_used_false(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
prober = _StubProber(None) # ffprobe simulated as failing
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert result.main_video is not None
assert result.media_info is None
assert result.probe_used is False
# --------------------------------------------------------------------------- #
# Mutation contract #
# --------------------------------------------------------------------------- #
class TestMutationContract:
def test_detect_media_type_refines_parsed(self, tmp_path: Path) -> None:
# Release name parses to "movie", but folder mixes video + non_video
# (e.g. an ISO sitting next to an mkv) → detect_media_type returns
# "unknown", which is in _NON_PROBABLE_MEDIA_TYPES → no probe.
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
(folder / "extras.iso").write_bytes(b"")
prober = _RaisingProber()
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
assert result.parsed.media_type == "unknown"
assert result.probe_used is False
def test_enrich_runs_when_probe_succeeds(self, tmp_path: Path) -> None:
# Build a release name with no codec; probe should fill it in.
name = "Inception.2010.1080p.BluRay-GROUP"
folder = tmp_path / name
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
prober = _StubProber(_media_info_1080p_h264())
result = inspect_release(name, folder, _KB, prober)
assert result.probe_used is True
# enrich_from_probe should have filled the missing codec field.
assert result.parsed.codec is not None
# --------------------------------------------------------------------------- #
# Resilience #
# --------------------------------------------------------------------------- #
class TestResilience:
def test_nonexistent_path_does_not_raise(self, tmp_path: Path) -> None:
ghost = tmp_path / "does-not-exist"
prober = _RaisingProber()
result = inspect_release(_MOVIE_NAME, ghost, _KB, prober)
assert result.main_video is None
assert result.media_info is None
assert result.probe_used is False
def test_tv_release_inspection(self, tmp_path: Path) -> None:
folder = tmp_path / _TV_NAME
folder.mkdir()
video = folder / "episode.mkv"
video.write_bytes(b"")
prober = _StubProber(_media_info_1080p_h264())
result = inspect_release(_TV_NAME, folder, _KB, prober)
assert result.parsed.media_type == "tv_show"
assert result.parsed.season == 1
assert result.parsed.episode == 1
assert result.main_video == video
assert result.probe_used is True
# --------------------------------------------------------------------------- #
# Frozen contract #
# --------------------------------------------------------------------------- #
class TestFrozen:
def test_inspected_result_is_frozen(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
prober = _StubProber(None)
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
# frozen=True → assigning a field raises FrozenInstanceError.
import dataclasses
try:
result.probe_used = True # type: ignore[misc]
except dataclasses.FrozenInstanceError:
pass
else: # pragma: no cover
raise AssertionError("InspectedResult should be frozen")
# --------------------------------------------------------------------------- #
# recommended_action #
# --------------------------------------------------------------------------- #
class TestRecommendedAction:
"""``recommended_action`` collapses the orchestrator's go / wait /
skip decision into a single property. The check ordering is part
of the contract (skip wins over ask_user, ask_user wins over
process) see the property docstring."""
def test_skip_when_no_main_video(self, tmp_path: Path) -> None:
# Folder with no video at all → main_video is None → skip.
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "readme.txt").write_text("hi")
result = inspect_release(_MOVIE_NAME, folder, _KB, _RaisingProber())
assert result.main_video is None
assert result.recommended_action == "skip"
def test_skip_when_media_type_other(self, tmp_path: Path) -> None:
# Folder with only non-video files (ISO) → media_type == "other"
# AND main_video is None (find_main_video filters by video ext).
# Both branches resolve to "skip"; this asserts the contract holds.
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "disc.iso").write_bytes(b"")
result = inspect_release(_MOVIE_NAME, folder, _KB, _RaisingProber())
assert result.parsed.media_type == "other"
assert result.recommended_action == "skip"
def test_ask_user_when_media_type_unknown(self, tmp_path: Path) -> None:
# Mixed video + non-video → detect_media_type returns "unknown".
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
(folder / "extras.iso").write_bytes(b"")
result = inspect_release(
_MOVIE_NAME, folder, _KB, _StubProber(_media_info_1080p_h264())
)
assert result.parsed.media_type == "unknown"
assert result.recommended_action == "ask_user"
def test_ask_user_when_path_of_pain_road(self, tmp_path: Path) -> None:
# Malformed name (forbidden chars) → road == "path_of_pain".
name = "garbage@#%name"
folder = tmp_path / "release"
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
result = inspect_release(
name, folder, _KB, _StubProber(_media_info_1080p_h264())
)
assert result.report.road == "path_of_pain"
# main_video is found but the road still flags uncertainty.
assert result.main_video is not None
assert result.recommended_action == "ask_user"
def test_process_for_confident_movie(self, tmp_path: Path) -> None:
folder = tmp_path / _MOVIE_NAME
folder.mkdir()
(folder / "movie.mkv").write_bytes(b"")
result = inspect_release(
_MOVIE_NAME, folder, _KB, _StubProber(_media_info_1080p_h264())
)
assert result.parsed.media_type == "movie"
assert result.report.road in ("easy", "shitty")
assert result.recommended_action == "process"
def test_process_for_confident_tv_show(self, tmp_path: Path) -> None:
folder = tmp_path / _TV_NAME
folder.mkdir()
(folder / "episode.mkv").write_bytes(b"")
result = inspect_release(
_TV_NAME, folder, _KB, _StubProber(_media_info_1080p_h264())
)
assert result.parsed.media_type == "tv_show"
assert result.recommended_action == "process"
+3 -3
View File
@@ -40,7 +40,7 @@ from alfred.application.filesystem.manage_subtitles import (
_to_imdb_id,
_to_unresolved_dto,
)
from alfred.domain.subtitles.entities import MediaSubtitleMetadata, SubtitleCandidate
from alfred.domain.subtitles.entities import MediaSubtitleMetadata, SubtitleScanResult
from alfred.application.subtitles.placer import PlacedTrack, PlaceResult
from alfred.domain.subtitles.value_objects import (
ScanStrategy,
@@ -63,8 +63,8 @@ def _track(
is_embedded: bool = False,
raw_tokens: list[str] | None = None,
file_size_kb: float | None = None,
) -> SubtitleCandidate:
return SubtitleCandidate(
) -> SubtitleScanResult:
return SubtitleScanResult(
language=lang,
format=fmt,
subtitle_type=stype,
+146 -19
View File
@@ -9,7 +9,6 @@ Four use cases compute library paths from a release name + TMDB metadata:
Coverage:
- ``TestSanitize`` Windows-forbidden chars stripped.
- ``TestFindExistingTvshowFolders`` empty root, prefix match (case + space dot).
- ``TestResolveSeriesFolderInternal`` confirmed_folder, no existing, single match,
ambiguous _Clarification.
@@ -32,14 +31,53 @@ from alfred.application.filesystem.resolve_destination import (
_Clarification,
_find_existing_tvshow_folders,
_resolve_series_folder,
_sanitize,
resolve_episode_destination,
resolve_movie_destination,
resolve_season_destination,
resolve_series_destination,
resolve_episode_destination as _resolve_episode_destination,
resolve_movie_destination as _resolve_movie_destination,
resolve_season_destination as _resolve_season_destination,
resolve_series_destination as _resolve_series_destination,
)
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from alfred.infrastructure.persistence import Memory, set_memory
_KB = YamlReleaseKnowledge()
class _NullProber:
"""Default prober stub — never returns probe data."""
def list_subtitle_streams(self, video): # pragma: no cover
return []
def probe(self, video):
return None
_DEFAULT_PROBER = _NullProber()
def resolve_season_destination(*args, prober=None, **kwargs):
return _resolve_season_destination(
*args, kb=_KB, prober=prober or _DEFAULT_PROBER, **kwargs
)
def resolve_episode_destination(*args, prober=None, **kwargs):
return _resolve_episode_destination(
*args, kb=_KB, prober=prober or _DEFAULT_PROBER, **kwargs
)
def resolve_movie_destination(*args, prober=None, **kwargs):
return _resolve_movie_destination(
*args, kb=_KB, prober=prober or _DEFAULT_PROBER, **kwargs
)
def resolve_series_destination(*args, prober=None, **kwargs):
return _resolve_series_destination(
*args, kb=_KB, prober=prober or _DEFAULT_PROBER, **kwargs
)
REL_EPISODE = "Oz.S01E01.1080p.WEBRip.x265-KONTRAST"
REL_SEASON = "Oz.S03.1080p.WEBRip.x265-KONTRAST"
REL_MOVIE = "Inception.2010.1080p.BluRay.x265-GROUP"
@@ -51,15 +89,6 @@ REL_SERIES = "Oz.Complete.Series.1080p.WEBRip.x265-KONTRAST"
# --------------------------------------------------------------------------- #
class TestSanitize:
def test_passthrough_safe_chars(self):
assert _sanitize("Oz.1997.1080p-GRP") == "Oz.1997.1080p-GRP"
def test_strips_windows_forbidden(self):
# ? : * " < > | \
assert _sanitize('a?b:c*d"e<f>g|h\\i') == "abcdefghi"
# --------------------------------------------------------------------------- #
# _find_existing_tvshow_folders #
# --------------------------------------------------------------------------- #
@@ -107,6 +136,7 @@ class TestResolveSeriesFolderInternal:
out = _resolve_series_folder(
tmp_path,
"Oz",
"Oz",
1997,
"Oz.1997.WEBRip-KONTRAST",
confirmed_folder="Oz.1997.X-GRP",
@@ -117,6 +147,7 @@ class TestResolveSeriesFolderInternal:
out = _resolve_series_folder(
tmp_path,
"Oz",
"Oz",
1997,
"Oz.1997.WEBRip-KONTRAST",
confirmed_folder="Oz.1997.New-X",
@@ -125,21 +156,21 @@ class TestResolveSeriesFolderInternal:
def test_no_existing_returns_computed_as_new(self, tmp_path):
out = _resolve_series_folder(
tmp_path, "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
tmp_path, "Oz", "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
)
assert out == ("Oz.1997.WEBRip-KONTRAST", True)
def test_single_existing_matching_computed_returns_existing(self, tmp_path):
(tmp_path / "Oz.1997.WEBRip-KONTRAST").mkdir()
out = _resolve_series_folder(
tmp_path, "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
tmp_path, "Oz", "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
)
assert out == ("Oz.1997.WEBRip-KONTRAST", False)
def test_single_existing_different_name_returns_clarification(self, tmp_path):
(tmp_path / "Oz.1997.BluRay-OTHER").mkdir()
out = _resolve_series_folder(
tmp_path, "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
tmp_path, "Oz", "Oz", 1997, "Oz.1997.WEBRip-KONTRAST", None
)
assert isinstance(out, _Clarification)
assert "Oz" in out.question
@@ -149,7 +180,7 @@ class TestResolveSeriesFolderInternal:
def test_multiple_existing_returns_clarification(self, tmp_path):
(tmp_path / "Oz.1997.A-GRP").mkdir()
(tmp_path / "Oz.1997.B-GRP").mkdir()
out = _resolve_series_folder(tmp_path, "Oz", 1997, "Oz.1997.A-GRP", None)
out = _resolve_series_folder(tmp_path, "Oz", "Oz", 1997, "Oz.1997.A-GRP", None)
assert isinstance(out, _Clarification)
# Computed already in existing → not duplicated.
assert out.options.count("Oz.1997.A-GRP") == 1
@@ -331,6 +362,102 @@ class TestSeries:
assert out.status == "needs_clarification"
# --------------------------------------------------------------------------- #
# Probe enrichment wiring #
# --------------------------------------------------------------------------- #
class _StubProber:
"""Minimal MediaProber stub used to drive enrich_from_probe."""
def __init__(self, info):
self._info = info
def list_subtitle_streams(self, video): # pragma: no cover - unused here
return []
def probe(self, video):
return self._info
def _stereo_movie_info():
"""A MediaInfo that fills quality+codec when the release name omits them."""
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
return MediaInfo(
video_tracks=(VideoTrack(index=0, codec="hevc", width=1920, height=1080),),
audio_tracks=(
AudioTrack(
index=1,
codec="aac",
channels=2,
channel_layout="stereo",
language="eng",
is_default=True,
),
),
subtitle_tracks=(),
)
class TestProbeEnrichmentWiring:
"""When source_path/source_file points to a real file, the resolver
should pick up ffprobe data via inspect_release and let the enriched
tech_string land in the destination name."""
def test_movie_picks_up_probe_quality(self, cfg_memory, tmp_path):
# Release name parses to "movie" but is missing the quality token;
# probe must supply 1080p and refresh tech_string.
bare_name = "Inception.2010.BluRay.x264-GROUP"
video = tmp_path / "movie.mkv"
video.write_bytes(b"")
out = resolve_movie_destination(
bare_name,
str(video),
"Inception",
2010,
prober=_StubProber(_stereo_movie_info()),
)
assert out.status == "ok"
# tech_string -> "1080p.BluRay.x264" -> "1080p" shows up in names.
assert "1080p" in out.movie_folder_name
assert "1080p" in out.filename
def test_movie_skips_probe_when_path_missing(self, cfg_memory):
# If the file doesn't exist, no probe runs (the stub would have
# injected 1080p — its absence proves the skip).
out = resolve_movie_destination(
"Inception.2010.BluRay.x264-GROUP",
"/nowhere/m.mkv",
"Inception",
2010,
prober=_StubProber(_stereo_movie_info()),
)
assert out.status == "ok"
assert "1080p" not in out.movie_folder_name
def test_season_picks_up_probe_via_source_path(self, cfg_memory, tmp_path):
# Season pack name missing quality token; probe must add it.
bare_name = "Oz.S03.BluRay.x265-KONTRAST"
release_dir = tmp_path / bare_name
release_dir.mkdir()
(release_dir / "episode.mkv").write_bytes(b"")
out = resolve_season_destination(
bare_name,
"Oz",
1997,
source_path=str(release_dir),
prober=_StubProber(_stereo_movie_info()),
)
assert out.status == "ok"
# Series folder name embeds tech_string -> "1080p" surfaced by probe.
assert "1080p" in out.series_folder_name
# --------------------------------------------------------------------------- #
# DTO to_dict() #
# --------------------------------------------------------------------------- #
+3 -3
View File
@@ -21,7 +21,7 @@ from unittest.mock import patch
import pytest
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.application.subtitles.placer import (
PlacedTrack,
PlaceResult,
@@ -46,8 +46,8 @@ def _track(
fmt=SRT,
stype=SubtitleType.STANDARD,
is_embedded: bool = False,
) -> SubtitleCandidate:
return SubtitleCandidate(
) -> SubtitleScanResult:
return SubtitleScanResult(
language=lang,
format=fmt,
subtitle_type=stype,
+130
View File
@@ -0,0 +1,130 @@
"""Tests for the pre-pipeline exclusion helpers (Phase A bis)."""
from __future__ import annotations
from pathlib import Path
import pytest
from alfred.application.release.supported_media import (
find_main_video,
is_supported_video,
)
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
# --------------------------------------------------------------------- #
# is_supported_video #
# --------------------------------------------------------------------- #
class TestIsSupportedVideo:
def test_mkv_is_supported(self, tmp_path: Path) -> None:
f = tmp_path / "movie.mkv"
f.touch()
assert is_supported_video(f, _KB) is True
def test_mp4_is_supported(self, tmp_path: Path) -> None:
f = tmp_path / "movie.mp4"
f.touch()
assert is_supported_video(f, _KB) is True
def test_uppercase_extension_is_supported(self, tmp_path: Path) -> None:
# File systems can return mixed case; we lowercase the suffix.
f = tmp_path / "movie.MKV"
f.touch()
assert is_supported_video(f, _KB) is True
def test_srt_is_not_video(self, tmp_path: Path) -> None:
f = tmp_path / "movie.srt"
f.touch()
assert is_supported_video(f, _KB) is False
def test_nfo_is_not_video(self, tmp_path: Path) -> None:
f = tmp_path / "movie.nfo"
f.touch()
assert is_supported_video(f, _KB) is False
def test_no_extension_is_not_video(self, tmp_path: Path) -> None:
f = tmp_path / "README"
f.touch()
assert is_supported_video(f, _KB) is False
def test_directory_is_not_video(self, tmp_path: Path) -> None:
d = tmp_path / "subdir.mkv" # even with a video extension
d.mkdir()
assert is_supported_video(d, _KB) is False
def test_nonexistent_path_is_not_video(self, tmp_path: Path) -> None:
assert is_supported_video(tmp_path / "ghost.mkv", _KB) is False
# --------------------------------------------------------------------- #
# find_main_video #
# --------------------------------------------------------------------- #
class TestFindMainVideo:
def test_single_video_file_in_folder(self, tmp_path: Path) -> None:
main = tmp_path / "Movie.2020.mkv"
main.touch()
assert find_main_video(tmp_path, _KB) == main
def test_returns_lexicographically_first_among_multiple(
self, tmp_path: Path
) -> None:
# Legitimate for season packs: pick the first episode by name.
ep2 = tmp_path / "Show.S01E02.mkv"
ep1 = tmp_path / "Show.S01E01.mkv"
ep2.touch()
ep1.touch()
assert find_main_video(tmp_path, _KB) == ep1
def test_skips_non_video_files(self, tmp_path: Path) -> None:
# nfo and srt come alphabetically before .mkv, must not win.
(tmp_path / "Movie.nfo").touch()
(tmp_path / "Movie.srt").touch()
vid = tmp_path / "Movie.mkv"
vid.touch()
assert find_main_video(tmp_path, _KB) == vid
def test_ignores_subdirectories(self, tmp_path: Path) -> None:
# A Sample/ subdir must NOT be descended into.
sample_dir = tmp_path / "Sample"
sample_dir.mkdir()
(sample_dir / "sample.mkv").touch()
main = tmp_path / "Movie.mkv"
main.touch()
assert find_main_video(tmp_path, _KB) == main
def test_only_subdirectory_with_video_returns_none(
self, tmp_path: Path
) -> None:
# No top-level video, only one inside a subdir → None.
sub = tmp_path / "Sample"
sub.mkdir()
(sub / "video.mkv").touch()
assert find_main_video(tmp_path, _KB) is None
def test_empty_folder_returns_none(self, tmp_path: Path) -> None:
assert find_main_video(tmp_path, _KB) is None
def test_nonexistent_folder_returns_none(self, tmp_path: Path) -> None:
assert find_main_video(tmp_path / "ghost", _KB) is None
def test_single_file_release_passed_as_folder_arg(
self, tmp_path: Path
) -> None:
# Some releases are a bare .mkv with no enclosing folder.
f = tmp_path / "Movie.2020.1080p.mkv"
f.touch()
assert find_main_video(f, _KB) == f
def test_single_file_non_video_passed_as_folder_arg(
self, tmp_path: Path
) -> None:
f = tmp_path / "README.nfo"
f.touch()
assert find_main_video(f, _KB) is None
View File
+216
View File
@@ -0,0 +1,216 @@
"""EASY-path tests for the v2 annotate-based pipeline.
These tests assert that the **v2 pipeline itself** produces the correct
annotated stream and assembled fields for releases from known groups
(KONTRAST, ELiTE, ) without going through ``parse_release``. The
fixtures suite (``tests/domain/test_release_fixtures.py``) already
locks the user-visible ``ParsedRelease`` contract; here we cover the
internal pipeline behavior so a future refactor of ``parse_release``
can't quietly drop EASY without us noticing.
"""
from __future__ import annotations
from alfred.domain.release.parser import TokenRole
from alfred.domain.release.parser.pipeline import (
_detect_group,
annotate,
assemble,
tokenize,
)
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
class TestDetectGroup:
def test_codec_group(self) -> None:
tokens, _ = tokenize(
"Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
)
name, idx = _detect_group(tokens, _KB)
assert name == "KONTRAST"
assert idx == 6 # x265-KONTRAST is the 7th token
def test_unknown_when_no_dash(self) -> None:
tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x265.KONTRAST", _KB)
# No dash anywhere → no group detected.
name, idx = _detect_group(tokens, _KB)
assert idx is None
assert name == "UNKNOWN"
def test_skips_dashed_source(self) -> None:
# "Web-DL" must not be mistaken for a group token.
tokens, _ = tokenize("Movie.2020.1080p.Web-DL.x265-GRP", _KB)
name, idx = _detect_group(tokens, _KB)
assert name == "GRP"
class TestAnnotateEasy:
def test_kontrast_movie(self) -> None:
tokens, tag = tokenize(
"Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
)
annotated = annotate(tokens, _KB)
assert annotated is not None, "KONTRAST should hit the EASY path"
roles = [t.role for t in annotated]
assert roles == [
TokenRole.TITLE, # Back
TokenRole.TITLE, # in
TokenRole.TITLE, # Action
TokenRole.YEAR,
TokenRole.RESOLUTION,
TokenRole.SOURCE,
TokenRole.CODEC, # x265-KONTRAST → CODEC with extra.group=KONTRAST
]
assert annotated[-1].extra["group"] == "KONTRAST"
assert annotated[-1].extra["codec"] == "x265"
def test_kontrast_tv_episode(self) -> None:
tokens, _ = tokenize(
"Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST", _KB
)
annotated = annotate(tokens, _KB)
assert annotated is not None
# Year is optional and absent → skipped. Season_episode present.
roles = [t.role for t in annotated]
assert TokenRole.SEASON_EPISODE in roles
assert TokenRole.YEAR not in roles
def test_elite_no_source(self) -> None:
# ELiTE schema marks source as optional — Foundation.S02 omits it.
tokens, _ = tokenize("Foundation.S02.1080p.x265-ELiTE", _KB)
annotated = annotate(tokens, _KB)
assert annotated is not None, "ELiTE optional source must be tolerated"
roles = [t.role for t in annotated]
assert TokenRole.SOURCE not in roles
assert TokenRole.RESOLUTION in roles
assert TokenRole.CODEC in roles
def test_unknown_group_falls_to_shitty(self) -> None:
tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x264-RANDOM", _KB)
# RANDOM is not in our release_groups/ — annotate() now falls
# through to the in-pipeline SHITTY pass and returns a populated
# token list (no None sentinel anymore).
annotated = annotate(tokens, _KB)
assert annotated is not None
roles = [t.role for t in annotated]
# Title is "Some.Movie", then YEAR, RESOLUTION, SOURCE, CODEC
# carrying the group in extra.
assert TokenRole.TITLE in roles
assert TokenRole.YEAR in roles
assert TokenRole.RESOLUTION in roles
assert TokenRole.SOURCE in roles
assert TokenRole.CODEC in roles
codec_tok = next(t for t in annotated if t.role is TokenRole.CODEC)
assert codec_tok.extra.get("group") == "RANDOM"
class TestAssemble:
def test_kontrast_movie_fields(self) -> None:
name = "Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
fields = assemble(annotated, tag, name, _KB)
assert fields["title"] == "Back.in.Action"
assert fields["year"] == 2025
assert fields["season"] is None
assert fields["quality"] == "1080p"
assert fields["source"] == "WEBRip"
assert fields["codec"] == "x265"
assert fields["group"] == "KONTRAST"
assert fields["media_type"] == "movie"
assert fields["site_tag"] is None
def test_kontrast_tv_fields(self) -> None:
name = "Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
fields = assemble(annotated, tag, name, _KB)
assert fields["title"] == "Slow.Horses"
assert fields["year"] is None
assert fields["season"] == 5
assert fields["episode"] == 1
assert fields["media_type"] == "tv_show"
assert fields["group"] == "KONTRAST"
def test_elite_season_pack(self) -> None:
name = "Foundation.S02.1080p.x265-ELiTE"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
fields = assemble(annotated, tag, name, _KB)
assert fields["title"] == "Foundation"
assert fields["season"] == 2
assert fields["episode"] is None # season pack
assert fields["source"] is None # ELiTE omits it
assert fields["quality"] == "1080p"
assert fields["codec"] == "x265"
assert fields["group"] == "ELiTE"
class TestEnrichers:
"""Non-positional roles populated alongside the structural walk.
These releases would have failed the v2 EASY path before the enricher
pass landed (leftover unknown tokens would force a fallback). They
now succeed in v2 with rich metadata.
"""
def test_bit_depth_and_audio(self) -> None:
name = "Back.in.Action.2025.1080p.WEBRip.10bit.DDP.5.1.x265-KONTRAST"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
assert annotated is not None
fields = assemble(annotated, tag, name, _KB)
assert fields["title"] == "Back.in.Action"
assert fields["bit_depth"] == "10bit"
assert fields["audio_codec"] == "DDP"
assert fields["audio_channels"] == "5.1"
def test_hdr_sequence(self) -> None:
# DV.HDR10 sequence + TrueHD.Atmos sequence + 7.1 channels +
# DIRECTORS.CUT edition all in one release.
name = (
"Some.Movie.2024.DIRECTORS.CUT.2160p.BluRay.DV.HDR10."
"TrueHD.Atmos.7.1.x265-KONTRAST"
)
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
assert annotated is not None
fields = assemble(annotated, tag, name, _KB)
assert fields["edition"] == "DIRECTORS.CUT"
assert fields["hdr_format"] == "DV.HDR10"
assert fields["audio_codec"] == "TrueHD.Atmos"
assert fields["audio_channels"] == "7.1"
def test_multiple_languages(self) -> None:
name = "Movie.2020.FRENCH.MULTI.1080p.WEBRip.DTS.HD.MA.5.1.x265-KONTRAST"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
assert annotated is not None
fields = assemble(annotated, tag, name, _KB)
assert fields["languages"] == ("FRENCH", "MULTI")
assert fields["audio_codec"] == "DTS-HD.MA"
assert fields["audio_channels"] == "5.1"
def test_tv_with_language(self) -> None:
name = "Show.S01E05.FRENCH.1080p.WEBRip.x265-KONTRAST"
tokens, tag = tokenize(name, _KB)
annotated = annotate(tokens, _KB)
assert annotated is not None
fields = assemble(annotated, tag, name, _KB)
assert fields["title"] == "Show"
assert fields["season"] == 1
assert fields["episode"] == 5
assert fields["languages"] == ("FRENCH",)
assert fields["media_type"] == "tv_show"
@@ -0,0 +1,79 @@
"""Scaffolding tests for the v2 parser package.
These tests lock the **shape** of the new pipeline (token VOs, tokenize
output, site-tag stripping) before the annotate step is wired in. They
do not check parsed-release output yet that comes once :func:`annotate`
is implemented and the fixtures-based suite switches over.
"""
from __future__ import annotations
from alfred.domain.release.parser import Token, TokenRole
from alfred.domain.release.parser.pipeline import strip_site_tag, tokenize
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
class TestToken:
def test_default_role_is_unknown(self) -> None:
t = Token(text="1080p", index=3)
assert t.role is TokenRole.UNKNOWN
assert not t.is_annotated
def test_with_role_returns_new_instance(self) -> None:
t = Token(text="1080p", index=3)
promoted = t.with_role(TokenRole.RESOLUTION)
assert promoted is not t
assert promoted.role is TokenRole.RESOLUTION
assert t.role is TokenRole.UNKNOWN # original unchanged (frozen)
def test_with_role_merges_extra(self) -> None:
t = Token(text="x265-KONTRAST", index=5)
promoted = t.with_role(TokenRole.CODEC, group="KONTRAST")
assert promoted.role is TokenRole.CODEC
assert promoted.extra == {"group": "KONTRAST"}
class TestStripSiteTag:
def test_no_tag(self) -> None:
clean, tag = strip_site_tag("The.Movie.2020.1080p-GRP")
assert tag is None
assert clean == "The.Movie.2020.1080p-GRP"
def test_suffix_tag(self) -> None:
clean, tag = strip_site_tag("Sinners.2025.1080p-[YTS.MX]")
assert tag == "YTS.MX"
assert clean == "Sinners.2025.1080p-"
def test_prefix_tag(self) -> None:
clean, tag = strip_site_tag("[ OxTorrent.vc ] The.Title.S01E01")
assert tag == "OxTorrent.vc"
assert clean == "The.Title.S01E01"
class TestTokenize:
def test_simple_release(self) -> None:
tokens, tag = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
assert tag is None
texts = [t.text for t in tokens]
# Dash is not a separator, so x265-KONTRAST stays glued.
assert texts == [
"Back", "in", "Action", "2025", "1080p", "WEBRip", "x265-KONTRAST",
]
def test_all_tokens_start_unknown(self) -> None:
tokens, _ = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
assert all(t.role is TokenRole.UNKNOWN for t in tokens)
def test_indexes_are_contiguous(self) -> None:
tokens, _ = tokenize("A.B.C.D", _KB)
assert [t.index for t in tokens] == [0, 1, 2, 3]
def test_strips_site_tag_before_tokenize(self) -> None:
tokens, tag = tokenize(
"Sinners.2025.1080p.WEBRip.x265.10bit.AAC5.1-[YTS.MX]", _KB
)
assert tag == "YTS.MX"
# Site tag substring must not appear among tokens.
assert not any("YTS" in t.text for t in tokens)
@@ -0,0 +1,279 @@
"""Phase A — parse-confidence scoring.
These tests pin the score / road semantics without going through
fixtures. They exercise the small pure functions in
``alfred.domain.release.parser.scoring`` and the end-to-end contract
that ``parse_release`` returns a ``(ParsedRelease, ParseReport)`` tuple.
"""
from __future__ import annotations
import pytest
from alfred.domain.release.parser.scoring import (
Road,
collect_missing_critical,
collect_unknown_tokens,
compute_score,
decide_road,
)
from alfred.domain.release.parser.tokens import Token, TokenRole
from alfred.domain.release.services import parse_release
from alfred.domain.release.value_objects import (
MediaTypeToken,
ParsedRelease,
ParseReport,
TokenizationRoute,
)
from alfred.domain.shared.exceptions import ValidationError
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
# --------------------------------------------------------------------- #
# ParseReport VO #
# --------------------------------------------------------------------- #
class TestParseReport:
def test_construct_with_defaults(self) -> None:
report = ParseReport(confidence=80, road="easy")
assert report.confidence == 80
assert report.road == "easy"
assert report.unknown_tokens == ()
assert report.missing_critical == ()
def test_is_frozen(self) -> None:
report = ParseReport(confidence=50, road="shitty")
with pytest.raises(Exception): # FrozenInstanceError
report.confidence = 99 # type: ignore[misc]
def test_confidence_lower_bound(self) -> None:
with pytest.raises(ValidationError):
ParseReport(confidence=-1, road="easy")
def test_confidence_upper_bound(self) -> None:
with pytest.raises(ValidationError):
ParseReport(confidence=101, road="easy")
# --------------------------------------------------------------------- #
# compute_score #
# --------------------------------------------------------------------- #
def _movie(year: int = 2020, **overrides) -> ParsedRelease:
"""Build a populated movie ParsedRelease for scoring tests."""
base = dict(
raw="Inception.2010.1080p.BluRay.x264-GROUP",
clean="Inception.2010.1080p.BluRay.x264-GROUP",
title="Inception",
title_sanitized="Inception",
year=year,
season=None,
episode=None,
episode_end=None,
quality="1080p",
source="BluRay",
codec="x264",
group="GROUP",
media_type=MediaTypeToken.MOVIE,
parse_path=TokenizationRoute.DIRECT,
)
base.update(overrides)
return ParsedRelease(**base)
def _all_annotated() -> list[Token]:
"""Token stream where everything is annotated — zero penalty."""
return [
Token("Inception", 0, TokenRole.TITLE),
Token("2010", 1, TokenRole.YEAR),
Token("1080p", 2, TokenRole.RESOLUTION),
Token("BluRay", 3, TokenRole.SOURCE),
Token("x264", 4, TokenRole.CODEC),
Token("GROUP", 5, TokenRole.GROUP),
]
class TestComputeScore:
def test_fully_populated_movie_scores_high(self) -> None:
parsed = _movie()
score = compute_score(parsed, _all_annotated(), _KB)
# title 30 + media_type 20 + year 15 + resolution 5 + source 5
# + codec 5 + group 5 = 85
assert score == 85
def test_tv_show_gets_season_and_episode_weight(self) -> None:
parsed = ParsedRelease(
raw="Oz.S01E01.1080p.WEBRip.x265-KONTRAST",
clean="Oz.S01E01.1080p.WEBRip.x265-KONTRAST",
title="Oz",
title_sanitized="Oz",
year=None,
season=1,
episode=1,
episode_end=None,
quality="1080p",
source="WEBRip",
codec="x265",
group="KONTRAST",
media_type=MediaTypeToken.TV_SHOW,
parse_path=TokenizationRoute.DIRECT,
)
tokens = [
Token("Oz", 0, TokenRole.TITLE),
Token("S01E01", 1, TokenRole.SEASON_EPISODE),
Token("1080p", 2, TokenRole.RESOLUTION),
Token("WEBRip", 3, TokenRole.SOURCE),
Token("x265", 4, TokenRole.CODEC),
Token("KONTRAST", 5, TokenRole.GROUP),
]
score = compute_score(parsed, tokens, _KB)
# title 30 + media_type 20 + season 10 + episode 5 + resolution 5
# + source 5 + codec 5 + group 5 = 85 (no year)
assert score == 85
def test_unknown_tokens_subtract_penalty(self) -> None:
parsed = _movie()
tokens = _all_annotated() + [
Token("noise", 6, TokenRole.UNKNOWN),
Token("more", 7, TokenRole.UNKNOWN),
]
score = compute_score(parsed, tokens, _KB)
# 85 baseline - 2*5 unknown tokens = 75
assert score == 75
def test_unknown_penalty_capped(self) -> None:
parsed = _movie()
# 20 unknown tokens × 5 = 100 raw, capped at 30
tokens = _all_annotated() + [
Token(f"t{i}", 6 + i, TokenRole.UNKNOWN) for i in range(20)
]
score = compute_score(parsed, tokens, _KB)
assert score == 85 - 30
def test_score_clamped_to_zero(self) -> None:
# Empty-ish parse with lots of unknown tokens
parsed = _movie(year=None, quality=None, source=None, codec=None)
tokens = [Token(f"t{i}", i, TokenRole.UNKNOWN) for i in range(10)]
score = compute_score(parsed, tokens, _KB)
# title 30 + media_type 20 + group 5 = 55, -30 cap = 25
# Sanity: still clamped at 0 minimum even if math goes weird
assert 0 <= score <= 100
def test_unknown_media_type_does_not_count(self) -> None:
parsed = _movie(media_type=MediaTypeToken.UNKNOWN)
score = compute_score(parsed, _all_annotated(), _KB)
# Loses the 20 of media_type vs baseline
assert score == 85 - 20
def test_unknown_group_does_not_count(self) -> None:
parsed = _movie(group="UNKNOWN")
score = compute_score(parsed, _all_annotated(), _KB)
assert score == 85 - 5
# --------------------------------------------------------------------- #
# decide_road #
# --------------------------------------------------------------------- #
class TestDecideRoad:
def test_known_schema_is_easy_regardless_of_score(self) -> None:
# Even a terrible score returns EASY when a schema matched.
assert decide_road(score=0, has_schema=True, kb=_KB) is Road.EASY
def test_no_schema_high_score_is_shitty(self) -> None:
assert decide_road(score=80, has_schema=False, kb=_KB) is Road.SHITTY
def test_no_schema_low_score_is_pop(self) -> None:
assert decide_road(score=10, has_schema=False, kb=_KB) is Road.PATH_OF_PAIN
def test_threshold_boundary_is_inclusive(self) -> None:
threshold = _KB.scoring["thresholds"]["shitty_min"]
assert decide_road(threshold, has_schema=False, kb=_KB) is Road.SHITTY
assert (
decide_road(threshold - 1, has_schema=False, kb=_KB)
is Road.PATH_OF_PAIN
)
# --------------------------------------------------------------------- #
# Collectors #
# --------------------------------------------------------------------- #
class TestCollectors:
def test_collect_unknown_tokens_preserves_order(self) -> None:
tokens = [
Token("A", 0, TokenRole.TITLE),
Token("X", 1, TokenRole.UNKNOWN),
Token("B", 2, TokenRole.RESOLUTION),
Token("Y", 3, TokenRole.UNKNOWN),
]
assert collect_unknown_tokens(tokens) == ("X", "Y")
def test_collect_missing_critical_full(self) -> None:
empty = ParsedRelease(
raw="x",
clean="x",
title="",
title_sanitized="",
year=None,
season=None,
episode=None,
episode_end=None,
quality=None,
source=None,
codec=None,
group="UNKNOWN",
media_type=MediaTypeToken.UNKNOWN,
parse_path=TokenizationRoute.DIRECT,
)
assert set(collect_missing_critical(empty)) == {
"title",
"media_type",
"year",
}
def test_collect_missing_critical_none(self) -> None:
parsed = _movie()
assert collect_missing_critical(parsed) == ()
# --------------------------------------------------------------------- #
# End-to-end contract #
# --------------------------------------------------------------------- #
class TestParseReleaseReturnsReport:
def test_returns_tuple(self) -> None:
result = parse_release("Inception.2010.1080p.BluRay.x264-GROUP", _KB)
assert isinstance(result, tuple)
assert len(result) == 2
parsed, report = result
assert isinstance(parsed, ParsedRelease)
assert isinstance(report, ParseReport)
def test_known_group_is_easy_road(self) -> None:
# KONTRAST has a schema in release_groups/
_, report = parse_release(
"Oz.S03E01.1080p.WEBRip.x265-KONTRAST", _KB
)
assert report.road == Road.EASY.value
assert report.confidence > 0
def test_unknown_group_well_formed_is_shitty(self) -> None:
# No registered schema but well-formed scene name → SHITTY
_, report = parse_release(
"Inception.2010.1080p.BluRay.x264-NOSCHEMA", _KB
)
assert report.road == Road.SHITTY.value
def test_malformed_name_is_pop(self) -> None:
# Forbidden chars (@) — short-circuits to AI / PoP.
_, report = parse_release("garbage@#%name", _KB)
assert report.road == Road.PATH_OF_PAIN.value
assert report.confidence == 0
+43 -33
View File
@@ -20,13 +20,21 @@ import pytest
from alfred.domain.release.services import parse_release
from alfred.domain.release.value_objects import ParsedRelease
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
_KB = YamlReleaseKnowledge()
def _parse(name: str) -> ParsedRelease:
parsed, _report = parse_release(name, _KB)
return parsed
class TestParseTVEpisode:
"""Single-episode TV releases."""
def test_basic_tv_episode(self):
r = parse_release("Oz.S03E01.1080p.WEBRip.x265-KONTRAST")
r = _parse("Oz.S03E01.1080p.WEBRip.x265-KONTRAST")
assert r.title == "Oz"
assert r.season == 3
assert r.episode == 1
@@ -40,27 +48,27 @@ class TestParseTVEpisode:
assert r.is_season_pack is False
def test_multi_episode(self):
r = parse_release("Archer.S14E09E10.1080p.WEB.x265-GRP")
r = _parse("Archer.S14E09E10.1080p.WEB.x265-GRP")
assert r.season == 14
assert r.episode == 9
assert r.episode_end == 10
def test_nxnn_alt_form(self):
# Alt season/episode form: 1x05 instead of S01E05.
r = parse_release("Some.Show.1x05.720p.HDTV.x264-GRP")
r = _parse("Some.Show.1x05.720p.HDTV.x264-GRP")
assert r.season == 1
assert r.episode == 5
assert r.episode_end is None
assert r.media_type == "tv_show"
def test_nxnnxnn_multi_episode_alt_form(self):
r = parse_release("Some.Show.2x07x08.1080p.WEB.x265-GRP")
r = _parse("Some.Show.2x07x08.1080p.WEB.x265-GRP")
assert r.season == 2
assert r.episode == 7
assert r.episode_end == 8
def test_season_pack(self):
r = parse_release("Oz.S03.1080p.WEBRip.x265-KONTRAST")
r = _parse("Oz.S03.1080p.WEBRip.x265-KONTRAST")
assert r.season == 3
assert r.episode is None
assert r.is_season_pack is True
@@ -71,7 +79,7 @@ class TestParseMovie:
"""Movie releases."""
def test_basic_movie(self):
r = parse_release("Inception.2010.1080p.BluRay.x264-GROUP")
r = _parse("Inception.2010.1080p.BluRay.x264-GROUP")
assert r.title == "Inception"
assert r.year == 2010
assert r.season is None
@@ -83,13 +91,13 @@ class TestParseMovie:
assert r.media_type == "movie"
def test_movie_multi_word_title(self):
r = parse_release("The.Dark.Knight.2008.2160p.UHD.BluRay.x265-TERMINAL")
r = _parse("The.Dark.Knight.2008.2160p.UHD.BluRay.x265-TERMINAL")
assert r.title == "The.Dark.Knight"
assert r.year == 2008
assert r.quality == "2160p"
def test_movie_without_year_still_movie_if_tech_present(self):
r = parse_release("UntitledFilm.1080p.WEBRip.x264-GRP")
r = _parse("UntitledFilm.1080p.WEBRip.x264-GRP")
# No season, no year, but tech markers → still movie
assert r.media_type == "movie"
assert r.year is None
@@ -99,39 +107,39 @@ class TestParseEdgeCases:
"""Site tags, malformed names, and unknown media types."""
def test_site_tag_prefix_stripped(self):
r = parse_release("[ OxTorrent.vc ] The.Title.S01E01.1080p.WEB.x265-GRP")
r = _parse("[ OxTorrent.vc ] The.Title.S01E01.1080p.WEB.x265-GRP")
assert r.site_tag == "OxTorrent.vc"
assert r.parse_path == "sanitized"
assert r.season == 1
assert r.episode == 1
def test_site_tag_suffix_stripped(self):
r = parse_release("The.Title.S01E01.1080p.WEB.x265-NTb[TGx]")
r = _parse("The.Title.S01E01.1080p.WEB.x265-NTb[TGx]")
assert r.site_tag == "TGx"
# Suffix-tagged names are well-formed (only [] in tag → after strip clean)
assert r.season == 1
def test_irrecoverably_malformed(self):
# @ is a forbidden char and not stripped by _sanitize → stays malformed
r = parse_release("foo@bar@baz")
r = _parse("foo@bar@baz")
assert r.media_type == "unknown"
assert r.parse_path == "ai"
assert r.group == "UNKNOWN"
def test_empty_unknown_when_no_evidence(self):
r = parse_release("Some.Random.Title")
r = _parse("Some.Random.Title")
# No season, no year, no tech markers → unknown
assert r.media_type == "unknown"
def test_missing_group_defaults_to_unknown(self):
r = parse_release("Movie.2020.1080p.WEBRip.x265")
r = _parse("Movie.2020.1080p.WEBRip.x265")
# No "-GROUP" suffix → group = "UNKNOWN"
assert r.group == "UNKNOWN"
def test_yts_bracket_release(self):
# YTS-style: spaces, parens for year, multiple bracketed tech tokens.
# The tokenizer must handle ' ', '(', ')', '[', ']' transparently.
r = parse_release("The Father (2020) [1080p] [WEBRip] [5.1] [YTS.MX]")
r = _parse("The Father (2020) [1080p] [WEBRip] [5.1] [YTS.MX]")
assert r.title == "The.Father"
assert r.year == 2020
assert r.quality == "1080p"
@@ -141,7 +149,7 @@ class TestParseEdgeCases:
def test_human_friendly_spaces(self):
# Spaces as separators (no brackets).
r = parse_release("Inception 2010 1080p BluRay x264-GROUP")
r = _parse("Inception 2010 1080p BluRay x264-GROUP")
assert r.title == "Inception"
assert r.year == 2010
assert r.quality == "1080p"
@@ -151,7 +159,7 @@ class TestParseEdgeCases:
def test_underscore_separators(self):
# Old usenet style: underscores between tokens.
r = parse_release("Some_Show_S01E01_1080p_WEB_x265-GRP")
r = _parse("Some_Show_S01E01_1080p_WEB_x265-GRP")
assert r.season == 1
assert r.episode == 1
assert r.quality == "1080p"
@@ -162,15 +170,15 @@ class TestParseAudioVideoEdition:
"""Audio, video metadata, edition extraction."""
def test_audio_codec_and_channels(self):
r = parse_release("Movie.2020.1080p.BluRay.DTS.5.1.x264-GRP")
r = _parse("Movie.2020.1080p.BluRay.DTS.5.1.x264-GRP")
assert r.audio_channels == "5.1"
def test_language_token(self):
r = parse_release("Movie.2020.MULTI.1080p.WEBRip.x265-GRP")
r = _parse("Movie.2020.MULTI.1080p.WEBRip.x265-GRP")
assert "MULTI" in r.languages
def test_edition_token(self):
r = parse_release("Movie.2020.UNRATED.1080p.BluRay.x264-GRP")
r = _parse("Movie.2020.UNRATED.1080p.BluRay.x264-GRP")
assert r.edition == "UNRATED"
@@ -178,19 +186,21 @@ class TestParsedReleaseFolderNames:
"""Helpers that build filesystem-safe folder/filenames."""
def _parsed_tv(self) -> ParsedRelease:
return parse_release("Oz.S03E01.1080p.WEBRip.x265-KONTRAST")
return _parse("Oz.S03E01.1080p.WEBRip.x265-KONTRAST")
def _parsed_movie(self) -> ParsedRelease:
return parse_release("Inception.2010.1080p.BluRay.x264-GROUP")
return _parse("Inception.2010.1080p.BluRay.x264-GROUP")
def test_show_folder_name(self):
r = self._parsed_tv()
assert r.show_folder_name("Oz", 1997) == "Oz.1997.1080p.WEBRip.x265-KONTRAST"
def test_show_folder_name_strips_windows_chars(self):
def test_show_folder_name_uses_already_safe_title(self):
# Option B: callers sanitize at the use-case boundary via
# kb.sanitize_for_fs(...) before passing the title in.
r = self._parsed_tv()
# Colons and question marks are Windows-forbidden — must be stripped.
result = r.show_folder_name("Oz: The Series?", 1997)
safe = _KB.sanitize_for_fs("Oz: The Series?")
result = r.show_folder_name(safe, 1997)
assert ":" not in result
assert "?" not in result
@@ -202,7 +212,7 @@ class TestParsedReleaseFolderNames:
assert "E01" not in result
def test_season_folder_name_multi_episode(self):
r = parse_release("Archer.S14E09E10E11.1080p.WEB.x265-GRP")
r = _parse("Archer.S14E09E10E11.1080p.WEB.x265-GRP")
result = r.season_folder_name()
assert "S14" in result
assert "E09" not in result
@@ -251,21 +261,21 @@ class TestParsedReleaseInvariants:
def test_raw_is_preserved(self):
raw = "Oz.S03E01.1080p.WEBRip.x265-KONTRAST"
r = parse_release(raw)
r = _parse(raw)
assert r.raw == raw
def test_languages_defaults_to_empty_list_not_none(self):
r = parse_release("Movie.2020.1080p.BluRay.x264-GRP")
# __post_init__ ensures languages is a list, never None
assert r.languages == []
def test_languages_defaults_to_empty_tuple_not_none(self):
r = _parse("Movie.2020.1080p.BluRay.x264-GRP")
# ``languages`` defaults to an empty tuple (frozen VO).
assert r.languages == ()
def test_tech_string_joined(self):
r = parse_release("Movie.2020.1080p.BluRay.x264-GRP")
r = _parse("Movie.2020.1080p.BluRay.x264-GRP")
assert r.tech_string == "1080p.BluRay.x264"
def test_tech_string_partial(self):
# Codec-only release (no quality/source): tech_string == codec
r = parse_release("Show.S01E01.x265-GRP")
r = _parse("Show.S01E01.x265-GRP")
assert r.tech_string == "x265"
assert r.codec == "x265"
assert r.quality is None
@@ -280,4 +290,4 @@ class TestParsedReleaseInvariants:
],
)
def test_media_type_inference(self, name, expected_type):
assert parse_release(name).media_type == expected_type
assert _parse(name).media_type == expected_type
+19 -5
View File
@@ -19,24 +19,38 @@ from dataclasses import asdict
import pytest
from alfred.domain.release.services import parse_release
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
from tests.fixtures.releases.conftest import ReleaseFixture, discover_fixtures
_KB = YamlReleaseKnowledge()
FIXTURES = discover_fixtures()
def _fixture_param(f: ReleaseFixture) -> pytest.param:
marks = []
if f.xfail_reason:
marks.append(pytest.mark.xfail(reason=f.xfail_reason, strict=False))
return pytest.param(f, id=f.name, marks=marks)
@pytest.mark.parametrize(
"fixture",
FIXTURES,
ids=[f.name for f in FIXTURES],
[_fixture_param(f) for f in FIXTURES],
)
def test_parse_matches_fixture(fixture: ReleaseFixture, tmp_path) -> None:
# Materialize the tree to assert it is at least well-formed YAML +
# plausible filesystem paths. Catches typos / missing leading dirs early.
fixture.materialize(tmp_path)
result = asdict(parse_release(fixture.release_name))
# ``is_season_pack`` is a @property — asdict() does not include it.
result["is_season_pack"] = parse_release(fixture.release_name).is_season_pack
parsed, _report = parse_release(fixture.release_name, _KB)
result = asdict(parsed)
# ``is_season_pack`` and ``tech_string`` are @property values —
# ``asdict()`` does not include them.
result["is_season_pack"] = parsed.is_season_pack
result["tech_string"] = parsed.tech_string
# ``languages`` is a tuple on the VO; fixtures encode it as a YAML list.
# Compare list-to-list so the equality is unambiguous.
result["languages"] = list(result.get("languages", ()))
for field, expected in fixture.expected_parsed.items():
assert field in result, (
+3 -3
View File
@@ -23,7 +23,7 @@ from unittest.mock import patch
import pytest
from alfred.domain.shared.ports import FileEntry
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.domain.subtitles.services.identifier import (
SubtitleIdentifier,
_count_entries,
@@ -310,8 +310,8 @@ class TestSizeDisambiguation:
detection=TypeDetectionMethod.SIZE_AND_COUNT,
)
def _track(self, lang_code: str, entries: int) -> SubtitleCandidate:
return SubtitleCandidate(
def _track(self, lang_code: str, entries: int) -> SubtitleScanResult:
return SubtitleScanResult(
language=SubtitleLanguage(code=lang_code, tokens=[lang_code]),
format=None,
subtitle_type=SubtitleType.UNKNOWN,
+3 -3
View File
@@ -18,7 +18,7 @@ from __future__ import annotations
import pytest
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.domain.subtitles.services.matcher import SubtitleMatcher
from alfred.domain.subtitles.value_objects import (
SubtitleFormat,
@@ -40,8 +40,8 @@ def _track(
stype: SubtitleType = SubtitleType.STANDARD,
confidence: float = 1.0,
is_embedded: bool = False,
) -> SubtitleCandidate:
return SubtitleCandidate(
) -> SubtitleScanResult:
return SubtitleScanResult(
language=lang,
format=fmt,
subtitle_type=stype,
+31 -30
View File
@@ -5,9 +5,9 @@ uncovered:
- ``TestSubtitleFormat`` extension matching (case-insensitive).
- ``TestSubtitleLanguage`` token matching (case-insensitive).
- ``TestSubtitleCandidateDestName`` ``destination_name`` property:
- ``TestSubtitleScanResultDestName`` ``destination_name`` property:
standard / SDH / forced naming, error on missing language or format.
- ``TestSubtitleCandidateRepr`` debug repr for embedded vs external.
- ``TestSubtitleScanResultRepr`` debug repr for embedded vs external.
- ``TestMediaSubtitleMetadata`` ``all_tracks`` / ``total_count`` /
``unresolved_tracks``.
- ``TestAvailableSubtitles`` utility dedup by (lang, type).
@@ -24,10 +24,11 @@ from pathlib import Path
import pytest
from alfred.domain.subtitles.aggregates import SubtitleRuleSet
from alfred.domain.subtitles.entities import MediaSubtitleMetadata, SubtitleCandidate
from alfred.domain.subtitles.entities import MediaSubtitleMetadata, SubtitleScanResult
from alfred.domain.subtitles.services.utils import available_subtitles
from alfred.domain.subtitles.value_objects import (
RuleScope,
RuleScopeLevel,
SubtitleFormat,
SubtitleLanguage,
SubtitleMatchingRules,
@@ -73,7 +74,7 @@ class TestSubtitleLanguage:
# --------------------------------------------------------------------------- #
# SubtitleCandidate #
# SubtitleScanResult #
# --------------------------------------------------------------------------- #
@@ -81,50 +82,50 @@ SRT = SubtitleFormat(id="srt", extensions=[".srt"])
FRA = SubtitleLanguage(code="fra", tokens=["fr", "fre"])
class TestSubtitleCandidateDestName:
class TestSubtitleScanResultDestName:
def test_standard(self):
t = SubtitleCandidate(
t = SubtitleScanResult(
language=FRA, format=SRT, subtitle_type=SubtitleType.STANDARD
)
assert t.destination_name == "fra.srt"
def test_sdh(self):
t = SubtitleCandidate(language=FRA, format=SRT, subtitle_type=SubtitleType.SDH)
t = SubtitleScanResult(language=FRA, format=SRT, subtitle_type=SubtitleType.SDH)
assert t.destination_name == "fra.sdh.srt"
def test_forced(self):
t = SubtitleCandidate(
t = SubtitleScanResult(
language=FRA, format=SRT, subtitle_type=SubtitleType.FORCED
)
assert t.destination_name == "fra.forced.srt"
def test_unknown_treated_as_standard(self):
t = SubtitleCandidate(
t = SubtitleScanResult(
language=FRA, format=SRT, subtitle_type=SubtitleType.UNKNOWN
)
# UNKNOWN doesn't add a suffix → same as standard.
assert t.destination_name == "fra.srt"
def test_missing_language_raises(self):
t = SubtitleCandidate(language=None, format=SRT)
t = SubtitleScanResult(language=None, format=SRT)
with pytest.raises(ValueError, match="language or format missing"):
t.destination_name
def test_missing_format_raises(self):
t = SubtitleCandidate(language=FRA, format=None)
t = SubtitleScanResult(language=FRA, format=None)
with pytest.raises(ValueError, match="language or format missing"):
t.destination_name
def test_extension_dot_stripped(self):
# Format extension is ".srt" — leading dot must not be duplicated.
t = SubtitleCandidate(language=FRA, format=SRT)
t = SubtitleScanResult(language=FRA, format=SRT)
assert t.destination_name.endswith(".srt")
assert ".." not in t.destination_name
class TestSubtitleCandidateRepr:
class TestSubtitleScanResultRepr:
def test_embedded_repr(self):
t = SubtitleCandidate(
t = SubtitleScanResult(
language=FRA, format=None, is_embedded=True, confidence=1.0
)
r = repr(t)
@@ -134,14 +135,14 @@ class TestSubtitleCandidateRepr:
def test_external_repr_uses_filename(self, tmp_path):
f = tmp_path / "fr.srt"
f.write_text("")
t = SubtitleCandidate(language=FRA, format=SRT, file_path=f, confidence=0.85)
t = SubtitleScanResult(language=FRA, format=SRT, file_path=f, confidence=0.85)
r = repr(t)
assert "fra" in r
assert "fr.srt" in r
assert "0.85" in r
def test_unresolved_repr(self):
t = SubtitleCandidate(language=None, format=None)
t = SubtitleScanResult(language=None, format=None)
r = repr(t)
assert "?" in r
@@ -159,8 +160,8 @@ class TestMediaSubtitleMetadata:
assert m.unresolved_tracks == []
def test_aggregates_embedded_and_external(self):
e = SubtitleCandidate(language=FRA, format=None, is_embedded=True)
x = SubtitleCandidate(language=FRA, format=SRT, file_path=Path("/x.srt"))
e = SubtitleScanResult(language=FRA, format=None, is_embedded=True)
x = SubtitleScanResult(language=FRA, format=SRT, file_path=Path("/x.srt"))
m = MediaSubtitleMetadata(
media_id=None,
media_type="movie",
@@ -173,13 +174,13 @@ class TestMediaSubtitleMetadata:
def test_unresolved_tracks_only_external_with_none_lang(self):
# An embedded with None language must NOT appear in unresolved_tracks
# (the property only iterates external_tracks).
embedded_unknown = SubtitleCandidate(
embedded_unknown = SubtitleScanResult(
language=None, format=None, is_embedded=True
)
external_known = SubtitleCandidate(
external_known = SubtitleScanResult(
language=FRA, format=SRT, file_path=Path("/a.srt")
)
external_unknown = SubtitleCandidate(
external_unknown = SubtitleScanResult(
language=None, format=SRT, file_path=Path("/b.srt")
)
m = MediaSubtitleMetadata(
@@ -200,14 +201,14 @@ class TestAvailableSubtitles:
def test_dedup_by_lang_and_type(self):
ENG = SubtitleLanguage(code="eng", tokens=["en"])
tracks = [
SubtitleCandidate(
SubtitleScanResult(
language=FRA, format=SRT, subtitle_type=SubtitleType.STANDARD
),
SubtitleCandidate(
SubtitleScanResult(
language=FRA, format=SRT, subtitle_type=SubtitleType.STANDARD
),
SubtitleCandidate(language=FRA, format=SRT, subtitle_type=SubtitleType.SDH),
SubtitleCandidate(
SubtitleScanResult(language=FRA, format=SRT, subtitle_type=SubtitleType.SDH),
SubtitleScanResult(
language=ENG, format=SRT, subtitle_type=SubtitleType.STANDARD
),
]
@@ -221,10 +222,10 @@ class TestAvailableSubtitles:
def test_none_language_treated_as_key(self):
# Tracks with no language form a single None-keyed bucket.
t1 = SubtitleCandidate(
t1 = SubtitleScanResult(
language=None, format=SRT, subtitle_type=SubtitleType.UNKNOWN
)
t2 = SubtitleCandidate(
t2 = SubtitleScanResult(
language=None, format=SRT, subtitle_type=SubtitleType.UNKNOWN
)
result = available_subtitles([t1, t2])
@@ -257,7 +258,7 @@ class TestSubtitleRuleSet:
def test_override_partial_keeps_parent_for_unset_fields(self):
parent = SubtitleRuleSet.global_default()
child = SubtitleRuleSet(
scope=RuleScope(level="show", identifier="tt1"),
scope=RuleScope(level=RuleScopeLevel.SHOW, identifier="tt1"),
parent=parent,
)
child.override(languages=["jpn"])
@@ -267,14 +268,14 @@ class TestSubtitleRuleSet:
assert rules.min_confidence == parent.resolve(_DEFAULT_RULES).min_confidence
def test_to_dict_only_emits_set_deltas(self):
rs = SubtitleRuleSet(scope=RuleScope(level="show", identifier="tt1"))
rs = SubtitleRuleSet(scope=RuleScope(level=RuleScopeLevel.SHOW, identifier="tt1"))
rs.override(languages=["fra"])
out = rs.to_dict()
assert out["scope"] == {"level": "show", "identifier": "tt1"}
assert out["override"] == {"languages": ["fra"]}
def test_to_dict_full_override(self):
rs = SubtitleRuleSet(scope=RuleScope(level="global"))
rs = SubtitleRuleSet(scope=RuleScope(level=RuleScopeLevel.GLOBAL))
rs.override(
languages=["fra"],
formats=["srt"],
+8
View File
@@ -39,6 +39,14 @@ class ReleaseFixture:
def routing(self) -> dict:
return self.data.get("routing", {})
@property
def xfail_reason(self) -> str | None:
"""If set, the fixture is expected to fail — wrapped with
``pytest.mark.xfail`` by the test runner. Used for known
not-supported pathological cases (typically PATH OF PAIN bucket).
"""
return self.data.get("xfail_reason")
def materialize(self, root: Path) -> None:
"""Create the fixture's ``tree`` as empty files/dirs under ``root``."""
for entry in self.tree:
@@ -1,5 +1,10 @@
release_name: "Deutschland 83-86-89 (2015) Season 1-3 S01-S03 (1080p BluRay x265 HEVC 10bit AAC 5.1 German Kappa)"
# Out of SHITTY scope by design: parenthesized tech blocks, group name as
# the last bare word inside parens, year-suffix range in title, dual
# season expression. PATH OF PAIN handles this via LLM pre-analysis.
xfail_reason: "PoP-grade pathological franchise box-set, beyond simple-dict SHITTY"
# Pathological franchise box-set:
# - Title contains year-suffix range "83-86-89" (3 years glued)
# - Season range expressed twice: "Season 1-3" AND "S01-S03"
@@ -1,13 +1,15 @@
release_name: "Khruangbin Austin City Limits Music Festival 2024 Full Set [V_-7WWPPeBs].webm"
# yt-dlp slug: UTF-8 wide pipe '' (U+FF5C, not the ASCII '|'), trailing
# YouTube video ID in brackets, .webm extension. Parser extracts the year
# (2024) correctly but mistakes the YouTube ID '7WWPPeBs' for a release
# group, and the wide pipe survives the tokenizer (not a separator).
# YouTube video ID in brackets, .webm extension. The wide pipe survives
# the tokenizer (not a separator) but is now dropped at title assembly
# (pure-punctuation TITLE tokens carry no content). Year (2024) parses
# correctly; the YouTube ID '7WWPPeBs' is still mistaken for a release
# group (separate gap, see PoP backlog).
# This is a concert recording — closer to "live music" than "movie", but
# media_type=movie is the current degenerate best guess.
parsed:
title: "Khruangbin..Austin.City.Limits.Music.Festival"
title: "Khruangbin.Austin.City.Limits.Music.Festival"
year: 2024
season: null
episode: null
@@ -1,5 +1,10 @@
release_name: "Predator Badlands 2025 1080p HDRip HEVC x265 BONE"
# Space-separated release with both codec aliases present (HEVC + x265)
# and no dash-before-group. Simple-SHITTY first-wins picks HEVC, expected
# was x265 (legacy last-wins). Reclassified PoP.
xfail_reason: "Space-separated, dual codec aliases, no dashed group"
# Space-separated release: tokenizer correctly splits and identifies year +
# tech, but the dash-before-group convention is absent so 'BONE' is not
# recognized as the group — falls to UNKNOWN. Anti-regression baseline.
@@ -1,5 +1,9 @@
release_name: "SLEAFORD MODS Live Glastonbury June 27th 2015-niNjHn8abyY.mp4"
# YouTube-style slug with year-prefixed video-id dash suffix. Not a scene
# release shape at all — PATH OF PAIN.
xfail_reason: "YouTube slug with year-prefixed video-id, not a scene shape"
# yt-dlp filename: triple space between band name and event, no canonical
# tech markers, dashed YouTube video ID glued to the year, .mp4 extension
# preserved in the title. Parser:
@@ -1,5 +1,10 @@
release_name: "Super Mario Bros. le film [FR-EN] (2023).mkv"
# Bare-dashed language pair interior to the title (``[FR-EN]``) is tagged
# as group by ``_detect_group``, leaving the title fragment behind.
# Out of simple-SHITTY scope.
xfail_reason: "Interior bare-dashed language pair confuses group detection"
# Hybrid English/French marketing title with:
# - Trailing period after 'Bros' that is part of the title abbreviation
# (not a separator), but tokenizer treats it as one
@@ -1,28 +1,26 @@
release_name: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
# Apocalypse case combining every horror:
# - Unescaped apostrophe ("World's") → forces parse_path="ai" fallback
# - Spaces AND dashes used as separators inconsistently
# - "Blu-ray" with a dash (vs. canonical BluRay)
# - "1080i" interlaced flag (not 1080p)
# - "DTS-HD MA 5.1" multi-word audio codec
# - " - GROUP.mkv" trailing format (space-dash-space before group)
# Apocalypse case combining every horror — partially tamed by the
# apostrophe fix. Remaining gaps (still PoP-worthy):
# - "1080i" interlaced flag (not in quality KB)
# - "Blu-ray" with a dash (vs. canonical BluRay) — recognized as source
# but with the dash form
# - "DTS-HD MA 5.1" multi-word audio codec — the trailing "HD" leaks
# into the group
# - Trailing .mkv extension survives in title
# Result: total degeneration — UNKNOWN across the board, title=raw input.
# Once the apostrophe + multi-word-audio + 1080i are handled this fixture
# should be revisited. For now: anti-regression of the failure shape.
# - " - GROUP" trailing format (space-dash-space before group)
parsed:
title: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
year: null
title: "The.Prodigy.Worlds.on.Fire"
year: 2011
season: null
episode: null
quality: null
source: null
codec: null
group: "UNKNOWN"
tech_string: ""
media_type: "unknown"
parse_path: "ai"
source: "Blu-ray"
codec: "AVC"
group: "HD"
tech_string: "Blu-ray.AVC"
media_type: "movie"
parse_path: "sanitized"
is_season_pack: false
tree:
@@ -1,14 +1,13 @@
release_name: "Archer.S14E09E10E11.1080p.WEB.h264-ETHEL"
# Tech debt: triple-episode chain (E09E10E11) — current parser captures
# episode=9 and episode_end=10, but E11 is lost. Anti-regression: lock in
# the partial behavior so any future improvement is intentional.
# Triple-episode chain (E09E10E11) — the parser collapses the chain to a
# range (episode=first, episode_end=last). Intermediate values are implied.
parsed:
title: "Archer"
year: null
season: 14
episode: 9
episode_end: 10
episode_end: 11
quality: "1080p"
source: "WEB"
codec: "h264"
+14 -13
View File
@@ -1,21 +1,22 @@
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
# Tech debt: the unescaped apostrophe in "Don't" pushes the whole release
# through the AI fallback path (parse_path="ai") and the parse degenerates to
# UNKNOWN across the board. Anti-regression here — once the tokenizer learns
# to handle apostrophes, this fixture should be revisited.
# Apostrophes inside titles ("Don't", "L'avare") used to push the release
# through the AI fallback (parse_path="ai", everything UNKNOWN). They are
# now pre-stripped before well-formed check and tokenize, so the parse
# completes normally — only the title text loses its apostrophe
# ("Honey.Dont").
parsed:
title: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
year: null
title: "Honey.Dont"
year: 2025
season: null
episode: null
quality: null
source: null
codec: null
group: "UNKNOWN"
tech_string: ""
media_type: "unknown"
parse_path: "ai"
quality: "2160p"
source: "WEBRip"
codec: "x265"
group: "Amen"
tech_string: "2160p.WEBRip.x265"
media_type: "movie"
parse_path: "sanitized"
is_season_pack: false
tree:
@@ -1,7 +1,8 @@
release_name: "Notre.planete.s01e01.1080p.NF.WEB-DL.DDP5.1.x264-NTb"
# Lowercase 's01e01' and lowercased title word ('planete') correctly parsed.
# NF (Netflix) source tag is not in the source KB — drops; WEB-DL wins.
# NF is the Netflix streaming distributor (separate dimension from source);
# WEB-DL is the encoding source.
parsed:
title: "Notre.planete"
year: null
@@ -11,6 +12,7 @@ parsed:
source: "WEB-DL"
codec: "x264"
group: "NTb"
distributor: "NF"
tech_string: "1080p.WEB-DL.x264"
media_type: "tv_show"
parse_path: "direct"
@@ -1,22 +1,22 @@
release_name: "Der.Tatortreiniger.S01-06.GERMAN.1080p.WEB.x264-WAYNE"
# Tech debt: range syntax 'S01-06' is not recognized as TV — falls through
# to media_type=movie with the range glued onto the title. Captured here so a
# future ranger-aware parser change is intentional.
# Range syntax 'S01-06' is now recognized as a season-range marker:
# season=1 (first of the range), media_type=tv_complete, and the token
# no longer leaks into the title.
parsed:
title: "Der.Tatortreiniger.S01-06"
title: "Der.Tatortreiniger"
year: null
season: null
season: 1
episode: null
quality: "1080p"
source: "WEB"
codec: "x264"
group: "WAYNE"
tech_string: "1080p.WEB.x264"
media_type: "movie"
media_type: "tv_complete"
languages: ["GERMAN"]
parse_path: "direct"
is_season_pack: false
is_season_pack: true
tree:
- "Der.Tatortreiniger.S01-06.GERMAN.1080p.WEB.x264-WAYNE/"
@@ -1,11 +1,12 @@
release_name: "Vinyl - 1x01 - FHD"
# Tech debt: surrounding ' - ' separators leave a stray '-' token attached
# to the title ("Vinyl.-"). NxNN form correctly identifies S01E01; everything
# tech-side empty (no quality token in KB — "FHD" not yet known). Anti-regression
# the current degenerate title so a future fix is intentional.
# Surrounding ' - ' separators in human-friendly release names left stray
# '-' tokens attached to the title. They are now dropped at assembly time
# (pure-punctuation TITLE tokens carry no content). NxNN form correctly
# identifies S01E01; tech-side stays empty (no quality token in KB — "FHD"
# not yet known).
parsed:
title: "Vinyl.-"
title: "Vinyl"
year: null
season: 1
episode: 1
+155
View File
@@ -0,0 +1,155 @@
"""Tests for :class:`FfprobeMediaProber`.
Covers the full-probe path (``probe()`` returning a ``MediaInfo``) by
patching ``subprocess.run`` at the adapter module level. The
subtitle-streams path is exercised by the subtitle domain tests via
the same adapter.
"""
from __future__ import annotations
import json
import subprocess
from unittest.mock import MagicMock, patch
from alfred.infrastructure.probe import FfprobeMediaProber
_PROBER = FfprobeMediaProber()
_PATCH_TARGET = "alfred.infrastructure.probe.ffprobe_prober.subprocess.run"
def _ffprobe_result(returncode=0, stdout="{}", stderr="") -> MagicMock:
return MagicMock(returncode=returncode, stdout=stdout, stderr=stderr)
class TestProbe:
def test_timeout_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
_PATCH_TARGET,
side_effect=subprocess.TimeoutExpired(cmd="ffprobe", timeout=30),
):
assert _PROBER.probe(f) is None
def test_nonzero_returncode_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(returncode=1, stderr="not a media file"),
):
assert _PROBER.probe(f) is None
def test_invalid_json_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(stdout="not json {"),
):
assert _PROBER.probe(f) is None
def test_parses_format_duration_and_bitrate(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {"duration": "1234.5", "bit_rate": "5000000"},
"streams": [],
}
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = _PROBER.probe(f)
assert info is not None
assert info.duration_seconds == 1234.5
assert info.bitrate_kbps == 5000 # bit_rate // 1000
def test_invalid_numeric_format_fields_skipped(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {"duration": "garbage", "bit_rate": "also-bad"},
"streams": [],
}
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = _PROBER.probe(f)
assert info is not None
assert info.duration_seconds is None
assert info.bitrate_kbps is None
def test_parses_streams(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {},
"streams": [
{
"index": 0,
"codec_type": "video",
"codec_name": "h264",
"width": 1920,
"height": 1080,
},
{
"index": 1,
"codec_type": "audio",
"codec_name": "ac3",
"channels": 6,
"channel_layout": "5.1",
"tags": {"language": "eng"},
"disposition": {"default": 1},
},
{
"index": 2,
"codec_type": "audio",
"codec_name": "aac",
"channels": 2,
"tags": {"language": "fra"},
},
{
"index": 3,
"codec_type": "subtitle",
"codec_name": "subrip",
"tags": {"language": "fra"},
"disposition": {"forced": 1},
},
],
}
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = _PROBER.probe(f)
assert info.video_codec == "h264"
assert info.width == 1920 and info.height == 1080
assert len(info.audio_tracks) == 2
eng = info.audio_tracks[0]
assert eng.language == "eng"
assert eng.is_default is True
assert info.audio_tracks[1].is_default is False
assert len(info.subtitle_tracks) == 1
assert info.subtitle_tracks[0].is_forced is True
def test_first_video_stream_wins(self, tmp_path):
# The implementation only fills video_codec on the FIRST video stream.
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {},
"streams": [
{"codec_type": "video", "codec_name": "h264", "width": 1920},
{"codec_type": "video", "codec_name": "hevc", "width": 3840},
],
}
with patch(
_PATCH_TARGET,
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = _PROBER.probe(f)
assert info.video_codec == "h264"
assert info.width == 1920
+11 -152
View File
@@ -1,21 +1,19 @@
"""Tests for the smaller ``alfred.infrastructure.filesystem`` helpers.
Covers four siblings of ``FileManager`` that had near-zero coverage:
Covers three siblings of ``FileManager`` that had near-zero coverage:
- ``ffprobe.probe`` wraps ``ffprobe`` JSON output into a ``MediaInfo``.
- ``filesystem_operations.create_folder`` / ``move`` thin
``mkdir`` / ``mv`` wrappers returning dict-shaped responses.
- ``organizer.MediaOrganizer`` computes destination paths for movies
and TV episodes; creates folders for them.
- ``find_video.find_video_file`` first-video lookup in a folder.
External commands (``ffprobe`` / ``mv``) are patched via ``subprocess.run``.
(``ffprobe`` coverage now lives in ``test_ffprobe_prober.py`` alongside
its adapter.)
"""
from __future__ import annotations
import json
import subprocess
from unittest.mock import MagicMock, patch
from alfred.domain.movies.entities import Movie
@@ -27,154 +25,15 @@ from alfred.domain.tv_shows.value_objects import (
SeasonNumber,
ShowStatus,
)
from alfred.infrastructure.filesystem import ffprobe
from alfred.infrastructure.filesystem.filesystem_operations import (
create_folder,
move,
)
from alfred.infrastructure.filesystem.find_video import find_video_file
from alfred.infrastructure.filesystem.organizer import MediaOrganizer
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
# --------------------------------------------------------------------------- #
# ffprobe.probe #
# --------------------------------------------------------------------------- #
def _ffprobe_result(returncode=0, stdout="{}", stderr="") -> MagicMock:
return MagicMock(returncode=returncode, stdout=stdout, stderr=stderr)
class TestFfprobe:
def test_timeout_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
side_effect=subprocess.TimeoutExpired(cmd="ffprobe", timeout=30),
):
assert ffprobe.probe(f) is None
def test_nonzero_returncode_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(returncode=1, stderr="not a media file"),
):
assert ffprobe.probe(f) is None
def test_invalid_json_returns_none(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(stdout="not json {"),
):
assert ffprobe.probe(f) is None
def test_parses_format_duration_and_bitrate(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {"duration": "1234.5", "bit_rate": "5000000"},
"streams": [],
}
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = ffprobe.probe(f)
assert info is not None
assert info.duration_seconds == 1234.5
assert info.bitrate_kbps == 5000 # bit_rate // 1000
def test_invalid_numeric_format_fields_skipped(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {"duration": "garbage", "bit_rate": "also-bad"},
"streams": [],
}
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = ffprobe.probe(f)
assert info is not None
assert info.duration_seconds is None
assert info.bitrate_kbps is None
def test_parses_streams(self, tmp_path):
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {},
"streams": [
{
"index": 0,
"codec_type": "video",
"codec_name": "h264",
"width": 1920,
"height": 1080,
},
{
"index": 1,
"codec_type": "audio",
"codec_name": "ac3",
"channels": 6,
"channel_layout": "5.1",
"tags": {"language": "eng"},
"disposition": {"default": 1},
},
{
"index": 2,
"codec_type": "audio",
"codec_name": "aac",
"channels": 2,
"tags": {"language": "fra"},
},
{
"index": 3,
"codec_type": "subtitle",
"codec_name": "subrip",
"tags": {"language": "fra"},
"disposition": {"forced": 1},
},
],
}
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = ffprobe.probe(f)
assert info.video_codec == "h264"
assert info.width == 1920 and info.height == 1080
assert len(info.audio_tracks) == 2
eng = info.audio_tracks[0]
assert eng.language == "eng"
assert eng.is_default is True
assert info.audio_tracks[1].is_default is False
assert len(info.subtitle_tracks) == 1
assert info.subtitle_tracks[0].is_forced is True
def test_first_video_stream_wins(self, tmp_path):
# The implementation only fills video_codec on the FIRST video stream.
f = tmp_path / "x.mkv"
f.write_bytes(b"")
payload = {
"format": {},
"streams": [
{"codec_type": "video", "codec_name": "h264", "width": 1920},
{"codec_type": "video", "codec_name": "hevc", "width": 3840},
],
}
with patch(
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
return_value=_ffprobe_result(stdout=json.dumps(payload)),
):
info = ffprobe.probe(f)
assert info.video_codec == "h264"
assert info.width == 1920
_KB = YamlReleaseKnowledge()
# --------------------------------------------------------------------------- #
@@ -263,35 +122,35 @@ class TestFindVideo:
def test_returns_file_directly_when_video(self, tmp_path):
f = tmp_path / "Movie.mkv"
f.write_bytes(b"")
assert find_video_file(f) == f
assert find_video_file(f, _KB) == f
def test_returns_none_when_file_is_not_video(self, tmp_path):
f = tmp_path / "notes.txt"
f.write_text("x")
assert find_video_file(f) is None
assert find_video_file(f, _KB) is None
def test_returns_none_when_folder_has_no_video(self, tmp_path):
(tmp_path / "a.txt").write_text("x")
assert find_video_file(tmp_path) is None
assert find_video_file(tmp_path, _KB) is None
def test_returns_first_sorted_video(self, tmp_path):
(tmp_path / "B.mkv").write_bytes(b"")
(tmp_path / "A.mkv").write_bytes(b"")
(tmp_path / "C.mkv").write_bytes(b"")
found = find_video_file(tmp_path)
found = find_video_file(tmp_path, _KB)
assert found.name == "A.mkv"
def test_recurses_into_subfolders(self, tmp_path):
sub = tmp_path / "sub"
sub.mkdir()
(sub / "X.mkv").write_bytes(b"")
found = find_video_file(tmp_path)
found = find_video_file(tmp_path, _KB)
assert found is not None and found.name == "X.mkv"
def test_case_insensitive_extension(self, tmp_path):
f = tmp_path / "Movie.MKV"
f.write_bytes(b"")
assert find_video_file(f) == f
assert find_video_file(f, _KB) == f
# --------------------------------------------------------------------------- #
@@ -0,0 +1,82 @@
"""Tests for ``LanguageRegistry`` — the YAML-backed adapter for the
:class:`alfred.domain.shared.ports.LanguageRepository` port.
The port is structural (Protocol), so the assertion that the adapter
satisfies it is a static one we exercise the public surface here and
let mypy / runtime polymorphism do the rest.
"""
from __future__ import annotations
from alfred.domain.shared.ports import LanguageRepository
from alfred.domain.shared.value_objects import Language
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
def _registry() -> LanguageRepository:
"""Return a fresh registry typed as the port — proves structural fit."""
return LanguageRegistry()
class TestPortSurface:
def test_satisfies_protocol(self):
# If LanguageRegistry diverged from LanguageRepository, the annotation
# below would already be wrong at type-check time; at runtime, this
# just confirms the methods exist.
reg: LanguageRepository = LanguageRegistry()
assert hasattr(reg, "from_iso")
assert hasattr(reg, "from_any")
assert hasattr(reg, "all")
def test_len_reflects_loaded_entries(self):
reg = _registry()
# The builtin YAML ships dozens of languages — exact count drifts
# with knowledge updates, so just sanity-check it's non-empty.
assert len(reg) > 0
class TestFromIso:
def test_known_iso_returns_language(self):
reg = _registry()
fre = reg.from_iso("fre")
assert isinstance(fre, Language)
assert fre.iso == "fre"
def test_case_insensitive(self):
reg = _registry()
assert reg.from_iso("FRE") == reg.from_iso("fre")
def test_unknown_iso_returns_none(self):
assert _registry().from_iso("zzz") is None
def test_non_string_returns_none(self):
assert _registry().from_iso(None) is None # type: ignore[arg-type]
class TestFromAny:
def test_english_name(self):
reg = _registry()
lang = reg.from_any("French")
assert lang is not None
assert lang.iso == "fre"
def test_iso_639_1_alias(self):
# "fr" is the 639-1 form, registered as an alias.
reg = _registry()
lang = reg.from_any("fr")
assert lang is not None
assert lang.iso == "fre"
def test_unknown_returns_none(self):
assert _registry().from_any("vostfr") is None
def test_non_string_returns_none(self):
assert _registry().from_any(123) is None # type: ignore[arg-type]
class TestMembership:
def test_contains_known(self):
assert "english" in _registry()
def test_does_not_contain_unknown(self):
assert "klingon" not in _registry()
@@ -16,7 +16,7 @@ from __future__ import annotations
from pathlib import Path
from alfred.domain.subtitles.entities import SubtitleCandidate
from alfred.domain.subtitles.entities import SubtitleScanResult
from alfred.application.subtitles.placer import PlacedTrack
from alfred.domain.subtitles.value_objects import (
SubtitleFormat,
@@ -32,8 +32,8 @@ ENG = SubtitleLanguage(code="eng", tokens=["en"])
def _track(
lang=FRA, *, embedded: bool = False, confidence: float = 0.92
) -> SubtitleCandidate:
return SubtitleCandidate(
) -> SubtitleScanResult:
return SubtitleScanResult(
language=lang,
format=SRT,
subtitle_type=SubtitleType.STANDARD,