Compare commits
36 Commits
9f10f4e0ad
...
688c37bbec
| Author | SHA1 | Date | |
|---|---|---|---|
| 688c37bbec | |||
| 757e4045ee | |||
| c3767aacb6 | |||
| 5bcf22b408 | |||
| cfa9f54d9f | |||
| f0aaf50c97 | |||
| a09262b33f | |||
| 9c7cd66d2b | |||
| 83dbed887b | |||
| 0c9489e16b | |||
| 621bb96995 | |||
| 448ef3b79c | |||
| b1c7f35ffb | |||
| 5bbdc9081f | |||
| 5d7b214af2 | |||
| 18267d0165 | |||
| 19fe8a519a | |||
| a0d1846ff2 | |||
| 0fb59a4581 | |||
| e79ca462b8 | |||
| 03aa844d7d | |||
| c303efea48 | |||
| 5db350a1df | |||
| 12dc796ea2 | |||
| 9ddd85929e | |||
| ed7680b58f | |||
| b4c9efd13b | |||
| 98c688f29b | |||
| fcd80763e2 | |||
| 629387591f | |||
| 230a7ab88a | |||
| 3737f66851 | |||
| fd3bd1ad8c | |||
| 7dc7f0c241 | |||
| 075a827b0e | |||
| a2c917618f |
+271
@@ -15,8 +15,263 @@ callers).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Multi-episode chain (e.g. `S14E09E10E11`) now collapses to a full
|
||||
range.** The parser previously captured `episode=9, episode_end=10`
|
||||
and dropped E11+. It now returns `episode=first, episode_end=last`,
|
||||
with intermediate values implied. Fixture
|
||||
`shitty/archer_multi_episode/` updated from anti-regression-of-bug
|
||||
to anti-regression-of-fix.
|
||||
- **Apostrophes in titles no longer push the release through the AI
|
||||
fallback.** `Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265-Amen`
|
||||
previously parsed with `parse_path="ai"` and everything UNKNOWN
|
||||
because `'` is in the forbidden-chars list. Apostrophes are now
|
||||
pre-stripped before the well-formed check, so the parse completes
|
||||
normally (`title=Honey.Dont, year=2025, quality=2160p, ...`); only
|
||||
the title text loses its apostrophe. `parse_path` becomes
|
||||
`sanitized` to surface the cleanup. Side win: PoP fixture
|
||||
`the_prodigy_full_chaos/` also moves from total failure to a
|
||||
partially-correct parse (year, source, codec extracted).
|
||||
- **Season-range markers (`Sxx-yy`) are now recognized as
|
||||
`tv_complete`.** `Der.Tatortreiniger.S01-06.GERMAN...` previously
|
||||
parsed as `media_type=movie` with `S01-06` glued onto the title.
|
||||
The parser now recognizes the range, sets `season=first`,
|
||||
`media_type=tv_complete`, and removes the marker from the title.
|
||||
`is_season_pack` flips to `true`.
|
||||
- **Pure-punctuation TITLE tokens are dropped at assembly.** Releases
|
||||
with surrounding ` - ` separators (`Vinyl - 1x01 - FHD`) previously
|
||||
produced `title="Vinyl.-"`. Such tokens (a stray dash, a wide pipe
|
||||
`|`, …) carry no title content and are now filtered out. Side
|
||||
effect: PoP fixture `khruangbin_yt_wide_pipe/` also benefits — the
|
||||
YouTube wide-pipe no longer leaks into the title.
|
||||
|
||||
### Added
|
||||
|
||||
- **`LanguageRepository` port** in `alfred.domain.shared.ports`. Structural
|
||||
Protocol covering `from_iso`, `from_any`, `all`, `__contains__`, `__len__`
|
||||
— the surface previously coupled to the concrete `LanguageRegistry`.
|
||||
Mirrors the `MediaProber` / `FilesystemScanner` pattern: domain code
|
||||
depends on the Protocol, infrastructure provides the YAML-backed
|
||||
adapter. Tests in `tests/infrastructure/test_language_registry.py`.
|
||||
|
||||
### Changed
|
||||
|
||||
- **`RuleScope.level` is now an enum (`RuleScopeLevel`).** The set of
|
||||
valid levels (global, release_group, movie, show, season, episode)
|
||||
was documented only in a docstring comment and validated nowhere.
|
||||
`RuleScopeLevel(str, Enum)` keeps wire compatibility (YAML
|
||||
serialization, `.value` access) while making the closed set explicit
|
||||
to type-checkers and IDEs. `to_dict()` emits `.value` strings so
|
||||
YAML output is unchanged.
|
||||
- **`FilePath` VO uses `__post_init__` instead of a hand-rolled
|
||||
`__init__`.** Same public API (accepts `str | Path`), same behavior,
|
||||
but the dataclass-generated `__init__` is no longer bypassed. One
|
||||
less smell in the shared VOs.
|
||||
- **`Language` VO is strict by default; `Language.from_raw()` factory
|
||||
for normalization.** The previous `__post_init__` mutated `iso` and
|
||||
`aliases` via `object.__setattr__` on a frozen dataclass — a code
|
||||
smell hiding behind the dataclass facade. Split: the direct
|
||||
constructor now rejects un-normalized input (uppercase iso,
|
||||
whitespace in aliases, etc.), and `Language.from_raw()` handles
|
||||
arbitrary YAML/user input. Only one caller (LanguageRegistry loading
|
||||
the ISO YAML) needed migration.
|
||||
- **`ParsedRelease.normalised` renamed to `clean`.** The field name
|
||||
promised "dots instead of spaces" but in practice held
|
||||
`raw - site_tag - apostrophes` — only used by `season_folder_name()`.
|
||||
Renamed and docstring corrected.
|
||||
- **`ParsedRelease.media_type` / `parse_path` are strict enums.** The
|
||||
fields were already typed as `MediaTypeToken` / `ParsePath`, but a
|
||||
tolerant `__post_init__` coerced raw strings. With both classes
|
||||
being `(str, Enum)`, the coercion served no purpose. Strict
|
||||
constructor; `.value` no longer passed at call sites; dropped the
|
||||
unused `_VALID_MEDIA_TYPES` / `_VALID_PARSE_PATHS` lookup tables.
|
||||
|
||||
### Removed
|
||||
|
||||
- **`settings.min_movie_size_bytes`** — orphan Pydantic field +
|
||||
validator. Its only consumer (`MovieService.validate_movie_file`)
|
||||
had been removed during an earlier refactor. The "real movie vs
|
||||
sample" rule now lives in extension-based exclusion
|
||||
(`application/release/supported_media.py`) and PoP. If a size
|
||||
threshold is ever needed, it'll go in a knowledge YAML, not in
|
||||
`settings`.
|
||||
|
||||
### Internal
|
||||
|
||||
- **Flattened `alfred.domain.shared.media/` package into a single
|
||||
`media.py` module.** The 6-file package (audio, video, subtitle,
|
||||
info, matching, tracks_mixin + `__init__`) collapsed into one ~250
|
||||
LoC module. All 12 import sites continue to resolve unchanged
|
||||
(`from alfred.domain.shared.media import AudioTrack, MediaInfo, …`)
|
||||
since Python treats `media.py` and `media/__init__.py`
|
||||
interchangeably for import paths. Easier to scan when the whole
|
||||
bounded-context fits on one screen.
|
||||
- **`SubtitleKnowledgeBase` types `language_registry` against the
|
||||
`LanguageRepository` port** instead of the concrete `LanguageRegistry`
|
||||
class. The default constructor still instantiates the concrete adapter
|
||||
when no repository is injected — behaviour is unchanged for existing
|
||||
callers. Opens the door to in-memory fakes in future tests without
|
||||
loading the full ISO 639 YAML.
|
||||
- **Moved `detect_media_type` and `enrich_from_probe` from
|
||||
`alfred.application.filesystem` to `alfred.application.release`**.
|
||||
They are inspection-pipeline helpers — their natural home is next to
|
||||
`inspect_release`, not next to the filesystem use cases. The move
|
||||
also eliminates a circular-import workaround in
|
||||
`resolve_destination.py`: `inspect_release` can now be imported at
|
||||
module top instead of lazily inside `_resolve_parsed`. Public
|
||||
surface is unchanged for callers that imported the helpers from
|
||||
their full module paths (the only call sites — `inspect.py`, two
|
||||
tests, one testing script — were updated in this commit).
|
||||
|
||||
### Added
|
||||
|
||||
- **`resolve_*_destination` use cases now consume `inspect_release`**.
|
||||
`resolve_episode_destination` and `resolve_movie_destination` reuse
|
||||
their existing `source_file` parameter as the inspection target;
|
||||
`resolve_season_destination` and `resolve_series_destination` gain
|
||||
a new **optional** `source_path` parameter (also threaded through
|
||||
the tool wrappers and YAML specs). When the path exists, ffprobe
|
||||
data fills tokens missing from the release name (e.g. quality) and
|
||||
refreshes `tech_string`, so the destination folder / file names
|
||||
end up more accurate. When the path is missing or absent (back-compat
|
||||
callers), the use cases fall back to parse-only — same behavior as
|
||||
before.
|
||||
|
||||
### Fixed
|
||||
|
||||
- **`enrich_from_probe` now refreshes `tech_string`** after filling
|
||||
`quality` / `source` / `codec`. Previously the field stayed at its
|
||||
parser-time value, so filename builders saw stale tech tokens even
|
||||
after a successful probe. New `TestTechString` class in
|
||||
`tests/application/test_enrich_from_probe.py` locks the behavior.
|
||||
|
||||
### Added
|
||||
|
||||
- **`inspect_release` orchestrator + `InspectedResult` VO**
|
||||
(`alfred/application/release/inspect.py`). Single composition of the
|
||||
four inspection layers: `parse_release` → `detect_media_type` (patches
|
||||
`parsed.media_type`) → `find_main_video` (top-level scan) →
|
||||
`prober.probe` + `enrich_from_probe` when a video exists and the
|
||||
refined media type isn't in `{"unknown", "other"}`. Returns a frozen
|
||||
`InspectedResult(parsed, report, source_path, main_video, media_info,
|
||||
probe_used)` that downstream callers consume directly instead of
|
||||
rebuilding the same chain. `kb` and `prober` are injected — no
|
||||
module-level singletons. Never raises.
|
||||
|
||||
### Changed
|
||||
|
||||
- **`analyze_release` tool now delegates to `inspect_release`** — same
|
||||
output shape, plus two new fields: `confidence` (0–100) and `road`
|
||||
(`"easy"` / `"shitty"` / `"path_of_pain"`) surfaced from the parser's
|
||||
`ParseReport`. The tool spec (`specs/analyze_release.yaml`) documents
|
||||
both fields so the LLM can route releases by confidence.
|
||||
|
||||
- **`MediaProber` port now covers full media probing**: added
|
||||
`probe(video) -> MediaInfo | None` alongside the existing
|
||||
`list_subtitle_streams`. `FfprobeMediaProber` (in
|
||||
`alfred/infrastructure/probe/`) implements both methods and is now
|
||||
the single adapter shelling out to `ffprobe`. The standalone
|
||||
`alfred/infrastructure/filesystem/ffprobe.py` module was removed —
|
||||
all callers (tools, testing scripts) instantiate
|
||||
`FfprobeMediaProber` instead. Unblocks the upcoming
|
||||
`inspect_release` orchestrator, which depends on the port.
|
||||
|
||||
### Removed
|
||||
|
||||
- `alfred/infrastructure/filesystem/ffprobe.py` (folded into the
|
||||
`FfprobeMediaProber` adapter).
|
||||
|
||||
---
|
||||
|
||||
## [2026-05-20] — Release parser confidence scoring + exclusion
|
||||
|
||||
### Added
|
||||
|
||||
- **Pre-pipeline exclusion helpers** (`alfred/application/release/supported_media.py`):
|
||||
`is_supported_video(path, kb)` (extension-only check against
|
||||
`kb.video_extensions`) and `find_main_video(folder, kb)` (top-level
|
||||
scan, lexicographically-first eligible file, returns `None` when no
|
||||
video qualifies; accepts a bare file as folder for single-file
|
||||
releases). No size threshold, no filename heuristics —
|
||||
PATH_OF_PAIN handles the exotic cases. Foundation for the future
|
||||
`inspect_release` orchestrator.
|
||||
|
||||
- **Release parser — parse-confidence scoring** (`alfred/domain/release/parser/scoring.py`,
|
||||
`alfred/knowledge/release/scoring.yaml`). `parse_release` now returns
|
||||
`(ParsedRelease, ParseReport)`. The new `ParseReport` frozen VO
|
||||
carries a 0–100 `confidence`, a `road` (`"easy"` / `"shitty"` /
|
||||
`"path_of_pain"`), the residual UNKNOWN tokens, and the missing
|
||||
critical fields. EASY is decided structurally (a group schema
|
||||
matched); SHITTY vs PATH_OF_PAIN is decided by score against a
|
||||
YAML-configurable cutoff (default 60). Weights and penalties also
|
||||
live in `scoring.yaml` — title 30, media_type 20, year 15, season
|
||||
10, episode 5, tech 5 each; penalty 5 per UNKNOWN token capped at
|
||||
-30. `Road` is a new enum, distinct from `ParsePath` (which records
|
||||
the tokenization route, not the confidence tier). `ReleaseKnowledge`
|
||||
port gains a `scoring: dict` field.
|
||||
|
||||
### Changed
|
||||
|
||||
- **`parse_release` signature** is now `(name, kb) → tuple[ParsedRelease,
|
||||
ParseReport]` instead of returning a bare `ParsedRelease`. Call
|
||||
sites updated in `application/filesystem/resolve_destination.py` and
|
||||
`agent/tools/filesystem.py`. Tests updated accordingly.
|
||||
|
||||
---
|
||||
|
||||
## [2026-05-20] — Release parser v2 (EASY + SHITTY)
|
||||
|
||||
### Added
|
||||
|
||||
- **Release parser v2 — EASY path live** (`alfred/domain/release/parser/`):
|
||||
new annotate-based pipeline (tokenize → annotate → assemble) drives
|
||||
releases from known groups. Exposes `Token` (frozen VO with `index` +
|
||||
`role` + `extra`), `TokenRole` enum (structural/technical/meta families),
|
||||
and `GroupSchema` / `SchemaChunk` value objects.
|
||||
- `pipeline.tokenize`: string-ops separator split (no regex), strips
|
||||
a `[site.tag]` prefix/suffix first.
|
||||
- `pipeline.annotate`: detects the trailing group right-to-left
|
||||
(priority to `codec-GROUP` shape, fallback to any non-source dashed
|
||||
token), looks up its `GroupSchema`, then walks tokens and schema
|
||||
chunks in lockstep — optional chunks that don't match are skipped,
|
||||
mandatory mismatches abort EASY and return `None` so the caller can
|
||||
fall back to SHITTY.
|
||||
- `pipeline.assemble`: folds annotated tokens into a
|
||||
`ParsedRelease`-compatible dict.
|
||||
- `parse_release` (in `release.services`) tries the v2 EASY path first
|
||||
and falls through to the legacy SHITTY heuristic on `None`. Legacy
|
||||
SHITTY/PATH OF PAIN behavior is unchanged.
|
||||
- Knowledge: `alfred/knowledge/release/release_groups/{kontrast,elite,
|
||||
rarbg}.yaml` declare the canonical chunk order per group, loaded via
|
||||
new `ReleaseKnowledge.group_schema(name)` port method.
|
||||
- Tests in `tests/domain/release/test_parser_v2_{scaffolding,easy}.py`
|
||||
cover token VOs, site-tag stripping, group detection, schema-driven
|
||||
annotation (movie, TV episode, season pack with optional source),
|
||||
and field assembly.
|
||||
|
||||
- **Release parser v2 — enricher pass** completes the EASY pipeline.
|
||||
The structural schema walk now tolerates non-positional tokens
|
||||
between chunks (instead of aborting on leftover tokens), and a second
|
||||
pass tags them with audio / video-meta / edition / language roles.
|
||||
Multi-token sequences from `audio.yaml`, `video.yaml`, `editions.yaml`
|
||||
(e.g. `DTS.HD.MA`, `DV.HDR10`, `TrueHD.Atmos`, `DIRECTORS.CUT`) are
|
||||
matched before single tokens. Channel layouts like `5.1` and `7.1`
|
||||
(split into two tokens by the `.` separator) are detected as
|
||||
consecutive pairs. Sequence members carry an `extra["sequence_member"]`
|
||||
marker so `assemble` extracts the canonical value only from the
|
||||
primary token. KONTRAST releases with audio / HDR / edition / language
|
||||
metadata now produce a fully populated `ParsedRelease`.
|
||||
|
||||
- **Streaming distributor as a separate dimension** from encoding source.
|
||||
New `alfred/knowledge/release/distributors.yaml` (NF, AMZN, DSNP, HMAX,
|
||||
ATVP, HULU, PCOK, PMTP, CR) feeds a new `ReleaseKnowledge.distributors`
|
||||
port field, a `TokenRole.DISTRIBUTOR` annotation, and a
|
||||
`ParsedRelease.distributor` field. `WEB-DL` stays the source; the
|
||||
platform that produced the release is now recorded distinctly. The
|
||||
five entries (NF, AMZN, DSNP, HMAX, ATVP) were correspondingly removed
|
||||
from `sources.yaml`.
|
||||
|
||||
- **Real-world release fixtures** under `tests/fixtures/releases/{easy,shitty,path_of_pain}/`,
|
||||
each documenting an expected `ParsedRelease` plus the future `routing`
|
||||
(library / torrents / seed_hardlinks) for the upcoming `organize_media`
|
||||
@@ -54,6 +309,22 @@ callers).
|
||||
|
||||
### Changed
|
||||
|
||||
- **Release parser v2 — SHITTY simplified to dict-driven tagging**.
|
||||
The legacy ~480-line heuristic block in `release/services.py` is gone;
|
||||
`pipeline._annotate_shitty` does a single pass that looks each token
|
||||
up in the kb buckets (resolutions / sources / codecs / distributors /
|
||||
year / `SxxExx`) with first-match-wins semantics, and the leftmost
|
||||
contiguous UNKNOWN run becomes the title. `annotate()` no longer
|
||||
returns `None` — SHITTY is the always-on fallback when no group schema
|
||||
matches. `services.py` shrunk from ~525 to ~85 lines. Four fixtures
|
||||
(`deutschland_franchise_box`, `sleaford_yt_slug`,
|
||||
`super_mario_bilingual`, `predator_space_separators` — the last one
|
||||
moved from `shitty/` → `path_of_pain/`) are now marked
|
||||
`pytest.mark.xfail(strict=False)` documenting PoP-grade pathologies
|
||||
that SHITTY intentionally won't handle. `ReleaseFixture` grows an
|
||||
`xfail_reason` field; the parametrized suite wires the xfail mark
|
||||
automatically.
|
||||
|
||||
- **`parse_release` tokenizer is now data-driven**: it splits on any character
|
||||
listed in `separators.yaml` (regex character class) instead of `name.split(".")`.
|
||||
This makes YTS-style releases (`The Father (2020) [1080p] [WEBRip] [5.1] [YTS.MX]`),
|
||||
|
||||
@@ -13,8 +13,6 @@ from alfred.application.filesystem import (
|
||||
MoveMediaUseCase,
|
||||
SetFolderPathUseCase,
|
||||
)
|
||||
from alfred.application.filesystem.detect_media_type import detect_media_type
|
||||
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
|
||||
from alfred.application.filesystem.resolve_destination import (
|
||||
resolve_episode_destination as _resolve_episode_destination,
|
||||
)
|
||||
@@ -28,10 +26,11 @@ from alfred.application.filesystem.resolve_destination import (
|
||||
resolve_series_destination as _resolve_series_destination,
|
||||
)
|
||||
from alfred.infrastructure.filesystem import FileManager, create_folder, move
|
||||
from alfred.infrastructure.filesystem.ffprobe import probe
|
||||
from alfred.infrastructure.filesystem.find_video import find_video_file
|
||||
from alfred.infrastructure.metadata import MetadataStore
|
||||
from alfred.infrastructure.persistence import get_memory
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
_PROBER = FfprobeMediaProber()
|
||||
|
||||
_LEARNED_ROOT = Path(_alfred_pkg.__file__).parent.parent / "data" / "knowledge"
|
||||
|
||||
@@ -57,10 +56,11 @@ def resolve_season_destination(
|
||||
tmdb_title: str,
|
||||
tmdb_year: int,
|
||||
confirmed_folder: str | None = None,
|
||||
source_path: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/resolve_season_destination.yaml."""
|
||||
return _resolve_season_destination(
|
||||
release_name, tmdb_title, tmdb_year, confirmed_folder
|
||||
release_name, tmdb_title, tmdb_year, confirmed_folder, source_path
|
||||
).to_dict()
|
||||
|
||||
|
||||
@@ -100,10 +100,11 @@ def resolve_series_destination(
|
||||
tmdb_title: str,
|
||||
tmdb_year: int,
|
||||
confirmed_folder: str | None = None,
|
||||
source_path: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/resolve_series_destination.yaml."""
|
||||
return _resolve_series_destination(
|
||||
release_name, tmdb_title, tmdb_year, confirmed_folder
|
||||
release_name, tmdb_title, tmdb_year, confirmed_folder, source_path
|
||||
).to_dict()
|
||||
|
||||
|
||||
@@ -191,21 +192,10 @@ def set_path_for_folder(folder_name: str, path_value: str) -> dict[str, Any]:
|
||||
def analyze_release(release_name: str, source_path: str) -> dict[str, Any]:
|
||||
"""Thin tool wrapper — semantics live in alfred/agent/tools/specs/analyze_release.yaml."""
|
||||
from alfred.application.filesystem.resolve_destination import _KB # noqa: PLC0415
|
||||
from alfred.domain.release.services import parse_release # noqa: PLC0415
|
||||
|
||||
path = Path(source_path)
|
||||
parsed = parse_release(release_name, _KB)
|
||||
parsed.media_type = detect_media_type(parsed, path, _KB)
|
||||
|
||||
probe_used = False
|
||||
if parsed.media_type not in ("unknown", "other"):
|
||||
video_file = find_video_file(path, _KB)
|
||||
if video_file:
|
||||
media_info = probe(video_file)
|
||||
if media_info:
|
||||
enrich_from_probe(parsed, media_info)
|
||||
probe_used = True
|
||||
from alfred.application.release import inspect_release # noqa: PLC0415
|
||||
|
||||
result = inspect_release(release_name, Path(source_path), _KB, _PROBER)
|
||||
parsed = result.parsed
|
||||
return {
|
||||
"status": "ok",
|
||||
"media_type": parsed.media_type,
|
||||
@@ -227,7 +217,9 @@ def analyze_release(release_name: str, source_path: str) -> dict[str, Any]:
|
||||
"edition": parsed.edition,
|
||||
"site_tag": parsed.site_tag,
|
||||
"is_season_pack": parsed.is_season_pack,
|
||||
"probe_used": probe_used,
|
||||
"probe_used": result.probe_used,
|
||||
"confidence": result.report.confidence,
|
||||
"road": result.report.road,
|
||||
}
|
||||
|
||||
|
||||
@@ -241,7 +233,7 @@ def probe_media(source_path: str) -> dict[str, Any]:
|
||||
"message": f"{source_path} does not exist",
|
||||
}
|
||||
|
||||
media_info = probe(path)
|
||||
media_info = _PROBER.probe(path)
|
||||
if media_info is None:
|
||||
return {
|
||||
"status": "error",
|
||||
|
||||
@@ -80,3 +80,5 @@ returns:
|
||||
site_tag: Source-site tag if present.
|
||||
is_season_pack: True when the folder contains a full season.
|
||||
probe_used: True when ffprobe successfully enriched the result.
|
||||
confidence: Parser confidence score, 0–100 (higher = more reliable).
|
||||
road: "Parser road: 'easy' (group schema matched), 'shitty' (heuristic but acceptable), or 'path_of_pain' (low confidence — ask the user before auto-routing)."
|
||||
|
||||
@@ -61,6 +61,17 @@ parameters:
|
||||
one.
|
||||
example: Oz.1997.1080p.WEBRip.x265-KONTRAST
|
||||
|
||||
source_path:
|
||||
description: |
|
||||
Absolute path to the release folder on disk. Optional.
|
||||
why_needed: |
|
||||
When provided, the tool runs ffprobe on the main video inside the
|
||||
folder and uses the probe data to fill quality/codec tokens that
|
||||
may be missing from the release name. The enriched tech tokens
|
||||
end up in the destination folder name, so providing source_path
|
||||
gives more accurate names for releases with sparse metadata.
|
||||
example: /downloads/Oz.S03.1080p.WEBRip.x265-KONTRAST
|
||||
|
||||
returns:
|
||||
ok:
|
||||
description: Paths resolved unambiguously; ready to move.
|
||||
|
||||
@@ -56,6 +56,16 @@ parameters:
|
||||
Forces the use case to use this exact folder name and skip detection.
|
||||
example: The.Wire.2002.1080p.BluRay.x265-GROUP
|
||||
|
||||
source_path:
|
||||
description: |
|
||||
Absolute path to the release folder on disk. Optional.
|
||||
why_needed: |
|
||||
When provided, the tool runs ffprobe on the main video inside the
|
||||
folder and uses probe data to fill quality/codec tokens that may
|
||||
be missing from the release name, producing a more accurate
|
||||
destination folder name.
|
||||
example: /downloads/The.Wire.S01-S05.1080p.BluRay.x265-GROUP
|
||||
|
||||
returns:
|
||||
ok:
|
||||
description: Path resolved; ready to move the pack.
|
||||
|
||||
@@ -22,10 +22,13 @@ import logging
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.application.release import inspect_release
|
||||
from alfred.domain.release import parse_release
|
||||
from alfred.domain.release.ports import ReleaseKnowledge
|
||||
from alfred.domain.release.value_objects import ParsedRelease
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
from alfred.infrastructure.persistence import get_memory
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -33,6 +36,26 @@ logger = logging.getLogger(__name__)
|
||||
# Tests that need a custom KB can monkeypatch this attribute.
|
||||
_KB: ReleaseKnowledge = YamlReleaseKnowledge()
|
||||
|
||||
# Module-level prober — same singleton style as _KB. Tests that need a custom
|
||||
# adapter can monkeypatch this attribute.
|
||||
_PROBER = FfprobeMediaProber()
|
||||
|
||||
|
||||
def _resolve_parsed(release_name: str, source_path: str | None) -> ParsedRelease:
|
||||
"""Pick the right entry point depending on whether we have a path.
|
||||
|
||||
When ``source_path`` is provided and points to something that exists,
|
||||
we run the full inspection pipeline so probe data can refresh
|
||||
``tech_string`` (which feeds every filename builder). Otherwise we
|
||||
fall back to a parse-only path — same behavior as before.
|
||||
"""
|
||||
if source_path:
|
||||
path = Path(source_path)
|
||||
if path.exists():
|
||||
return inspect_release(release_name, path, _KB, _PROBER).parsed
|
||||
parsed, _ = parse_release(release_name, _KB)
|
||||
return parsed
|
||||
|
||||
|
||||
def _find_existing_tvshow_folders(
|
||||
tv_root: Path, tmdb_title_safe: str, tmdb_year: int
|
||||
@@ -237,12 +260,17 @@ def resolve_season_destination(
|
||||
tmdb_title: str,
|
||||
tmdb_year: int,
|
||||
confirmed_folder: str | None = None,
|
||||
source_path: str | None = None,
|
||||
) -> ResolvedSeasonDestination:
|
||||
"""
|
||||
Compute destination paths for a season pack.
|
||||
|
||||
Returns series_folder + season_folder. No file paths — the whole
|
||||
source folder is moved as-is into season_folder.
|
||||
|
||||
When ``source_path`` points to the release on disk, the parser is
|
||||
augmented with ffprobe data so tech tokens missing from the release
|
||||
name (quality / codec) end up in the folder names.
|
||||
"""
|
||||
tv_root = _get_tv_root()
|
||||
if not tv_root:
|
||||
@@ -252,7 +280,7 @@ def resolve_season_destination(
|
||||
message="TV show library path is not configured.",
|
||||
)
|
||||
|
||||
parsed = parse_release(release_name, _KB)
|
||||
parsed = _resolve_parsed(release_name, source_path)
|
||||
tmdb_title_safe = _KB.sanitize_for_fs(tmdb_title)
|
||||
computed_name = parsed.show_folder_name(tmdb_title_safe, tmdb_year)
|
||||
|
||||
@@ -293,6 +321,8 @@ def resolve_episode_destination(
|
||||
Compute destination paths for a single episode file.
|
||||
|
||||
Returns series_folder + season_folder + library_file (full path to .mkv).
|
||||
``source_file`` doubles as the inspection target — when it exists,
|
||||
ffprobe enrichment refreshes tech tokens missing from the release name.
|
||||
"""
|
||||
tv_root = _get_tv_root()
|
||||
if not tv_root:
|
||||
@@ -302,7 +332,7 @@ def resolve_episode_destination(
|
||||
message="TV show library path is not configured.",
|
||||
)
|
||||
|
||||
parsed = parse_release(release_name, _KB)
|
||||
parsed = _resolve_parsed(release_name, source_file)
|
||||
ext = Path(source_file).suffix
|
||||
tmdb_title_safe = _KB.sanitize_for_fs(tmdb_title)
|
||||
tmdb_episode_title_safe = (
|
||||
@@ -350,6 +380,8 @@ def resolve_movie_destination(
|
||||
Compute destination paths for a movie file.
|
||||
|
||||
Returns movie_folder + library_file (full path to .mkv).
|
||||
``source_file`` doubles as the inspection target — when it exists,
|
||||
ffprobe enrichment refreshes tech tokens missing from the release name.
|
||||
"""
|
||||
memory = get_memory()
|
||||
movies_root = memory.ltm.library_paths.get("movie")
|
||||
@@ -360,7 +392,7 @@ def resolve_movie_destination(
|
||||
message="Movie library path is not configured.",
|
||||
)
|
||||
|
||||
parsed = parse_release(release_name, _KB)
|
||||
parsed = _resolve_parsed(release_name, source_file)
|
||||
ext = Path(source_file).suffix
|
||||
tmdb_title_safe = _KB.sanitize_for_fs(tmdb_title)
|
||||
|
||||
@@ -385,11 +417,15 @@ def resolve_series_destination(
|
||||
tmdb_title: str,
|
||||
tmdb_year: int,
|
||||
confirmed_folder: str | None = None,
|
||||
source_path: str | None = None,
|
||||
) -> ResolvedSeriesDestination:
|
||||
"""
|
||||
Compute destination path for a complete multi-season series pack.
|
||||
|
||||
Returns only series_folder — the whole pack lands directly inside it.
|
||||
|
||||
When ``source_path`` points to the release on disk, ffprobe
|
||||
enrichment refreshes tech tokens missing from the release name.
|
||||
"""
|
||||
tv_root = _get_tv_root()
|
||||
if not tv_root:
|
||||
@@ -399,7 +435,7 @@ def resolve_series_destination(
|
||||
message="TV show library path is not configured.",
|
||||
)
|
||||
|
||||
parsed = parse_release(release_name, _KB)
|
||||
parsed = _resolve_parsed(release_name, source_path)
|
||||
tmdb_title_safe = _KB.sanitize_for_fs(tmdb_title)
|
||||
computed_name = parsed.show_folder_name(tmdb_title_safe, tmdb_year)
|
||||
|
||||
|
||||
@@ -0,0 +1,20 @@
|
||||
"""Release application layer — orchestrators sitting between domain
|
||||
parsing and infrastructure I/O.
|
||||
|
||||
Public surface:
|
||||
|
||||
- :func:`is_supported_video` / :func:`find_main_video` — pre-pipeline
|
||||
filesystem helpers (extension-only filtering, top-level video pick).
|
||||
- :func:`inspect_release` / :class:`InspectedResult` — full inspection
|
||||
pipeline combining parse + filesystem refinement + probe enrichment.
|
||||
"""
|
||||
|
||||
from .inspect import InspectedResult, inspect_release
|
||||
from .supported_media import find_main_video, is_supported_video
|
||||
|
||||
__all__ = [
|
||||
"InspectedResult",
|
||||
"find_main_video",
|
||||
"inspect_release",
|
||||
"is_supported_video",
|
||||
]
|
||||
+7
@@ -80,3 +80,10 @@ def enrich_from_probe(parsed: ParsedRelease, info: MediaInfo) -> None:
|
||||
for lang in info.audio_languages:
|
||||
if lang.lower() != "und" and lang.upper() not in existing:
|
||||
parsed.languages.append(lang)
|
||||
|
||||
# Re-derive tech_string so filename builders see the enriched
|
||||
# quality/source/codec. Built the same way as in the parser pipeline:
|
||||
# the non-None parts joined by dots, in order.
|
||||
parsed.tech_string = ".".join(
|
||||
p for p in (parsed.quality, parsed.source, parsed.codec) if p
|
||||
)
|
||||
@@ -0,0 +1,140 @@
|
||||
"""Release inspection orchestrator — the canonical "look at this thing"
|
||||
entry point.
|
||||
|
||||
``inspect_release`` is the single composition of the four layers we
|
||||
care about for a freshly-arrived release:
|
||||
|
||||
1. **Parse the name** — :func:`alfred.domain.release.services.parse_release`
|
||||
gives a ``ParsedRelease`` plus a ``ParseReport`` (confidence + road).
|
||||
2. **Pick the main video** — :func:`find_main_video` runs a top-level
|
||||
scan over the source path. If nothing qualifies the result still
|
||||
completes; downstream callers decide what to do with a videoless
|
||||
release.
|
||||
3. **Refine the media type** — :func:`detect_media_type` uses the
|
||||
on-disk extension mix to override any token-level guess (e.g. a
|
||||
bare ``.iso`` folder becomes ``"other"``). The refined value is
|
||||
patched onto ``parsed`` in place — same convention as
|
||||
``analyze_release`` had before.
|
||||
4. **Probe the video** — the injected :class:`MediaProber` fills in
|
||||
missing technical fields via :func:`enrich_from_probe`. Skipped
|
||||
when there is no main video or when ``media_type`` ended up in
|
||||
``{"unknown", "other"}`` (the probe would tell us nothing useful).
|
||||
|
||||
The return type is :class:`InspectedResult`, a frozen VO that bundles
|
||||
everything downstream callers need (``analyze_release`` tool,
|
||||
``resolve_destination``, future workflow stages) without forcing them
|
||||
to redo the same four calls.
|
||||
|
||||
Design notes:
|
||||
|
||||
- **Application layer.** This module touches both domain
|
||||
(``parse_release``) and infrastructure (``MediaProber`` port). That
|
||||
is exactly application's job — orchestrate.
|
||||
- **Knowledge base is injected.** ``inspect_release`` takes ``kb`` and
|
||||
``prober`` as parameters; no module-level singletons here. Callers
|
||||
(the tool wrapper, tests) decide what to plug in.
|
||||
- **Mutation is contained.** We still mutate ``parsed.media_type`` and
|
||||
let ``enrich_from_probe`` fill its ``None`` fields, because
|
||||
``ParsedRelease`` is intentionally a mutable dataclass. The outer
|
||||
``InspectedResult`` is frozen so the *bundle* is immutable from the
|
||||
caller's perspective.
|
||||
- **Never raises.** Filesystem / probe errors surface as ``None``
|
||||
fields on the result, never as exceptions — same contract as the
|
||||
underlying adapters.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.application.release.detect_media_type import detect_media_type
|
||||
from alfred.application.release.enrich_from_probe import enrich_from_probe
|
||||
from alfred.application.release.supported_media import find_main_video
|
||||
from alfred.domain.release.ports import ReleaseKnowledge
|
||||
from alfred.domain.release.services import parse_release
|
||||
from alfred.domain.release.value_objects import ParsedRelease, ParseReport
|
||||
from alfred.domain.shared.media import MediaInfo
|
||||
from alfred.domain.shared.ports import MediaProber
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class InspectedResult:
|
||||
"""The full picture of a release: parsed name + filesystem reality.
|
||||
|
||||
Bundles everything the downstream pipeline needs after a single
|
||||
inspection pass:
|
||||
|
||||
- ``parsed`` — :class:`ParsedRelease`, with ``media_type`` already
|
||||
refined by :func:`detect_media_type` and ``None`` tech fields
|
||||
filled in by :func:`enrich_from_probe` when a probe ran.
|
||||
- ``report`` — :class:`ParseReport` from the parser (confidence +
|
||||
road, untouched by inspection).
|
||||
- ``source_path`` — the path the inspector was pointed at (file or
|
||||
folder), as supplied by the caller.
|
||||
- ``main_video`` — the canonical video file inside ``source_path``,
|
||||
or ``None`` if no eligible file was found.
|
||||
- ``media_info`` — the :class:`MediaInfo` snapshot when a probe
|
||||
succeeded; ``None`` when no video was probed (no main video, or
|
||||
``media_type`` in ``{"unknown", "other"}``) or when ffprobe
|
||||
failed.
|
||||
- ``probe_used`` — ``True`` iff ``media_info`` is non-``None`` and
|
||||
``enrich_from_probe`` actually ran. Explicit flag so callers
|
||||
don't have to re-derive the condition.
|
||||
"""
|
||||
|
||||
parsed: ParsedRelease
|
||||
report: ParseReport
|
||||
source_path: Path
|
||||
main_video: Path | None
|
||||
media_info: MediaInfo | None
|
||||
probe_used: bool
|
||||
|
||||
|
||||
# Media types for which a probe carries no useful information.
|
||||
_NON_PROBABLE_MEDIA_TYPES = frozenset({"unknown", "other"})
|
||||
|
||||
|
||||
def inspect_release(
|
||||
release_name: str,
|
||||
source_path: Path,
|
||||
kb: ReleaseKnowledge,
|
||||
prober: MediaProber,
|
||||
) -> InspectedResult:
|
||||
"""Run the full inspection pipeline on ``release_name`` /
|
||||
``source_path``.
|
||||
|
||||
See module docstring for the four-step flow. ``kb`` and ``prober``
|
||||
are injected so the caller controls the knowledge base layering
|
||||
and the probe adapter (real ffprobe in production, stubs in tests).
|
||||
|
||||
Never raises. A missing or unreadable ``source_path`` simply
|
||||
results in ``main_video=None`` and ``media_info=None``.
|
||||
"""
|
||||
parsed, report = parse_release(release_name, kb)
|
||||
|
||||
# Step 2: refine media_type from the on-disk extension mix.
|
||||
# detect_media_type tolerates non-existent paths (returns parsed.media_type
|
||||
# untouched), so no need to guard here.
|
||||
parsed.media_type = detect_media_type(parsed, source_path, kb)
|
||||
|
||||
# Step 3: pick the canonical main video (top-level scan only).
|
||||
main_video = find_main_video(source_path, kb)
|
||||
|
||||
# Step 4: probe + enrich, when it makes sense.
|
||||
media_info: MediaInfo | None = None
|
||||
probe_used = False
|
||||
if main_video is not None and parsed.media_type not in _NON_PROBABLE_MEDIA_TYPES:
|
||||
media_info = prober.probe(main_video)
|
||||
if media_info is not None:
|
||||
enrich_from_probe(parsed, media_info)
|
||||
probe_used = True
|
||||
|
||||
return InspectedResult(
|
||||
parsed=parsed,
|
||||
report=report,
|
||||
source_path=source_path,
|
||||
main_video=main_video,
|
||||
media_info=media_info,
|
||||
probe_used=probe_used,
|
||||
)
|
||||
@@ -0,0 +1,74 @@
|
||||
"""Pre-pipeline exclusion — decide which files are worth parsing.
|
||||
|
||||
These helpers live one notch above the domain: they touch the
|
||||
filesystem (``Path.iterdir``, ``Path.suffix``) but carry no parsing
|
||||
logic of their own. The goal is to filter out non-video files and pick
|
||||
the canonical "main video" from a release folder *before* anything
|
||||
hits :func:`~alfred.domain.release.parse_release`.
|
||||
|
||||
Design notes (Phase A bis, 2026-05-20):
|
||||
|
||||
- **Extension is the sole eligibility criterion.** A file is supported
|
||||
iff its suffix is in ``kb.video_extensions``. No size threshold, no
|
||||
filename heuristics ("sample", "trailer", …). If a release packs a
|
||||
bloated featurette or names its sample alphabetically before the
|
||||
main feature, that's PATH_OF_PAIN territory — not this layer's job.
|
||||
|
||||
- **Top-level scan only.** ``find_main_video`` does not descend into
|
||||
subdirectories. Releases that wrap the main video in ``Sample/`` or
|
||||
similar are non-scene-standard and handled by the orchestrator
|
||||
upstream.
|
||||
|
||||
- **Lexicographic tie-break.** When several candidates qualify
|
||||
(legitimate for season packs), we return the first by alphabetical
|
||||
order. Deterministic, no size-based ranking.
|
||||
|
||||
- **Direct ``Path`` I/O.** No ``FilesystemScanner`` port — this layer
|
||||
is application, not domain. If isolation becomes necessary for
|
||||
testing scale, we'll introduce a port then.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.domain.release.ports.knowledge import ReleaseKnowledge
|
||||
|
||||
|
||||
def is_supported_video(path: Path, kb: ReleaseKnowledge) -> bool:
|
||||
"""Return True when ``path`` is a video file the parser should
|
||||
consider.
|
||||
|
||||
The check is purely extension-based: ``path.suffix.lower()`` must
|
||||
belong to ``kb.video_extensions``. ``path`` must also be a regular
|
||||
file — directories and broken symlinks return False.
|
||||
"""
|
||||
if not path.is_file():
|
||||
return False
|
||||
return path.suffix.lower() in kb.video_extensions
|
||||
|
||||
|
||||
def find_main_video(folder: Path, kb: ReleaseKnowledge) -> Path | None:
|
||||
"""Return the canonical main video file inside ``folder``, or
|
||||
``None`` if there isn't one.
|
||||
|
||||
Behavior:
|
||||
|
||||
- Top-level scan only — subdirectories are ignored.
|
||||
- Eligibility is :func:`is_supported_video`.
|
||||
- When several files qualify, the lexicographically first one wins.
|
||||
- When ``folder`` itself is a video file, it is returned as-is
|
||||
(single-file releases are valid).
|
||||
- When ``folder`` doesn't exist or isn't a directory (and isn't a
|
||||
video file either), returns ``None``.
|
||||
"""
|
||||
if folder.is_file():
|
||||
return folder if is_supported_video(folder, kb) else None
|
||||
|
||||
if not folder.is_dir():
|
||||
return None
|
||||
|
||||
candidates = sorted(
|
||||
child for child in folder.iterdir() if is_supported_video(child, kb)
|
||||
)
|
||||
return candidates[0] if candidates else None
|
||||
@@ -1,6 +1,6 @@
|
||||
"""Release domain — release name parsing and naming conventions."""
|
||||
|
||||
from .services import parse_release
|
||||
from .value_objects import ParsedRelease
|
||||
from .value_objects import ParsedRelease, ParseReport
|
||||
|
||||
__all__ = ["ParsedRelease", "parse_release"]
|
||||
__all__ = ["ParsedRelease", "ParseReport", "parse_release"]
|
||||
|
||||
@@ -0,0 +1,31 @@
|
||||
"""Release parser v2 — annotate-based pipeline.
|
||||
|
||||
This package is the future home of ``parse_release``. It restructures the
|
||||
parsing logic around a **tokenize → annotate → assemble** pipeline:
|
||||
|
||||
1. **tokenize**: split the release name into atomic tokens.
|
||||
2. **annotate**: walk tokens left-to-right, assigning each one a
|
||||
:class:`TokenRole` (TITLE, YEAR, SEASON, RESOLUTION, …) using the
|
||||
injected :class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge`.
|
||||
3. **assemble**: fold the annotated tokens into a :class:`ParsedRelease`.
|
||||
|
||||
The pipeline has three internal paths driven by the detected release group:
|
||||
|
||||
- **EASY**: known group (KONTRAST, RARBG, …) with a schema-driven layout
|
||||
declared in ``knowledge/release/release_groups/<group>.yaml``.
|
||||
- **SHITTY**: unknown group, best-effort matching against the global
|
||||
knowledge sets, with a 0-100 confidence score.
|
||||
- **PATH OF PAIN**: score below threshold OR critical chunks missing —
|
||||
signaled to the caller, who decides whether to involve the LLM/user.
|
||||
|
||||
Today the package exposes scaffolding only (token VOs and a thin pipeline
|
||||
stub). The legacy ``parse_release`` in ``release.services`` keeps serving
|
||||
production until each piece of the v2 pipeline is wired in.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from .schema import GroupSchema, SchemaChunk
|
||||
from .tokens import Token, TokenRole
|
||||
|
||||
__all__ = ["GroupSchema", "SchemaChunk", "Token", "TokenRole"]
|
||||
@@ -0,0 +1,767 @@
|
||||
"""Annotate-based pipeline.
|
||||
|
||||
Three stages:
|
||||
|
||||
1. :func:`tokenize` — release name → ``list[Token]`` (all UNKNOWN), plus
|
||||
a separately-returned site tag (e.g. ``[YTS.MX]``) that is never
|
||||
tokenized.
|
||||
2. :func:`annotate` — promote each token's :class:`TokenRole` using the
|
||||
injected knowledge base. Two sub-passes:
|
||||
|
||||
a. **Structural** (schema-driven, EASY only). Detects the group at
|
||||
the right end, looks up its :class:`GroupSchema`, then matches
|
||||
the schema's chunk sequence against the token stream. Between
|
||||
two structural chunks, any number of unmatched tokens may
|
||||
remain — they are left UNKNOWN for the enricher pass to handle.
|
||||
b. **Enrichers** (non-positional). Walks UNKNOWN tokens and tags
|
||||
audio / video-meta / edition / language roles. Multi-token
|
||||
sequences (``DTS.HD.MA``, ``DV.HDR10``, ``DIRECTORS.CUT``) are
|
||||
matched first, single tokens after.
|
||||
|
||||
3. :func:`assemble` — fold annotated tokens into a
|
||||
:class:`~alfred.domain.release.value_objects.ParsedRelease`-compatible
|
||||
dict.
|
||||
|
||||
The pipeline is **pure**: no I/O, no TMDB, no probe. All knowledge
|
||||
arrives through ``kb: ReleaseKnowledge``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from ..ports.knowledge import ReleaseKnowledge
|
||||
from ..value_objects import MediaTypeToken
|
||||
from .schema import GroupSchema
|
||||
from .tokens import Token, TokenRole
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 1 — tokenize
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def strip_site_tag(name: str) -> tuple[str, str | None]:
|
||||
"""Split off a ``[site.tag]`` prefix or suffix.
|
||||
|
||||
Returns ``(clean_name, tag)``. If no tag is found, returns
|
||||
``(name.strip(), None)``.
|
||||
"""
|
||||
s = name.strip()
|
||||
|
||||
if s.startswith("["):
|
||||
close = s.find("]")
|
||||
if close != -1:
|
||||
tag = s[1:close].strip()
|
||||
remainder = s[close + 1 :].strip()
|
||||
if tag and remainder:
|
||||
return remainder, tag
|
||||
|
||||
if s.endswith("]"):
|
||||
open_bracket = s.rfind("[")
|
||||
if open_bracket != -1:
|
||||
tag = s[open_bracket + 1 : -1].strip()
|
||||
remainder = s[:open_bracket].strip()
|
||||
if tag and remainder:
|
||||
return remainder, tag
|
||||
|
||||
return s, None
|
||||
|
||||
|
||||
def tokenize(name: str, kb: ReleaseKnowledge) -> tuple[list[Token], str | None]:
|
||||
"""Split ``name`` into tokens after stripping any site tag.
|
||||
|
||||
String-ops style: replace every configured separator with a single
|
||||
NUL byte then split. NUL cannot legally appear in a release name, so
|
||||
it's a safe sentinel.
|
||||
"""
|
||||
clean, site_tag = strip_site_tag(name)
|
||||
|
||||
DELIM = "\x00"
|
||||
buf = clean
|
||||
for sep in kb.separators:
|
||||
if sep != DELIM:
|
||||
buf = buf.replace(sep, DELIM)
|
||||
|
||||
pieces = [p for p in buf.split(DELIM) if p]
|
||||
tokens = [Token(text=p, index=i) for i, p in enumerate(pieces)]
|
||||
return tokens, site_tag
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers shared across passes
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _parse_season_episode(text: str) -> tuple[int, int | None, int | None] | None:
|
||||
"""Parse a single token as ``SxxExx`` / ``SxxExxExx`` / ``Sxx`` /
|
||||
``Sxx-yy`` (season range) / ``NxNN``.
|
||||
|
||||
Returns ``(season, episode, episode_end)`` or ``None`` if the token
|
||||
is not a season/episode marker. For ``Sxx-yy``, returns the first
|
||||
season with no episode info — the caller is expected to detect the
|
||||
range form and promote ``media_type`` to ``tv_complete`` separately.
|
||||
"""
|
||||
upper = text.upper()
|
||||
|
||||
# SxxExx form (and Sxx, Sxx-yy)
|
||||
if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
|
||||
season = int(upper[1:3])
|
||||
rest = upper[3:]
|
||||
|
||||
if not rest:
|
||||
return season, None, None
|
||||
|
||||
# Sxx-yy season-range form: capture the first season, treat as a
|
||||
# complete-series marker (no episode info).
|
||||
if (
|
||||
len(rest) == 3
|
||||
and rest[0] == "-"
|
||||
and rest[1:3].isdigit()
|
||||
):
|
||||
return season, None, None
|
||||
|
||||
episodes: list[int] = []
|
||||
while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
|
||||
episodes.append(int(rest[1:3]))
|
||||
rest = rest[3:]
|
||||
|
||||
if not episodes:
|
||||
return None
|
||||
# For chained multi-episode markers (E09E10E11), the range is the
|
||||
# first → last episode. Intermediate values are implied.
|
||||
return season, episodes[0], episodes[-1] if len(episodes) >= 2 else None
|
||||
|
||||
# NxNN form
|
||||
if "X" in upper:
|
||||
parts = upper.split("X")
|
||||
if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
|
||||
season = int(parts[0])
|
||||
episode = int(parts[1])
|
||||
episode_end = int(parts[2]) if len(parts) >= 3 else None
|
||||
return season, episode, episode_end
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _is_year(text: str) -> bool:
|
||||
"""Return True if ``text`` is a 4-digit year in [1900, 2099]."""
|
||||
return len(text) == 4 and text.isdigit() and 1900 <= int(text) <= 2099
|
||||
|
||||
|
||||
def _split_codec_group(text: str, kb: ReleaseKnowledge) -> tuple[str, str] | None:
|
||||
"""Split a ``codec-GROUP`` token into ``(codec, group)`` if it fits.
|
||||
|
||||
Returns ``None`` if the token doesn't match the ``codec-GROUP``
|
||||
shape. Handles the empty-group case (``x265-``) as ``(codec, "")``.
|
||||
"""
|
||||
if "-" not in text:
|
||||
return None
|
||||
head, _, tail = text.rpartition("-")
|
||||
if head.lower() in kb.codecs:
|
||||
return head, tail
|
||||
return None
|
||||
|
||||
|
||||
def _match_role(text: str, role: TokenRole, kb: ReleaseKnowledge) -> TokenRole | None:
|
||||
"""Return ``role`` if ``text`` matches it under ``kb``, else ``None``."""
|
||||
lower = text.lower()
|
||||
|
||||
if role is TokenRole.YEAR:
|
||||
return TokenRole.YEAR if _is_year(text) else None
|
||||
|
||||
if role is TokenRole.SEASON_EPISODE:
|
||||
return (
|
||||
TokenRole.SEASON_EPISODE
|
||||
if _parse_season_episode(text) is not None
|
||||
else None
|
||||
)
|
||||
|
||||
if role is TokenRole.RESOLUTION:
|
||||
return TokenRole.RESOLUTION if lower in kb.resolutions else None
|
||||
|
||||
if role is TokenRole.SOURCE:
|
||||
return TokenRole.SOURCE if lower in kb.sources else None
|
||||
|
||||
if role is TokenRole.CODEC:
|
||||
return TokenRole.CODEC if lower in kb.codecs else None
|
||||
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2a — group detection
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _detect_group(tokens: list[Token], kb: ReleaseKnowledge) -> tuple[str, int | None]:
|
||||
"""Identify the release group by walking tokens right-to-left.
|
||||
|
||||
Returns ``(group_name, token_index_carrying_group)``. ``index`` is
|
||||
``None`` when the group is absent (no trailing ``-`` in the stream).
|
||||
"""
|
||||
# Priority 1: codec-GROUP shape (clearest signal).
|
||||
for tok in reversed(tokens):
|
||||
split = _split_codec_group(tok.text, kb)
|
||||
if split is not None:
|
||||
_, group = split
|
||||
return (group or "UNKNOWN"), tok.index
|
||||
|
||||
# Priority 2: rightmost dash, excluding dashed sources (Web-DL, etc.).
|
||||
for tok in reversed(tokens):
|
||||
if "-" not in tok.text:
|
||||
continue
|
||||
head, _, tail = tok.text.rpartition("-")
|
||||
if (
|
||||
head.lower() in kb.sources
|
||||
or tok.text.lower().replace("-", "") in kb.sources
|
||||
):
|
||||
continue
|
||||
if tail:
|
||||
return tail, tok.index
|
||||
|
||||
return "UNKNOWN", None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2b — structural annotation (schema-driven)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _annotate_structural(
|
||||
tokens: list[Token],
|
||||
kb: ReleaseKnowledge,
|
||||
schema: GroupSchema,
|
||||
group_token_index: int,
|
||||
) -> list[Token] | None:
|
||||
"""Annotate structural tokens following a known group schema.
|
||||
|
||||
Walks the schema's chunks against the body (tokens up to the group
|
||||
token). For each chunk, scans forward in the body for a matching
|
||||
token — tokens passed over without match are left UNKNOWN (the
|
||||
enricher pass will handle them).
|
||||
|
||||
Returns ``None`` if any mandatory chunk fails to find a match.
|
||||
"""
|
||||
result = list(tokens)
|
||||
|
||||
# The codec-GROUP token carries CODEC + GROUP. Split it now so the
|
||||
# schema walk knows the codec is "pre-consumed" at the end.
|
||||
group_token = result[group_token_index]
|
||||
cg_split = _split_codec_group(group_token.text, kb)
|
||||
codec_pre_consumed = False
|
||||
if cg_split is not None:
|
||||
codec, group = cg_split
|
||||
result[group_token_index] = group_token.with_role(
|
||||
TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
|
||||
)
|
||||
codec_pre_consumed = True
|
||||
else:
|
||||
head, _, tail = group_token.text.rpartition("-")
|
||||
result[group_token_index] = group_token.with_role(
|
||||
TokenRole.GROUP, group=tail or "UNKNOWN", prefix=head
|
||||
)
|
||||
|
||||
body_end = group_token_index # exclusive
|
||||
tok_idx = 0
|
||||
chunk_idx = 0
|
||||
|
||||
# 1) TITLE — leftmost contiguous tokens up to the first structural
|
||||
# boundary. Title is special because it can be multi-token.
|
||||
while (
|
||||
chunk_idx < len(schema.chunks)
|
||||
and schema.chunks[chunk_idx].role is TokenRole.TITLE
|
||||
):
|
||||
title_end = _find_title_end(result, body_end, kb)
|
||||
for i in range(tok_idx, title_end):
|
||||
result[i] = result[i].with_role(TokenRole.TITLE)
|
||||
tok_idx = title_end
|
||||
chunk_idx += 1
|
||||
|
||||
# 2) Remaining structural chunks. For each, scan forward in the body
|
||||
# for a matching token; tokens passed over remain UNKNOWN.
|
||||
for chunk in schema.chunks[chunk_idx:]:
|
||||
if chunk.role is TokenRole.GROUP:
|
||||
continue
|
||||
if chunk.role is TokenRole.CODEC and codec_pre_consumed:
|
||||
continue
|
||||
|
||||
match_idx = _find_chunk(result, tok_idx, body_end, chunk.role, kb)
|
||||
if match_idx is None:
|
||||
if chunk.optional:
|
||||
continue
|
||||
return None
|
||||
|
||||
result[match_idx] = result[match_idx].with_role(chunk.role)
|
||||
tok_idx = match_idx + 1
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def _find_title_end(
|
||||
tokens: list[Token], body_end: int, kb: ReleaseKnowledge
|
||||
) -> int:
|
||||
"""Return the exclusive index where the title ends.
|
||||
|
||||
The title is the leftmost run of tokens whose text does not match
|
||||
any structural role (year, season/episode, resolution, source,
|
||||
codec). Enricher tokens (audio, HDR, language) are *not* boundaries
|
||||
because they can appear in the middle of the structural sequence;
|
||||
however, in canonical scene names they don't appear inside the title
|
||||
itself, so this heuristic holds in practice.
|
||||
"""
|
||||
for i in range(body_end):
|
||||
text = tokens[i].text
|
||||
if _parse_season_episode(text) is not None:
|
||||
return i
|
||||
if _is_year(text):
|
||||
return i
|
||||
lower = text.lower()
|
||||
if lower in kb.resolutions:
|
||||
return i
|
||||
if lower in kb.sources:
|
||||
return i
|
||||
if lower in kb.codecs:
|
||||
return i
|
||||
# codec-GROUP token (e.g. "x265-KONTRAST") or dashed source (Web-DL).
|
||||
if "-" in text:
|
||||
head, _, _ = text.rpartition("-")
|
||||
if (
|
||||
head.lower() in kb.codecs
|
||||
or head.lower() in kb.sources
|
||||
or text.lower().replace("-", "") in kb.sources
|
||||
):
|
||||
return i
|
||||
return body_end
|
||||
|
||||
|
||||
def _find_chunk(
|
||||
tokens: list[Token],
|
||||
start: int,
|
||||
end: int,
|
||||
role: TokenRole,
|
||||
kb: ReleaseKnowledge,
|
||||
) -> int | None:
|
||||
"""Return the first index in ``[start, end)`` whose token matches ``role``.
|
||||
|
||||
Returns ``None`` if no token in the range matches. Tokens already
|
||||
annotated (non-UNKNOWN) are skipped — they belong to another chunk.
|
||||
"""
|
||||
for i in range(start, end):
|
||||
if tokens[i].role is not TokenRole.UNKNOWN:
|
||||
continue
|
||||
if _match_role(tokens[i].text, role, kb) is not None:
|
||||
return i
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2b' — SHITTY annotation (schema-less heuristic)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _annotate_shitty(
|
||||
tokens: list[Token],
|
||||
kb: ReleaseKnowledge,
|
||||
group_index: int | None,
|
||||
) -> list[Token]:
|
||||
"""Schema-less, dictionary-driven annotation.
|
||||
|
||||
SHITTY's job is narrow: for releases that *look* like scene names
|
||||
but don't have a registered group schema, tag every token whose text
|
||||
falls into a known YAML bucket (resolutions, codecs, sources, …).
|
||||
Anything we can't classify stays UNKNOWN. The leftmost run of
|
||||
UNKNOWN tokens becomes the title. Done.
|
||||
|
||||
Anything that requires more reasoning (parenthesized tech blocks,
|
||||
bare-dashed title fragments, year-disguised slug suffixes, …) is
|
||||
PATH OF PAIN territory and stays out of here on purpose.
|
||||
"""
|
||||
result = list(tokens)
|
||||
|
||||
# 1) Group token — split codec-GROUP or tag GROUP. Same logic as EASY.
|
||||
if group_index is not None:
|
||||
gt = result[group_index]
|
||||
cg_split = _split_codec_group(gt.text, kb)
|
||||
if cg_split is not None:
|
||||
codec, group = cg_split
|
||||
result[group_index] = gt.with_role(
|
||||
TokenRole.CODEC, codec=codec, group=group or "UNKNOWN"
|
||||
)
|
||||
else:
|
||||
_, _, tail = gt.text.rpartition("-")
|
||||
result[group_index] = gt.with_role(
|
||||
TokenRole.GROUP, group=tail or "UNKNOWN"
|
||||
)
|
||||
|
||||
# 2) Enrichers (audio / video-meta / edition / language).
|
||||
result = _annotate_enrichers(result, kb)
|
||||
|
||||
# 3) Single pass: tag each UNKNOWN token by looking it up in the kb
|
||||
# buckets. First match wins per token, first occurrence wins per
|
||||
# role (we don't overwrite an already-tagged role).
|
||||
matchers: list[tuple[TokenRole, callable]] = [
|
||||
(TokenRole.SEASON_EPISODE, lambda t: _parse_season_episode(t) is not None),
|
||||
(TokenRole.YEAR, _is_year),
|
||||
(TokenRole.RESOLUTION, lambda t: t.lower() in kb.resolutions),
|
||||
(TokenRole.DISTRIBUTOR, lambda t: t.upper() in kb.distributors),
|
||||
(TokenRole.SOURCE, lambda t: t.lower() in kb.sources),
|
||||
(TokenRole.CODEC, lambda t: t.lower() in kb.codecs),
|
||||
]
|
||||
seen: set[TokenRole] = set()
|
||||
|
||||
for i, tok in enumerate(result):
|
||||
if tok.role is not TokenRole.UNKNOWN:
|
||||
continue
|
||||
for role, matches in matchers:
|
||||
if role in seen:
|
||||
continue
|
||||
if matches(tok.text):
|
||||
result[i] = tok.with_role(role)
|
||||
seen.add(role)
|
||||
break
|
||||
|
||||
# 4) Title = leftmost contiguous UNKNOWN tokens.
|
||||
for i, tok in enumerate(result):
|
||||
if tok.role is not TokenRole.UNKNOWN:
|
||||
break
|
||||
result[i] = tok.with_role(TokenRole.TITLE)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2c — enricher pass (non-positional roles)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _annotate_enrichers(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
|
||||
"""Tag the remaining UNKNOWN tokens with non-positional roles.
|
||||
|
||||
Multi-token sequences are matched first (so ``DTS.HD.MA`` wins over
|
||||
a single-token ``DTS``). For each sequence match, the first token
|
||||
receives the role + ``extra["sequence"]`` (the canonical joined
|
||||
value), and the trailing members are marked with the same role +
|
||||
``extra["sequence_member"]=True`` so :func:`assemble` extracts the
|
||||
value only from the primary.
|
||||
"""
|
||||
result = list(tokens)
|
||||
|
||||
# Multi-token sequences first.
|
||||
_apply_sequences(
|
||||
result, kb.audio.get("sequences", []), "codec", TokenRole.AUDIO_CODEC
|
||||
)
|
||||
_apply_sequences(
|
||||
result, kb.video_meta.get("sequences", []), "hdr", TokenRole.HDR
|
||||
)
|
||||
_apply_sequences(
|
||||
result, kb.editions.get("sequences", []), "edition", TokenRole.EDITION
|
||||
)
|
||||
|
||||
# Single tokens.
|
||||
known_audio_codecs = {c.upper() for c in kb.audio.get("codecs", [])}
|
||||
known_audio_channels = set(kb.audio.get("channels", []))
|
||||
known_hdr = {h.upper() for h in kb.video_meta.get("hdr", [])} | kb.hdr_extra
|
||||
known_bit_depth = {d.lower() for d in kb.video_meta.get("bit_depth", [])}
|
||||
known_editions = {t.upper() for t in kb.editions.get("tokens", [])}
|
||||
|
||||
# Channel layouts like "5.1" are tokenized as two tokens ("5", "1")
|
||||
# because "." is a separator. Detect consecutive pairs whose joined
|
||||
# value (without any trailing "-GROUP") is in the channel set.
|
||||
_detect_channel_pairs(result, known_audio_channels)
|
||||
|
||||
for i, tok in enumerate(result):
|
||||
if tok.role is not TokenRole.UNKNOWN:
|
||||
continue
|
||||
text = tok.text
|
||||
upper = text.upper()
|
||||
lower = text.lower()
|
||||
|
||||
if upper in known_audio_codecs:
|
||||
result[i] = tok.with_role(TokenRole.AUDIO_CODEC)
|
||||
continue
|
||||
if text in known_audio_channels:
|
||||
result[i] = tok.with_role(TokenRole.AUDIO_CHANNELS)
|
||||
continue
|
||||
if upper in known_hdr:
|
||||
result[i] = tok.with_role(TokenRole.HDR)
|
||||
continue
|
||||
if lower in known_bit_depth:
|
||||
result[i] = tok.with_role(TokenRole.BIT_DEPTH)
|
||||
continue
|
||||
if upper in known_editions:
|
||||
result[i] = tok.with_role(TokenRole.EDITION)
|
||||
continue
|
||||
if upper in kb.language_tokens:
|
||||
result[i] = tok.with_role(TokenRole.LANGUAGE)
|
||||
continue
|
||||
if upper in kb.distributors:
|
||||
result[i] = tok.with_role(TokenRole.DISTRIBUTOR)
|
||||
continue
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def _apply_sequences(
|
||||
tokens: list[Token],
|
||||
sequences: list[dict],
|
||||
value_key: str,
|
||||
role: TokenRole,
|
||||
) -> None:
|
||||
"""Mark the first occurrence of each sequence in place.
|
||||
|
||||
Mutates ``tokens`` (replacing entries with new role-tagged Token
|
||||
instances). Sequences in the YAML must be ordered most-specific
|
||||
first; the first match wins per starting position.
|
||||
"""
|
||||
if not sequences:
|
||||
return
|
||||
|
||||
upper_texts = [t.text.upper() for t in tokens]
|
||||
consumed: set[int] = set()
|
||||
|
||||
for seq in sequences:
|
||||
seq_upper = [s.upper() for s in seq["tokens"]]
|
||||
n = len(seq_upper)
|
||||
for start in range(len(tokens) - n + 1):
|
||||
if any(idx in consumed for idx in range(start, start + n)):
|
||||
continue
|
||||
if any(
|
||||
tokens[start + k].role is not TokenRole.UNKNOWN for k in range(n)
|
||||
):
|
||||
continue
|
||||
if upper_texts[start : start + n] == seq_upper:
|
||||
tokens[start] = tokens[start].with_role(
|
||||
role, sequence=seq[value_key]
|
||||
)
|
||||
for k in range(1, n):
|
||||
tokens[start + k] = tokens[start + k].with_role(
|
||||
role, sequence_member="True"
|
||||
)
|
||||
consumed.update(range(start, start + n))
|
||||
|
||||
|
||||
def _detect_channel_pairs(
|
||||
tokens: list[Token], known_channels: set[str]
|
||||
) -> None:
|
||||
"""Spot two consecutive numeric tokens that form a channel layout.
|
||||
|
||||
Example: ``["5", "1-KTH"]`` → joined ``"5.1"`` (after stripping the
|
||||
``-GROUP`` suffix on the second). The second token may be the trailing
|
||||
codec-GROUP token, in which case it's already tagged CODEC and we
|
||||
skip — we'd corrupt its role.
|
||||
"""
|
||||
for i in range(len(tokens) - 1):
|
||||
first = tokens[i]
|
||||
second = tokens[i + 1]
|
||||
if first.role is not TokenRole.UNKNOWN:
|
||||
continue
|
||||
# Strip a "-GROUP" suffix on the second token before joining.
|
||||
second_text = second.text.split("-")[0]
|
||||
candidate = f"{first.text}.{second_text}"
|
||||
if candidate not in known_channels:
|
||||
continue
|
||||
# Only tag the first token (carries the channel value). The
|
||||
# second token may legitimately remain UNKNOWN (or be the
|
||||
# codec-GROUP token, already tagged CODEC).
|
||||
tokens[i] = first.with_role(
|
||||
TokenRole.AUDIO_CHANNELS, sequence=candidate
|
||||
)
|
||||
if second.role is TokenRole.UNKNOWN:
|
||||
tokens[i + 1] = second.with_role(
|
||||
TokenRole.AUDIO_CHANNELS, sequence_member="True"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 2 entry point
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def annotate(tokens: list[Token], kb: ReleaseKnowledge) -> list[Token]:
|
||||
"""Annotate token roles.
|
||||
|
||||
Dispatch:
|
||||
|
||||
* If a group is detected AND has a known schema, run the EASY
|
||||
structural walk. If the schema walk aborts on a mandatory chunk
|
||||
mismatch, fall through to SHITTY (the heuristic still does better
|
||||
than giving up).
|
||||
* Otherwise run SHITTY — schema-less, best-effort, never aborts.
|
||||
|
||||
The enricher pass runs in both cases. The pipeline always returns a
|
||||
populated token list; downstream callers don't need to distinguish
|
||||
EASY vs SHITTY at this layer (the parse_path is decided in the
|
||||
service based on whether a schema matched).
|
||||
"""
|
||||
group_name, group_index = _detect_group(tokens, kb)
|
||||
|
||||
schema = kb.group_schema(group_name) if group_index is not None else None
|
||||
if schema is not None and group_index is not None:
|
||||
structural = _annotate_structural(tokens, kb, schema, group_index)
|
||||
if structural is not None:
|
||||
return _annotate_enrichers(structural, kb)
|
||||
|
||||
# SHITTY fallback — heuristic positional pass. ``_annotate_shitty``
|
||||
# runs its own enricher pass internally (it has to, so the title
|
||||
# scan can skip enricher-tagged tokens).
|
||||
return _annotate_shitty(tokens, kb, group_index)
|
||||
|
||||
|
||||
def has_known_schema(tokens: list[Token], kb: ReleaseKnowledge) -> bool:
|
||||
"""Return True if ``tokens`` would take the EASY path in :func:`annotate`."""
|
||||
group_name, group_index = _detect_group(tokens, kb)
|
||||
if group_index is None:
|
||||
return False
|
||||
return kb.group_schema(group_name) is not None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage 3 — assemble
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def assemble(
|
||||
annotated: list[Token],
|
||||
site_tag: str | None,
|
||||
raw_name: str,
|
||||
kb: ReleaseKnowledge,
|
||||
) -> dict:
|
||||
"""Fold annotated tokens into a ``ParsedRelease``-compatible dict.
|
||||
|
||||
Returns a dict (not a ``ParsedRelease`` instance) so the caller can
|
||||
layer in additional fields (``parse_path``, ``raw``, …) before
|
||||
instantiation.
|
||||
"""
|
||||
# Pure-punctuation tokens (e.g. a stray "-" left by ` - ` separators in
|
||||
# human-friendly release names) carry no title content and would leak
|
||||
# into the joined title as ``"Show.-.Episode"``. Drop them here.
|
||||
title_parts = [
|
||||
t.text
|
||||
for t in annotated
|
||||
if t.role is TokenRole.TITLE and any(c.isalnum() for c in t.text)
|
||||
]
|
||||
title = ".".join(title_parts) if title_parts else (
|
||||
annotated[0].text if annotated else raw_name
|
||||
)
|
||||
|
||||
year: int | None = None
|
||||
season: int | None = None
|
||||
episode: int | None = None
|
||||
episode_end: int | None = None
|
||||
quality: str | None = None
|
||||
source: str | None = None
|
||||
codec: str | None = None
|
||||
group = "UNKNOWN"
|
||||
audio_codec: str | None = None
|
||||
audio_channels: str | None = None
|
||||
bit_depth: str | None = None
|
||||
hdr_format: str | None = None
|
||||
edition: str | None = None
|
||||
distributor: str | None = None
|
||||
languages: list[str] = []
|
||||
is_season_range = False
|
||||
|
||||
for tok in annotated:
|
||||
# Skip non-primary members of a multi-token sequence.
|
||||
if tok.extra.get("sequence_member") == "True":
|
||||
continue
|
||||
|
||||
role = tok.role
|
||||
if role is TokenRole.YEAR:
|
||||
year = int(tok.text)
|
||||
elif role is TokenRole.SEASON_EPISODE:
|
||||
parsed = _parse_season_episode(tok.text)
|
||||
if parsed is not None:
|
||||
season, episode, episode_end = parsed
|
||||
# Detect Sxx-yy range form to flag it as a multi-season pack.
|
||||
upper = tok.text.upper()
|
||||
if (
|
||||
len(upper) == 6
|
||||
and upper[0] == "S"
|
||||
and upper[1:3].isdigit()
|
||||
and upper[3] == "-"
|
||||
and upper[4:6].isdigit()
|
||||
):
|
||||
is_season_range = True
|
||||
elif role is TokenRole.RESOLUTION:
|
||||
quality = tok.text
|
||||
elif role is TokenRole.SOURCE:
|
||||
source = tok.text
|
||||
elif role is TokenRole.CODEC:
|
||||
codec = tok.extra.get("codec", tok.text)
|
||||
if "group" in tok.extra:
|
||||
group = tok.extra["group"] or "UNKNOWN"
|
||||
elif role is TokenRole.GROUP:
|
||||
group = tok.extra.get("group", tok.text) or "UNKNOWN"
|
||||
elif role is TokenRole.AUDIO_CODEC:
|
||||
if audio_codec is None:
|
||||
audio_codec = tok.extra.get("sequence", tok.text)
|
||||
elif role is TokenRole.AUDIO_CHANNELS:
|
||||
if audio_channels is None:
|
||||
audio_channels = tok.extra.get("sequence", tok.text)
|
||||
elif role is TokenRole.BIT_DEPTH:
|
||||
if bit_depth is None:
|
||||
bit_depth = tok.text.lower()
|
||||
elif role is TokenRole.HDR:
|
||||
if hdr_format is None:
|
||||
hdr_format = tok.extra.get("sequence", tok.text.upper())
|
||||
elif role is TokenRole.EDITION:
|
||||
if edition is None:
|
||||
edition = tok.extra.get("sequence", tok.text.upper())
|
||||
elif role is TokenRole.LANGUAGE:
|
||||
languages.append(tok.text.upper())
|
||||
elif role is TokenRole.DISTRIBUTOR:
|
||||
if distributor is None:
|
||||
distributor = tok.text.upper()
|
||||
|
||||
tech_parts = [p for p in (quality, source, codec) if p]
|
||||
tech_string = ".".join(tech_parts)
|
||||
|
||||
# Media type heuristic. Doc/concert/integrale tokens win over the
|
||||
# generic tech-based fallback. We look across all tokens (not just
|
||||
# annotated ones) because these markers may be tagged UNKNOWN by the
|
||||
# structural pass — only the assemble step cares about them.
|
||||
upper_tokens = {tok.text.upper() for tok in annotated}
|
||||
doc_tokens = {t.upper() for t in kb.media_type_tokens.get("doc", [])}
|
||||
concert_tokens = {t.upper() for t in kb.media_type_tokens.get("concert", [])}
|
||||
integrale_tokens = {t.upper() for t in kb.media_type_tokens.get("integrale", [])}
|
||||
|
||||
if upper_tokens & doc_tokens:
|
||||
media_type = MediaTypeToken.DOCUMENTARY
|
||||
elif upper_tokens & concert_tokens:
|
||||
media_type = MediaTypeToken.CONCERT
|
||||
elif is_season_range:
|
||||
media_type = MediaTypeToken.TV_COMPLETE
|
||||
elif (
|
||||
edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
|
||||
or upper_tokens & integrale_tokens
|
||||
) and season is None:
|
||||
media_type = MediaTypeToken.TV_COMPLETE
|
||||
elif season is not None:
|
||||
media_type = MediaTypeToken.TV_SHOW
|
||||
elif any((quality, source, codec, year)):
|
||||
media_type = MediaTypeToken.MOVIE
|
||||
else:
|
||||
media_type = MediaTypeToken.UNKNOWN
|
||||
|
||||
return {
|
||||
"title": title,
|
||||
"title_sanitized": kb.sanitize_for_fs(title),
|
||||
"year": year,
|
||||
"season": season,
|
||||
"episode": episode,
|
||||
"episode_end": episode_end,
|
||||
"quality": quality,
|
||||
"source": source,
|
||||
"codec": codec,
|
||||
"group": group,
|
||||
"tech_string": tech_string,
|
||||
"media_type": media_type,
|
||||
"site_tag": site_tag,
|
||||
"languages": languages,
|
||||
"audio_codec": audio_codec,
|
||||
"audio_channels": audio_channels,
|
||||
"bit_depth": bit_depth,
|
||||
"hdr_format": hdr_format,
|
||||
"edition": edition,
|
||||
"distributor": distributor,
|
||||
}
|
||||
@@ -0,0 +1,47 @@
|
||||
"""Group schema value objects.
|
||||
|
||||
A :class:`GroupSchema` describes the canonical chunk layout of releases
|
||||
from a known group (KONTRAST, RARBG, ELiTE, …). It is the EASY-road
|
||||
contract: when a release ends in ``-<GROUP>`` and we know the group,
|
||||
the annotator walks the schema instead of running the heuristic SHITTY
|
||||
matchers.
|
||||
|
||||
Schemas are loaded from ``knowledge/release/release_groups/<group>.yaml``
|
||||
by an infrastructure adapter and surfaced via the
|
||||
:class:`~alfred.domain.release.ports.knowledge.ReleaseKnowledge` port.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
from .tokens import TokenRole
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SchemaChunk:
|
||||
"""One entry in a group's chunk order.
|
||||
|
||||
``role`` is the :class:`TokenRole` the chunk maps to. ``optional``
|
||||
is True for chunks that may be absent (e.g. ``year`` on TV releases,
|
||||
``source`` on bare ELiTE TV releases).
|
||||
"""
|
||||
|
||||
role: TokenRole
|
||||
optional: bool = False
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class GroupSchema:
|
||||
"""Schema for a known release group.
|
||||
|
||||
``chunks`` is the left-to-right canonical order. The annotator walks
|
||||
tokens and chunks in lockstep: an optional chunk that doesn't match
|
||||
the current token is skipped (the chunk index advances, the token
|
||||
index stays), a mandatory chunk that doesn't match aborts the EASY
|
||||
path and falls back to SHITTY.
|
||||
"""
|
||||
|
||||
name: str
|
||||
separator: str
|
||||
chunks: tuple[SchemaChunk, ...]
|
||||
@@ -0,0 +1,139 @@
|
||||
"""Parse-confidence scoring.
|
||||
|
||||
``parse_release`` returns a :class:`ParseReport` alongside its
|
||||
:class:`ParsedRelease`. The report carries:
|
||||
|
||||
- ``confidence``: integer 0–100 derived from which structural and
|
||||
technical fields got populated, minus a penalty per UNKNOWN token
|
||||
left in the annotated stream.
|
||||
- ``road``: which of the three roads the parse took
|
||||
(:class:`Road.EASY` / :class:`Road.SHITTY` / :class:`Road.PATH_OF_PAIN`).
|
||||
- ``unknown_tokens``: textual residue, useful for diagnostics.
|
||||
- ``missing_critical``: structural fields the score-tally found absent
|
||||
(e.g. ``("year", "media_type")``) — the caller can use this to drive
|
||||
PoP recovery (questions, LLM call).
|
||||
|
||||
All weights, penalties and thresholds come from the injected knowledge
|
||||
base (``kb.scoring``), itself loaded from
|
||||
``alfred/knowledge/release/scoring.yaml``. No magic numbers here.
|
||||
|
||||
The scoring functions are pure — they consume the annotated token list
|
||||
and the resulting :class:`ParsedRelease` and return the report. They are
|
||||
called by ``services.parse_release`` after ``assemble`` has run.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from enum import Enum
|
||||
|
||||
from ..ports.knowledge import ReleaseKnowledge
|
||||
from ..value_objects import ParsedRelease
|
||||
from .tokens import Token, TokenRole
|
||||
|
||||
|
||||
class Road(str, Enum):
|
||||
"""How the parser handled a given release name.
|
||||
|
||||
Distinct from :class:`~alfred.domain.release.value_objects.ParsePath`,
|
||||
which records the tokenization route (DIRECT / SANITIZED / AI). Road
|
||||
is about confidence in the *result*, not the *method*.
|
||||
"""
|
||||
|
||||
EASY = "easy" # group schema matched — structural annotation
|
||||
SHITTY = "shitty" # no schema, dict-driven annotation, score ≥ threshold
|
||||
PATH_OF_PAIN = "path_of_pain" # score below threshold, needs help
|
||||
|
||||
|
||||
# Critical structural fields — their absence drives the
|
||||
# ``missing_critical`` list in the report.
|
||||
_CRITICAL_FIELDS: tuple[str, ...] = ("title", "media_type", "year")
|
||||
|
||||
|
||||
def _is_tv_shaped(parsed: ParsedRelease) -> bool:
|
||||
"""Season/episode weights only count for releases that *look* like TV."""
|
||||
return parsed.season is not None
|
||||
|
||||
|
||||
def compute_score(
|
||||
parsed: ParsedRelease,
|
||||
annotated: list[Token],
|
||||
kb: ReleaseKnowledge,
|
||||
) -> int:
|
||||
"""Compute a 0–100 confidence score for the parse.
|
||||
|
||||
Each populated field contributes its weight from
|
||||
``kb.scoring["weights"]``. Season/episode only count when the parse
|
||||
looks like TV. ``group == "UNKNOWN"`` is treated as absent.
|
||||
|
||||
Then a penalty is subtracted per residual UNKNOWN token in
|
||||
``annotated``, capped at ``penalties["max_unknown_penalty"]``.
|
||||
|
||||
Result is clamped to ``[0, 100]``.
|
||||
"""
|
||||
weights = kb.scoring["weights"]
|
||||
penalties = kb.scoring["penalties"]
|
||||
|
||||
score = 0
|
||||
if parsed.title:
|
||||
score += weights.get("title", 0)
|
||||
if parsed.media_type and parsed.media_type.value != "unknown":
|
||||
score += weights.get("media_type", 0)
|
||||
if parsed.year is not None:
|
||||
score += weights.get("year", 0)
|
||||
if _is_tv_shaped(parsed):
|
||||
if parsed.season is not None:
|
||||
score += weights.get("season", 0)
|
||||
if parsed.episode is not None:
|
||||
score += weights.get("episode", 0)
|
||||
if parsed.quality:
|
||||
score += weights.get("resolution", 0)
|
||||
if parsed.source:
|
||||
score += weights.get("source", 0)
|
||||
if parsed.codec:
|
||||
score += weights.get("codec", 0)
|
||||
if parsed.group and parsed.group != "UNKNOWN":
|
||||
score += weights.get("group", 0)
|
||||
|
||||
unknown_count = sum(1 for t in annotated if t.role is TokenRole.UNKNOWN)
|
||||
raw_penalty = unknown_count * penalties.get("unknown_token", 0)
|
||||
capped_penalty = min(raw_penalty, penalties.get("max_unknown_penalty", 0))
|
||||
score -= capped_penalty
|
||||
|
||||
return max(0, min(100, score))
|
||||
|
||||
|
||||
def collect_unknown_tokens(annotated: list[Token]) -> tuple[str, ...]:
|
||||
"""Return the text of every token still tagged UNKNOWN."""
|
||||
return tuple(t.text for t in annotated if t.role is TokenRole.UNKNOWN)
|
||||
|
||||
|
||||
def collect_missing_critical(parsed: ParsedRelease) -> tuple[str, ...]:
|
||||
"""Return the names of critical structural fields that are absent."""
|
||||
missing: list[str] = []
|
||||
if not parsed.title:
|
||||
missing.append("title")
|
||||
if not parsed.media_type or parsed.media_type.value == "unknown":
|
||||
missing.append("media_type")
|
||||
if parsed.year is None:
|
||||
missing.append("year")
|
||||
return tuple(missing)
|
||||
|
||||
|
||||
def decide_road(
|
||||
score: int,
|
||||
has_schema: bool,
|
||||
kb: ReleaseKnowledge,
|
||||
) -> Road:
|
||||
"""Pick the road the parse took.
|
||||
|
||||
EASY is decided structurally: if a known group schema matched, the
|
||||
annotation walked the schema, and that's enough — the score does not
|
||||
veto EASY. Otherwise the score decides between SHITTY and
|
||||
PATH_OF_PAIN using ``kb.scoring["thresholds"]["shitty_min"]``.
|
||||
"""
|
||||
if has_schema:
|
||||
return Road.EASY
|
||||
threshold = kb.scoring["thresholds"].get("shitty_min", 60)
|
||||
if score >= threshold:
|
||||
return Road.SHITTY
|
||||
return Road.PATH_OF_PAIN
|
||||
@@ -0,0 +1,90 @@
|
||||
"""Token value objects for the annotate-based parser.
|
||||
|
||||
A :class:`Token` carries both the original substring and its position in
|
||||
the original release name's token stream. A :class:`TokenRole` is the
|
||||
semantic tag assigned by the annotator.
|
||||
|
||||
Why VOs instead of bare ``str``: the annotate step needs to flag tokens
|
||||
without consuming them (a token may carry residual info — e.g. a
|
||||
``codec-GROUP`` token contributes both a CODEC and a GROUP role). Tracking
|
||||
the index also lets later stages reason about *order* (year must come
|
||||
after title, group must be rightmost, etc.) without re-scanning the list.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class TokenRole(str, Enum):
|
||||
"""Semantic role a token can take after annotation.
|
||||
|
||||
A token starts as ``UNKNOWN`` and may be promoted by the annotator.
|
||||
``str``-backed for cheap comparisons and YAML/JSON interop.
|
||||
|
||||
Roles split into three families:
|
||||
|
||||
- **structural**: TITLE / YEAR / SEASON_EPISODE / GROUP — drive folder
|
||||
and filename naming.
|
||||
- **technical**: RESOLUTION / SOURCE / CODEC / AUDIO_CODEC /
|
||||
AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION / LANGUAGE — feed
|
||||
``tech_string`` and metadata fields.
|
||||
- **meta**: SITE_TAG (stripped pre-tokenize), SEPARATOR (kept for the
|
||||
assemble step if a release uses spaces that need preservation in the
|
||||
title), UNKNOWN (residual, contributes to the SHITTY score penalty).
|
||||
"""
|
||||
|
||||
UNKNOWN = "unknown"
|
||||
|
||||
# Structural
|
||||
TITLE = "title"
|
||||
YEAR = "year"
|
||||
SEASON_EPISODE = "season_episode"
|
||||
GROUP = "group"
|
||||
|
||||
# Technical
|
||||
RESOLUTION = "resolution"
|
||||
SOURCE = "source"
|
||||
CODEC = "codec"
|
||||
AUDIO_CODEC = "audio_codec"
|
||||
AUDIO_CHANNELS = "audio_channels"
|
||||
BIT_DEPTH = "bit_depth"
|
||||
HDR = "hdr"
|
||||
EDITION = "edition"
|
||||
LANGUAGE = "language"
|
||||
DISTRIBUTOR = "distributor"
|
||||
|
||||
# Meta
|
||||
SITE_TAG = "site_tag"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Token:
|
||||
"""An atomic token from a release name.
|
||||
|
||||
``text`` is the substring exactly as it appeared after tokenization
|
||||
(case preserved — uppercase comparisons happen at match time).
|
||||
``index`` is the 0-based position in the tokenized stream, used by
|
||||
downstream stages to enforce ordering invariants.
|
||||
|
||||
``role`` defaults to :attr:`TokenRole.UNKNOWN`. The annotator returns
|
||||
new :class:`Token` instances with the role set rather than mutating
|
||||
(the dataclass is frozen). ``extra`` carries role-specific payload
|
||||
when the token text alone isn't enough (e.g. a ``codec-GROUP`` token
|
||||
annotated as CODEC may record the group name in ``extra["group"]``).
|
||||
"""
|
||||
|
||||
text: str
|
||||
index: int
|
||||
role: TokenRole = TokenRole.UNKNOWN
|
||||
extra: dict[str, str] = field(default_factory=dict)
|
||||
|
||||
def with_role(self, role: TokenRole, **extra: str) -> Token:
|
||||
"""Return a copy of this token with ``role`` (and optional ``extra``)."""
|
||||
merged = {**self.extra, **extra} if extra else self.extra
|
||||
return Token(text=self.text, index=self.index, role=role, extra=merged)
|
||||
|
||||
@property
|
||||
def is_annotated(self) -> bool:
|
||||
return self.role is not TokenRole.UNKNOWN
|
||||
@@ -10,7 +10,10 @@ object that satisfies this shape (e.g. a simple dataclass).
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Protocol
|
||||
from typing import TYPE_CHECKING, Protocol
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from ..parser.schema import GroupSchema
|
||||
|
||||
|
||||
class ReleaseKnowledge(Protocol):
|
||||
@@ -21,6 +24,7 @@ class ReleaseKnowledge(Protocol):
|
||||
resolutions: set[str]
|
||||
sources: set[str]
|
||||
codecs: set[str]
|
||||
distributors: set[str]
|
||||
language_tokens: set[str]
|
||||
forbidden_chars: set[str]
|
||||
hdr_extra: set[str]
|
||||
@@ -36,6 +40,18 @@ class ReleaseKnowledge(Protocol):
|
||||
|
||||
separators: list[str]
|
||||
|
||||
# --- Parse scoring (Phase A) ---
|
||||
#
|
||||
# ``scoring`` is a dict with three keys:
|
||||
# - ``weights``: dict[field_name, int] field weight contribution
|
||||
# - ``penalties``: {"unknown_token": int, "max_unknown_penalty": int}
|
||||
# - ``thresholds``: {"shitty_min": int} SHITTY vs PATH_OF_PAIN cutoff
|
||||
#
|
||||
# Concrete values come from ``alfred/knowledge/release/scoring.yaml``.
|
||||
# The loader fills in safe defaults so this dict is always populated.
|
||||
|
||||
scoring: dict
|
||||
|
||||
# --- File-extension sets (used by application/infra modules that work
|
||||
# directly with filesystem paths, e.g. media-type detection, video
|
||||
# lookup). Domain parsing itself doesn't touch these. ---
|
||||
@@ -50,3 +66,14 @@ class ReleaseKnowledge(Protocol):
|
||||
def sanitize_for_fs(self, text: str) -> str:
|
||||
"""Strip filesystem-forbidden characters from ``text``."""
|
||||
...
|
||||
|
||||
# --- Release group schemas (EASY path) ---
|
||||
|
||||
def group_schema(self, name: str) -> GroupSchema | None:
|
||||
"""Return the parsing schema for the named release group, or
|
||||
``None`` if the group is unknown (caller falls back to SHITTY).
|
||||
|
||||
Lookup is case-insensitive: ``"KONTRAST"``, ``"kontrast"`` and
|
||||
``"Kontrast"`` all resolve to the same schema.
|
||||
"""
|
||||
...
|
||||
|
||||
@@ -1,43 +1,68 @@
|
||||
"""Release domain — parsing service."""
|
||||
"""Release domain — parsing service.
|
||||
|
||||
Thin orchestrator over the annotate-based pipeline in
|
||||
:mod:`alfred.domain.release.parser.pipeline`. Responsibilities:
|
||||
|
||||
* Strip a leading/trailing ``[site.tag]`` and decide ``parse_path``.
|
||||
* Reject malformed names (forbidden characters) → ``parse_path=AI`` so
|
||||
the LLM can clean them up.
|
||||
* Otherwise call the v2 pipeline (tokenize → annotate → assemble) and
|
||||
wrap the result in :class:`ParsedRelease`.
|
||||
* Score the result and decide the road (EASY / SHITTY / PATH_OF_PAIN)
|
||||
via :mod:`alfred.domain.release.parser.scoring`.
|
||||
|
||||
The public entry point is :func:`parse_release`, which returns
|
||||
``(ParsedRelease, ParseReport)``. The report carries the confidence
|
||||
score, the road, and diagnostic info for downstream callers.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
from .parser import pipeline as _v2
|
||||
from .parser import scoring as _scoring
|
||||
from .ports import ReleaseKnowledge
|
||||
from .value_objects import MediaTypeToken, ParsedRelease, ParsePath
|
||||
from .value_objects import MediaTypeToken, ParsedRelease, ParsePath, ParseReport
|
||||
|
||||
|
||||
def _tokenize(name: str, kb: ReleaseKnowledge) -> list[str]:
|
||||
"""Split a release name on the configured separators, dropping empty tokens."""
|
||||
pattern = "[" + re.escape("".join(kb.separators)) + "]+"
|
||||
return [t for t in re.split(pattern, name) if t]
|
||||
def parse_release(
|
||||
name: str, kb: ReleaseKnowledge
|
||||
) -> tuple[ParsedRelease, ParseReport]:
|
||||
"""Parse a release name.
|
||||
|
||||
|
||||
def parse_release(name: str, kb: ReleaseKnowledge) -> ParsedRelease:
|
||||
"""
|
||||
Parse a release name and return a ParsedRelease.
|
||||
Returns a tuple ``(ParsedRelease, ParseReport)``. The structural VO
|
||||
is unchanged from the previous single-return contract; the report
|
||||
is new and carries the confidence score + road decision.
|
||||
|
||||
Flow:
|
||||
1. Strip a leading/trailing [site.tag] if present (sets parse_path="sanitized").
|
||||
2. Check the remainder for truly forbidden chars (anything not in the
|
||||
configured separators list). If any remain → media_type="unknown",
|
||||
parse_path="ai", and the LLM handles it.
|
||||
3. Tokenize using the configured separators (".", " ", "[", "]", "(", ")", "_", ...)
|
||||
and run token-level matchers (season/episode, tech, languages, audio,
|
||||
video, edition, title, year).
|
||||
"""
|
||||
parse_path = ParsePath.DIRECT.value
|
||||
|
||||
# Always try to extract a bracket-enclosed site tag first.
|
||||
clean, site_tag = _strip_site_tag(name)
|
||||
1. Strip a leading/trailing ``[site.tag]`` if present (sets
|
||||
``parse_path="sanitized"``).
|
||||
2. If the remainder still contains truly forbidden chars (anything
|
||||
not in the configured separators), short-circuit to
|
||||
``media_type="unknown"`` / ``parse_path="ai"`` and emit a
|
||||
PATH_OF_PAIN report — the LLM handles these.
|
||||
3. Otherwise run the v2 pipeline: tokenize → annotate (EASY when a
|
||||
group schema is known, SHITTY otherwise) → assemble → score.
|
||||
"""
|
||||
parse_path = ParsePath.DIRECT
|
||||
|
||||
# Apostrophes inside titles ("Don't", "L'avare") are common and should
|
||||
# not push the release through the AI fallback. Strip them up front so
|
||||
# both strip_site_tag and tokenize see "Dont" / "Lavare", which is good
|
||||
# enough for token-level matching. The raw name is preserved on the VO.
|
||||
working_name = name
|
||||
if "'" in working_name:
|
||||
working_name = working_name.replace("'", "")
|
||||
parse_path = ParsePath.SANITIZED
|
||||
|
||||
clean, site_tag = _v2.strip_site_tag(working_name)
|
||||
if site_tag is not None:
|
||||
parse_path = ParsePath.SANITIZED.value
|
||||
parse_path = ParsePath.SANITIZED
|
||||
|
||||
if not _is_well_formed(clean, kb):
|
||||
return ParsedRelease(
|
||||
parsed = ParsedRelease(
|
||||
raw=name,
|
||||
normalised=clean,
|
||||
clean=clean,
|
||||
title=clean,
|
||||
title_sanitized=kb.sanitize_for_fs(clean),
|
||||
year=None,
|
||||
@@ -49,458 +74,49 @@ def parse_release(name: str, kb: ReleaseKnowledge) -> ParsedRelease:
|
||||
codec=None,
|
||||
group="UNKNOWN",
|
||||
tech_string="",
|
||||
media_type=MediaTypeToken.UNKNOWN.value,
|
||||
media_type=MediaTypeToken.UNKNOWN,
|
||||
site_tag=site_tag,
|
||||
parse_path=ParsePath.AI.value,
|
||||
parse_path=ParsePath.AI,
|
||||
)
|
||||
|
||||
name = clean
|
||||
tokens = _tokenize(name, kb)
|
||||
|
||||
season, episode, episode_end = _extract_season_episode(tokens)
|
||||
quality, source, codec, group, tech_tokens = _extract_tech(tokens, kb)
|
||||
languages, lang_tokens = _extract_languages(tokens, kb)
|
||||
audio_codec, audio_channels, audio_tokens = _extract_audio(tokens, kb)
|
||||
bit_depth, hdr_format, video_tokens = _extract_video_meta(tokens, kb)
|
||||
edition, edition_tokens = _extract_edition(tokens, kb)
|
||||
title = _extract_title(
|
||||
tokens,
|
||||
tech_tokens | lang_tokens | audio_tokens | video_tokens | edition_tokens,
|
||||
kb,
|
||||
)
|
||||
year = _extract_year(tokens, title)
|
||||
media_type = _infer_media_type(
|
||||
season, quality, source, codec, year, edition, tokens, kb
|
||||
report = ParseReport(
|
||||
confidence=0,
|
||||
road=_scoring.Road.PATH_OF_PAIN.value,
|
||||
unknown_tokens=(clean,),
|
||||
missing_critical=("title", "media_type", "year"),
|
||||
)
|
||||
return parsed, report
|
||||
|
||||
tech_parts = [p for p in [quality, source, codec] if p]
|
||||
tech_string = ".".join(tech_parts)
|
||||
tokens, v2_tag = _v2.tokenize(working_name, kb)
|
||||
annotated = _v2.annotate(tokens, kb)
|
||||
fields = _v2.assemble(annotated, v2_tag, name, kb)
|
||||
|
||||
return ParsedRelease(
|
||||
parsed = ParsedRelease(
|
||||
raw=name,
|
||||
normalised=name,
|
||||
title=title,
|
||||
title_sanitized=kb.sanitize_for_fs(title),
|
||||
year=year,
|
||||
season=season,
|
||||
episode=episode,
|
||||
episode_end=episode_end,
|
||||
quality=quality,
|
||||
source=source,
|
||||
codec=codec,
|
||||
group=group,
|
||||
tech_string=tech_string,
|
||||
media_type=media_type,
|
||||
site_tag=site_tag,
|
||||
clean=clean,
|
||||
parse_path=parse_path,
|
||||
languages=languages,
|
||||
audio_codec=audio_codec,
|
||||
audio_channels=audio_channels,
|
||||
bit_depth=bit_depth,
|
||||
hdr_format=hdr_format,
|
||||
edition=edition,
|
||||
**fields,
|
||||
)
|
||||
|
||||
|
||||
def _infer_media_type(
|
||||
season: int | None,
|
||||
quality: str | None,
|
||||
source: str | None,
|
||||
codec: str | None,
|
||||
year: int | None,
|
||||
edition: str | None,
|
||||
tokens: list[str],
|
||||
kb: ReleaseKnowledge,
|
||||
) -> str:
|
||||
"""
|
||||
Infer media_type from token-level evidence only (no filesystem access).
|
||||
|
||||
- documentary : DOC token present
|
||||
- concert : CONCERT token present
|
||||
- tv_complete : INTEGRALE/COMPLETE token, no season
|
||||
- tv_show : season token found
|
||||
- movie : no season, at least one tech marker
|
||||
- unknown : no conclusive evidence
|
||||
"""
|
||||
upper_tokens = {t.upper() for t in tokens}
|
||||
|
||||
doc_tokens = {t.upper() for t in kb.media_type_tokens.get("doc", [])}
|
||||
concert_tokens = {t.upper() for t in kb.media_type_tokens.get("concert", [])}
|
||||
integrale_tokens = {t.upper() for t in kb.media_type_tokens.get("integrale", [])}
|
||||
|
||||
if upper_tokens & doc_tokens:
|
||||
return MediaTypeToken.DOCUMENTARY.value
|
||||
if upper_tokens & concert_tokens:
|
||||
return MediaTypeToken.CONCERT.value
|
||||
if (
|
||||
edition in {"COMPLETE", "INTEGRALE", "COLLECTION"}
|
||||
or upper_tokens & integrale_tokens
|
||||
) and season is None:
|
||||
return MediaTypeToken.TV_COMPLETE.value
|
||||
if season is not None:
|
||||
return MediaTypeToken.TV_SHOW.value
|
||||
if any([quality, source, codec, year]):
|
||||
return MediaTypeToken.MOVIE.value
|
||||
return MediaTypeToken.UNKNOWN.value
|
||||
has_schema = _v2.has_known_schema(tokens, kb)
|
||||
score = _scoring.compute_score(parsed, annotated, kb)
|
||||
road = _scoring.decide_road(score, has_schema, kb)
|
||||
report = ParseReport(
|
||||
confidence=score,
|
||||
road=road.value,
|
||||
unknown_tokens=_scoring.collect_unknown_tokens(annotated),
|
||||
missing_critical=_scoring.collect_missing_critical(parsed),
|
||||
)
|
||||
return parsed, report
|
||||
|
||||
|
||||
def _is_well_formed(name: str, kb: ReleaseKnowledge) -> bool:
|
||||
"""Return True if name contains no forbidden characters per scene naming rules.
|
||||
"""Return True if ``name`` contains no forbidden characters per scene
|
||||
naming rules.
|
||||
|
||||
Characters listed as token separators (spaces, brackets, parens, …) are NOT
|
||||
considered malforming — the tokenizer handles them. Only truly broken chars
|
||||
like '@', '#', '!', '%' make a name malformed.
|
||||
Characters listed as token separators (spaces, brackets, parens, …)
|
||||
are NOT considered malforming — the tokenizer handles them. Only
|
||||
truly broken chars like ``@``, ``#``, ``!``, ``%`` make a name
|
||||
malformed.
|
||||
"""
|
||||
tokenizable = set(kb.separators)
|
||||
return not any(c in name for c in kb.forbidden_chars if c not in tokenizable)
|
||||
|
||||
|
||||
def _strip_site_tag(name: str) -> tuple[str, str | None]:
|
||||
"""
|
||||
Strip a site watermark tag from the release name and return (clean_name, tag).
|
||||
|
||||
Handles two positions:
|
||||
- Prefix: "[ OxTorrent.vc ] The.Title.S01..."
|
||||
- Suffix: "The.Title.S01...-NTb[TGx]"
|
||||
|
||||
Anything between [...] is treated as a site tag.
|
||||
Returns (original_name, None) if no tag found.
|
||||
"""
|
||||
s = name.strip()
|
||||
|
||||
if s.startswith("["):
|
||||
close = s.find("]")
|
||||
if close != -1:
|
||||
tag = s[1:close].strip()
|
||||
remainder = s[close + 1 :].strip()
|
||||
if tag and remainder:
|
||||
return remainder, tag
|
||||
|
||||
if s.endswith("]"):
|
||||
open_bracket = s.rfind("[")
|
||||
if open_bracket != -1:
|
||||
tag = s[open_bracket + 1 : -1].strip()
|
||||
remainder = s[:open_bracket].strip()
|
||||
if tag and remainder:
|
||||
return remainder, tag
|
||||
|
||||
return s, None
|
||||
|
||||
|
||||
def _parse_season_episode(tok: str) -> tuple[int, int | None, int | None] | None:
|
||||
"""
|
||||
Parse a single token as a season/episode marker.
|
||||
|
||||
Handles:
|
||||
- SxxExx / SxxExxExx / Sxx (canonical scene form)
|
||||
- NxNN / NxNNxNN (alt form: 1x05, 12x07x08)
|
||||
|
||||
Returns (season, episode, episode_end) or None if not a season token.
|
||||
"""
|
||||
upper = tok.upper()
|
||||
|
||||
# SxxExx form
|
||||
if len(upper) >= 3 and upper[0] == "S" and upper[1:3].isdigit():
|
||||
season = int(upper[1:3])
|
||||
rest = upper[3:]
|
||||
|
||||
if not rest:
|
||||
return season, None, None
|
||||
|
||||
episodes: list[int] = []
|
||||
while rest.startswith("E") and len(rest) >= 3 and rest[1:3].isdigit():
|
||||
episodes.append(int(rest[1:3]))
|
||||
rest = rest[3:]
|
||||
|
||||
if not episodes:
|
||||
return None # malformed token like "S03XYZ"
|
||||
|
||||
return season, episodes[0], episodes[1] if len(episodes) >= 2 else None
|
||||
|
||||
# NxNN form — split on "X" (uppercased), all parts must be digits
|
||||
if "X" in upper:
|
||||
parts = upper.split("X")
|
||||
if len(parts) >= 2 and all(p.isdigit() and p for p in parts):
|
||||
season = int(parts[0])
|
||||
episode = int(parts[1])
|
||||
episode_end = int(parts[2]) if len(parts) >= 3 else None
|
||||
return season, episode, episode_end
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def _extract_season_episode(
|
||||
tokens: list[str],
|
||||
) -> tuple[int | None, int | None, int | None]:
|
||||
for tok in tokens:
|
||||
parsed = _parse_season_episode(tok)
|
||||
if parsed is not None:
|
||||
return parsed
|
||||
return None, None, None
|
||||
|
||||
|
||||
def _extract_tech(
|
||||
tokens: list[str],
|
||||
kb: ReleaseKnowledge,
|
||||
) -> tuple[str | None, str | None, str | None, str, set[str]]:
|
||||
"""
|
||||
Extract quality, source, codec, group from tokens.
|
||||
|
||||
Returns (quality, source, codec, group, tech_token_set).
|
||||
|
||||
Group extraction strategy (in priority order):
|
||||
1. Token where prefix is a known codec: x265-GROUP
|
||||
2. Rightmost token with a dash that isn't a known source
|
||||
"""
|
||||
quality: str | None = None
|
||||
source: str | None = None
|
||||
codec: str | None = None
|
||||
group = "UNKNOWN"
|
||||
tech_tokens: set[str] = set()
|
||||
|
||||
for tok in tokens:
|
||||
tl = tok.lower()
|
||||
|
||||
if tl in kb.resolutions:
|
||||
quality = tok
|
||||
tech_tokens.add(tok)
|
||||
continue
|
||||
|
||||
if tl in kb.sources:
|
||||
source = tok
|
||||
tech_tokens.add(tok)
|
||||
continue
|
||||
|
||||
if "-" in tok:
|
||||
parts = tok.rsplit("-", 1)
|
||||
# codec-GROUP (highest priority for group)
|
||||
if parts[0].lower() in kb.codecs:
|
||||
codec = parts[0]
|
||||
group = parts[1] if parts[1] else "UNKNOWN"
|
||||
tech_tokens.add(tok)
|
||||
continue
|
||||
# source with dash: Web-DL, WEB-DL, etc.
|
||||
if parts[0].lower() in kb.sources or tok.lower().replace("-", "") in kb.sources:
|
||||
source = tok
|
||||
tech_tokens.add(tok)
|
||||
continue
|
||||
|
||||
if tl in kb.codecs:
|
||||
codec = tok
|
||||
tech_tokens.add(tok)
|
||||
|
||||
# Fallback: rightmost token with a dash that isn't a known source
|
||||
if group == "UNKNOWN":
|
||||
for tok in reversed(tokens):
|
||||
if "-" in tok:
|
||||
parts = tok.rsplit("-", 1)
|
||||
tl = tok.lower()
|
||||
if tl in kb.sources or tok.lower().replace("-", "") in kb.sources:
|
||||
continue
|
||||
if parts[1]:
|
||||
group = parts[1]
|
||||
break
|
||||
|
||||
return quality, source, codec, group, tech_tokens
|
||||
|
||||
|
||||
def _is_year_token(tok: str) -> bool:
|
||||
"""Return True if tok is a 4-digit year between 1900 and 2099."""
|
||||
return len(tok) == 4 and tok.isdigit() and 1900 <= int(tok) <= 2099
|
||||
|
||||
|
||||
def _extract_title(
|
||||
tokens: list[str], tech_tokens: set[str], kb: ReleaseKnowledge
|
||||
) -> str:
|
||||
"""Extract the title portion: everything before the first season/year/tech token."""
|
||||
title_parts = []
|
||||
known_tech = kb.resolutions | kb.sources | kb.codecs
|
||||
for tok in tokens:
|
||||
if _parse_season_episode(tok) is not None:
|
||||
break
|
||||
if _is_year_token(tok):
|
||||
break
|
||||
if tok in tech_tokens or tok.lower() in known_tech:
|
||||
break
|
||||
if "-" in tok and any(p.lower() in kb.codecs | kb.sources for p in tok.split("-")):
|
||||
break
|
||||
title_parts.append(tok)
|
||||
|
||||
return ".".join(title_parts) if title_parts else tokens[0]
|
||||
|
||||
|
||||
def _extract_year(tokens: list[str], title: str) -> int | None:
|
||||
"""Extract a 4-digit year from tokens (only after the title)."""
|
||||
title_len = len(title.split("."))
|
||||
for tok in tokens[title_len:]:
|
||||
if _is_year_token(tok):
|
||||
return int(tok)
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Sequence matcher
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _match_sequences(
|
||||
tokens: list[str],
|
||||
sequences: list[dict],
|
||||
key: str,
|
||||
) -> tuple[str | None, set[str]]:
|
||||
"""
|
||||
Try to match multi-token sequences against consecutive tokens.
|
||||
|
||||
Returns (matched_value, set_of_matched_tokens) or (None, empty_set).
|
||||
Sequences must be ordered most-specific first in the YAML.
|
||||
"""
|
||||
upper_tokens = [t.upper() for t in tokens]
|
||||
for seq in sequences:
|
||||
seq_upper = [s.upper() for s in seq["tokens"]]
|
||||
n = len(seq_upper)
|
||||
for i in range(len(upper_tokens) - n + 1):
|
||||
if upper_tokens[i : i + n] == seq_upper:
|
||||
matched = set(tokens[i : i + n])
|
||||
return seq[key], matched
|
||||
return None, set()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Language extraction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _extract_languages(
|
||||
tokens: list[str], kb: ReleaseKnowledge
|
||||
) -> tuple[list[str], set[str]]:
|
||||
"""Extract language tokens. Returns (languages, matched_token_set)."""
|
||||
languages = []
|
||||
lang_tokens: set[str] = set()
|
||||
for tok in tokens:
|
||||
if tok.upper() in kb.language_tokens:
|
||||
languages.append(tok.upper())
|
||||
lang_tokens.add(tok)
|
||||
return languages, lang_tokens
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Audio extraction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _extract_audio(
|
||||
tokens: list[str], kb: ReleaseKnowledge,
|
||||
) -> tuple[str | None, str | None, set[str]]:
|
||||
"""
|
||||
Extract audio codec and channel layout.
|
||||
|
||||
Returns (audio_codec, audio_channels, matched_token_set).
|
||||
Sequences are tried first (DTS.HD.MA, TrueHD.Atmos, …), then single tokens.
|
||||
"""
|
||||
audio_codec: str | None = None
|
||||
audio_channels: str | None = None
|
||||
audio_tokens: set[str] = set()
|
||||
|
||||
known_codecs = {c.upper() for c in kb.audio.get("codecs", [])}
|
||||
known_channels = set(kb.audio.get("channels", []))
|
||||
|
||||
# Try multi-token sequences first
|
||||
matched_codec, matched_set = _match_sequences(
|
||||
tokens, kb.audio.get("sequences", []), "codec"
|
||||
)
|
||||
if matched_codec:
|
||||
audio_codec = matched_codec
|
||||
audio_tokens |= matched_set
|
||||
|
||||
# Channel layouts like "5.1" or "7.1" are split into two tokens by normalize —
|
||||
# detect them as consecutive pairs "X" + "Y" where "X.Y" is a known channel.
|
||||
# The second token may have a "-GROUP" suffix (e.g. "1-KTH" → strip it).
|
||||
for i in range(len(tokens) - 1):
|
||||
second = tokens[i + 1].split("-")[0]
|
||||
candidate = f"{tokens[i]}.{second}"
|
||||
if candidate in known_channels and audio_channels is None:
|
||||
audio_channels = candidate
|
||||
audio_tokens.add(tokens[i])
|
||||
audio_tokens.add(tokens[i + 1])
|
||||
|
||||
for tok in tokens:
|
||||
if tok in audio_tokens:
|
||||
continue
|
||||
if tok.upper() in known_codecs and audio_codec is None:
|
||||
audio_codec = tok
|
||||
audio_tokens.add(tok)
|
||||
elif tok in known_channels and audio_channels is None:
|
||||
audio_channels = tok
|
||||
audio_tokens.add(tok)
|
||||
|
||||
return audio_codec, audio_channels, audio_tokens
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Video metadata extraction (bit depth, HDR)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _extract_video_meta(
|
||||
tokens: list[str], kb: ReleaseKnowledge,
|
||||
) -> tuple[str | None, str | None, set[str]]:
|
||||
"""
|
||||
Extract bit depth and HDR format.
|
||||
|
||||
Returns (bit_depth, hdr_format, matched_token_set).
|
||||
"""
|
||||
bit_depth: str | None = None
|
||||
hdr_format: str | None = None
|
||||
video_tokens: set[str] = set()
|
||||
|
||||
known_hdr = {h.upper() for h in kb.video_meta.get("hdr", [])} | kb.hdr_extra
|
||||
known_depth = {d.lower() for d in kb.video_meta.get("bit_depth", [])}
|
||||
|
||||
# Try HDR sequences first
|
||||
matched_hdr, matched_set = _match_sequences(
|
||||
tokens, kb.video_meta.get("sequences", []), "hdr"
|
||||
)
|
||||
if matched_hdr:
|
||||
hdr_format = matched_hdr
|
||||
video_tokens |= matched_set
|
||||
|
||||
for tok in tokens:
|
||||
if tok in video_tokens:
|
||||
continue
|
||||
if tok.upper() in known_hdr and hdr_format is None:
|
||||
hdr_format = tok.upper()
|
||||
video_tokens.add(tok)
|
||||
elif tok.lower() in known_depth and bit_depth is None:
|
||||
bit_depth = tok.lower()
|
||||
video_tokens.add(tok)
|
||||
|
||||
return bit_depth, hdr_format, video_tokens
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Edition extraction
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def _extract_edition(
|
||||
tokens: list[str], kb: ReleaseKnowledge
|
||||
) -> tuple[str | None, set[str]]:
|
||||
"""
|
||||
Extract release edition (UNRATED, EXTENDED, DIRECTORS.CUT, …).
|
||||
|
||||
Returns (edition, matched_token_set).
|
||||
"""
|
||||
known_tokens = {t.upper() for t in kb.editions.get("tokens", [])}
|
||||
|
||||
# Try multi-token sequences first
|
||||
matched_edition, matched_set = _match_sequences(
|
||||
tokens, kb.editions.get("sequences", []), "edition"
|
||||
)
|
||||
if matched_edition:
|
||||
return matched_edition, matched_set
|
||||
|
||||
for tok in tokens:
|
||||
if tok.upper() in known_tokens:
|
||||
return tok.upper(), {tok}
|
||||
|
||||
return None, set()
|
||||
|
||||
@@ -49,10 +49,6 @@ class ParsePath(str, Enum):
|
||||
AI = "ai"
|
||||
|
||||
|
||||
_VALID_MEDIA_TYPES: frozenset[str] = frozenset(m.value for m in MediaTypeToken)
|
||||
_VALID_PARSE_PATHS: frozenset[str] = frozenset(p.value for p in ParsePath)
|
||||
|
||||
|
||||
def _strip_episode_from_normalized(normalized: str) -> str:
|
||||
"""
|
||||
Remove all episode parts (Exx) from a normalized release name, keeping Sxx.
|
||||
@@ -72,6 +68,40 @@ def _strip_episode_from_normalized(normalized: str) -> str:
|
||||
return ".".join(result)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ParseReport:
|
||||
"""Diagnostic report attached to a :class:`ParsedRelease`.
|
||||
|
||||
``parse_release`` returns ``(ParsedRelease, ParseReport)``. The
|
||||
report describes *how confident* the parser is in the result and
|
||||
*which road* produced it. It is intentionally separate from
|
||||
``ParsedRelease`` so the structural VO stays free of meta-concerns
|
||||
about its own quality.
|
||||
|
||||
Fields:
|
||||
|
||||
- ``confidence``: integer 0–100 (see :func:`parser.scoring.compute_score`).
|
||||
- ``road``: ``"easy"`` / ``"shitty"`` / ``"path_of_pain"`` — distinct
|
||||
from ``ParsedRelease.parse_path`` (which describes the
|
||||
tokenization route, not the confidence tier).
|
||||
- ``unknown_tokens``: tokens that finished annotation with role
|
||||
UNKNOWN, in order of appearance.
|
||||
- ``missing_critical``: names of critical structural fields the
|
||||
parser couldn't fill (subset of ``{"title", "media_type", "year"}``).
|
||||
"""
|
||||
|
||||
confidence: int
|
||||
road: str # one of parser.scoring.Road values
|
||||
unknown_tokens: tuple[str, ...] = ()
|
||||
missing_critical: tuple[str, ...] = ()
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
if not (0 <= self.confidence <= 100):
|
||||
raise ValidationError(
|
||||
f"ParseReport.confidence out of range: {self.confidence}"
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ParsedRelease:
|
||||
"""Structured representation of a parsed release name.
|
||||
@@ -82,7 +112,7 @@ class ParsedRelease:
|
||||
"""
|
||||
|
||||
raw: str # original release name (untouched)
|
||||
normalised: str # dots instead of spaces
|
||||
clean: str # raw minus site_tag and apostrophes — used by season_folder_name()
|
||||
title: str # show/movie title (dots, no year/season/tech)
|
||||
title_sanitized: str # title with filesystem-forbidden chars stripped
|
||||
year: int | None # movie year or show start year (from TMDB)
|
||||
@@ -105,6 +135,7 @@ class ParsedRelease:
|
||||
bit_depth: str | None = None # "10bit", "8bit", …
|
||||
hdr_format: str | None = None # "DV", "HDR10", "DV.HDR10", …
|
||||
edition: str | None = None # "UNRATED", "EXTENDED", "DIRECTORS.CUT", …
|
||||
distributor: str | None = None # "NF", "AMZN", "DSNP", … (streaming origin)
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
if not self.raw:
|
||||
@@ -133,23 +164,16 @@ class ParsedRelease:
|
||||
f"ParsedRelease.episode_end ({self.episode_end}) < "
|
||||
f"episode ({self.episode})"
|
||||
)
|
||||
# Coerce raw strings into their enum form (tolerant constructor).
|
||||
if not isinstance(self.media_type, MediaTypeToken):
|
||||
try:
|
||||
self.media_type = MediaTypeToken(self.media_type)
|
||||
except ValueError:
|
||||
raise ValidationError(
|
||||
f"ParsedRelease.media_type invalid: {self.media_type!r} "
|
||||
f"(expected one of {sorted(_VALID_MEDIA_TYPES)})"
|
||||
) from None
|
||||
f"ParsedRelease.media_type must be a MediaTypeToken, "
|
||||
f"got {type(self.media_type).__name__}: {self.media_type!r}"
|
||||
)
|
||||
if not isinstance(self.parse_path, ParsePath):
|
||||
try:
|
||||
self.parse_path = ParsePath(self.parse_path)
|
||||
except ValueError:
|
||||
raise ValidationError(
|
||||
f"ParsedRelease.parse_path invalid: {self.parse_path!r} "
|
||||
f"(expected one of {sorted(_VALID_PARSE_PATHS)})"
|
||||
) from None
|
||||
f"ParsedRelease.parse_path must be a ParsePath, "
|
||||
f"got {type(self.parse_path).__name__}: {self.parse_path!r}"
|
||||
)
|
||||
|
||||
@property
|
||||
def is_season_pack(self) -> bool:
|
||||
@@ -177,7 +201,7 @@ class ParsedRelease:
|
||||
For a single-episode release we still strip the episode token so the
|
||||
folder can hold the whole season.
|
||||
"""
|
||||
return _strip_episode_from_normalized(self.normalised)
|
||||
return _strip_episode_from_normalized(self.clean)
|
||||
|
||||
def episode_filename(self, tmdb_episode_title_safe: str | None, ext: str) -> str:
|
||||
"""
|
||||
|
||||
@@ -0,0 +1,267 @@
|
||||
"""Media — file-level track types (video/audio/subtitle) and MediaInfo container.
|
||||
|
||||
These are the **container-view** dataclasses, populated from ffprobe output and
|
||||
used across the project to describe the content of a media file.
|
||||
|
||||
Not to be confused with ``alfred.domain.subtitles.entities.SubtitleCandidate``
|
||||
which models a subtitle being **scanned/matched** (with confidence, raw tokens,
|
||||
file path, etc.). The two coexist by design — they describe the same real-world
|
||||
concept seen from two different bounded contexts.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from .value_objects import Language
|
||||
|
||||
__all__ = [
|
||||
"AudioTrack",
|
||||
"MediaInfo",
|
||||
"MediaWithTracks",
|
||||
"SubtitleTrack",
|
||||
"VideoTrack",
|
||||
"track_lang_matches",
|
||||
]
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Track types — one frozen dataclass per stream kind
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class AudioTrack:
|
||||
"""A single audio track as reported by ffprobe."""
|
||||
|
||||
index: int
|
||||
codec: str | None # aac, ac3, eac3, dts, truehd, flac, …
|
||||
channels: int | None # 2, 6 (5.1), 8 (7.1), …
|
||||
channel_layout: str | None # stereo, 5.1, 7.1, …
|
||||
language: str | None # ISO 639-2: fre, eng, und, …
|
||||
is_default: bool = False
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SubtitleTrack:
|
||||
"""A single embedded subtitle track as reported by ffprobe."""
|
||||
|
||||
index: int
|
||||
codec: str | None # subrip, ass, hdmv_pgs_subtitle, …
|
||||
language: str | None # ISO 639-2: fre, eng, und, …
|
||||
is_default: bool = False
|
||||
is_forced: bool = False
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class VideoTrack:
|
||||
"""A single video track as reported by ffprobe.
|
||||
|
||||
A media file typically has one video track but can have several (alt
|
||||
camera angles, attached thumbnail images reported as still-image streams,
|
||||
etc.), hence the list[VideoTrack] on MediaInfo.
|
||||
"""
|
||||
|
||||
index: int
|
||||
codec: str | None # h264, hevc, av1, …
|
||||
width: int | None
|
||||
height: int | None
|
||||
is_default: bool = False
|
||||
|
||||
@property
|
||||
def resolution(self) -> str | None:
|
||||
"""
|
||||
Best-effort resolution string: 2160p, 1080p, 720p, …
|
||||
|
||||
Width takes priority over height to handle widescreen/cinema crops
|
||||
(e.g. 1920×960 scope → 1080p, not 720p). Falls back to height when
|
||||
width is unavailable.
|
||||
"""
|
||||
match (self.width, self.height):
|
||||
case (None, None):
|
||||
return None
|
||||
case (w, h) if w is not None:
|
||||
match True:
|
||||
case _ if w >= 3840:
|
||||
return "2160p"
|
||||
case _ if w >= 1920:
|
||||
return "1080p"
|
||||
case _ if w >= 1280:
|
||||
return "720p"
|
||||
case _ if w >= 720:
|
||||
return "576p"
|
||||
case _ if w >= 640:
|
||||
return "480p"
|
||||
case _:
|
||||
return f"{h}p" if h else f"{w}w"
|
||||
case (None, h):
|
||||
match True:
|
||||
case _ if h >= 2160:
|
||||
return "2160p"
|
||||
case _ if h >= 1080:
|
||||
return "1080p"
|
||||
case _ if h >= 720:
|
||||
return "720p"
|
||||
case _ if h >= 576:
|
||||
return "576p"
|
||||
case _ if h >= 480:
|
||||
return "480p"
|
||||
case _:
|
||||
return f"{h}p"
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# MediaInfo — assembles video/audio/subtitle tracks for a media file
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MediaInfo:
|
||||
"""
|
||||
File-level media metadata extracted by ffprobe — immutable snapshot.
|
||||
|
||||
Symmetric design: every stream type is a tuple of typed track objects
|
||||
(immutable on purpose — a MediaInfo is a frozen view of one ffprobe run,
|
||||
not a mutable collection to append to).
|
||||
Backwards-compatible flat accessors (``resolution``, ``width``, …) read
|
||||
from the first video track when present.
|
||||
"""
|
||||
|
||||
video_tracks: tuple[VideoTrack, ...] = field(default_factory=tuple)
|
||||
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
|
||||
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
|
||||
|
||||
# File-level (from ffprobe ``format`` block, not from any single stream)
|
||||
duration_seconds: float | None = None
|
||||
bitrate_kbps: int | None = None
|
||||
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# Video conveniences — read the first video track
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
@property
|
||||
def primary_video(self) -> VideoTrack | None:
|
||||
return self.video_tracks[0] if self.video_tracks else None
|
||||
|
||||
@property
|
||||
def width(self) -> int | None:
|
||||
v = self.primary_video
|
||||
return v.width if v else None
|
||||
|
||||
@property
|
||||
def height(self) -> int | None:
|
||||
v = self.primary_video
|
||||
return v.height if v else None
|
||||
|
||||
@property
|
||||
def video_codec(self) -> str | None:
|
||||
v = self.primary_video
|
||||
return v.codec if v else None
|
||||
|
||||
@property
|
||||
def resolution(self) -> str | None:
|
||||
v = self.primary_video
|
||||
return v.resolution if v else None
|
||||
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# Audio conveniences
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
@property
|
||||
def audio_languages(self) -> list[str]:
|
||||
"""Unique audio languages across all tracks (ISO 639-2)."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for track in self.audio_tracks:
|
||||
if track.language and track.language not in seen:
|
||||
seen.add(track.language)
|
||||
result.append(track.language)
|
||||
return result
|
||||
|
||||
@property
|
||||
def is_multi_audio(self) -> bool:
|
||||
"""True if more than one audio language is present."""
|
||||
return len(self.audio_languages) > 1
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Language matching — shared helper + mixin
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def track_lang_matches(track_lang: str | None, query: str | Language) -> bool:
|
||||
"""
|
||||
Match a track's language string against a query (contract "C+").
|
||||
|
||||
* ``Language`` query → matches if the track string is any known
|
||||
representation of that Language (delegates to ``Language.matches``).
|
||||
Powerful, cross-format mode.
|
||||
* ``str`` query → case-insensitive direct comparison against
|
||||
``track_lang``. Simple, no normalization, no registry lookup.
|
||||
|
||||
Callers needing cross-format resolution (``"fr"`` ↔ ``"fre"`` ↔
|
||||
``"french"``) should resolve their string through a ``LanguageRegistry``
|
||||
once and pass the resulting ``Language``.
|
||||
"""
|
||||
if track_lang is None:
|
||||
return False
|
||||
if isinstance(query, Language):
|
||||
return query.matches(track_lang)
|
||||
if isinstance(query, str):
|
||||
return track_lang.lower().strip() == query.lower().strip()
|
||||
return False
|
||||
|
||||
|
||||
class MediaWithTracks:
|
||||
"""
|
||||
Mixin providing audio/subtitle helpers for entities with track collections.
|
||||
|
||||
Hosts must expose two attributes:
|
||||
|
||||
* ``audio_tracks: list[AudioTrack]``
|
||||
* ``subtitle_tracks: list[SubtitleTrack]``
|
||||
|
||||
The helpers follow the "C+" matching contract: pass a :class:`Language`
|
||||
for cross-format matching, or a ``str`` for case-insensitive comparison.
|
||||
"""
|
||||
|
||||
# These attributes are provided by the host entity (Movie, Episode, …).
|
||||
# Declared here only for type-checkers and to make the contract explicit.
|
||||
audio_tracks: list[AudioTrack]
|
||||
subtitle_tracks: list[SubtitleTrack]
|
||||
|
||||
# ── Audio helpers ──────────────────────────────────────────────────────
|
||||
|
||||
def has_audio_in(self, lang: str | Language) -> bool:
|
||||
"""True if at least one audio track is in the given language."""
|
||||
return any(track_lang_matches(t.language, lang) for t in self.audio_tracks)
|
||||
|
||||
def audio_languages(self) -> list[str]:
|
||||
"""Unique audio languages across all tracks, in track order."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for t in self.audio_tracks:
|
||||
if t.language and t.language not in seen:
|
||||
seen.add(t.language)
|
||||
result.append(t.language)
|
||||
return result
|
||||
|
||||
# ── Subtitle helpers ───────────────────────────────────────────────────
|
||||
|
||||
def has_subtitles_in(self, lang: str | Language) -> bool:
|
||||
"""True if at least one subtitle track is in the given language."""
|
||||
return any(track_lang_matches(t.language, lang) for t in self.subtitle_tracks)
|
||||
|
||||
def has_forced_subs(self) -> bool:
|
||||
"""True if at least one subtitle track is flagged as forced."""
|
||||
return any(t.is_forced for t in self.subtitle_tracks)
|
||||
|
||||
def subtitle_languages(self) -> list[str]:
|
||||
"""Unique subtitle languages across all tracks, in track order."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for t in self.subtitle_tracks:
|
||||
if t.language and t.language not in seen:
|
||||
seen.add(t.language)
|
||||
result.append(t.language)
|
||||
return result
|
||||
@@ -1,21 +0,0 @@
|
||||
"""Media — file-level track types (video/audio/subtitle) and MediaInfo container.
|
||||
|
||||
These are the **container-view** dataclasses, populated from ffprobe output and
|
||||
used across the project to describe the content of a media file.
|
||||
"""
|
||||
|
||||
from .audio import AudioTrack
|
||||
from .info import MediaInfo
|
||||
from .matching import track_lang_matches
|
||||
from .subtitle import SubtitleTrack
|
||||
from .tracks_mixin import MediaWithTracks
|
||||
from .video import VideoTrack
|
||||
|
||||
__all__ = [
|
||||
"AudioTrack",
|
||||
"MediaInfo",
|
||||
"MediaWithTracks",
|
||||
"SubtitleTrack",
|
||||
"VideoTrack",
|
||||
"track_lang_matches",
|
||||
]
|
||||
@@ -1,17 +0,0 @@
|
||||
"""AudioTrack — a single audio stream as reported by ffprobe."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class AudioTrack:
|
||||
"""A single audio track as reported by ffprobe."""
|
||||
|
||||
index: int
|
||||
codec: str | None # aac, ac3, eac3, dts, truehd, flac, …
|
||||
channels: int | None # 2, 6 (5.1), 8 (7.1), …
|
||||
channel_layout: str | None # stereo, 5.1, 7.1, …
|
||||
language: str | None # ISO 639-2: fre, eng, und, …
|
||||
is_default: bool = False
|
||||
@@ -1,78 +0,0 @@
|
||||
"""MediaInfo — assembles video, audio and subtitle tracks for a media file."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from .audio import AudioTrack
|
||||
from .subtitle import SubtitleTrack
|
||||
from .video import VideoTrack
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class MediaInfo:
|
||||
"""
|
||||
File-level media metadata extracted by ffprobe — immutable snapshot.
|
||||
|
||||
Symmetric design: every stream type is a tuple of typed track objects
|
||||
(immutable on purpose — a MediaInfo is a frozen view of one ffprobe run,
|
||||
not a mutable collection to append to).
|
||||
Backwards-compatible flat accessors (``resolution``, ``width``, …) read
|
||||
from the first video track when present.
|
||||
"""
|
||||
|
||||
video_tracks: tuple[VideoTrack, ...] = field(default_factory=tuple)
|
||||
audio_tracks: tuple[AudioTrack, ...] = field(default_factory=tuple)
|
||||
subtitle_tracks: tuple[SubtitleTrack, ...] = field(default_factory=tuple)
|
||||
|
||||
# File-level (from ffprobe ``format`` block, not from any single stream)
|
||||
duration_seconds: float | None = None
|
||||
bitrate_kbps: int | None = None
|
||||
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# Video conveniences — read the first video track
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
@property
|
||||
def primary_video(self) -> VideoTrack | None:
|
||||
return self.video_tracks[0] if self.video_tracks else None
|
||||
|
||||
@property
|
||||
def width(self) -> int | None:
|
||||
v = self.primary_video
|
||||
return v.width if v else None
|
||||
|
||||
@property
|
||||
def height(self) -> int | None:
|
||||
v = self.primary_video
|
||||
return v.height if v else None
|
||||
|
||||
@property
|
||||
def video_codec(self) -> str | None:
|
||||
v = self.primary_video
|
||||
return v.codec if v else None
|
||||
|
||||
@property
|
||||
def resolution(self) -> str | None:
|
||||
v = self.primary_video
|
||||
return v.resolution if v else None
|
||||
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
# Audio conveniences
|
||||
# ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
@property
|
||||
def audio_languages(self) -> list[str]:
|
||||
"""Unique audio languages across all tracks (ISO 639-2)."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for track in self.audio_tracks:
|
||||
if track.language and track.language not in seen:
|
||||
seen.add(track.language)
|
||||
result.append(track.language)
|
||||
return result
|
||||
|
||||
@property
|
||||
def is_multi_audio(self) -> bool:
|
||||
"""True if more than one audio language is present."""
|
||||
return len(self.audio_languages) > 1
|
||||
@@ -1,33 +0,0 @@
|
||||
"""Language-matching helper shared by media-bearing entities.
|
||||
|
||||
Both ``Episode`` and ``Movie`` carry ``audio_tracks`` / ``subtitle_tracks`` and
|
||||
need to answer "do I have audio in language X?". The matching contract is the
|
||||
same in both cases — keep it in one place.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from ..value_objects import Language
|
||||
|
||||
|
||||
def track_lang_matches(track_lang: str | None, query: str | Language) -> bool:
|
||||
"""
|
||||
Match a track's language string against a query (contract "C+").
|
||||
|
||||
* ``Language`` query → matches if the track string is any known
|
||||
representation of that Language (delegates to ``Language.matches``).
|
||||
Powerful, cross-format mode.
|
||||
* ``str`` query → case-insensitive direct comparison against
|
||||
``track_lang``. Simple, no normalization, no registry lookup.
|
||||
|
||||
Callers needing cross-format resolution (``"fr"`` ↔ ``"fre"`` ↔
|
||||
``"french"``) should resolve their string through a ``LanguageRegistry``
|
||||
once and pass the resulting ``Language``.
|
||||
"""
|
||||
if track_lang is None:
|
||||
return False
|
||||
if isinstance(query, Language):
|
||||
return query.matches(track_lang)
|
||||
if isinstance(query, str):
|
||||
return track_lang.lower().strip() == query.lower().strip()
|
||||
return False
|
||||
@@ -1,25 +0,0 @@
|
||||
"""SubtitleTrack — a single embedded subtitle stream as reported by ffprobe.
|
||||
|
||||
This is the **container-view** representation (ffprobe output) used uniformly
|
||||
across the project to describe a subtitle stream embedded in a media file.
|
||||
|
||||
Not to be confused with ``alfred.domain.subtitles.entities.SubtitleCandidate``
|
||||
which models a subtitle being **scanned/matched** (with confidence, raw tokens,
|
||||
file path, etc.). The two coexist by design — they describe the same real-world
|
||||
concept seen from two different bounded contexts.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SubtitleTrack:
|
||||
"""A single embedded subtitle track as reported by ffprobe."""
|
||||
|
||||
index: int
|
||||
codec: str | None # subrip, ass, hdmv_pgs_subtitle, …
|
||||
language: str | None # ISO 639-2: fre, eng, und, …
|
||||
is_default: bool = False
|
||||
is_forced: bool = False
|
||||
@@ -1,77 +0,0 @@
|
||||
"""Mixin shared by entities that carry audio + subtitle tracks.
|
||||
|
||||
Both ``Movie`` and ``Episode`` carry a ``list[AudioTrack]`` plus a
|
||||
``list[SubtitleTrack]`` and answer the same 5 queries about them (language
|
||||
presence, unique languages, forced flag). Keep that behavior in one place so a
|
||||
fix in one is a fix in both.
|
||||
|
||||
The mixin is plain Python (no dataclass machinery) so it composes cleanly with
|
||||
``@dataclass`` entities — it only reads ``self.audio_tracks`` and
|
||||
``self.subtitle_tracks`` which the host class provides as fields.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import TYPE_CHECKING
|
||||
|
||||
from ..value_objects import Language
|
||||
from .matching import track_lang_matches
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .audio import AudioTrack
|
||||
from .subtitle import SubtitleTrack
|
||||
|
||||
|
||||
class MediaWithTracks:
|
||||
"""
|
||||
Mixin providing audio/subtitle helpers for entities with track collections.
|
||||
|
||||
Hosts must expose two attributes:
|
||||
|
||||
* ``audio_tracks: list[AudioTrack]``
|
||||
* ``subtitle_tracks: list[SubtitleTrack]``
|
||||
|
||||
The helpers follow the "C+" matching contract: pass a :class:`Language`
|
||||
for cross-format matching, or a ``str`` for case-insensitive comparison.
|
||||
"""
|
||||
|
||||
# These attributes are provided by the host entity (Movie, Episode, …).
|
||||
# Declared here only for type-checkers and to make the contract explicit.
|
||||
audio_tracks: list["AudioTrack"]
|
||||
subtitle_tracks: list["SubtitleTrack"]
|
||||
|
||||
# ── Audio helpers ──────────────────────────────────────────────────────
|
||||
|
||||
def has_audio_in(self, lang: str | Language) -> bool:
|
||||
"""True if at least one audio track is in the given language."""
|
||||
return any(track_lang_matches(t.language, lang) for t in self.audio_tracks)
|
||||
|
||||
def audio_languages(self) -> list[str]:
|
||||
"""Unique audio languages across all tracks, in track order."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for t in self.audio_tracks:
|
||||
if t.language and t.language not in seen:
|
||||
seen.add(t.language)
|
||||
result.append(t.language)
|
||||
return result
|
||||
|
||||
# ── Subtitle helpers ───────────────────────────────────────────────────
|
||||
|
||||
def has_subtitles_in(self, lang: str | Language) -> bool:
|
||||
"""True if at least one subtitle track is in the given language."""
|
||||
return any(track_lang_matches(t.language, lang) for t in self.subtitle_tracks)
|
||||
|
||||
def has_forced_subs(self) -> bool:
|
||||
"""True if at least one subtitle track is flagged as forced."""
|
||||
return any(t.is_forced for t in self.subtitle_tracks)
|
||||
|
||||
def subtitle_languages(self) -> list[str]:
|
||||
"""Unique subtitle languages across all tracks, in track order."""
|
||||
seen: set[str] = set()
|
||||
result: list[str] = []
|
||||
for t in self.subtitle_tracks:
|
||||
if t.language and t.language not in seen:
|
||||
seen.add(t.language)
|
||||
result.append(t.language)
|
||||
return result
|
||||
@@ -1,62 +0,0 @@
|
||||
"""VideoTrack — a single video stream as reported by ffprobe."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class VideoTrack:
|
||||
"""A single video track as reported by ffprobe.
|
||||
|
||||
A media file typically has one video track but can have several (alt
|
||||
camera angles, attached thumbnail images reported as still-image streams,
|
||||
etc.), hence the list[VideoTrack] on MediaInfo.
|
||||
"""
|
||||
|
||||
index: int
|
||||
codec: str | None # h264, hevc, av1, …
|
||||
width: int | None
|
||||
height: int | None
|
||||
is_default: bool = False
|
||||
|
||||
@property
|
||||
def resolution(self) -> str | None:
|
||||
"""
|
||||
Best-effort resolution string: 2160p, 1080p, 720p, …
|
||||
|
||||
Width takes priority over height to handle widescreen/cinema crops
|
||||
(e.g. 1920×960 scope → 1080p, not 720p). Falls back to height when
|
||||
width is unavailable.
|
||||
"""
|
||||
match (self.width, self.height):
|
||||
case (None, None):
|
||||
return None
|
||||
case (w, h) if w is not None:
|
||||
match True:
|
||||
case _ if w >= 3840:
|
||||
return "2160p"
|
||||
case _ if w >= 1920:
|
||||
return "1080p"
|
||||
case _ if w >= 1280:
|
||||
return "720p"
|
||||
case _ if w >= 720:
|
||||
return "576p"
|
||||
case _ if w >= 640:
|
||||
return "480p"
|
||||
case _:
|
||||
return f"{h}p" if h else f"{w}w"
|
||||
case (None, h):
|
||||
match True:
|
||||
case _ if h >= 2160:
|
||||
return "2160p"
|
||||
case _ if h >= 1080:
|
||||
return "1080p"
|
||||
case _ if h >= 720:
|
||||
return "720p"
|
||||
case _ if h >= 576:
|
||||
return "576p"
|
||||
case _ if h >= 480:
|
||||
return "480p"
|
||||
case _:
|
||||
return f"{h}p"
|
||||
@@ -7,11 +7,13 @@ Protocol without going through real I/O.
|
||||
"""
|
||||
|
||||
from .filesystem_scanner import FileEntry, FilesystemScanner
|
||||
from .language_repository import LanguageRepository
|
||||
from .media_prober import MediaProber, SubtitleStreamInfo
|
||||
|
||||
__all__ = [
|
||||
"FileEntry",
|
||||
"FilesystemScanner",
|
||||
"LanguageRepository",
|
||||
"MediaProber",
|
||||
"SubtitleStreamInfo",
|
||||
]
|
||||
|
||||
@@ -0,0 +1,36 @@
|
||||
"""LanguageRepository port — abstracts canonical language lookup.
|
||||
|
||||
The adapter (typically loading from ISO 639 YAML knowledge) maps a wide
|
||||
range of raw forms (codes, English/native names, aliases) onto the
|
||||
canonical :class:`Language` value object. Domain code accepts the port
|
||||
via constructor injection; tests can pass a small in-memory fake.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Protocol
|
||||
|
||||
from alfred.domain.shared.value_objects import Language
|
||||
|
||||
|
||||
class LanguageRepository(Protocol):
|
||||
"""Canonical language lookup."""
|
||||
|
||||
def from_iso(self, code: str) -> Language | None:
|
||||
"""Look up by canonical ISO 639-2/B code (case-insensitive)."""
|
||||
...
|
||||
|
||||
def from_any(self, raw: str) -> Language | None:
|
||||
"""Look up by any known representation: ISO code, name, alias.
|
||||
|
||||
Case-insensitive. Returns ``None`` when the raw form is unknown.
|
||||
"""
|
||||
...
|
||||
|
||||
def all(self) -> list[Language]:
|
||||
"""Return all known languages, in a stable order."""
|
||||
...
|
||||
|
||||
def __contains__(self, raw: str) -> bool: ...
|
||||
|
||||
def __len__(self) -> int: ...
|
||||
@@ -9,7 +9,10 @@ from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Protocol
|
||||
from typing import TYPE_CHECKING, Protocol
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from alfred.domain.shared.media import MediaInfo
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
@@ -37,3 +40,13 @@ class MediaProber(Protocol):
|
||||
no subtitle streams. Adapters must not raise.
|
||||
"""
|
||||
...
|
||||
|
||||
def probe(self, video: Path) -> MediaInfo | None:
|
||||
"""Return the full :class:`MediaInfo` for ``video``, or ``None``.
|
||||
|
||||
Covers all stream families (video, audio, subtitle) plus
|
||||
file-level duration / bitrate. ``None`` signals that ffprobe is
|
||||
unavailable or the file can't be read — adapters must not
|
||||
raise.
|
||||
"""
|
||||
...
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
"""Shared value objects used across multiple domains."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
@@ -43,29 +45,21 @@ class ImdbId:
|
||||
@dataclass(frozen=True)
|
||||
class FilePath:
|
||||
"""
|
||||
Value object representing a file path with validation.
|
||||
Value object representing a file path.
|
||||
|
||||
Ensures the path is valid and optionally checks existence.
|
||||
Accepts either ``str`` or :class:`pathlib.Path` at construction;
|
||||
the value is normalized to ``Path`` in ``__post_init__``.
|
||||
"""
|
||||
|
||||
value: Path
|
||||
|
||||
def __init__(self, path: str | Path):
|
||||
"""
|
||||
Initialize FilePath.
|
||||
|
||||
Args:
|
||||
path: String or Path object representing the file path
|
||||
"""
|
||||
if isinstance(path, str):
|
||||
path_obj = Path(path)
|
||||
elif isinstance(path, Path):
|
||||
path_obj = path
|
||||
else:
|
||||
raise ValidationError(f"Path must be str or Path, got {type(path)}")
|
||||
|
||||
# Use object.__setattr__ because dataclass is frozen
|
||||
object.__setattr__(self, "value", path_obj)
|
||||
def __post_init__(self) -> None:
|
||||
if isinstance(self.value, Path):
|
||||
return
|
||||
if isinstance(self.value, str):
|
||||
object.__setattr__(self, "value", Path(self.value))
|
||||
return
|
||||
raise ValidationError(f"Path must be str or Path, got {type(self.value)}")
|
||||
|
||||
def __str__(self) -> str:
|
||||
return str(self.value)
|
||||
@@ -150,19 +144,49 @@ class Language:
|
||||
raise ValidationError(
|
||||
f"Language.iso must be a 3-letter ISO 639-2/B code, got {self.iso!r}"
|
||||
)
|
||||
# Normalize iso to lowercase
|
||||
object.__setattr__(self, "iso", self.iso.lower())
|
||||
# Normalize aliases to a tuple of lowercase strings (dedup, preserve order)
|
||||
if self.iso != self.iso.lower():
|
||||
raise ValidationError(
|
||||
f"Language.iso must be lowercase, got {self.iso!r} — "
|
||||
f"use Language.from_raw() to construct from arbitrary input"
|
||||
)
|
||||
for alias in self.aliases:
|
||||
if not isinstance(alias, str) or alias != alias.lower().strip() or not alias:
|
||||
raise ValidationError(
|
||||
f"Language.aliases must be lowercase non-empty strings, "
|
||||
f"got {alias!r} — use Language.from_raw() to normalize"
|
||||
)
|
||||
|
||||
@classmethod
|
||||
def from_raw(
|
||||
cls,
|
||||
iso: str,
|
||||
english_name: str,
|
||||
native_name: str,
|
||||
aliases: tuple[str, ...] | list[str] = (),
|
||||
) -> Language:
|
||||
"""
|
||||
Construct a Language from arbitrary (possibly un-normalized) input.
|
||||
|
||||
Use this factory when loading from external sources (YAML, user input,
|
||||
third-party APIs) — it lowercases the iso code and normalizes/dedups
|
||||
the alias tuple. The direct constructor is strict and rejects
|
||||
un-normalized input.
|
||||
"""
|
||||
seen: set[str] = set()
|
||||
normalized: list[str] = []
|
||||
for alias in self.aliases:
|
||||
for alias in aliases:
|
||||
if not isinstance(alias, str):
|
||||
continue
|
||||
a = alias.lower().strip()
|
||||
if a and a not in seen:
|
||||
seen.add(a)
|
||||
normalized.append(a)
|
||||
object.__setattr__(self, "aliases", tuple(normalized))
|
||||
return cls(
|
||||
iso=iso.lower(),
|
||||
english_name=english_name,
|
||||
native_name=native_name,
|
||||
aliases=tuple(normalized),
|
||||
)
|
||||
|
||||
def matches(self, raw: str) -> bool:
|
||||
"""
|
||||
|
||||
@@ -6,6 +6,7 @@ from .exceptions import SubtitleNotFound
|
||||
from .services import PatternDetector, SubtitleIdentifier, SubtitleMatcher
|
||||
from .value_objects import (
|
||||
RuleScope,
|
||||
RuleScopeLevel,
|
||||
ScanStrategy,
|
||||
SubtitleFormat,
|
||||
SubtitleLanguage,
|
||||
@@ -30,5 +31,6 @@ __all__ = [
|
||||
"TypeDetectionMethod",
|
||||
"SubtitleMatchingRules",
|
||||
"RuleScope",
|
||||
"RuleScopeLevel",
|
||||
"SubtitleNotFound",
|
||||
]
|
||||
|
||||
@@ -4,7 +4,7 @@ from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
|
||||
from ..shared.value_objects import ImdbId
|
||||
from .value_objects import RuleScope, SubtitleMatchingRules
|
||||
from .value_objects import RuleScope, RuleScopeLevel, SubtitleMatchingRules
|
||||
|
||||
|
||||
@dataclass
|
||||
@@ -86,10 +86,13 @@ class SubtitleRuleSet:
|
||||
if self._min_confidence is not None:
|
||||
delta["min_confidence"] = self._min_confidence
|
||||
return {
|
||||
"scope": {"level": self.scope.level, "identifier": self.scope.identifier},
|
||||
"scope": {
|
||||
"level": self.scope.level.value,
|
||||
"identifier": self.scope.identifier,
|
||||
},
|
||||
"override": delta,
|
||||
}
|
||||
|
||||
@classmethod
|
||||
def global_default(cls) -> SubtitleRuleSet:
|
||||
return cls(scope=RuleScope(level="global"))
|
||||
return cls(scope=RuleScope(level=RuleScopeLevel.GLOBAL))
|
||||
|
||||
@@ -83,9 +83,20 @@ class SubtitleMatchingRules:
|
||||
min_confidence: float = 0.7
|
||||
|
||||
|
||||
class RuleScopeLevel(str, Enum):
|
||||
"""At which level a subtitle rule set applies."""
|
||||
|
||||
GLOBAL = "global"
|
||||
RELEASE_GROUP = "release_group"
|
||||
MOVIE = "movie"
|
||||
SHOW = "show"
|
||||
SEASON = "season"
|
||||
EPISODE = "episode"
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class RuleScope:
|
||||
"""At which level a rule set applies."""
|
||||
|
||||
level: str # "global" | "release_group" | "movie" | "show" | "season" | "episode"
|
||||
level: RuleScopeLevel
|
||||
identifier: str | None = None # imdb_id, group name, "S01", "S01E03"…
|
||||
|
||||
@@ -1,121 +0,0 @@
|
||||
"""ffprobe — infrastructure adapter for extracting MediaInfo from a video file."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.domain.shared.media import AudioTrack, MediaInfo, SubtitleTrack, VideoTrack
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_FFPROBE_CMD = [
|
||||
"ffprobe",
|
||||
"-v",
|
||||
"quiet",
|
||||
"-print_format",
|
||||
"json",
|
||||
"-show_streams",
|
||||
"-show_format",
|
||||
]
|
||||
|
||||
|
||||
def probe(path: Path) -> MediaInfo | None:
|
||||
"""
|
||||
Run ffprobe on path and return a MediaInfo.
|
||||
|
||||
Returns None if ffprobe is not available or the file cannot be probed.
|
||||
"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[*_FFPROBE_CMD, str(path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30,
|
||||
check=False,
|
||||
)
|
||||
except subprocess.TimeoutExpired:
|
||||
logger.warning("ffprobe timed out on %s", path)
|
||||
return None
|
||||
|
||||
if result.returncode != 0:
|
||||
logger.warning("ffprobe failed on %s: %s", path, result.stderr.strip())
|
||||
return None
|
||||
|
||||
try:
|
||||
data = json.loads(result.stdout)
|
||||
except json.JSONDecodeError:
|
||||
logger.warning("ffprobe returned invalid JSON for %s", path)
|
||||
return None
|
||||
|
||||
return _parse(data)
|
||||
|
||||
|
||||
def _parse(data: dict) -> MediaInfo:
|
||||
streams = data.get("streams", [])
|
||||
fmt = data.get("format", {})
|
||||
|
||||
# File-level duration/bitrate (ffprobe ``format`` block — independent of streams)
|
||||
duration_seconds: float | None = None
|
||||
bitrate_kbps: int | None = None
|
||||
if "duration" in fmt:
|
||||
try:
|
||||
duration_seconds = float(fmt["duration"])
|
||||
except ValueError:
|
||||
pass
|
||||
if "bit_rate" in fmt:
|
||||
try:
|
||||
bitrate_kbps = int(fmt["bit_rate"]) // 1000
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
video_tracks: list[VideoTrack] = []
|
||||
audio_tracks: list[AudioTrack] = []
|
||||
subtitle_tracks: list[SubtitleTrack] = []
|
||||
|
||||
for stream in streams:
|
||||
codec_type = stream.get("codec_type")
|
||||
|
||||
if codec_type == "video":
|
||||
video_tracks.append(
|
||||
VideoTrack(
|
||||
index=stream.get("index", len(video_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
width=stream.get("width"),
|
||||
height=stream.get("height"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
elif codec_type == "audio":
|
||||
audio_tracks.append(
|
||||
AudioTrack(
|
||||
index=stream.get("index", len(audio_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
channels=stream.get("channels"),
|
||||
channel_layout=stream.get("channel_layout"),
|
||||
language=stream.get("tags", {}).get("language"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
elif codec_type == "subtitle":
|
||||
subtitle_tracks.append(
|
||||
SubtitleTrack(
|
||||
index=stream.get("index", len(subtitle_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
language=stream.get("tags", {}).get("language"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
is_forced=stream.get("disposition", {}).get("forced", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
return MediaInfo(
|
||||
video_tracks=tuple(video_tracks),
|
||||
audio_tracks=tuple(audio_tracks),
|
||||
subtitle_tracks=tuple(subtitle_tracks),
|
||||
duration_seconds=duration_seconds,
|
||||
bitrate_kbps=bitrate_kbps,
|
||||
)
|
||||
@@ -87,7 +87,7 @@ class LanguageRegistry:
|
||||
merged = _merge_language_entries(builtin, learned)
|
||||
|
||||
for iso, entry in merged.items():
|
||||
language = Language(
|
||||
language = Language.from_raw(
|
||||
iso=iso,
|
||||
english_name=entry.get("english_name", iso),
|
||||
native_name=entry.get("native_name", iso),
|
||||
|
||||
@@ -16,9 +16,11 @@ import alfred as _alfred_pkg
|
||||
|
||||
_BUILTIN_ROOT = Path(_alfred_pkg.__file__).parent / "knowledge" / "release"
|
||||
_SITES_ROOT = _BUILTIN_ROOT / "sites"
|
||||
_GROUPS_ROOT = _BUILTIN_ROOT / "release_groups"
|
||||
_LEARNED_ROOT = (
|
||||
Path(_alfred_pkg.__file__).parent.parent / "data" / "knowledge" / "release"
|
||||
)
|
||||
_LEARNED_GROUPS_ROOT = _LEARNED_ROOT / "release_groups"
|
||||
|
||||
|
||||
def _merge(base: dict, overlay: dict) -> dict:
|
||||
@@ -62,6 +64,15 @@ def load_sources() -> set[str]:
|
||||
return set(_load("sources.yaml").get("sources", []))
|
||||
|
||||
|
||||
def load_distributors() -> set[str]:
|
||||
"""Streaming distributor tokens (NF, AMZN, DSNP, …).
|
||||
|
||||
Distinct from ``load_sources()`` — distributors are uppercase scene
|
||||
tags identifying the platform, not the capture origin.
|
||||
"""
|
||||
return {t.upper() for t in _load("distributors.yaml").get("distributors", [])}
|
||||
|
||||
|
||||
def load_codecs() -> set[str]:
|
||||
return set(_load("codecs.yaml").get("codecs", []))
|
||||
|
||||
@@ -128,6 +139,58 @@ def load_media_type_tokens() -> dict:
|
||||
return _load_sites().get("media_type_tokens", {})
|
||||
|
||||
|
||||
def load_group_schemas() -> dict:
|
||||
"""Load every release-group schema YAML keyed by uppercase group name.
|
||||
|
||||
Builtin schemas in ``alfred/knowledge/release/release_groups/`` are
|
||||
merged with user-learned schemas in
|
||||
``data/knowledge/release/release_groups/`` (the learned ones win on
|
||||
name collision).
|
||||
"""
|
||||
result: dict = {}
|
||||
for root in (_GROUPS_ROOT, _LEARNED_GROUPS_ROOT):
|
||||
if not root.is_dir():
|
||||
continue
|
||||
for path in sorted(root.glob("*.yaml")):
|
||||
data = _read(path)
|
||||
name = data.get("name")
|
||||
if not name:
|
||||
continue
|
||||
result[name.upper()] = data
|
||||
return result
|
||||
|
||||
|
||||
def load_scoring() -> dict:
|
||||
"""Load the parse-scoring config.
|
||||
|
||||
Returns a dict with three top-level keys: ``weights``, ``penalties``,
|
||||
``thresholds``. Defaults are baked in so a missing or partial YAML
|
||||
never breaks the parser — only de-tunes it.
|
||||
"""
|
||||
raw = _load("scoring.yaml")
|
||||
weights = {
|
||||
"title": 30,
|
||||
"media_type": 20,
|
||||
"year": 15,
|
||||
"season": 10,
|
||||
"episode": 5,
|
||||
"resolution": 5,
|
||||
"source": 5,
|
||||
"codec": 5,
|
||||
"group": 5,
|
||||
}
|
||||
weights.update(raw.get("weights", {}) or {})
|
||||
penalties = {"unknown_token": 5, "max_unknown_penalty": 30}
|
||||
penalties.update(raw.get("penalties", {}) or {})
|
||||
thresholds = {"shitty_min": 60}
|
||||
thresholds.update(raw.get("thresholds", {}) or {})
|
||||
return {
|
||||
"weights": weights,
|
||||
"penalties": penalties,
|
||||
"thresholds": thresholds,
|
||||
}
|
||||
|
||||
|
||||
def load_separators() -> list[str]:
|
||||
"""Single-char token separators used by the release name tokenizer.
|
||||
|
||||
|
||||
@@ -14,17 +14,23 @@ filesystem-level concerns.
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from alfred.domain.release.parser.schema import GroupSchema, SchemaChunk
|
||||
from alfred.domain.release.parser.tokens import TokenRole
|
||||
|
||||
from .release import (
|
||||
load_audio,
|
||||
load_codecs,
|
||||
load_distributors,
|
||||
load_editions,
|
||||
load_forbidden_chars,
|
||||
load_group_schemas,
|
||||
load_hdr_extra,
|
||||
load_language_tokens,
|
||||
load_media_type_tokens,
|
||||
load_metadata_extensions,
|
||||
load_non_video_extensions,
|
||||
load_resolutions,
|
||||
load_scoring,
|
||||
load_separators,
|
||||
load_sources,
|
||||
load_sources_extra,
|
||||
@@ -35,6 +41,26 @@ from .release import (
|
||||
)
|
||||
|
||||
|
||||
def _build_group_schema(data: dict) -> GroupSchema:
|
||||
"""Translate a raw YAML schema dict into a frozen :class:`GroupSchema`.
|
||||
|
||||
Unknown roles raise ``ValueError`` early so a typo in a YAML file
|
||||
surfaces at construction time, not on first parse.
|
||||
"""
|
||||
chunks = tuple(
|
||||
SchemaChunk(
|
||||
role=TokenRole(entry["role"]),
|
||||
optional=bool(entry.get("optional", False)),
|
||||
)
|
||||
for entry in data.get("chunk_order", [])
|
||||
)
|
||||
return GroupSchema(
|
||||
name=data["name"],
|
||||
separator=data.get("separator", "."),
|
||||
chunks=chunks,
|
||||
)
|
||||
|
||||
|
||||
class YamlReleaseKnowledge:
|
||||
"""Single object holding every parsed-release knowledge constant.
|
||||
|
||||
@@ -48,6 +74,7 @@ class YamlReleaseKnowledge:
|
||||
self.resolutions: set[str] = load_resolutions()
|
||||
self.sources: set[str] = load_sources() | load_sources_extra()
|
||||
self.codecs: set[str] = load_codecs()
|
||||
self.distributors: set[str] = load_distributors()
|
||||
self.language_tokens: set[str] = load_language_tokens()
|
||||
self.forbidden_chars: set[str] = load_forbidden_chars()
|
||||
self.hdr_extra: set[str] = load_hdr_extra()
|
||||
@@ -59,6 +86,9 @@ class YamlReleaseKnowledge:
|
||||
|
||||
self.separators: list[str] = load_separators()
|
||||
|
||||
# Parse-scoring config (weights / penalties / thresholds).
|
||||
self.scoring: dict = load_scoring()
|
||||
|
||||
# File-extension sets (used by application/infra modules, not by
|
||||
# the parser itself — kept here so there is a single ownership
|
||||
# point for release knowledge).
|
||||
@@ -78,6 +108,15 @@ class YamlReleaseKnowledge:
|
||||
"", "", "".join(load_win_forbidden_chars())
|
||||
)
|
||||
|
||||
# Group schemas, keyed by uppercase group name for fast lookup.
|
||||
self._group_schemas: dict[str, GroupSchema] = {
|
||||
key: _build_group_schema(data)
|
||||
for key, data in load_group_schemas().items()
|
||||
}
|
||||
|
||||
def sanitize_for_fs(self, text: str) -> str:
|
||||
"""Strip Windows-forbidden characters from ``text``."""
|
||||
return text.translate(self._win_forbidden_table)
|
||||
|
||||
def group_schema(self, name: str) -> GroupSchema | None:
|
||||
return self._group_schemas.get(name.upper())
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
import logging
|
||||
|
||||
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
|
||||
from alfred.domain.shared.ports import LanguageRepository
|
||||
from alfred.domain.subtitles.value_objects import (
|
||||
ScanStrategy,
|
||||
SubtitleFormat,
|
||||
@@ -12,6 +12,8 @@ from alfred.domain.subtitles.value_objects import (
|
||||
SubtitleType,
|
||||
TypeDetectionMethod,
|
||||
)
|
||||
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
|
||||
|
||||
from .loader import KnowledgeLoader
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
@@ -28,10 +30,12 @@ class SubtitleKnowledgeBase:
|
||||
def __init__(
|
||||
self,
|
||||
loader: KnowledgeLoader | None = None,
|
||||
language_registry: LanguageRegistry | None = None,
|
||||
language_registry: LanguageRepository | None = None,
|
||||
):
|
||||
self._loader = loader or KnowledgeLoader()
|
||||
self._language_registry = language_registry or LanguageRegistry()
|
||||
self._language_registry: LanguageRepository = (
|
||||
language_registry or LanguageRegistry()
|
||||
)
|
||||
self._build()
|
||||
|
||||
def _build(self) -> None: # noqa: PLR0912 — straight-line YAML projection
|
||||
|
||||
@@ -7,12 +7,23 @@ import logging
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.domain.shared.media import AudioTrack, MediaInfo, SubtitleTrack, VideoTrack
|
||||
from alfred.domain.shared.ports import SubtitleStreamInfo
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
_FFPROBE_TIMEOUT_SECONDS = 30
|
||||
|
||||
_FFPROBE_FULL_CMD = [
|
||||
"ffprobe",
|
||||
"-v",
|
||||
"quiet",
|
||||
"-print_format",
|
||||
"json",
|
||||
"-show_streams",
|
||||
"-show_format",
|
||||
]
|
||||
|
||||
|
||||
class FfprobeMediaProber:
|
||||
"""Inspect media files by shelling out to ``ffprobe``.
|
||||
@@ -63,3 +74,101 @@ class FfprobeMediaProber:
|
||||
)
|
||||
)
|
||||
return streams
|
||||
|
||||
def probe(self, video: Path) -> MediaInfo | None:
|
||||
"""Run ffprobe on ``video`` and return a :class:`MediaInfo`.
|
||||
|
||||
Returns ``None`` when ffprobe is not available, times out, or
|
||||
the file cannot be parsed. Never raises.
|
||||
"""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
[*_FFPROBE_FULL_CMD, str(video)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=_FFPROBE_TIMEOUT_SECONDS,
|
||||
check=False,
|
||||
)
|
||||
except (subprocess.TimeoutExpired, FileNotFoundError) as e:
|
||||
logger.warning("ffprobe failed on %s: %s", video, e)
|
||||
return None
|
||||
|
||||
if result.returncode != 0:
|
||||
logger.warning("ffprobe failed on %s: %s", video, result.stderr.strip())
|
||||
return None
|
||||
|
||||
try:
|
||||
data = json.loads(result.stdout)
|
||||
except json.JSONDecodeError:
|
||||
logger.warning("ffprobe returned invalid JSON for %s", video)
|
||||
return None
|
||||
|
||||
return _parse_media_info(data)
|
||||
|
||||
|
||||
def _parse_media_info(data: dict) -> MediaInfo:
|
||||
"""Translate raw ffprobe JSON into a :class:`MediaInfo` snapshot."""
|
||||
streams = data.get("streams", [])
|
||||
fmt = data.get("format", {})
|
||||
|
||||
duration_seconds: float | None = None
|
||||
bitrate_kbps: int | None = None
|
||||
if "duration" in fmt:
|
||||
try:
|
||||
duration_seconds = float(fmt["duration"])
|
||||
except ValueError:
|
||||
pass
|
||||
if "bit_rate" in fmt:
|
||||
try:
|
||||
bitrate_kbps = int(fmt["bit_rate"]) // 1000
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
video_tracks: list[VideoTrack] = []
|
||||
audio_tracks: list[AudioTrack] = []
|
||||
subtitle_tracks: list[SubtitleTrack] = []
|
||||
|
||||
for stream in streams:
|
||||
codec_type = stream.get("codec_type")
|
||||
|
||||
if codec_type == "video":
|
||||
video_tracks.append(
|
||||
VideoTrack(
|
||||
index=stream.get("index", len(video_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
width=stream.get("width"),
|
||||
height=stream.get("height"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
elif codec_type == "audio":
|
||||
audio_tracks.append(
|
||||
AudioTrack(
|
||||
index=stream.get("index", len(audio_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
channels=stream.get("channels"),
|
||||
channel_layout=stream.get("channel_layout"),
|
||||
language=stream.get("tags", {}).get("language"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
elif codec_type == "subtitle":
|
||||
subtitle_tracks.append(
|
||||
SubtitleTrack(
|
||||
index=stream.get("index", len(subtitle_tracks)),
|
||||
codec=stream.get("codec_name"),
|
||||
language=stream.get("tags", {}).get("language"),
|
||||
is_default=stream.get("disposition", {}).get("default", 0) == 1,
|
||||
is_forced=stream.get("disposition", {}).get("forced", 0) == 1,
|
||||
)
|
||||
)
|
||||
|
||||
return MediaInfo(
|
||||
video_tracks=tuple(video_tracks),
|
||||
audio_tracks=tuple(audio_tracks),
|
||||
subtitle_tracks=tuple(subtitle_tracks),
|
||||
duration_seconds=duration_seconds,
|
||||
bitrate_kbps=bitrate_kbps,
|
||||
)
|
||||
|
||||
@@ -7,7 +7,7 @@ from typing import TYPE_CHECKING
|
||||
import yaml
|
||||
|
||||
from alfred.domain.subtitles.aggregates import SubtitleRuleSet
|
||||
from alfred.domain.subtitles.value_objects import RuleScope
|
||||
from alfred.domain.subtitles.value_objects import RuleScope, RuleScopeLevel
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from alfred.infrastructure.persistence.memory.ltm.components.subtitle_preferences import (
|
||||
@@ -72,7 +72,9 @@ class RuleSetRepository:
|
||||
rg_data = _load_yaml(rg_path).get("override", {})
|
||||
if rg_data:
|
||||
rg_ruleset = SubtitleRuleSet(
|
||||
scope=RuleScope(level="release_group", identifier=release_group),
|
||||
scope=RuleScope(
|
||||
level=RuleScopeLevel.RELEASE_GROUP, identifier=release_group
|
||||
),
|
||||
parent=current,
|
||||
)
|
||||
rg_ruleset.override(**_filter_override(rg_data))
|
||||
@@ -85,7 +87,7 @@ class RuleSetRepository:
|
||||
local_data = _load_yaml(self._alfred_dir / "rules.yaml").get("override", {})
|
||||
if local_data:
|
||||
local_ruleset = SubtitleRuleSet(
|
||||
scope=RuleScope(level="show"),
|
||||
scope=RuleScope(level=RuleScopeLevel.SHOW),
|
||||
parent=current,
|
||||
)
|
||||
local_ruleset.override(**_filter_override(local_data))
|
||||
|
||||
@@ -0,0 +1,17 @@
|
||||
# Known streaming distributor tokens (case-insensitive match).
|
||||
#
|
||||
# These tags identify *which platform* the release was sourced from
|
||||
# (Netflix, Amazon, Disney+, …). Distinct from ``sources.yaml`` which
|
||||
# captures the encoding origin (WEB-DL, BluRay, …). A typical release
|
||||
# carries both: ``Show.S01E01.1080p.NF.WEB-DL.x264-GROUP`` →
|
||||
# source=WEB-DL, distributor=NF.
|
||||
distributors:
|
||||
- NF # Netflix
|
||||
- AMZN # Amazon Prime Video
|
||||
- DSNP # Disney+
|
||||
- HMAX # HBO Max
|
||||
- ATVP # Apple TV+
|
||||
- HULU # Hulu
|
||||
- PCOK # Peacock
|
||||
- PMTP # Paramount+
|
||||
- CR # Crunchyroll
|
||||
@@ -0,0 +1,22 @@
|
||||
# ELiTE release naming schema.
|
||||
#
|
||||
# Examples seen in the wild:
|
||||
# Foundation.S02.1080p.x265-ELiTE (TV season pack, no source)
|
||||
#
|
||||
# ELiTE often omits the source token entirely on TV releases (no WEBRip /
|
||||
# BluRay), going straight from resolution to codec.
|
||||
|
||||
name: ELiTE
|
||||
separator: "."
|
||||
|
||||
chunk_order:
|
||||
- role: title
|
||||
- role: year
|
||||
optional: true
|
||||
- role: season_episode
|
||||
optional: true
|
||||
- role: resolution
|
||||
- role: source
|
||||
optional: true # often absent on TV
|
||||
- role: codec
|
||||
- role: group
|
||||
@@ -0,0 +1,28 @@
|
||||
# KONTRAST release naming schema.
|
||||
#
|
||||
# Examples seen in the wild:
|
||||
# Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST (movie)
|
||||
# The.Long.Walk.2025.1080p.WEBRip.x265-KONTRAST (movie)
|
||||
# Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST (TV episode)
|
||||
# Slow.Horses.S05.1080p.WEBRip.x265-KONTRAST (TV season pack)
|
||||
#
|
||||
# Schema is a left-to-right description of the canonical chunk order.
|
||||
# Each entry is a role (matching TokenRole). Optional chunks are marked
|
||||
# with `optional: true`. The parser consumes tokens greedily by role,
|
||||
# skipping over optional chunks that don't match.
|
||||
|
||||
name: KONTRAST
|
||||
separator: "."
|
||||
|
||||
# Canonical order of structural + technical chunks (left to right).
|
||||
# `title` is special-cased as "everything up to the first non-title role".
|
||||
chunk_order:
|
||||
- role: title
|
||||
- role: year
|
||||
optional: true # absent on TV releases (S01E01 instead)
|
||||
- role: season_episode
|
||||
optional: true # absent on movies
|
||||
- role: resolution # always present (1080p, 2160p, …)
|
||||
- role: source # always present (WEBRip, BluRay, …)
|
||||
- role: codec # always present (x265, x264, …)
|
||||
- role: group # everything after the final `-`
|
||||
@@ -0,0 +1,20 @@
|
||||
# RARBG release naming schema.
|
||||
#
|
||||
# RARBG follows the canonical scene convention closely:
|
||||
# Title.Year.Resolution.Source.Codec-RARBG
|
||||
# For TV:
|
||||
# Title.S01E01.Resolution.Source.Codec-RARBG
|
||||
|
||||
name: RARBG
|
||||
separator: "."
|
||||
|
||||
chunk_order:
|
||||
- role: title
|
||||
- role: year
|
||||
optional: true
|
||||
- role: season_episode
|
||||
optional: true
|
||||
- role: resolution
|
||||
- role: source
|
||||
- role: codec
|
||||
- role: group
|
||||
@@ -0,0 +1,42 @@
|
||||
# Release parse scoring.
|
||||
#
|
||||
# `parse_release` returns a `ParseReport` alongside the `ParsedRelease`.
|
||||
# The report carries a 0-100 confidence score computed from the annotated
|
||||
# tokens, plus the road decision (EASY / SHITTY / PATH_OF_PAIN).
|
||||
#
|
||||
# Why YAML: the weights and the SHITTY/PoP cutoff are tuning knobs we
|
||||
# expect to iterate on as fixtures grow. Keeping them in code would
|
||||
# mean a commit per tweak; here the user can adjust without touching
|
||||
# Python.
|
||||
#
|
||||
# Weights are awarded when the corresponding ParsedRelease field is
|
||||
# populated (non-None, non-"UNKNOWN" for group). Season and episode
|
||||
# only contribute when the parse looks like TV (season is not None).
|
||||
|
||||
weights:
|
||||
title: 30 # structural pivot — without it nothing else matters
|
||||
media_type: 20 # movie / tv_show / tv_complete / …
|
||||
year: 15
|
||||
season: 10 # only counted for TV-shaped releases
|
||||
episode: 5
|
||||
resolution: 5
|
||||
source: 5
|
||||
codec: 5
|
||||
group: 5 # "UNKNOWN" yields 0
|
||||
|
||||
# Penalty applied per UNKNOWN token left in the annotated stream.
|
||||
# Capped at `max_unknown_penalty` to keep a long-tail of garbage from
|
||||
# pushing every release into PoP.
|
||||
penalties:
|
||||
unknown_token: 5
|
||||
max_unknown_penalty: 30
|
||||
|
||||
# Decision thresholds.
|
||||
#
|
||||
# EASY is decided structurally (a known group schema matched) — it does
|
||||
# not look at the score. SHITTY vs PATH_OF_PAIN is decided here:
|
||||
#
|
||||
# score >= shitty_min → SHITTY (best-effort parse usable)
|
||||
# score < shitty_min → PATH_OF_PAIN (needs user / LLM help)
|
||||
thresholds:
|
||||
shitty_min: 60
|
||||
@@ -1,4 +1,9 @@
|
||||
# Known release source tokens (case-insensitive match)
|
||||
# Known release source tokens (case-insensitive match).
|
||||
#
|
||||
# "Source" here means the capture/encoding origin (disc, broadcast, web
|
||||
# stream) — NOT the streaming distributor (Netflix, Disney+, …). Those
|
||||
# live in ``distributors.yaml`` because they're a separate dimension:
|
||||
# a release is typically "WEB-DL from NF" — both should be captured.
|
||||
sources:
|
||||
- bluray
|
||||
- blu-ray
|
||||
@@ -14,8 +19,3 @@ sources:
|
||||
- dvdrip
|
||||
- dvd
|
||||
- vodrip
|
||||
- amzn
|
||||
- nf
|
||||
- dsnp
|
||||
- hmax
|
||||
- atvp
|
||||
|
||||
@@ -37,12 +37,6 @@ class Settings(BaseSettings):
|
||||
llm_temperature: float = 0.2
|
||||
data_storage_dir: str = "data"
|
||||
|
||||
# --- MEDIA ---
|
||||
# Minimum file size to consider a video file as a real movie (in bytes).
|
||||
# 100 MB is generous enough to skip sample clips / trailers without rejecting
|
||||
# legitimate low-bitrate releases (e.g. older anime, certain web rips).
|
||||
min_movie_size_bytes: int = 100 * 1024 * 1024
|
||||
|
||||
# --- BUILD ---
|
||||
alfred_version: str | None = None
|
||||
|
||||
@@ -90,15 +84,6 @@ class Settings(BaseSettings):
|
||||
)
|
||||
return v
|
||||
|
||||
@field_validator("min_movie_size_bytes")
|
||||
@classmethod
|
||||
def validate_min_movie_size(cls, v: int) -> int:
|
||||
if v < 0:
|
||||
raise ConfigurationError(
|
||||
f"min_movie_size_bytes must be non-negative, got {v}"
|
||||
)
|
||||
return v
|
||||
|
||||
@field_validator("request_timeout")
|
||||
@classmethod
|
||||
def validate_timeout(cls, v: int) -> int:
|
||||
|
||||
@@ -88,13 +88,13 @@ def analyze(release_name: str, source_path: str | None = None) -> None:
|
||||
if not path.exists():
|
||||
print(" (chemin inexistant, probe skipped)")
|
||||
else:
|
||||
from alfred.infrastructure.filesystem.ffprobe import probe
|
||||
from alfred.infrastructure.filesystem.find_video import find_video_file
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
video = find_video_file(path) if path.is_dir() else path
|
||||
if video:
|
||||
print(f" video file: {video.name}")
|
||||
info = probe(video)
|
||||
info = FfprobeMediaProber().probe(video)
|
||||
if info:
|
||||
print(f" codec: {info.video_codec}")
|
||||
print(f" resolution: {info.resolution}")
|
||||
|
||||
@@ -98,9 +98,9 @@ def main() -> None:
|
||||
print(c(f"Error: {path} does not exist", RED), file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
from alfred.infrastructure.filesystem.ffprobe import probe
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
info = probe(path)
|
||||
info = FfprobeMediaProber().probe(path)
|
||||
if info is None:
|
||||
print(c("Error: ffprobe failed to probe the file", RED), file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
@@ -100,11 +100,13 @@ def main() -> None:
|
||||
print(c(f"Error: {downloads} does not exist", RED), file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
from alfred.application.filesystem.detect_media_type import detect_media_type
|
||||
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
|
||||
from alfred.application.release.detect_media_type import detect_media_type
|
||||
from alfred.application.release.enrich_from_probe import enrich_from_probe
|
||||
from alfred.domain.release.services import parse_release
|
||||
from alfred.infrastructure.filesystem.ffprobe import probe
|
||||
from alfred.infrastructure.filesystem.find_video import find_video_file
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
_prober = FfprobeMediaProber()
|
||||
|
||||
entries = sorted(downloads.iterdir(), key=lambda p: p.name.lower())
|
||||
total = len(entries)
|
||||
@@ -126,7 +128,7 @@ def main() -> None:
|
||||
if p.media_type not in ("unknown", "other"):
|
||||
video_file = find_video_file(entry)
|
||||
if video_file:
|
||||
media_info = probe(video_file)
|
||||
media_info = _prober.probe(video_file)
|
||||
if media_info:
|
||||
enrich_from_probe(p, media_info)
|
||||
warnings = _assess(p)
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Tests for ``alfred.application.filesystem.detect_media_type``.
|
||||
"""Tests for ``alfred.application.release.detect_media_type``.
|
||||
|
||||
The function refines a ``ParsedRelease.media_type`` using filesystem evidence.
|
||||
|
||||
@@ -18,7 +18,7 @@ from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from alfred.application.filesystem.detect_media_type import detect_media_type
|
||||
from alfred.application.release.detect_media_type import detect_media_type
|
||||
from alfred.domain.release.services import parse_release
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
@@ -28,11 +28,14 @@ _KB = YamlReleaseKnowledge()
|
||||
def _parsed(media_type: str = "movie"):
|
||||
"""Build a ParsedRelease with the requested media_type via the real parser."""
|
||||
if media_type == "tv_show":
|
||||
return parse_release("Show.S01E01.1080p-GRP", _KB)
|
||||
parsed, _ = parse_release("Show.S01E01.1080p-GRP", _KB)
|
||||
return parsed
|
||||
if media_type == "movie":
|
||||
return parse_release("Movie.2020.1080p-GRP", _KB)
|
||||
parsed, _ = parse_release("Movie.2020.1080p-GRP", _KB)
|
||||
return parsed
|
||||
# "unknown" / other — feed a name the parser can't classify
|
||||
return parse_release("randomthing", _KB)
|
||||
parsed, _ = parse_release("randomthing", _KB)
|
||||
return parsed
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Tests for ``alfred.application.filesystem.enrich_from_probe``.
|
||||
"""Tests for ``alfred.application.release.enrich_from_probe``.
|
||||
|
||||
The function mutates a ``ParsedRelease`` in place using ffprobe ``MediaInfo``.
|
||||
Token-level values from the release name always win — only ``None`` fields
|
||||
@@ -18,7 +18,7 @@ Uses real ``ParsedRelease`` / ``MediaInfo`` instances — no mocking needed.
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from alfred.application.filesystem.enrich_from_probe import enrich_from_probe
|
||||
from alfred.application.release.enrich_from_probe import enrich_from_probe
|
||||
from alfred.domain.release.value_objects import ParsedRelease
|
||||
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
|
||||
|
||||
@@ -35,7 +35,7 @@ def _bare(**overrides) -> ParsedRelease:
|
||||
"""Build a minimal ParsedRelease with all enrichable fields = None."""
|
||||
defaults = dict(
|
||||
raw="X",
|
||||
normalised="X",
|
||||
clean="X",
|
||||
title="X",
|
||||
title_sanitized="X",
|
||||
year=None,
|
||||
@@ -210,3 +210,42 @@ class TestLanguages:
|
||||
p = _bare()
|
||||
enrich_from_probe(p, MediaInfo())
|
||||
assert p.languages == []
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# tech_string #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestTechString:
|
||||
"""tech_string drives the filename builders; it must be re-derived
|
||||
whenever quality / source / codec change."""
|
||||
|
||||
def test_rebuilt_from_filled_quality_and_codec(self):
|
||||
p = _bare()
|
||||
enrich_from_probe(
|
||||
p, _info_with_video(width=1920, height=1080, codec="hevc")
|
||||
)
|
||||
assert p.quality == "1080p"
|
||||
assert p.codec == "x265"
|
||||
assert p.tech_string == "1080p.x265"
|
||||
|
||||
def test_keeps_existing_source_when_enriching(self):
|
||||
# Token-level source must stay; probe fills only None fields.
|
||||
p = _bare(source="BluRay")
|
||||
enrich_from_probe(
|
||||
p, _info_with_video(width=1920, height=1080, codec="hevc")
|
||||
)
|
||||
assert p.tech_string == "1080p.BluRay.x265"
|
||||
|
||||
def test_unchanged_when_no_enrichable_video_info(self):
|
||||
# No video info → nothing to fill → tech_string stays as it was.
|
||||
p = _bare(quality="2160p", source="WEB-DL", codec="x265")
|
||||
p.tech_string = "2160p.WEB-DL.x265"
|
||||
enrich_from_probe(p, MediaInfo())
|
||||
assert p.tech_string == "2160p.WEB-DL.x265"
|
||||
|
||||
def test_empty_when_nothing_known(self):
|
||||
p = _bare()
|
||||
enrich_from_probe(p, MediaInfo())
|
||||
assert p.tech_string == ""
|
||||
|
||||
@@ -0,0 +1,265 @@
|
||||
"""Tests for the ``inspect_release`` orchestrator (Phase C).
|
||||
|
||||
Covers the four composition steps as a black box: a real
|
||||
``YamlReleaseKnowledge``, real on-disk filesystem under ``tmp_path``,
|
||||
and a stubbed ``MediaProber`` so we don't depend on a system ``ffprobe``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from alfred.application.release import InspectedResult, inspect_release
|
||||
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
_MOVIE_NAME = "Inception.2010.1080p.BluRay.x264-GROUP"
|
||||
_TV_NAME = "Dexter.S01E01.1080p.WEB-DL.x264-GROUP"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Test doubles #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class _StubProber:
|
||||
"""Minimal MediaProber stub. Records the path it was asked to probe."""
|
||||
|
||||
def __init__(self, info: MediaInfo | None) -> None:
|
||||
self._info = info
|
||||
self.calls: list[Path] = []
|
||||
|
||||
def list_subtitle_streams(self, video: Path): # pragma: no cover - unused here
|
||||
return []
|
||||
|
||||
def probe(self, video: Path) -> MediaInfo | None:
|
||||
self.calls.append(video)
|
||||
return self._info
|
||||
|
||||
|
||||
class _RaisingProber:
|
||||
"""A prober that would explode if called — used to assert no probe."""
|
||||
|
||||
def list_subtitle_streams(self, video: Path): # pragma: no cover
|
||||
raise AssertionError("list_subtitle_streams must not be called")
|
||||
|
||||
def probe(self, video: Path): # pragma: no cover
|
||||
raise AssertionError("probe must not be called")
|
||||
|
||||
|
||||
def _media_info_1080p_h264() -> MediaInfo:
|
||||
return MediaInfo(
|
||||
video_tracks=(VideoTrack(index=0, codec="h264", width=1920, height=1080),),
|
||||
audio_tracks=(
|
||||
AudioTrack(
|
||||
index=1,
|
||||
codec="ac3",
|
||||
channels=6,
|
||||
channel_layout="5.1",
|
||||
language="eng",
|
||||
is_default=True,
|
||||
),
|
||||
),
|
||||
subtitle_tracks=(),
|
||||
duration_seconds=7200.0,
|
||||
bitrate_kbps=8000,
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Happy paths #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestInspectMovieFolder:
|
||||
def test_returns_inspected_result_with_all_fields(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
video = folder / "movie.mkv"
|
||||
video.write_bytes(b"")
|
||||
prober = _StubProber(_media_info_1080p_h264())
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert isinstance(result, InspectedResult)
|
||||
assert result.source_path == folder
|
||||
assert result.main_video == video
|
||||
assert result.media_info is not None
|
||||
assert result.probe_used is True
|
||||
assert prober.calls == [video]
|
||||
|
||||
def test_parsed_carries_token_level_fields(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
prober = _StubProber(_media_info_1080p_h264())
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.parsed.title.lower().startswith("inception")
|
||||
assert result.parsed.year == 2010
|
||||
assert result.parsed.group == "GROUP"
|
||||
assert result.parsed.media_type == "movie"
|
||||
|
||||
def test_report_has_confidence_and_road(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
prober = _StubProber(None)
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert 0 <= result.report.confidence <= 100
|
||||
assert result.report.road in ("easy", "shitty", "path_of_pain")
|
||||
|
||||
|
||||
class TestInspectSingleFile:
|
||||
def test_file_is_its_own_main_video(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / f"{_MOVIE_NAME}.mkv"
|
||||
f.write_bytes(b"")
|
||||
prober = _StubProber(_media_info_1080p_h264())
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, f, _KB, prober)
|
||||
|
||||
assert result.main_video == f
|
||||
assert result.probe_used is True
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Probe-gating logic #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestProbeGating:
|
||||
def test_no_video_means_no_probe(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
# Only a non-video file present.
|
||||
(folder / "readme.txt").write_text("hi")
|
||||
prober = _RaisingProber()
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.main_video is None
|
||||
assert result.media_info is None
|
||||
assert result.probe_used is False
|
||||
|
||||
def test_media_type_other_means_no_probe(self, tmp_path: Path) -> None:
|
||||
# An ISO-only folder gets detect_media_type → "other".
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "disc.iso").write_bytes(b"")
|
||||
prober = _RaisingProber()
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.parsed.media_type == "other"
|
||||
assert result.media_info is None
|
||||
assert result.probe_used is False
|
||||
|
||||
def test_probe_failure_keeps_probe_used_false(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
prober = _StubProber(None) # ffprobe simulated as failing
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.main_video is not None
|
||||
assert result.media_info is None
|
||||
assert result.probe_used is False
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Mutation contract #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestMutationContract:
|
||||
def test_detect_media_type_refines_parsed(self, tmp_path: Path) -> None:
|
||||
# Release name parses to "movie", but folder mixes video + non_video
|
||||
# (e.g. an ISO sitting next to an mkv) → detect_media_type returns
|
||||
# "unknown", which is in _NON_PROBABLE_MEDIA_TYPES → no probe.
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
(folder / "extras.iso").write_bytes(b"")
|
||||
prober = _RaisingProber()
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.parsed.media_type == "unknown"
|
||||
assert result.probe_used is False
|
||||
|
||||
def test_enrich_runs_when_probe_succeeds(self, tmp_path: Path) -> None:
|
||||
# Build a release name with no codec; probe should fill it in.
|
||||
name = "Inception.2010.1080p.BluRay-GROUP"
|
||||
folder = tmp_path / name
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
prober = _StubProber(_media_info_1080p_h264())
|
||||
|
||||
result = inspect_release(name, folder, _KB, prober)
|
||||
|
||||
assert result.probe_used is True
|
||||
# enrich_from_probe should have filled the missing codec field.
|
||||
assert result.parsed.codec is not None
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Resilience #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestResilience:
|
||||
def test_nonexistent_path_does_not_raise(self, tmp_path: Path) -> None:
|
||||
ghost = tmp_path / "does-not-exist"
|
||||
prober = _RaisingProber()
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, ghost, _KB, prober)
|
||||
|
||||
assert result.main_video is None
|
||||
assert result.media_info is None
|
||||
assert result.probe_used is False
|
||||
|
||||
def test_tv_release_inspection(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _TV_NAME
|
||||
folder.mkdir()
|
||||
video = folder / "episode.mkv"
|
||||
video.write_bytes(b"")
|
||||
prober = _StubProber(_media_info_1080p_h264())
|
||||
|
||||
result = inspect_release(_TV_NAME, folder, _KB, prober)
|
||||
|
||||
assert result.parsed.media_type == "tv_show"
|
||||
assert result.parsed.season == 1
|
||||
assert result.parsed.episode == 1
|
||||
assert result.main_video == video
|
||||
assert result.probe_used is True
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Frozen contract #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestFrozen:
|
||||
def test_inspected_result_is_frozen(self, tmp_path: Path) -> None:
|
||||
folder = tmp_path / _MOVIE_NAME
|
||||
folder.mkdir()
|
||||
(folder / "movie.mkv").write_bytes(b"")
|
||||
prober = _StubProber(None)
|
||||
|
||||
result = inspect_release(_MOVIE_NAME, folder, _KB, prober)
|
||||
|
||||
# frozen=True → assigning a field raises FrozenInstanceError.
|
||||
import dataclasses
|
||||
|
||||
try:
|
||||
result.probe_used = True # type: ignore[misc]
|
||||
except dataclasses.FrozenInstanceError:
|
||||
pass
|
||||
else: # pragma: no cover
|
||||
raise AssertionError("InspectedResult should be frozen")
|
||||
@@ -322,6 +322,104 @@ class TestSeries:
|
||||
assert out.status == "needs_clarification"
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Probe enrichment wiring #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class _StubProber:
|
||||
"""Minimal MediaProber stub used to drive enrich_from_probe."""
|
||||
|
||||
def __init__(self, info):
|
||||
self._info = info
|
||||
|
||||
def list_subtitle_streams(self, video): # pragma: no cover - unused here
|
||||
return []
|
||||
|
||||
def probe(self, video):
|
||||
return self._info
|
||||
|
||||
|
||||
def _stereo_movie_info():
|
||||
"""A MediaInfo that fills quality+codec when the release name omits them."""
|
||||
from alfred.domain.shared.media import AudioTrack, MediaInfo, VideoTrack
|
||||
|
||||
return MediaInfo(
|
||||
video_tracks=(VideoTrack(index=0, codec="hevc", width=1920, height=1080),),
|
||||
audio_tracks=(
|
||||
AudioTrack(
|
||||
index=1,
|
||||
codec="aac",
|
||||
channels=2,
|
||||
channel_layout="stereo",
|
||||
language="eng",
|
||||
is_default=True,
|
||||
),
|
||||
),
|
||||
subtitle_tracks=(),
|
||||
)
|
||||
|
||||
|
||||
class TestProbeEnrichmentWiring:
|
||||
"""When source_path/source_file points to a real file, the resolver
|
||||
should pick up ffprobe data via inspect_release and let the enriched
|
||||
tech_string land in the destination name."""
|
||||
|
||||
def test_movie_picks_up_probe_quality(
|
||||
self, cfg_memory, tmp_path, monkeypatch
|
||||
):
|
||||
from alfred.application.filesystem import resolve_destination as rd
|
||||
|
||||
monkeypatch.setattr(rd, "_PROBER", _StubProber(_stereo_movie_info()))
|
||||
# Release name parses to "movie" but is missing the quality token;
|
||||
# probe must supply 1080p and refresh tech_string.
|
||||
bare_name = "Inception.2010.BluRay.x264-GROUP"
|
||||
video = tmp_path / "movie.mkv"
|
||||
video.write_bytes(b"")
|
||||
|
||||
out = resolve_movie_destination(bare_name, str(video), "Inception", 2010)
|
||||
|
||||
assert out.status == "ok"
|
||||
# tech_string -> "1080p.BluRay.x264" -> "1080p" shows up in names.
|
||||
assert "1080p" in out.movie_folder_name
|
||||
assert "1080p" in out.filename
|
||||
|
||||
def test_movie_skips_probe_when_path_missing(self, cfg_memory, monkeypatch):
|
||||
# If the file doesn't exist, no probe runs (the stub would have
|
||||
# injected 1080p — its absence proves the skip).
|
||||
from alfred.application.filesystem import resolve_destination as rd
|
||||
|
||||
monkeypatch.setattr(rd, "_PROBER", _StubProber(_stereo_movie_info()))
|
||||
out = resolve_movie_destination(
|
||||
"Inception.2010.BluRay.x264-GROUP",
|
||||
"/nowhere/m.mkv",
|
||||
"Inception",
|
||||
2010,
|
||||
)
|
||||
assert out.status == "ok"
|
||||
assert "1080p" not in out.movie_folder_name
|
||||
|
||||
def test_season_picks_up_probe_via_source_path(
|
||||
self, cfg_memory, tmp_path, monkeypatch
|
||||
):
|
||||
from alfred.application.filesystem import resolve_destination as rd
|
||||
|
||||
monkeypatch.setattr(rd, "_PROBER", _StubProber(_stereo_movie_info()))
|
||||
# Season pack name missing quality token; probe must add it.
|
||||
bare_name = "Oz.S03.BluRay.x265-KONTRAST"
|
||||
release_dir = tmp_path / bare_name
|
||||
release_dir.mkdir()
|
||||
(release_dir / "episode.mkv").write_bytes(b"")
|
||||
|
||||
out = resolve_season_destination(
|
||||
bare_name, "Oz", 1997, source_path=str(release_dir)
|
||||
)
|
||||
|
||||
assert out.status == "ok"
|
||||
# Series folder name embeds tech_string -> "1080p" surfaced by probe.
|
||||
assert "1080p" in out.series_folder_name
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# DTO to_dict() #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
@@ -0,0 +1,130 @@
|
||||
"""Tests for the pre-pipeline exclusion helpers (Phase A bis)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from alfred.application.release.supported_media import (
|
||||
find_main_video,
|
||||
is_supported_video,
|
||||
)
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# is_supported_video #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestIsSupportedVideo:
|
||||
def test_mkv_is_supported(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / "movie.mkv"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is True
|
||||
|
||||
def test_mp4_is_supported(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / "movie.mp4"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is True
|
||||
|
||||
def test_uppercase_extension_is_supported(self, tmp_path: Path) -> None:
|
||||
# File systems can return mixed case; we lowercase the suffix.
|
||||
f = tmp_path / "movie.MKV"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is True
|
||||
|
||||
def test_srt_is_not_video(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / "movie.srt"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is False
|
||||
|
||||
def test_nfo_is_not_video(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / "movie.nfo"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is False
|
||||
|
||||
def test_no_extension_is_not_video(self, tmp_path: Path) -> None:
|
||||
f = tmp_path / "README"
|
||||
f.touch()
|
||||
assert is_supported_video(f, _KB) is False
|
||||
|
||||
def test_directory_is_not_video(self, tmp_path: Path) -> None:
|
||||
d = tmp_path / "subdir.mkv" # even with a video extension
|
||||
d.mkdir()
|
||||
assert is_supported_video(d, _KB) is False
|
||||
|
||||
def test_nonexistent_path_is_not_video(self, tmp_path: Path) -> None:
|
||||
assert is_supported_video(tmp_path / "ghost.mkv", _KB) is False
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# find_main_video #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestFindMainVideo:
|
||||
def test_single_video_file_in_folder(self, tmp_path: Path) -> None:
|
||||
main = tmp_path / "Movie.2020.mkv"
|
||||
main.touch()
|
||||
assert find_main_video(tmp_path, _KB) == main
|
||||
|
||||
def test_returns_lexicographically_first_among_multiple(
|
||||
self, tmp_path: Path
|
||||
) -> None:
|
||||
# Legitimate for season packs: pick the first episode by name.
|
||||
ep2 = tmp_path / "Show.S01E02.mkv"
|
||||
ep1 = tmp_path / "Show.S01E01.mkv"
|
||||
ep2.touch()
|
||||
ep1.touch()
|
||||
assert find_main_video(tmp_path, _KB) == ep1
|
||||
|
||||
def test_skips_non_video_files(self, tmp_path: Path) -> None:
|
||||
# nfo and srt come alphabetically before .mkv, must not win.
|
||||
(tmp_path / "Movie.nfo").touch()
|
||||
(tmp_path / "Movie.srt").touch()
|
||||
vid = tmp_path / "Movie.mkv"
|
||||
vid.touch()
|
||||
assert find_main_video(tmp_path, _KB) == vid
|
||||
|
||||
def test_ignores_subdirectories(self, tmp_path: Path) -> None:
|
||||
# A Sample/ subdir must NOT be descended into.
|
||||
sample_dir = tmp_path / "Sample"
|
||||
sample_dir.mkdir()
|
||||
(sample_dir / "sample.mkv").touch()
|
||||
main = tmp_path / "Movie.mkv"
|
||||
main.touch()
|
||||
assert find_main_video(tmp_path, _KB) == main
|
||||
|
||||
def test_only_subdirectory_with_video_returns_none(
|
||||
self, tmp_path: Path
|
||||
) -> None:
|
||||
# No top-level video, only one inside a subdir → None.
|
||||
sub = tmp_path / "Sample"
|
||||
sub.mkdir()
|
||||
(sub / "video.mkv").touch()
|
||||
assert find_main_video(tmp_path, _KB) is None
|
||||
|
||||
def test_empty_folder_returns_none(self, tmp_path: Path) -> None:
|
||||
assert find_main_video(tmp_path, _KB) is None
|
||||
|
||||
def test_nonexistent_folder_returns_none(self, tmp_path: Path) -> None:
|
||||
assert find_main_video(tmp_path / "ghost", _KB) is None
|
||||
|
||||
def test_single_file_release_passed_as_folder_arg(
|
||||
self, tmp_path: Path
|
||||
) -> None:
|
||||
# Some releases are a bare .mkv with no enclosing folder.
|
||||
f = tmp_path / "Movie.2020.1080p.mkv"
|
||||
f.touch()
|
||||
assert find_main_video(f, _KB) == f
|
||||
|
||||
def test_single_file_non_video_passed_as_folder_arg(
|
||||
self, tmp_path: Path
|
||||
) -> None:
|
||||
f = tmp_path / "README.nfo"
|
||||
f.touch()
|
||||
assert find_main_video(f, _KB) is None
|
||||
@@ -0,0 +1,216 @@
|
||||
"""EASY-path tests for the v2 annotate-based pipeline.
|
||||
|
||||
These tests assert that the **v2 pipeline itself** produces the correct
|
||||
annotated stream and assembled fields for releases from known groups
|
||||
(KONTRAST, ELiTE, …) — without going through ``parse_release``. The
|
||||
fixtures suite (``tests/domain/test_release_fixtures.py``) already
|
||||
locks the user-visible ``ParsedRelease`` contract; here we cover the
|
||||
internal pipeline behavior so a future refactor of ``parse_release``
|
||||
can't quietly drop EASY without us noticing.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from alfred.domain.release.parser import TokenRole
|
||||
from alfred.domain.release.parser.pipeline import (
|
||||
_detect_group,
|
||||
annotate,
|
||||
assemble,
|
||||
tokenize,
|
||||
)
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
|
||||
class TestDetectGroup:
|
||||
def test_codec_group(self) -> None:
|
||||
tokens, _ = tokenize(
|
||||
"Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
|
||||
)
|
||||
name, idx = _detect_group(tokens, _KB)
|
||||
assert name == "KONTRAST"
|
||||
assert idx == 6 # x265-KONTRAST is the 7th token
|
||||
|
||||
def test_unknown_when_no_dash(self) -> None:
|
||||
tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x265.KONTRAST", _KB)
|
||||
# No dash anywhere → no group detected.
|
||||
name, idx = _detect_group(tokens, _KB)
|
||||
assert idx is None
|
||||
assert name == "UNKNOWN"
|
||||
|
||||
def test_skips_dashed_source(self) -> None:
|
||||
# "Web-DL" must not be mistaken for a group token.
|
||||
tokens, _ = tokenize("Movie.2020.1080p.Web-DL.x265-GRP", _KB)
|
||||
name, idx = _detect_group(tokens, _KB)
|
||||
assert name == "GRP"
|
||||
|
||||
|
||||
class TestAnnotateEasy:
|
||||
def test_kontrast_movie(self) -> None:
|
||||
tokens, tag = tokenize(
|
||||
"Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB
|
||||
)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None, "KONTRAST should hit the EASY path"
|
||||
|
||||
roles = [t.role for t in annotated]
|
||||
assert roles == [
|
||||
TokenRole.TITLE, # Back
|
||||
TokenRole.TITLE, # in
|
||||
TokenRole.TITLE, # Action
|
||||
TokenRole.YEAR,
|
||||
TokenRole.RESOLUTION,
|
||||
TokenRole.SOURCE,
|
||||
TokenRole.CODEC, # x265-KONTRAST → CODEC with extra.group=KONTRAST
|
||||
]
|
||||
assert annotated[-1].extra["group"] == "KONTRAST"
|
||||
assert annotated[-1].extra["codec"] == "x265"
|
||||
|
||||
def test_kontrast_tv_episode(self) -> None:
|
||||
tokens, _ = tokenize(
|
||||
"Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST", _KB
|
||||
)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
|
||||
# Year is optional and absent → skipped. Season_episode present.
|
||||
roles = [t.role for t in annotated]
|
||||
assert TokenRole.SEASON_EPISODE in roles
|
||||
assert TokenRole.YEAR not in roles
|
||||
|
||||
def test_elite_no_source(self) -> None:
|
||||
# ELiTE schema marks source as optional — Foundation.S02 omits it.
|
||||
tokens, _ = tokenize("Foundation.S02.1080p.x265-ELiTE", _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None, "ELiTE optional source must be tolerated"
|
||||
|
||||
roles = [t.role for t in annotated]
|
||||
assert TokenRole.SOURCE not in roles
|
||||
assert TokenRole.RESOLUTION in roles
|
||||
assert TokenRole.CODEC in roles
|
||||
|
||||
def test_unknown_group_falls_to_shitty(self) -> None:
|
||||
tokens, _ = tokenize("Some.Movie.2020.1080p.WEBRip.x264-RANDOM", _KB)
|
||||
# RANDOM is not in our release_groups/ — annotate() now falls
|
||||
# through to the in-pipeline SHITTY pass and returns a populated
|
||||
# token list (no None sentinel anymore).
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
roles = [t.role for t in annotated]
|
||||
# Title is "Some.Movie", then YEAR, RESOLUTION, SOURCE, CODEC
|
||||
# carrying the group in extra.
|
||||
assert TokenRole.TITLE in roles
|
||||
assert TokenRole.YEAR in roles
|
||||
assert TokenRole.RESOLUTION in roles
|
||||
assert TokenRole.SOURCE in roles
|
||||
assert TokenRole.CODEC in roles
|
||||
codec_tok = next(t for t in annotated if t.role is TokenRole.CODEC)
|
||||
assert codec_tok.extra.get("group") == "RANDOM"
|
||||
|
||||
|
||||
class TestAssemble:
|
||||
def test_kontrast_movie_fields(self) -> None:
|
||||
name = "Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["title"] == "Back.in.Action"
|
||||
assert fields["year"] == 2025
|
||||
assert fields["season"] is None
|
||||
assert fields["quality"] == "1080p"
|
||||
assert fields["source"] == "WEBRip"
|
||||
assert fields["codec"] == "x265"
|
||||
assert fields["group"] == "KONTRAST"
|
||||
assert fields["tech_string"] == "1080p.WEBRip.x265"
|
||||
assert fields["media_type"] == "movie"
|
||||
assert fields["site_tag"] is None
|
||||
|
||||
def test_kontrast_tv_fields(self) -> None:
|
||||
name = "Slow.Horses.S05E01.1080p.WEBRip.x265-KONTRAST"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["title"] == "Slow.Horses"
|
||||
assert fields["year"] is None
|
||||
assert fields["season"] == 5
|
||||
assert fields["episode"] == 1
|
||||
assert fields["media_type"] == "tv_show"
|
||||
assert fields["group"] == "KONTRAST"
|
||||
|
||||
def test_elite_season_pack(self) -> None:
|
||||
name = "Foundation.S02.1080p.x265-ELiTE"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["title"] == "Foundation"
|
||||
assert fields["season"] == 2
|
||||
assert fields["episode"] is None # season pack
|
||||
assert fields["source"] is None # ELiTE omits it
|
||||
assert fields["tech_string"] == "1080p.x265"
|
||||
assert fields["group"] == "ELiTE"
|
||||
|
||||
|
||||
class TestEnrichers:
|
||||
"""Non-positional roles populated alongside the structural walk.
|
||||
|
||||
These releases would have failed the v2 EASY path before the enricher
|
||||
pass landed (leftover unknown tokens would force a fallback). They
|
||||
now succeed in v2 with rich metadata.
|
||||
"""
|
||||
|
||||
def test_bit_depth_and_audio(self) -> None:
|
||||
name = "Back.in.Action.2025.1080p.WEBRip.10bit.DDP.5.1.x265-KONTRAST"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["title"] == "Back.in.Action"
|
||||
assert fields["bit_depth"] == "10bit"
|
||||
assert fields["audio_codec"] == "DDP"
|
||||
assert fields["audio_channels"] == "5.1"
|
||||
|
||||
def test_hdr_sequence(self) -> None:
|
||||
# DV.HDR10 sequence + TrueHD.Atmos sequence + 7.1 channels +
|
||||
# DIRECTORS.CUT edition all in one release.
|
||||
name = (
|
||||
"Some.Movie.2024.DIRECTORS.CUT.2160p.BluRay.DV.HDR10."
|
||||
"TrueHD.Atmos.7.1.x265-KONTRAST"
|
||||
)
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["edition"] == "DIRECTORS.CUT"
|
||||
assert fields["hdr_format"] == "DV.HDR10"
|
||||
assert fields["audio_codec"] == "TrueHD.Atmos"
|
||||
assert fields["audio_channels"] == "7.1"
|
||||
|
||||
def test_multiple_languages(self) -> None:
|
||||
name = "Movie.2020.FRENCH.MULTI.1080p.WEBRip.DTS.HD.MA.5.1.x265-KONTRAST"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["languages"] == ["FRENCH", "MULTI"]
|
||||
assert fields["audio_codec"] == "DTS-HD.MA"
|
||||
assert fields["audio_channels"] == "5.1"
|
||||
|
||||
def test_tv_with_language(self) -> None:
|
||||
name = "Show.S01E05.FRENCH.1080p.WEBRip.x265-KONTRAST"
|
||||
tokens, tag = tokenize(name, _KB)
|
||||
annotated = annotate(tokens, _KB)
|
||||
assert annotated is not None
|
||||
fields = assemble(annotated, tag, name, _KB)
|
||||
|
||||
assert fields["title"] == "Show"
|
||||
assert fields["season"] == 1
|
||||
assert fields["episode"] == 5
|
||||
assert fields["languages"] == ["FRENCH"]
|
||||
assert fields["media_type"] == "tv_show"
|
||||
@@ -0,0 +1,79 @@
|
||||
"""Scaffolding tests for the v2 parser package.
|
||||
|
||||
These tests lock the **shape** of the new pipeline (token VOs, tokenize
|
||||
output, site-tag stripping) before the annotate step is wired in. They
|
||||
do not check parsed-release output yet — that comes once :func:`annotate`
|
||||
is implemented and the fixtures-based suite switches over.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from alfred.domain.release.parser import Token, TokenRole
|
||||
from alfred.domain.release.parser.pipeline import strip_site_tag, tokenize
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
|
||||
class TestToken:
|
||||
def test_default_role_is_unknown(self) -> None:
|
||||
t = Token(text="1080p", index=3)
|
||||
assert t.role is TokenRole.UNKNOWN
|
||||
assert not t.is_annotated
|
||||
|
||||
def test_with_role_returns_new_instance(self) -> None:
|
||||
t = Token(text="1080p", index=3)
|
||||
promoted = t.with_role(TokenRole.RESOLUTION)
|
||||
assert promoted is not t
|
||||
assert promoted.role is TokenRole.RESOLUTION
|
||||
assert t.role is TokenRole.UNKNOWN # original unchanged (frozen)
|
||||
|
||||
def test_with_role_merges_extra(self) -> None:
|
||||
t = Token(text="x265-KONTRAST", index=5)
|
||||
promoted = t.with_role(TokenRole.CODEC, group="KONTRAST")
|
||||
assert promoted.role is TokenRole.CODEC
|
||||
assert promoted.extra == {"group": "KONTRAST"}
|
||||
|
||||
|
||||
class TestStripSiteTag:
|
||||
def test_no_tag(self) -> None:
|
||||
clean, tag = strip_site_tag("The.Movie.2020.1080p-GRP")
|
||||
assert tag is None
|
||||
assert clean == "The.Movie.2020.1080p-GRP"
|
||||
|
||||
def test_suffix_tag(self) -> None:
|
||||
clean, tag = strip_site_tag("Sinners.2025.1080p-[YTS.MX]")
|
||||
assert tag == "YTS.MX"
|
||||
assert clean == "Sinners.2025.1080p-"
|
||||
|
||||
def test_prefix_tag(self) -> None:
|
||||
clean, tag = strip_site_tag("[ OxTorrent.vc ] The.Title.S01E01")
|
||||
assert tag == "OxTorrent.vc"
|
||||
assert clean == "The.Title.S01E01"
|
||||
|
||||
|
||||
class TestTokenize:
|
||||
def test_simple_release(self) -> None:
|
||||
tokens, tag = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
|
||||
assert tag is None
|
||||
texts = [t.text for t in tokens]
|
||||
# Dash is not a separator, so x265-KONTRAST stays glued.
|
||||
assert texts == [
|
||||
"Back", "in", "Action", "2025", "1080p", "WEBRip", "x265-KONTRAST",
|
||||
]
|
||||
|
||||
def test_all_tokens_start_unknown(self) -> None:
|
||||
tokens, _ = tokenize("Back.in.Action.2025.1080p.WEBRip.x265-KONTRAST", _KB)
|
||||
assert all(t.role is TokenRole.UNKNOWN for t in tokens)
|
||||
|
||||
def test_indexes_are_contiguous(self) -> None:
|
||||
tokens, _ = tokenize("A.B.C.D", _KB)
|
||||
assert [t.index for t in tokens] == [0, 1, 2, 3]
|
||||
|
||||
def test_strips_site_tag_before_tokenize(self) -> None:
|
||||
tokens, tag = tokenize(
|
||||
"Sinners.2025.1080p.WEBRip.x265.10bit.AAC5.1-[YTS.MX]", _KB
|
||||
)
|
||||
assert tag == "YTS.MX"
|
||||
# Site tag substring must not appear among tokens.
|
||||
assert not any("YTS" in t.text for t in tokens)
|
||||
@@ -0,0 +1,282 @@
|
||||
"""Phase A — parse-confidence scoring.
|
||||
|
||||
These tests pin the score / road semantics without going through
|
||||
fixtures. They exercise the small pure functions in
|
||||
``alfred.domain.release.parser.scoring`` and the end-to-end contract
|
||||
that ``parse_release`` returns a ``(ParsedRelease, ParseReport)`` tuple.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
from alfred.domain.release.parser.scoring import (
|
||||
Road,
|
||||
collect_missing_critical,
|
||||
collect_unknown_tokens,
|
||||
compute_score,
|
||||
decide_road,
|
||||
)
|
||||
from alfred.domain.release.parser.tokens import Token, TokenRole
|
||||
from alfred.domain.release.services import parse_release
|
||||
from alfred.domain.release.value_objects import (
|
||||
MediaTypeToken,
|
||||
ParsedRelease,
|
||||
ParsePath,
|
||||
ParseReport,
|
||||
)
|
||||
from alfred.domain.shared.exceptions import ValidationError
|
||||
from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# ParseReport VO #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestParseReport:
|
||||
def test_construct_with_defaults(self) -> None:
|
||||
report = ParseReport(confidence=80, road="easy")
|
||||
assert report.confidence == 80
|
||||
assert report.road == "easy"
|
||||
assert report.unknown_tokens == ()
|
||||
assert report.missing_critical == ()
|
||||
|
||||
def test_is_frozen(self) -> None:
|
||||
report = ParseReport(confidence=50, road="shitty")
|
||||
with pytest.raises(Exception): # FrozenInstanceError
|
||||
report.confidence = 99 # type: ignore[misc]
|
||||
|
||||
def test_confidence_lower_bound(self) -> None:
|
||||
with pytest.raises(ValidationError):
|
||||
ParseReport(confidence=-1, road="easy")
|
||||
|
||||
def test_confidence_upper_bound(self) -> None:
|
||||
with pytest.raises(ValidationError):
|
||||
ParseReport(confidence=101, road="easy")
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# compute_score #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
def _movie(year: int = 2020, **overrides) -> ParsedRelease:
|
||||
"""Build a populated movie ParsedRelease for scoring tests."""
|
||||
base = dict(
|
||||
raw="Inception.2010.1080p.BluRay.x264-GROUP",
|
||||
clean="Inception.2010.1080p.BluRay.x264-GROUP",
|
||||
title="Inception",
|
||||
title_sanitized="Inception",
|
||||
year=year,
|
||||
season=None,
|
||||
episode=None,
|
||||
episode_end=None,
|
||||
quality="1080p",
|
||||
source="BluRay",
|
||||
codec="x264",
|
||||
group="GROUP",
|
||||
tech_string="1080p.BluRay.x264",
|
||||
media_type=MediaTypeToken.MOVIE,
|
||||
parse_path=ParsePath.DIRECT,
|
||||
)
|
||||
base.update(overrides)
|
||||
return ParsedRelease(**base)
|
||||
|
||||
|
||||
def _all_annotated() -> list[Token]:
|
||||
"""Token stream where everything is annotated — zero penalty."""
|
||||
return [
|
||||
Token("Inception", 0, TokenRole.TITLE),
|
||||
Token("2010", 1, TokenRole.YEAR),
|
||||
Token("1080p", 2, TokenRole.RESOLUTION),
|
||||
Token("BluRay", 3, TokenRole.SOURCE),
|
||||
Token("x264", 4, TokenRole.CODEC),
|
||||
Token("GROUP", 5, TokenRole.GROUP),
|
||||
]
|
||||
|
||||
|
||||
class TestComputeScore:
|
||||
def test_fully_populated_movie_scores_high(self) -> None:
|
||||
parsed = _movie()
|
||||
score = compute_score(parsed, _all_annotated(), _KB)
|
||||
# title 30 + media_type 20 + year 15 + resolution 5 + source 5
|
||||
# + codec 5 + group 5 = 85
|
||||
assert score == 85
|
||||
|
||||
def test_tv_show_gets_season_and_episode_weight(self) -> None:
|
||||
parsed = ParsedRelease(
|
||||
raw="Oz.S01E01.1080p.WEBRip.x265-KONTRAST",
|
||||
clean="Oz.S01E01.1080p.WEBRip.x265-KONTRAST",
|
||||
title="Oz",
|
||||
title_sanitized="Oz",
|
||||
year=None,
|
||||
season=1,
|
||||
episode=1,
|
||||
episode_end=None,
|
||||
quality="1080p",
|
||||
source="WEBRip",
|
||||
codec="x265",
|
||||
group="KONTRAST",
|
||||
tech_string="1080p.WEBRip.x265",
|
||||
media_type=MediaTypeToken.TV_SHOW,
|
||||
parse_path=ParsePath.DIRECT,
|
||||
)
|
||||
tokens = [
|
||||
Token("Oz", 0, TokenRole.TITLE),
|
||||
Token("S01E01", 1, TokenRole.SEASON_EPISODE),
|
||||
Token("1080p", 2, TokenRole.RESOLUTION),
|
||||
Token("WEBRip", 3, TokenRole.SOURCE),
|
||||
Token("x265", 4, TokenRole.CODEC),
|
||||
Token("KONTRAST", 5, TokenRole.GROUP),
|
||||
]
|
||||
score = compute_score(parsed, tokens, _KB)
|
||||
# title 30 + media_type 20 + season 10 + episode 5 + resolution 5
|
||||
# + source 5 + codec 5 + group 5 = 85 (no year)
|
||||
assert score == 85
|
||||
|
||||
def test_unknown_tokens_subtract_penalty(self) -> None:
|
||||
parsed = _movie()
|
||||
tokens = _all_annotated() + [
|
||||
Token("noise", 6, TokenRole.UNKNOWN),
|
||||
Token("more", 7, TokenRole.UNKNOWN),
|
||||
]
|
||||
score = compute_score(parsed, tokens, _KB)
|
||||
# 85 baseline - 2*5 unknown tokens = 75
|
||||
assert score == 75
|
||||
|
||||
def test_unknown_penalty_capped(self) -> None:
|
||||
parsed = _movie()
|
||||
# 20 unknown tokens × 5 = 100 raw, capped at 30
|
||||
tokens = _all_annotated() + [
|
||||
Token(f"t{i}", 6 + i, TokenRole.UNKNOWN) for i in range(20)
|
||||
]
|
||||
score = compute_score(parsed, tokens, _KB)
|
||||
assert score == 85 - 30
|
||||
|
||||
def test_score_clamped_to_zero(self) -> None:
|
||||
# Empty-ish parse with lots of unknown tokens
|
||||
parsed = _movie(year=None, quality=None, source=None, codec=None)
|
||||
tokens = [Token(f"t{i}", i, TokenRole.UNKNOWN) for i in range(10)]
|
||||
score = compute_score(parsed, tokens, _KB)
|
||||
# title 30 + media_type 20 + group 5 = 55, -30 cap = 25
|
||||
# Sanity: still clamped at 0 minimum even if math goes weird
|
||||
assert 0 <= score <= 100
|
||||
|
||||
def test_unknown_media_type_does_not_count(self) -> None:
|
||||
parsed = _movie(media_type=MediaTypeToken.UNKNOWN)
|
||||
score = compute_score(parsed, _all_annotated(), _KB)
|
||||
# Loses the 20 of media_type vs baseline
|
||||
assert score == 85 - 20
|
||||
|
||||
def test_unknown_group_does_not_count(self) -> None:
|
||||
parsed = _movie(group="UNKNOWN")
|
||||
score = compute_score(parsed, _all_annotated(), _KB)
|
||||
assert score == 85 - 5
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# decide_road #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestDecideRoad:
|
||||
def test_known_schema_is_easy_regardless_of_score(self) -> None:
|
||||
# Even a terrible score returns EASY when a schema matched.
|
||||
assert decide_road(score=0, has_schema=True, kb=_KB) is Road.EASY
|
||||
|
||||
def test_no_schema_high_score_is_shitty(self) -> None:
|
||||
assert decide_road(score=80, has_schema=False, kb=_KB) is Road.SHITTY
|
||||
|
||||
def test_no_schema_low_score_is_pop(self) -> None:
|
||||
assert decide_road(score=10, has_schema=False, kb=_KB) is Road.PATH_OF_PAIN
|
||||
|
||||
def test_threshold_boundary_is_inclusive(self) -> None:
|
||||
threshold = _KB.scoring["thresholds"]["shitty_min"]
|
||||
assert decide_road(threshold, has_schema=False, kb=_KB) is Road.SHITTY
|
||||
assert (
|
||||
decide_road(threshold - 1, has_schema=False, kb=_KB)
|
||||
is Road.PATH_OF_PAIN
|
||||
)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# Collectors #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestCollectors:
|
||||
def test_collect_unknown_tokens_preserves_order(self) -> None:
|
||||
tokens = [
|
||||
Token("A", 0, TokenRole.TITLE),
|
||||
Token("X", 1, TokenRole.UNKNOWN),
|
||||
Token("B", 2, TokenRole.RESOLUTION),
|
||||
Token("Y", 3, TokenRole.UNKNOWN),
|
||||
]
|
||||
assert collect_unknown_tokens(tokens) == ("X", "Y")
|
||||
|
||||
def test_collect_missing_critical_full(self) -> None:
|
||||
empty = ParsedRelease(
|
||||
raw="x",
|
||||
clean="x",
|
||||
title="",
|
||||
title_sanitized="",
|
||||
year=None,
|
||||
season=None,
|
||||
episode=None,
|
||||
episode_end=None,
|
||||
quality=None,
|
||||
source=None,
|
||||
codec=None,
|
||||
group="UNKNOWN",
|
||||
tech_string="",
|
||||
media_type=MediaTypeToken.UNKNOWN,
|
||||
parse_path=ParsePath.DIRECT,
|
||||
)
|
||||
assert set(collect_missing_critical(empty)) == {
|
||||
"title",
|
||||
"media_type",
|
||||
"year",
|
||||
}
|
||||
|
||||
def test_collect_missing_critical_none(self) -> None:
|
||||
parsed = _movie()
|
||||
assert collect_missing_critical(parsed) == ()
|
||||
|
||||
|
||||
# --------------------------------------------------------------------- #
|
||||
# End-to-end contract #
|
||||
# --------------------------------------------------------------------- #
|
||||
|
||||
|
||||
class TestParseReleaseReturnsReport:
|
||||
def test_returns_tuple(self) -> None:
|
||||
result = parse_release("Inception.2010.1080p.BluRay.x264-GROUP", _KB)
|
||||
assert isinstance(result, tuple)
|
||||
assert len(result) == 2
|
||||
parsed, report = result
|
||||
assert isinstance(parsed, ParsedRelease)
|
||||
assert isinstance(report, ParseReport)
|
||||
|
||||
def test_known_group_is_easy_road(self) -> None:
|
||||
# KONTRAST has a schema in release_groups/
|
||||
_, report = parse_release(
|
||||
"Oz.S03E01.1080p.WEBRip.x265-KONTRAST", _KB
|
||||
)
|
||||
assert report.road == Road.EASY.value
|
||||
assert report.confidence > 0
|
||||
|
||||
def test_unknown_group_well_formed_is_shitty(self) -> None:
|
||||
# No registered schema but well-formed scene name → SHITTY
|
||||
_, report = parse_release(
|
||||
"Inception.2010.1080p.BluRay.x264-NOSCHEMA", _KB
|
||||
)
|
||||
assert report.road == Road.SHITTY.value
|
||||
|
||||
def test_malformed_name_is_pop(self) -> None:
|
||||
# Forbidden chars (@) — short-circuits to AI / PoP.
|
||||
_, report = parse_release("garbage@#%name", _KB)
|
||||
assert report.road == Road.PATH_OF_PAIN.value
|
||||
assert report.confidence == 0
|
||||
@@ -26,7 +26,8 @@ _KB = YamlReleaseKnowledge()
|
||||
|
||||
|
||||
def _parse(name: str) -> ParsedRelease:
|
||||
return parse_release(name, _KB)
|
||||
parsed, _report = parse_release(name, _KB)
|
||||
return parsed
|
||||
|
||||
|
||||
class TestParseTVEpisode:
|
||||
|
||||
@@ -26,19 +26,26 @@ _KB = YamlReleaseKnowledge()
|
||||
FIXTURES = discover_fixtures()
|
||||
|
||||
|
||||
def _fixture_param(f: ReleaseFixture) -> pytest.param:
|
||||
marks = []
|
||||
if f.xfail_reason:
|
||||
marks.append(pytest.mark.xfail(reason=f.xfail_reason, strict=False))
|
||||
return pytest.param(f, id=f.name, marks=marks)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"fixture",
|
||||
FIXTURES,
|
||||
ids=[f.name for f in FIXTURES],
|
||||
[_fixture_param(f) for f in FIXTURES],
|
||||
)
|
||||
def test_parse_matches_fixture(fixture: ReleaseFixture, tmp_path) -> None:
|
||||
# Materialize the tree to assert it is at least well-formed YAML +
|
||||
# plausible filesystem paths. Catches typos / missing leading dirs early.
|
||||
fixture.materialize(tmp_path)
|
||||
|
||||
result = asdict(parse_release(fixture.release_name, _KB))
|
||||
parsed, _report = parse_release(fixture.release_name, _KB)
|
||||
result = asdict(parsed)
|
||||
# ``is_season_pack`` is a @property — asdict() does not include it.
|
||||
result["is_season_pack"] = parse_release(fixture.release_name, _KB).is_season_pack
|
||||
result["is_season_pack"] = parsed.is_season_pack
|
||||
|
||||
for field, expected in fixture.expected_parsed.items():
|
||||
assert field in result, (
|
||||
|
||||
@@ -28,6 +28,7 @@ from alfred.domain.subtitles.entities import MediaSubtitleMetadata, SubtitleCand
|
||||
from alfred.domain.subtitles.services.utils import available_subtitles
|
||||
from alfred.domain.subtitles.value_objects import (
|
||||
RuleScope,
|
||||
RuleScopeLevel,
|
||||
SubtitleFormat,
|
||||
SubtitleLanguage,
|
||||
SubtitleMatchingRules,
|
||||
@@ -257,7 +258,7 @@ class TestSubtitleRuleSet:
|
||||
def test_override_partial_keeps_parent_for_unset_fields(self):
|
||||
parent = SubtitleRuleSet.global_default()
|
||||
child = SubtitleRuleSet(
|
||||
scope=RuleScope(level="show", identifier="tt1"),
|
||||
scope=RuleScope(level=RuleScopeLevel.SHOW, identifier="tt1"),
|
||||
parent=parent,
|
||||
)
|
||||
child.override(languages=["jpn"])
|
||||
@@ -267,14 +268,14 @@ class TestSubtitleRuleSet:
|
||||
assert rules.min_confidence == parent.resolve(_DEFAULT_RULES).min_confidence
|
||||
|
||||
def test_to_dict_only_emits_set_deltas(self):
|
||||
rs = SubtitleRuleSet(scope=RuleScope(level="show", identifier="tt1"))
|
||||
rs = SubtitleRuleSet(scope=RuleScope(level=RuleScopeLevel.SHOW, identifier="tt1"))
|
||||
rs.override(languages=["fra"])
|
||||
out = rs.to_dict()
|
||||
assert out["scope"] == {"level": "show", "identifier": "tt1"}
|
||||
assert out["override"] == {"languages": ["fra"]}
|
||||
|
||||
def test_to_dict_full_override(self):
|
||||
rs = SubtitleRuleSet(scope=RuleScope(level="global"))
|
||||
rs = SubtitleRuleSet(scope=RuleScope(level=RuleScopeLevel.GLOBAL))
|
||||
rs.override(
|
||||
languages=["fra"],
|
||||
formats=["srt"],
|
||||
|
||||
Vendored
+8
@@ -39,6 +39,14 @@ class ReleaseFixture:
|
||||
def routing(self) -> dict:
|
||||
return self.data.get("routing", {})
|
||||
|
||||
@property
|
||||
def xfail_reason(self) -> str | None:
|
||||
"""If set, the fixture is expected to fail — wrapped with
|
||||
``pytest.mark.xfail`` by the test runner. Used for known
|
||||
not-supported pathological cases (typically PATH OF PAIN bucket).
|
||||
"""
|
||||
return self.data.get("xfail_reason")
|
||||
|
||||
def materialize(self, root: Path) -> None:
|
||||
"""Create the fixture's ``tree`` as empty files/dirs under ``root``."""
|
||||
for entry in self.tree:
|
||||
|
||||
@@ -1,5 +1,10 @@
|
||||
release_name: "Deutschland 83-86-89 (2015) Season 1-3 S01-S03 (1080p BluRay x265 HEVC 10bit AAC 5.1 German Kappa)"
|
||||
|
||||
# Out of SHITTY scope by design: parenthesized tech blocks, group name as
|
||||
# the last bare word inside parens, year-suffix range in title, dual
|
||||
# season expression. PATH OF PAIN handles this via LLM pre-analysis.
|
||||
xfail_reason: "PoP-grade pathological franchise box-set, beyond simple-dict SHITTY"
|
||||
|
||||
# Pathological franchise box-set:
|
||||
# - Title contains year-suffix range "83-86-89" (3 years glued)
|
||||
# - Season range expressed twice: "Season 1-3" AND "S01-S03"
|
||||
|
||||
@@ -1,13 +1,15 @@
|
||||
release_name: "Khruangbin | Austin City Limits Music Festival 2024 | Full Set [V_-7WWPPeBs].webm"
|
||||
|
||||
# yt-dlp slug: UTF-8 wide pipe '|' (U+FF5C, not the ASCII '|'), trailing
|
||||
# YouTube video ID in brackets, .webm extension. Parser extracts the year
|
||||
# (2024) correctly but mistakes the YouTube ID '7WWPPeBs' for a release
|
||||
# group, and the wide pipe survives the tokenizer (not a separator).
|
||||
# YouTube video ID in brackets, .webm extension. The wide pipe survives
|
||||
# the tokenizer (not a separator) but is now dropped at title assembly
|
||||
# (pure-punctuation TITLE tokens carry no content). Year (2024) parses
|
||||
# correctly; the YouTube ID '7WWPPeBs' is still mistaken for a release
|
||||
# group (separate gap, see PoP backlog).
|
||||
# This is a concert recording — closer to "live music" than "movie", but
|
||||
# media_type=movie is the current degenerate best guess.
|
||||
parsed:
|
||||
title: "Khruangbin.|.Austin.City.Limits.Music.Festival"
|
||||
title: "Khruangbin.Austin.City.Limits.Music.Festival"
|
||||
year: 2024
|
||||
season: null
|
||||
episode: null
|
||||
|
||||
+5
@@ -1,5 +1,10 @@
|
||||
release_name: "Predator Badlands 2025 1080p HDRip HEVC x265 BONE"
|
||||
|
||||
# Space-separated release with both codec aliases present (HEVC + x265)
|
||||
# and no dash-before-group. Simple-SHITTY first-wins picks HEVC, expected
|
||||
# was x265 (legacy last-wins). Reclassified PoP.
|
||||
xfail_reason: "Space-separated, dual codec aliases, no dashed group"
|
||||
|
||||
# Space-separated release: tokenizer correctly splits and identifies year +
|
||||
# tech, but the dash-before-group convention is absent so 'BONE' is not
|
||||
# recognized as the group — falls to UNKNOWN. Anti-regression baseline.
|
||||
@@ -1,5 +1,9 @@
|
||||
release_name: "SLEAFORD MODS Live Glastonbury June 27th 2015-niNjHn8abyY.mp4"
|
||||
|
||||
# YouTube-style slug with year-prefixed video-id dash suffix. Not a scene
|
||||
# release shape at all — PATH OF PAIN.
|
||||
xfail_reason: "YouTube slug with year-prefixed video-id, not a scene shape"
|
||||
|
||||
# yt-dlp filename: triple space between band name and event, no canonical
|
||||
# tech markers, dashed YouTube video ID glued to the year, .mp4 extension
|
||||
# preserved in the title. Parser:
|
||||
|
||||
@@ -1,5 +1,10 @@
|
||||
release_name: "Super Mario Bros. le film [FR-EN] (2023).mkv"
|
||||
|
||||
# Bare-dashed language pair interior to the title (``[FR-EN]``) is tagged
|
||||
# as group by ``_detect_group``, leaving the title fragment behind.
|
||||
# Out of simple-SHITTY scope.
|
||||
xfail_reason: "Interior bare-dashed language pair confuses group detection"
|
||||
|
||||
# Hybrid English/French marketing title with:
|
||||
# - Trailing period after 'Bros' that is part of the title abbreviation
|
||||
# (not a separator), but tokenizer treats it as one
|
||||
|
||||
+16
-18
@@ -1,28 +1,26 @@
|
||||
release_name: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
|
||||
|
||||
# Apocalypse case combining every horror:
|
||||
# - Unescaped apostrophe ("World's") → forces parse_path="ai" fallback
|
||||
# - Spaces AND dashes used as separators inconsistently
|
||||
# - "Blu-ray" with a dash (vs. canonical BluRay)
|
||||
# - "1080i" interlaced flag (not 1080p)
|
||||
# - "DTS-HD MA 5.1" multi-word audio codec
|
||||
# - " - GROUP.mkv" trailing format (space-dash-space before group)
|
||||
# Apocalypse case combining every horror — partially tamed by the
|
||||
# apostrophe fix. Remaining gaps (still PoP-worthy):
|
||||
# - "1080i" interlaced flag (not in quality KB)
|
||||
# - "Blu-ray" with a dash (vs. canonical BluRay) — recognized as source
|
||||
# but with the dash form
|
||||
# - "DTS-HD MA 5.1" multi-word audio codec — the trailing "HD" leaks
|
||||
# into the group
|
||||
# - Trailing .mkv extension survives in title
|
||||
# Result: total degeneration — UNKNOWN across the board, title=raw input.
|
||||
# Once the apostrophe + multi-word-audio + 1080i are handled this fixture
|
||||
# should be revisited. For now: anti-regression of the failure shape.
|
||||
# - " - GROUP" trailing format (space-dash-space before group)
|
||||
parsed:
|
||||
title: "The Prodigy World's on Fire 2011 Blu-ray Remux 1080i AVC DTS-HD MA 5.1 - KRaLiMaRKo.mkv"
|
||||
year: null
|
||||
title: "The.Prodigy.Worlds.on.Fire"
|
||||
year: 2011
|
||||
season: null
|
||||
episode: null
|
||||
quality: null
|
||||
source: null
|
||||
codec: null
|
||||
group: "UNKNOWN"
|
||||
tech_string: ""
|
||||
media_type: "unknown"
|
||||
parse_path: "ai"
|
||||
source: "Blu-ray"
|
||||
codec: "AVC"
|
||||
group: "HD"
|
||||
tech_string: "Blu-ray.AVC"
|
||||
media_type: "movie"
|
||||
parse_path: "sanitized"
|
||||
is_season_pack: false
|
||||
|
||||
tree:
|
||||
|
||||
@@ -1,14 +1,13 @@
|
||||
release_name: "Archer.S14E09E10E11.1080p.WEB.h264-ETHEL"
|
||||
|
||||
# Tech debt: triple-episode chain (E09E10E11) — current parser captures
|
||||
# episode=9 and episode_end=10, but E11 is lost. Anti-regression: lock in
|
||||
# the partial behavior so any future improvement is intentional.
|
||||
# Triple-episode chain (E09E10E11) — the parser collapses the chain to a
|
||||
# range (episode=first, episode_end=last). Intermediate values are implied.
|
||||
parsed:
|
||||
title: "Archer"
|
||||
year: null
|
||||
season: 14
|
||||
episode: 9
|
||||
episode_end: 10
|
||||
episode_end: 11
|
||||
quality: "1080p"
|
||||
source: "WEB"
|
||||
codec: "h264"
|
||||
|
||||
+14
-13
@@ -1,21 +1,22 @@
|
||||
release_name: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
||||
|
||||
# Tech debt: the unescaped apostrophe in "Don't" pushes the whole release
|
||||
# through the AI fallback path (parse_path="ai") and the parse degenerates to
|
||||
# UNKNOWN across the board. Anti-regression here — once the tokenizer learns
|
||||
# to handle apostrophes, this fixture should be revisited.
|
||||
# Apostrophes inside titles ("Don't", "L'avare") used to push the release
|
||||
# through the AI fallback (parse_path="ai", everything UNKNOWN). They are
|
||||
# now pre-stripped before well-formed check and tokenize, so the parse
|
||||
# completes normally — only the title text loses its apostrophe
|
||||
# ("Honey.Dont").
|
||||
parsed:
|
||||
title: "Honey.Don't.2025.2160p.WEBRip.DSNP.DV.HDR.x265.EAC3.5.1-Amen"
|
||||
year: null
|
||||
title: "Honey.Dont"
|
||||
year: 2025
|
||||
season: null
|
||||
episode: null
|
||||
quality: null
|
||||
source: null
|
||||
codec: null
|
||||
group: "UNKNOWN"
|
||||
tech_string: ""
|
||||
media_type: "unknown"
|
||||
parse_path: "ai"
|
||||
quality: "2160p"
|
||||
source: "WEBRip"
|
||||
codec: "x265"
|
||||
group: "Amen"
|
||||
tech_string: "2160p.WEBRip.x265"
|
||||
media_type: "movie"
|
||||
parse_path: "sanitized"
|
||||
is_season_pack: false
|
||||
|
||||
tree:
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
release_name: "Notre.planete.s01e01.1080p.NF.WEB-DL.DDP5.1.x264-NTb"
|
||||
|
||||
# Lowercase 's01e01' and lowercased title word ('planete') correctly parsed.
|
||||
# NF (Netflix) source tag is not in the source KB — drops; WEB-DL wins.
|
||||
# NF is the Netflix streaming distributor (separate dimension from source);
|
||||
# WEB-DL is the encoding source.
|
||||
parsed:
|
||||
title: "Notre.planete"
|
||||
year: null
|
||||
@@ -11,6 +12,7 @@ parsed:
|
||||
source: "WEB-DL"
|
||||
codec: "x264"
|
||||
group: "NTb"
|
||||
distributor: "NF"
|
||||
tech_string: "1080p.WEB-DL.x264"
|
||||
media_type: "tv_show"
|
||||
parse_path: "direct"
|
||||
|
||||
+7
-7
@@ -1,22 +1,22 @@
|
||||
release_name: "Der.Tatortreiniger.S01-06.GERMAN.1080p.WEB.x264-WAYNE"
|
||||
|
||||
# Tech debt: range syntax 'S01-06' is not recognized as TV — falls through
|
||||
# to media_type=movie with the range glued onto the title. Captured here so a
|
||||
# future ranger-aware parser change is intentional.
|
||||
# Range syntax 'S01-06' is now recognized as a season-range marker:
|
||||
# season=1 (first of the range), media_type=tv_complete, and the token
|
||||
# no longer leaks into the title.
|
||||
parsed:
|
||||
title: "Der.Tatortreiniger.S01-06"
|
||||
title: "Der.Tatortreiniger"
|
||||
year: null
|
||||
season: null
|
||||
season: 1
|
||||
episode: null
|
||||
quality: "1080p"
|
||||
source: "WEB"
|
||||
codec: "x264"
|
||||
group: "WAYNE"
|
||||
tech_string: "1080p.WEB.x264"
|
||||
media_type: "movie"
|
||||
media_type: "tv_complete"
|
||||
languages: ["GERMAN"]
|
||||
parse_path: "direct"
|
||||
is_season_pack: false
|
||||
is_season_pack: true
|
||||
|
||||
tree:
|
||||
- "Der.Tatortreiniger.S01-06.GERMAN.1080p.WEB.x264-WAYNE/"
|
||||
|
||||
@@ -1,11 +1,12 @@
|
||||
release_name: "Vinyl - 1x01 - FHD"
|
||||
|
||||
# Tech debt: surrounding ' - ' separators leave a stray '-' token attached
|
||||
# to the title ("Vinyl.-"). NxNN form correctly identifies S01E01; everything
|
||||
# tech-side empty (no quality token in KB — "FHD" not yet known). Anti-regression
|
||||
# the current degenerate title so a future fix is intentional.
|
||||
# Surrounding ' - ' separators in human-friendly release names left stray
|
||||
# '-' tokens attached to the title. They are now dropped at assembly time
|
||||
# (pure-punctuation TITLE tokens carry no content). NxNN form correctly
|
||||
# identifies S01E01; tech-side stays empty (no quality token in KB — "FHD"
|
||||
# not yet known).
|
||||
parsed:
|
||||
title: "Vinyl.-"
|
||||
title: "Vinyl"
|
||||
year: null
|
||||
season: 1
|
||||
episode: 1
|
||||
|
||||
@@ -0,0 +1,155 @@
|
||||
"""Tests for :class:`FfprobeMediaProber`.
|
||||
|
||||
Covers the full-probe path (``probe()`` returning a ``MediaInfo``) by
|
||||
patching ``subprocess.run`` at the adapter module level. The
|
||||
subtitle-streams path is exercised by the subtitle domain tests via
|
||||
the same adapter.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from alfred.infrastructure.probe import FfprobeMediaProber
|
||||
|
||||
_PROBER = FfprobeMediaProber()
|
||||
_PATCH_TARGET = "alfred.infrastructure.probe.ffprobe_prober.subprocess.run"
|
||||
|
||||
|
||||
def _ffprobe_result(returncode=0, stdout="{}", stderr="") -> MagicMock:
|
||||
return MagicMock(returncode=returncode, stdout=stdout, stderr=stderr)
|
||||
|
||||
|
||||
class TestProbe:
|
||||
def test_timeout_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
side_effect=subprocess.TimeoutExpired(cmd="ffprobe", timeout=30),
|
||||
):
|
||||
assert _PROBER.probe(f) is None
|
||||
|
||||
def test_nonzero_returncode_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(returncode=1, stderr="not a media file"),
|
||||
):
|
||||
assert _PROBER.probe(f) is None
|
||||
|
||||
def test_invalid_json_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(stdout="not json {"),
|
||||
):
|
||||
assert _PROBER.probe(f) is None
|
||||
|
||||
def test_parses_format_duration_and_bitrate(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {"duration": "1234.5", "bit_rate": "5000000"},
|
||||
"streams": [],
|
||||
}
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = _PROBER.probe(f)
|
||||
assert info is not None
|
||||
assert info.duration_seconds == 1234.5
|
||||
assert info.bitrate_kbps == 5000 # bit_rate // 1000
|
||||
|
||||
def test_invalid_numeric_format_fields_skipped(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {"duration": "garbage", "bit_rate": "also-bad"},
|
||||
"streams": [],
|
||||
}
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = _PROBER.probe(f)
|
||||
assert info is not None
|
||||
assert info.duration_seconds is None
|
||||
assert info.bitrate_kbps is None
|
||||
|
||||
def test_parses_streams(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {},
|
||||
"streams": [
|
||||
{
|
||||
"index": 0,
|
||||
"codec_type": "video",
|
||||
"codec_name": "h264",
|
||||
"width": 1920,
|
||||
"height": 1080,
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"codec_type": "audio",
|
||||
"codec_name": "ac3",
|
||||
"channels": 6,
|
||||
"channel_layout": "5.1",
|
||||
"tags": {"language": "eng"},
|
||||
"disposition": {"default": 1},
|
||||
},
|
||||
{
|
||||
"index": 2,
|
||||
"codec_type": "audio",
|
||||
"codec_name": "aac",
|
||||
"channels": 2,
|
||||
"tags": {"language": "fra"},
|
||||
},
|
||||
{
|
||||
"index": 3,
|
||||
"codec_type": "subtitle",
|
||||
"codec_name": "subrip",
|
||||
"tags": {"language": "fra"},
|
||||
"disposition": {"forced": 1},
|
||||
},
|
||||
],
|
||||
}
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = _PROBER.probe(f)
|
||||
assert info.video_codec == "h264"
|
||||
assert info.width == 1920 and info.height == 1080
|
||||
assert len(info.audio_tracks) == 2
|
||||
eng = info.audio_tracks[0]
|
||||
assert eng.language == "eng"
|
||||
assert eng.is_default is True
|
||||
assert info.audio_tracks[1].is_default is False
|
||||
assert len(info.subtitle_tracks) == 1
|
||||
assert info.subtitle_tracks[0].is_forced is True
|
||||
|
||||
def test_first_video_stream_wins(self, tmp_path):
|
||||
# The implementation only fills video_codec on the FIRST video stream.
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {},
|
||||
"streams": [
|
||||
{"codec_type": "video", "codec_name": "h264", "width": 1920},
|
||||
{"codec_type": "video", "codec_name": "hevc", "width": 3840},
|
||||
],
|
||||
}
|
||||
with patch(
|
||||
_PATCH_TARGET,
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = _PROBER.probe(f)
|
||||
assert info.video_codec == "h264"
|
||||
assert info.width == 1920
|
||||
@@ -1,21 +1,19 @@
|
||||
"""Tests for the smaller ``alfred.infrastructure.filesystem`` helpers.
|
||||
|
||||
Covers four siblings of ``FileManager`` that had near-zero coverage:
|
||||
Covers three siblings of ``FileManager`` that had near-zero coverage:
|
||||
|
||||
- ``ffprobe.probe`` — wraps ``ffprobe`` JSON output into a ``MediaInfo``.
|
||||
- ``filesystem_operations.create_folder`` / ``move`` — thin
|
||||
``mkdir`` / ``mv`` wrappers returning dict-shaped responses.
|
||||
- ``organizer.MediaOrganizer`` — computes destination paths for movies
|
||||
and TV episodes; creates folders for them.
|
||||
- ``find_video.find_video_file`` — first-video lookup in a folder.
|
||||
|
||||
External commands (``ffprobe`` / ``mv``) are patched via ``subprocess.run``.
|
||||
(``ffprobe`` coverage now lives in ``test_ffprobe_prober.py`` alongside
|
||||
its adapter.)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import subprocess
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from alfred.domain.movies.entities import Movie
|
||||
@@ -27,7 +25,6 @@ from alfred.domain.tv_shows.value_objects import (
|
||||
SeasonNumber,
|
||||
ShowStatus,
|
||||
)
|
||||
from alfred.infrastructure.filesystem import ffprobe
|
||||
from alfred.infrastructure.filesystem.filesystem_operations import (
|
||||
create_folder,
|
||||
move,
|
||||
@@ -38,147 +35,6 @@ from alfred.infrastructure.knowledge.release_kb import YamlReleaseKnowledge
|
||||
|
||||
_KB = YamlReleaseKnowledge()
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# ffprobe.probe #
|
||||
# --------------------------------------------------------------------------- #
|
||||
|
||||
|
||||
def _ffprobe_result(returncode=0, stdout="{}", stderr="") -> MagicMock:
|
||||
return MagicMock(returncode=returncode, stdout=stdout, stderr=stderr)
|
||||
|
||||
|
||||
class TestFfprobe:
|
||||
def test_timeout_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
side_effect=subprocess.TimeoutExpired(cmd="ffprobe", timeout=30),
|
||||
):
|
||||
assert ffprobe.probe(f) is None
|
||||
|
||||
def test_nonzero_returncode_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(returncode=1, stderr="not a media file"),
|
||||
):
|
||||
assert ffprobe.probe(f) is None
|
||||
|
||||
def test_invalid_json_returns_none(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(stdout="not json {"),
|
||||
):
|
||||
assert ffprobe.probe(f) is None
|
||||
|
||||
def test_parses_format_duration_and_bitrate(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {"duration": "1234.5", "bit_rate": "5000000"},
|
||||
"streams": [],
|
||||
}
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = ffprobe.probe(f)
|
||||
assert info is not None
|
||||
assert info.duration_seconds == 1234.5
|
||||
assert info.bitrate_kbps == 5000 # bit_rate // 1000
|
||||
|
||||
def test_invalid_numeric_format_fields_skipped(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {"duration": "garbage", "bit_rate": "also-bad"},
|
||||
"streams": [],
|
||||
}
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = ffprobe.probe(f)
|
||||
assert info is not None
|
||||
assert info.duration_seconds is None
|
||||
assert info.bitrate_kbps is None
|
||||
|
||||
def test_parses_streams(self, tmp_path):
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {},
|
||||
"streams": [
|
||||
{
|
||||
"index": 0,
|
||||
"codec_type": "video",
|
||||
"codec_name": "h264",
|
||||
"width": 1920,
|
||||
"height": 1080,
|
||||
},
|
||||
{
|
||||
"index": 1,
|
||||
"codec_type": "audio",
|
||||
"codec_name": "ac3",
|
||||
"channels": 6,
|
||||
"channel_layout": "5.1",
|
||||
"tags": {"language": "eng"},
|
||||
"disposition": {"default": 1},
|
||||
},
|
||||
{
|
||||
"index": 2,
|
||||
"codec_type": "audio",
|
||||
"codec_name": "aac",
|
||||
"channels": 2,
|
||||
"tags": {"language": "fra"},
|
||||
},
|
||||
{
|
||||
"index": 3,
|
||||
"codec_type": "subtitle",
|
||||
"codec_name": "subrip",
|
||||
"tags": {"language": "fra"},
|
||||
"disposition": {"forced": 1},
|
||||
},
|
||||
],
|
||||
}
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = ffprobe.probe(f)
|
||||
assert info.video_codec == "h264"
|
||||
assert info.width == 1920 and info.height == 1080
|
||||
assert len(info.audio_tracks) == 2
|
||||
eng = info.audio_tracks[0]
|
||||
assert eng.language == "eng"
|
||||
assert eng.is_default is True
|
||||
assert info.audio_tracks[1].is_default is False
|
||||
assert len(info.subtitle_tracks) == 1
|
||||
assert info.subtitle_tracks[0].is_forced is True
|
||||
|
||||
def test_first_video_stream_wins(self, tmp_path):
|
||||
# The implementation only fills video_codec on the FIRST video stream.
|
||||
f = tmp_path / "x.mkv"
|
||||
f.write_bytes(b"")
|
||||
payload = {
|
||||
"format": {},
|
||||
"streams": [
|
||||
{"codec_type": "video", "codec_name": "h264", "width": 1920},
|
||||
{"codec_type": "video", "codec_name": "hevc", "width": 3840},
|
||||
],
|
||||
}
|
||||
with patch(
|
||||
"alfred.infrastructure.filesystem.ffprobe.subprocess.run",
|
||||
return_value=_ffprobe_result(stdout=json.dumps(payload)),
|
||||
):
|
||||
info = ffprobe.probe(f)
|
||||
assert info.video_codec == "h264"
|
||||
assert info.width == 1920
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# filesystem_operations #
|
||||
|
||||
@@ -0,0 +1,82 @@
|
||||
"""Tests for ``LanguageRegistry`` — the YAML-backed adapter for the
|
||||
:class:`alfred.domain.shared.ports.LanguageRepository` port.
|
||||
|
||||
The port is structural (Protocol), so the assertion that the adapter
|
||||
satisfies it is a static one — we exercise the public surface here and
|
||||
let mypy / runtime polymorphism do the rest.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from alfred.domain.shared.ports import LanguageRepository
|
||||
from alfred.domain.shared.value_objects import Language
|
||||
from alfred.infrastructure.knowledge.language_registry import LanguageRegistry
|
||||
|
||||
|
||||
def _registry() -> LanguageRepository:
|
||||
"""Return a fresh registry typed as the port — proves structural fit."""
|
||||
return LanguageRegistry()
|
||||
|
||||
|
||||
class TestPortSurface:
|
||||
def test_satisfies_protocol(self):
|
||||
# If LanguageRegistry diverged from LanguageRepository, the annotation
|
||||
# below would already be wrong at type-check time; at runtime, this
|
||||
# just confirms the methods exist.
|
||||
reg: LanguageRepository = LanguageRegistry()
|
||||
assert hasattr(reg, "from_iso")
|
||||
assert hasattr(reg, "from_any")
|
||||
assert hasattr(reg, "all")
|
||||
|
||||
def test_len_reflects_loaded_entries(self):
|
||||
reg = _registry()
|
||||
# The builtin YAML ships dozens of languages — exact count drifts
|
||||
# with knowledge updates, so just sanity-check it's non-empty.
|
||||
assert len(reg) > 0
|
||||
|
||||
|
||||
class TestFromIso:
|
||||
def test_known_iso_returns_language(self):
|
||||
reg = _registry()
|
||||
fre = reg.from_iso("fre")
|
||||
assert isinstance(fre, Language)
|
||||
assert fre.iso == "fre"
|
||||
|
||||
def test_case_insensitive(self):
|
||||
reg = _registry()
|
||||
assert reg.from_iso("FRE") == reg.from_iso("fre")
|
||||
|
||||
def test_unknown_iso_returns_none(self):
|
||||
assert _registry().from_iso("zzz") is None
|
||||
|
||||
def test_non_string_returns_none(self):
|
||||
assert _registry().from_iso(None) is None # type: ignore[arg-type]
|
||||
|
||||
|
||||
class TestFromAny:
|
||||
def test_english_name(self):
|
||||
reg = _registry()
|
||||
lang = reg.from_any("French")
|
||||
assert lang is not None
|
||||
assert lang.iso == "fre"
|
||||
|
||||
def test_iso_639_1_alias(self):
|
||||
# "fr" is the 639-1 form, registered as an alias.
|
||||
reg = _registry()
|
||||
lang = reg.from_any("fr")
|
||||
assert lang is not None
|
||||
assert lang.iso == "fre"
|
||||
|
||||
def test_unknown_returns_none(self):
|
||||
assert _registry().from_any("vostfr") is None
|
||||
|
||||
def test_non_string_returns_none(self):
|
||||
assert _registry().from_any(123) is None # type: ignore[arg-type]
|
||||
|
||||
|
||||
class TestMembership:
|
||||
def test_contains_known(self):
|
||||
assert "english" in _registry()
|
||||
|
||||
def test_does_not_contain_unknown(self):
|
||||
assert "klingon" not in _registry()
|
||||
Reference in New Issue
Block a user