Commit Graph

28 Commits

Author SHA1 Message Date
francwa 621bb96995 fix(release/parser): pre-strip apostrophes so titles like Don't parse cleanly
Apostrophes are in the forbidden-chars list, which made any release
with a title like "Don't" or "L'avare" short-circuit to the AI
fallback (parse_path=ai, everything UNKNOWN). They are now stripped
up front from the name before the well-formed check and tokenize,
so the parse completes normally. The raw name is preserved on the
VO; only the title field loses its apostrophe.

parse_path becomes 'sanitized' when an apostrophe was stripped, to
surface that the parser cleaned something up.

Fixtures updated:
- shitty/honey_uhd_hdr/ — went from total UNKNOWN to a clean parse
  (title=Honey.Dont, year=2025, quality=2160p, source=WEBRip,
  codec=x265, group=Amen).
- path_of_pain/the_prodigy_full_chaos/ — went from total failure to
  partial success (title, year, source, codec extracted). Remaining
  gaps (1080i, multi-word audio, Blu-ray-with-dash) are tracked
  separately in tech debt.
2026-05-20 23:29:10 +02:00
francwa 448ef3b79c fix(release/parser): recognize Sxx-yy season range as tv_complete
`Der.Tatortreiniger.S01-06.GERMAN...` previously parsed as a movie
with 'S01-06' glued to the title. The parser now matches the
season-range form in _parse_season_episode (returning season=first,
episode=None), and the assemble step detects the range token to
promote media_type to 'tv_complete'.

The first season is exposed as `season` so `is_season_pack`
fires (season is not None and episode is None) — useful for routing
to a series root folder.

Fixture shitty/tatortreiniger_flat_multiseason/ updated:
- title: Der.Tatortreiniger.S01-06 → Der.Tatortreiniger
- season: null → 1
- media_type: movie → tv_complete
- is_season_pack: false → true
2026-05-20 23:26:40 +02:00
francwa b1c7f35ffb fix(release/parser): drop pure-punctuation TITLE tokens at assembly
Releases using ' - ' as a separator (Vinyl - 1x01 - FHD) tokenize to
['Vinyl', '-', '1x01', '-', 'FHD'] — the standalone '-' tokens were
ending up in title_parts and leaked into the joined title
('Vinyl.-'). We can't add '-' to the separator list (it would break
codec-GROUP), so we filter at assembly: a TITLE token with no
alphanumeric characters carries no title content.

Side win: same logic eliminates the UTF-8 wide-pipe '|' from the
khruangbin_yt_wide_pipe fixture title.

Fixtures updated:
- shitty/vinyl_1x01_format/expected.yaml (title: Vinyl.- → Vinyl)
- path_of_pain/khruangbin_yt_wide_pipe/expected.yaml (| dropped)
2026-05-20 23:24:40 +02:00
francwa 5bbdc9081f fix(release/parser): collapse chained multi-episode markers to full range
S14E09E10E11 previously parsed to episode=9, episode_end=10 — E11
was silently dropped. The parser now takes episodes[-1] as
episode_end so the full chain is captured (episode=9, episode_end=11).
Intermediate values stay implied.

Fixture shitty/archer_multi_episode/ updated from anti-regression of
the bug to anti-regression of the fix.
2026-05-20 23:23:08 +02:00
francwa 18267d0165 refactor(language): LanguageRepository port + SubtitleKnowledgeBase wired to it
Mirror the MediaProber / FilesystemScanner pattern for language lookup:

- New Protocol `LanguageRepository` in alfred.domain.shared.ports
  covering from_iso, from_any, all, __contains__, __len__ — the
  surface previously coupled to the concrete LanguageRegistry.
- SubtitleKnowledgeBase types its `language_registry` parameter
  against the Protocol; the concrete LanguageRegistry stays in
  infrastructure as the YAML-backed adapter and remains the default
  when no repository is injected.
- New unit tests in tests/infrastructure/test_language_registry.py
  cover the adapter surface (from_iso, from_any, membership,
  case-insensitivity, non-string inputs).

Behaviour is unchanged for existing callers. The split opens the
door to in-memory fakes in future tests without loading the full
ISO 639 YAML.
2026-05-20 23:18:25 +02:00
francwa a0d1846ff2 refactor(release): move detect_media_type & enrich_from_probe to application/release
Both helpers are inspection-pipeline pieces, not filesystem use cases —
they belong next to inspect_release, not next to move_media /
resolve_destination / list_folder.

The move also kills the lazy import that was hiding inside
_resolve_parsed: alfred.application.filesystem.resolve_destination
no longer triggers a cycle through alfred.application.filesystem
__init__ when loading inspect_release. Top-level import restored.

Call sites updated: inspect.py, test_detect_media_type.py,
test_enrich_from_probe.py, testing/recognize_folders_in_downloads.py.
Module docstrings + test-file docstrings updated to match the new
location.
2026-05-20 09:29:58 +02:00
francwa 0fb59a4581 feat(filesystem): wire inspect_release into resolve_destination
The four resolve_*_destination use cases now route through a private
_resolve_parsed helper that picks the right entry point:

  - source path provided AND it exists -> inspect_release(name, path)
    runs the full pipeline (parse + media-type refinement + probe
    + enrich), so missing tech tokens (quality, codec, ...) get
    filled by ffprobe and the refreshed tech_string lands in the
    destination folder / file names.

  - source path missing or absent       -> parse_release(name) only,
    same behavior as before. Back-compat: tests using fake /dl/*.mkv
    paths still pass unchanged.

resolve_episode_destination / resolve_movie_destination reuse their
existing source_file parameter as the inspection target. The two
folder-move use cases (season / series) gain a new OPTIONAL
source_path parameter — threaded through the agent tool wrappers
and documented in the YAML specs.

The lazy import inside _resolve_parsed avoids a circular import:
inspect_release imports detect_media_type / enrich_from_probe from
the same application.filesystem package whose __init__ re-exports
resolve_destination.

Three new tests in TestProbeEnrichmentWiring with a stub MediaProber
prove the wiring: movie picks up probe quality, season picks it up
via source_path, and a missing path correctly skips probe (back-compat
guard).
2026-05-20 09:26:30 +02:00
francwa e79ca462b8 fix(release): refresh tech_string after enrich_from_probe
enrich_from_probe fills None fields on ParsedRelease (quality, source,
codec, audio_*, languages) but left tech_string at its parser-time
value — so the filename builders (movie_folder_name, episode_filename,
…) saw stale tech tokens even after a successful probe.

Re-derive tech_string the same way the parser does — quality.source.codec
joined by dots, skipping None — at the end of enrich_from_probe. Token-
level values still win because enrich only fills None fields.

Four new tests in TestTechString cover: enrichment rebuilds it,
existing source survives, no-info input leaves it untouched, fully
empty parsed produces ''.
2026-05-20 09:26:09 +02:00
francwa 03aa844d7d feat(release): inspect_release orchestrator + InspectedResult VO
New application-layer entry point that composes the four inspection
layers in one call:

  1. parse_release(name, kb)              -> (ParsedRelease, ParseReport)
  2. detect_media_type(parsed, path, kb)  -> patch parsed.media_type
  3. find_main_video(path, kb)            -> Path | None (top-level scan)
  4. prober.probe(video) + enrich         -> when video exists and
                                             media_type not in
                                             {unknown, other}

Returns a frozen InspectedResult(parsed, report, source_path,
main_video, media_info, probe_used). kb and prober are injected — no
module-level singletons in inspect.py.

analyze_release tool now delegates to inspect_release; its output
gains two fields, confidence (0-100) and road (easy/shitty/path_of_pain),
surfaced from ParseReport so the LLM can route by confidence. Spec
updated to document them.

12 new tests covering happy paths, probe gating (no video, media_type
'other', probe failure), mutation contract (detect refining
parsed.media_type, enrich filling None fields), resilience
(nonexistent path), and frozen contract. Suite: 1058 passing.
2026-05-20 09:15:29 +02:00
francwa c303efea48 refactor(probe): consolidate full probe() into MediaProber port
Add probe(video) -> MediaInfo | None to the MediaProber Protocol and
implement it on FfprobeMediaProber. The standalone
alfred/infrastructure/filesystem/ffprobe.py module is removed; all
callers (analyze_release / probe_media tools, testing scripts) now go
through the adapter.

Tests for the probe path moved to tests/infrastructure/test_ffprobe_prober.py
(patching subprocess.run at the adapter module level).

Unblocks the upcoming inspect_release orchestrator, which needs the
port — not a free function — to compose parse + main-video selection
+ probe in one shot.
2026-05-20 09:11:24 +02:00
francwa 12dc796ea2 docs(changelog): freeze confidence scoring + exclusion work block 2026-05-20 08:47:29 +02:00
francwa 9ddd85929e feat(release): pre-pipeline exclusion helpers
Add the application-layer helpers that decide which files are worth
parsing, sitting one notch above parse_release.

- is_supported_video(path, kb): extension-only check against
  kb.video_extensions. Lowercased suffix lookup. Directories and
  broken symlinks return False.
- find_main_video(folder, kb): top-level scan only (no recursion into
  subdirectories — releases that wrap their video in Sample/ are
  PATH_OF_PAIN territory). Lexicographically-first eligible file wins
  when several qualify (deterministic, no size-based ranking). A bare
  file as folder argument is supported for single-file releases.

No size threshold and no filename heuristics ('sample' / 'trailer'):
the parser's job is to extract structure, not to second-guess
non-standard release shapes. PoP catches the rest.

17 tests under tests/application/test_supported_media.py.
2026-05-20 01:34:32 +02:00
francwa ed7680b58f docs(changelog): log parse-confidence scoring + ParseReport tuple 2026-05-20 01:21:47 +02:00
francwa 629387591f docs(changelog): freeze release parser v2 work block (2026-05-20) 2026-05-20 01:08:17 +02:00
francwa 230a7ab88a docs(changelog): log SHITTY simplification + distributor split 2026-05-20 01:03:52 +02:00
francwa 7dc7f0c241 feat(release): v2 enricher pass for audio/video-meta/edition/language
The EASY pipeline now extracts the full ParsedRelease surface from
known-group releases, not just the structural backbone. Behavior is
unchanged for releases that don't carry these tokens.

Pipeline (parser/pipeline.py):
- Structural walk (renamed _annotate_structural): no longer requires
  body to be fully consumed. Tokens passed over between schema chunks
  remain UNKNOWN so the enricher pass can claim them.
- _find_chunk(): scans forward in the body for the next token matching
  a given role, skipping already-annotated tokens. Lets optional and
  mandatory chunks both tolerate intercalated enricher tokens.
- _annotate_enrichers(): new non-positional pass. Walks UNKNOWN tokens
  and tags AUDIO_CODEC / AUDIO_CHANNELS / BIT_DEPTH / HDR / EDITION /
  LANGUAGE. Multi-token sequences from kb.audio / kb.video_meta /
  kb.editions are matched first (longest-first ordering preserved from
  the YAML), single tokens after.
- _apply_sequences(): mutates the token list, tagging the first token
  of a matched sequence with extra['sequence']=<canonical value> and
  trailing members with extra['sequence_member']='True' so assemble
  skips them.
- _detect_channel_pairs(): handles the '5.1' / '7.1' case where the
  '.' separator splits the layout into two tokens. Strips a trailing
  '-GROUP' suffix on the second before joining.

Assemble:
- New fields populated: languages (list), audio_codec, audio_channels,
  bit_depth, hdr_format, edition. Each role-handler skips
  sequence_member tokens.
- media_type heuristic extended: edition in {COMPLETE, INTEGRALE,
  COLLECTION} + no season → tv_complete (mirrors legacy).

Tests:
- 4 new TestEnrichers cases covering bit_depth+audio_codec+channels,
  HDR sequence + edition sequence + TrueHD.Atmos + 7.1, multi-language
  with DTS-HD.MA sequence, TV episode with single language.
- All 14 v2 tests + 30 fixture tests still green. Suite: 1011 passed,
  8 skipped.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:26:05 +02:00
francwa 075a827b0e feat(release): wire v2 EASY path for known release groups
The annotate-based v2 pipeline now handles releases ending in -KONTRAST,
-ELiTE, or -RARBG. Unknown groups still fall through to the legacy
SHITTY heuristic in services.py — nothing changes for them.

Pipeline (alfred/domain/release/parser/pipeline.py):
- tokenize(): string-ops separator split, strips [site.tag] first.
- annotate(): right-to-left group detection (priority to codec-GROUP
  shape, fallback to any non-source dashed token), GroupSchema lookup
  via the kb port, then lockstep walk of tokens against schema chunks.
  Optional chunks skip on mismatch, mandatory mismatches return None so
  the caller falls back gracefully. CODEC pre-consumed by a codec-GROUP
  trailing token correctly skips the CODEC chunk in the body walk.
- assemble(): folds annotated tokens into a ParsedRelease-compatible
  dict (title joined by '.', group from the codec-GROUP token's extras).

Schema (alfred/domain/release/parser/schema.py):
- GroupSchema + SchemaChunk frozen value objects.
- TokenRole.GROUP added.

Port + adapter:
- ReleaseKnowledge.group_schema(name) lookup added (case-insensitive).
- YamlReleaseKnowledge loads alfred/knowledge/release/release_groups/
  *.yaml at construction time; learned overrides in
  data/knowledge/release/release_groups/ also picked up.

Knowledge:
- release_groups/kontrast.yaml, elite.yaml, rarbg.yaml declare the
  canonical chunk_order. ELiTE marks source as optional (Foundation.S02
  has no WEBRip token).

Services:
- parse_release tries the v2 path first; on None falls through to the
  legacy implementation untouched.

Tests:
- tests/domain/release/test_parser_v2_easy.py (10 cases) cover group
  detection (codec-GROUP, dashed-source skip, no-dash → unknown),
  schema-driven annotation (movie, TV episode, season pack with
  optional source, unknown group returns None), and field assembly.
- Existing tests/domain/test_release_fixtures.py (30 cases) stay green:
  5 EASY fixtures now produced by v2, 25 SHITTY/PATH OF PAIN fixtures
  still produced by the legacy path. Verified via spy on v2.assemble.

Suite: 1007 passed, 8 skipped.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:21:11 +02:00
francwa a2c917618f feat(release): scaffold v2 parser package (annotate-based pipeline)
New package alfred/domain/release/parser/ lays the foundation for the
release parser refactor (specs in memory). Exposes:

- Token: frozen VO carrying text + stream index + TokenRole + extra dict.
  with_role() returns a new instance (no mutation).
- TokenRole: str-backed enum split into structural (TITLE/YEAR/SEASON_EP/
  GROUP), technical (RESOLUTION/SOURCE/CODEC/AUDIO_*/BIT_DEPTH/HDR/
  EDITION/LANGUAGE), and meta (SITE_TAG/UNKNOWN) families.
- pipeline.strip_site_tag(): pulls a [site.tag] prefix or suffix.
- pipeline.tokenize(): release name -> list[Token] (all UNKNOWN),
  string-ops split on kb.separators (no regex, per CLAUDE.md).
- pipeline.annotate(): documented stub. Walk order recorded in docstring
  (group right-to-left, then season/episode, year, tech, title).

Legacy parse_release in release.services remains the live implementation
until the annotate step lands. Scaffolding tests verify Token API,
site-tag stripping (prefix/suffix), and tokenize output shape.

Refs: project_release_parser_v2_specs (memory)
2026-05-20 00:12:33 +02:00
francwa cd814c7922 docs(changelog): log refactor/domain-release-knowledge work block 2026-05-19 22:05:29 +02:00
francwa df798f55cc refactor(subtitles): introduce SubtitleKnowledge Protocol port
Domain services (SubtitleIdentifier, PatternDetector) used to import the
concrete SubtitleKnowledgeBase class directly from infrastructure for
their type hint. With this commit they depend on a structural Protocol
in alfred/domain/subtitles/ports/knowledge.py declaring just the 7
read-only query methods the domain actually consumes.

The concrete YAML-backed SubtitleKnowledgeBase in infrastructure remains
the sole adapter — no rename, no shim. With this change
alfred/domain/subtitles/ has zero imports from alfred/infrastructure/.

Also extend the changelog entry covering the full domain-io-extraction
branch.
2026-05-19 15:15:43 +02:00
francwa 535935cc73 docs(changelog): summarize refactor/domain-io-extraction work block 2026-05-19 15:11:17 +02:00
francwa f6eef59fca refactor: tech debt mini-pass (items 5, 6, 7, 20)
Low-risk cleanup items, no functional change to the parser. The
philosophy remains: keep the parser simple, the AI handles edge cases.

- Extract duplicated 'fs-safe title → dot-folder-name' regex into
  to_dot_folder_name() in domain/shared/value_objects.py. Used by both
  MovieTitle.normalized() and TVShow.get_folder_name() (item #5).
- ParsedRelease.languages now uses field(default_factory=list) instead
  of a manual __post_init__ assigning [] via object.__setattr__ (#6).
- tv_shows/entities.py module docstring: prepend ASCII ownership tree
  for quicker visual scan of the aggregate hierarchy (#7).
- file_extensions.yaml: split subtitle sidecars (.srt/.sub/.idx/.ass/.ssa)
  into a dedicated 'subtitle:' category instead of lumping them under
  'metadata:'. _METADATA_EXTENSIONS at the value_objects.py level remains
  the union of both — detect_media_type behavior unchanged. New loader
  load_subtitle_extensions() exposes the distinct subtitle set for future
  callers in the subtitles domain (#20).

Suite: 1020 passed, 8 skipped.
2026-05-18 16:24:28 +02:00
francwa 273510dff8 test(fixtures): seed PATH OF PAIN bucket with 10 worst-case fixtures
10 pathological release names mined from the real downloads folder.
Each fixture locks in the current parse_release output (including
its silent losses and false positives) so future parser improvements
are intentional, not silent drift.

Cases:
- Khruangbin yt-dlp slug (UTF-8 wide pipe '|', YT ID as group)
- Deutschland 83-86-89 franchise box (group=S03 misdetection)
- Chérie Le BéBé (accented chars preserved, VFF language)
- Jimmy Carr 8-word stand-up special title
- [ OxTorrent.vc ] prefix + XviD codec (site_tag prefix)
- Prodiges S12E01 with episode title + air-date silently lost
- The Prodigy: apostrophe + Blu-ray dash + 1080i + multi-word audio
  = full AI-path degeneration (everything UNKNOWN)
- Sleaford Mods yt-dlp slug (YT ID glued to year)
- Super Mario Bros [FR-EN] (bilingual tag mistaken for group)
- Gilmore Girls Complete S01-S07 (the well-behaved exception:
  COMPLETE token correctly drives tv_complete + REPACK + 10bit)

Also adds shitty + path_of_pain to the per-bucket sanity assertion.

Suite: 1020 passed, 8 skipped.
2026-05-18 15:57:56 +02:00
francwa c1831e3f46 test(fixtures): drop derry_duplicate_naming (was a copy-paste artifact)
The release name mixed two distinct releases — not a real-world case
worth anti-regression. SHITTY bucket now holds 14 fixtures (down from 15).
2026-05-18 15:51:11 +02:00
francwa aa182458b8 test(fixtures): seed SHITTY release bucket with 15 anti-regression cases
Add 15 expected.yaml fixtures under tests/fixtures/releases/shitty/
covering the awkward but real-world release names from the downloads
folder. Each fixture locks in the current parse_release behavior so
future parser changes are intentional, not silent drift.

Cases captured:
- Angel INTEGRALE 3-level hierarchy (tv_complete media_type)
- Buffy custom French title with dots preserved
- Archer S14E09E10E11 multi-episode (E11 lost — tech debt)
- Notre Planète lowercase s01e01
- Vinyl ' - 1x01 - FHD' (stray dash artifact — tech debt)
- Deutschland.83 (year-suffix as part of title)
- Tatortreiniger S01-06 range (falls to movie — tech debt)
- Derry Girls duplicated title
- Jurassic Park bare folder (media_type=unknown)
- La Nuit au Musée bilingual MULTI
- Chérie j'ai agrandi (ASCII-stripped apostrophe, parses fine)
- Honey Don't (unescaped apostrophe — full AI-path degeneration)
- Hook MULTi.SUBS movie with Subs/ folder
- Predator Badlands space separators (group=UNKNOWN — tech debt)
- Westworld S04 Subs.Only (no video file)

Each fixture also captures the future 3-flow routing (library /
torrents / seed_hardlinks) ahead of the organize_media refactor.

Suite: 1011 passed, 8 skipped.
2026-05-18 15:48:41 +02:00
francwa 7bc50fd5b8 test: add real-world release fixtures (EASY bucket)
Captures 5 canonical releases from /mnt/testipool/downloads as parametrized
fixtures under tests/fixtures/releases/easy/. Each fixture declares the
release name, expected ParsedRelease fields, original tree, and the future
routing (library / torrents / seed_hardlinks) for the upcoming organize_media
refactor.

Today only the 'parsed' section is asserted; tree is materialized into a
tmp_path to catch typos. Routing is captured ahead of the planner work — it
becomes verifiable once organize_media lands.

Cases: back_in_action (movie), slow_horses_single_ep (TV single),
foundation_season_pack (S02 + .nfo noise), long_walk_with_noise (movie +
KONTRAST.TOP.txt), sinners_yts (YTS bracket-heavy + Subs/ dir).

Also tracks CHANGELOG.md under [Unreleased] / Added.
2026-05-18 15:36:19 +02:00
francwa 6940c76e58 Updated README and did a little bit of cleanup 2025-12-09 04:24:16 +01:00
francwa 9ca31e45e0 feat!: migrate to OpenAI native tool calls and fix circular deps (#fuck-gemini)
- Fix circular dependencies in agent/tools
- Migrate from custom JSON to OpenAI tool calls format
- Add async streaming (step_stream, complete_stream)
- Simplify prompt system and remove token counting
- Add 5 new API endpoints (/health, /v1/models, /api/memory/*)
- Add 3 new tools (get_torrent_by_index, add_torrent_by_index, set_language)
- Fix all 500 tests and add coverage config (80% threshold)
- Add comprehensive docs (README, pytest guide)

BREAKING: LLM interface changed, memory injection via get_memory()
2025-12-06 19:11:05 +01:00