Content-hash duplicate file detection with two-pass efficiency
pip install philiprehberger-duplicate-finderContent-hash duplicate file detection with two-pass efficiency.
pip install philiprehberger-duplicate-finder
from philiprehberger_duplicate_finder import find_duplicates
# Find duplicates in a directory
groups = find_duplicates("~/Documents")
for group in groups:
print(f"Size: {group.size} bytes, {group.count} copies, wasted: {group.wasted_bytes} bytes")
for path in group.paths:
print(f" {path}")
# Multiple directories with filters
groups = find_duplicates(
paths=["~/Documents", "~/Downloads"],
min_size=1024,
extensions=[".pdf", ".jpg", ".png"],
algorithm="sha256",
)
# Progress tracking
groups = find_duplicates(
"~/Pictures",
on_progress=lambda current, total: print(f"{current}/{total}"),
)
for group in groups:
# Keep the most recently modified file
keep = group.keep_newest()
print(f"Keep: {keep}")
# Or keep the file with the shortest path (shallowest)
keep = group.keep_shortest_path()
print(f"Keep: {keep}")
# Get the list of files safe to delete
to_delete = group.deletable(strategy="newest")
for path in to_delete:
print(f" Delete: {path}")
| Function / Class | Description |
|---|---|
find_duplicates(paths, ...) | Find duplicate files using a two-pass size-then-hash approach |
DuplicateGroup | A group of duplicate files with paths, size, hash, count, and wasted_bytes |
DuplicateGroup.keep_newest() | Return the path with the most recent modification time |
DuplicateGroup.keep_shortest_path() | Return the path with the shortest string length |
DuplicateGroup.deletable(strategy) | Return all paths except the one to keep ("newest" or "shortest_path") |
pip install -e .
python -m pytest tests/ -v
If you find this project useful: