🖼️ Paper Scraper for Reddit

Paper Scraper is an asynchronous Python tool that downloads images from Reddit, either from your saved posts or from any subreddits you specify.

It recognizes direct image links, Reddit image/gallery posts (i.redd.it, preview.redd.it), Imgur images/albums/galleries, and Flickr photos. Posts that don't resolve to a downloadable image are skipped.

Requirements

Python 3.13+
uv
A Reddit account with a registered script app
An Imgur account with a registered app (used to resolve Imgur links)
(optional) a Flickr API key, only needed to resolve Flickr links

Installation

Clone this repository:

git clone https://github.com/samlowe106/PaperScraper.git
cd PaperScraper

Ensure uv is installed, then create the environment:
```
uv sync
```
Create a Reddit app at your app preferences and choose script as the app type.
Create an Imgur app at your application settings.

Create a file named .env in the project root with your credentials:

REDDIT_CLIENT_ID="..."
REDDIT_CLIENT_SECRET="..."
IMGUR_CLIENT_ID="..."
IMGUR_CLIENT_SECRET="..."
FLICKR_CLIENT_ID="..."   # optional, only needed to resolve Flickr links

Usage

After uv sync, run it via the paperscraper command (equivalently uv run python -m src.main):

uv run paperscraper [options]

Some common options (run with --help for the full list):

Flag	Description
`-r`, `--subreddit`	Include posts from a subreddit (repeatable)
`--sortby`	How to sort subreddit posts: `hot`, `new`, `controversial`, `gilded`, or `top_all` / `top_day` / `top_week` / `top_month` / `top_year` / `top_hour`
`--limit`	Max submissions to pull from each source (default: `10`); an album submission may still yield several images
`-k`, `--karma`	Only download posts with at least this score
`--hours` / `--days` / `--years`	Only download posts at most this old (mutually exclusive)
`-d`, `--dir`	Output directory (default: `Output`)
`--organize`	Sort downloaded images into per-subreddit subfolders
`--nolog`	Disable the per-run JSON log (written into the output dir by default)
`-u`, `--saved`	Include your saved posts — prompts for Reddit login (see note)
`--unsave`	Un-save saved posts after a successful download (opt-in; requires login)

Examples:

# Download images from r/wallpapers, sorted by top of all time, into ./pics
uv run paperscraper -r wallpapers --sortby top_all -d pics

# Top posts from the last week with >= 100 score, at most 5 per subreddit
uv run paperscraper -r wallpapers -r art --sortby top_week --days 7 -k 100 --limit 5

Files are written to a timestamped directory (e.g. Output/PaperScraper 2026-06-20 08:30/).

Note: the saved-posts flow (--saved / --unsave) is implemented but has only been exercised against mocked Reddit responses — it needs a real login to verify end-to-end. --saved also uses getpass, so it needs a real terminal (not an IDE console).

Testing

The test suite uses pytest (with pytest-asyncio for the async code and vcrpy cassettes for recorded HTTP interactions).

uv run pytest                                   # run the suite
uv run pytest --cov=src --block-network         # with coverage, no live network

As of the latest run, 139 tests pass with ~97% line coverage. CI runs the suite on the pinned Python version and fails the build if coverage drops below 80%.

This repo also ships a pre-commit config (ruff, black, mypy, and assorted file checks):

uv run pre-commit install      # hooks will now run on every commit
uv run pre-commit run --all-files

Technical Overview

After argument parsing, main() runs a fully asynchronous pipeline:

Stream building. StreamBuilder (see src/reddit/submission_source.py) signs into Reddit via asyncpraw and turns the requested subreddits into async listing generators (capped per source by --limit). These are interleaved with merge() and adapted with amap() / afilter() (see src/core/functional.py), yielding each submission as a SubmissionWrapper. A predicate built from --karma and the age flags (--hours/--days/--years) filters out submissions that don't qualify.
URL finding. Each SubmissionWrapper.find_urls() runs every parser (single_image, reddit, imgur, flickr in src/parsing/) concurrently in a strategy pattern and collects the direct media links it can resolve.
Downloading & saving. Resolved URLs are fetched with httpx and written to disk with aiofiles via UniqueDirectoryFileManager, which guarantees unique filenames and (with --organize) per-subreddit folders.

Concurrency is bounded by two asyncio.Semaphores — one for URL finding and a larger one for downloads — and the whole pipeline runs inside an asyncio.TaskGroup so submissions are processed as they stream in rather than in fixed batches. Each submission is handled independently: a failure is logged and skipped rather than aborting the run, and individual downloads retry transient errors with backoff. Unless --nolog is passed, a JSON record of each processed post is appended to a log in the output directory.

License

Paper Scraper is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 583 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.test.env		.test.env
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
todo.md		todo.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🖼️ Paper Scraper for Reddit

Requirements

Summary

Installation

Usage

Examples:

Testing

Technical Overview

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🖼️ Paper Scraper for Reddit

Requirements

Summary

Installation

Usage

Examples:

Testing

Technical Overview

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages