paper-profit

Don’t ask an LLM if a stock is a buy — collect evidence first, then ask

What this is and why I built it

I set out to build an open-source stock scoring tool that works the way a real equity analyst would — not by asking an LLM “is this stock a good buy?” (useless), and not by averaging Wall Street analyst ratings (also useless, and conflicted).

The first attempt was embarrassingly naive. A direct prompt to an AI gave back confident, fluent, and almost completely empty answers — the model had no idea what the current price was, whether the stock was cheap or expensive, or what the market was doing. Words that sounded right with no substance behind them.

The second attempt, averaging published analyst ratings, failed for a different reason: when every analyst rates everything a buy, the signal disappears.

So I spent time studying how professional equity analysts actually work — the people at hedge funds and banks whose literal job is figuring out which stocks to own. The conclusion: they don’t rely on any single signal. They combine multiple lenses, each answering a different question about a company.

That became the architecture for PaperProfit.

The core idea

Most of the heavy lifting is deterministic. Fetching stock prices, calculating ratios, computing moving averages, flagging red flags — all of that runs as regular Python. It’s fast, cheap, and perfectly repeatable. No AI required.

Where an LLM is brought in is for the qualitative layer: the things that require reading, interpreting, and summarizing unstructured text. Specifically, targeted API calls to analyze:

Earnings call transcripts — management confidence, revenue guidance, margin trajectory, competitive positioning, and the strongest bull and bear arguments for the stock.
SEC filings — comparing the current 10-K against the previous year’s to detect changes in earnings quality, balance sheet health, and any new risks worth flagging. The design principle: collect evidence first, then let normal Python handle the numbers, and use the LLM only where reading and judgment are genuinely useful.

The three-pillar framework

Pillar	What it answers
Fundamental	Is this a financially strong business at a reasonable price?
Technical	Is the market currently moving toward or away from this stock?
Qualitative	What does the company story, management tone, and filing language suggest?

Each pillar feeds into a five-dimension scoring system: Quality, Growth, Valuation, Momentum, and Risk — each scored −2 to +2, combined with weights that reflect long-term investing research (quality and growth carry the most weight; risk acts as a penalty).

The final signal: BUY / ACCUMULATE / HOLD / REDUCE / SELL.

The goal isn’t to predict the future. It’s to create a repeatable process that looks at a stock from several angles before making a judgment — the way a professional would.

How it works

run.py is the front door for the stock rating experiment.

You give it a ticker, like AAPL. It fetches market and company data, asks a few different kinds of questions about the stock, then prints a report with a final signal such as BUY, HOLD, or SELL.

The idea is simple: do not ask an LLM to guess whether a stock is good. First collect evidence. Then let normal Python code handle the numbers, and use the LLM only where reading and judgment are useful.

What Happens When You Run It

python run.py AAPL

The program follows this path:

run.py reads the ticker and selected pillars from the command line.
It loads .env so it knows which LLM provider to use.
fundamental.py fetches stock data from yfinance.
The selected analysis pillars run.
scoring.py combines the scores into one weighted result.
scoring.py prints a readable console report.

The Three Pillars

The evaluator has three analysis pillars.

Pillar	File	What it answers
1. Fundamental	`fundamental.py`	Is this a financially strong business at a reasonable price?
2. Technical	`technical.py`	Is the market currently moving toward or away from the stock?
3. Qualitative	`qualitative.py`	What does the company story, management tone, and filing language suggest?

You can run all pillars:

python run.py AAPL

Or only specific pillars:

python run.py AAPL --pillars 1,2
python run.py AAPL --pillars 3

--pillars accepts a comma-separated list using:

Number	Meaning
`1`	Fundamental analysis
`2`	Technical analysis
`3`	Qualitative LLM analysis

Main Files

`run.py`

This is the orchestrator. It does not contain most of the scoring logic itself.

Its main job is to:

parse command-line arguments
load environment variables from .env
fetch stock data
call the requested analysis pillars
merge red flags from different sources
call the final report printer

The main function is:

evaluate(ticker: str, provider: str, pillars: list[int] = None)

`fundamental.py`

This file handles the numbers behind the business.

It uses yfinance to fetch:

current price
52-week high and low
50-day and 200-day moving averages
RSI
valuation ratios like P/E and PEG
margins, revenue growth, free cash flow, debt, liquidity, and beta
raw financial statements
CIK, when available

Then score_fundamental() scores:

valuation
quality
growth
risk

Each dimension is scored from -2 to +2.

`technical.py`

This file scores momentum from price data.

It checks:

whether the 50-day moving average is above the 200-day moving average
whether the current price is above or below the 50-day moving average
whether RSI looks oversold, healthy, neutral, or overbought
whether the stock is near its 52-week high or far below it

The output is one score:

momentum, from -2 to +2

`qualitative.py`

This is where the LLM is used.

It supports:

Anthropic
OpenAI
DeepSeek

The qualitative pillar asks the LLM to score:

management quality
competitive position
earnings quality
growth narrative

It also runs two deeper checks:

a year-over-year earnings comparison using raw financial statements from yfinance
a 10-Q vs 10-K narrative comparison using SEC EDGAR filings

The LLM is expected to return JSON. _extract_json() tries to recover valid JSON even if the model wraps the answer in extra text.

`scoring.py`

This file turns all the evidence into the final signal.

It contains:

scoring weights
signal thresholds
score combination logic
red flag detection
console report formatting

The default weights are:

Dimension	Weight
Quality	30%
Growth	25%
Valuation	20%
Momentum	15%
Risk	10%

One important detail: the LLM qualitative score is blended into the quality dimension. If both fundamental quality and qualitative quality exist, they are averaged together.

Final Signal

After the dimensions are scored, combine_scores() calculates a weighted total.

Weighted total	Signal
`>= 1.0`	`BUY`
`>= 0.5`	`ACCUMULATE (weak buy)`
`>= -0.5`	`HOLD`
`>= -1.0`	`REDUCE (weak sell)`
`< -1.0`	`SELL`

Environment Setup

Get the code:

git clone https://github.com/pg1/paper-profit.git
cd paper-profit/docs/experiments/rating-stocks-llm

Install the Python packages:

pip install yfinance anthropic openai python-dotenv requests

Create or edit .env in this directory:

LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
DEEPSEEK_API_KEY=your_key_here

Only the key for the selected provider is required.

Supported LLM_PROVIDER values:

Value	Provider	Model used
`anthropic`	Anthropic	`claude-opus-4-5`
`openai`	OpenAI	`gpt-4o`
`deepseek`	DeepSeek	`deepseek-chat`

If LLM_PROVIDER is missing, run.py defaults to anthropic.

Red Flags

The evaluator can show red flags even when the total score looks okay.

Automatic red flags include:

declining revenue
high debt-to-equity
current ratio below 1.0
negative free cash flow
high short interest

The LLM can also add red flags from:

qualitative company analysis
year-over-year financial statement comparison
10-Q vs 10-K narrative comparison

run.py deduplicates these before printing the report.

Mental Model

Think of the script as a small analyst team:

fundamental.py is the accountant.
technical.py is the chart watcher.
qualitative.py is the filing reader.
scoring.py is the editor who turns everyone else’s notes into one clear report.
run.py is the person at the desk making sure each specialist gets called in the right order.

The goal is not to predict the future perfectly. The goal is to create a repeatable process that looks at a stock from several angles before making a judgment.

This site is open source. Improve this page.