← dean's list
data engineering · 2026 · ◌ paused

Market Cynic Pipeline

Bronze→Silver→Gold pipeline correlating Reddit sentiment with Yahoo Finance prices

Automated runs paused — Reddit shut down the public .json endpoints used for ingestion and blocked GitHub Actions IPs. v2 planned with proper OAuth, Airflow orchestration, and Spark processing.
view on github ↗
architecture

When a stock is heavily discussed with positive retail sentiment but its price is simultaneously falling, that divergence is a signal worth watching. Detecting it requires correlating two noisy, differently-structured data streams in near real time.

Bronze → Silver → Gold medallion pipeline. Yahoo Finance price data scraped via Playwright headless browser. Reddit sentiment pulled from four subreddits (r/stocks, r/wallstreetbets, r/investing, r/stockmarket). A two-layer "Cynic Heuristic" weights posts by controversy score (log-scaled by comment count) and by per-subreddit trust multipliers. Gold layer detects divergence events — positive sentiment momentum with negative price momentum — and surfaces them in a Streamlit dashboard with dual-axis charts.

  • Medallion architecture: Bronze (raw JSON/posts) → Silver (Pydantic validation) → Gold (merged divergence signals)
  • Subreddit trust weighting: r/investing 1.5×, r/stocks 1.2×, r/stockmarket 1.0×, r/wallstreetbets 0.7×
  • Controversy signal weight: 1.0 + (controversy_factor × log1p(comments) × 0.2) — viral controversial posts weighted heavier
  • Rolling divergence detection over 6-run window (~2 days at 3 runs/day)
  • Git as a database: market_history.parquet append-only, committed by MarketCynicBot on each scheduled run
  • Gatekeeper pattern: main.py exits with code 1 on any stage failure rather than propagating bad data downstream
PythonPlaywrightVADER / NLTKPydantic v2pandasPyArrow / ParquetStreamlitGitHub Actions