Fed Speech Sentiment Analysis

Ingests, transcribes, and analyzes Federal Reserve speeches to extract sentiment signals for market correlation and trading research.

Overview

Federal Reserve communications move markets, but extracting actionable signals from speeches requires processing audio, parsing nuanced language, and correlating with price movements. This project builds an NLP pipeline that transcribes Fed speeches, applies financial sentiment analysis, and visualizes correlations with market data.

The goal was to design a reliable pipeline that handles unstructured audio data and produces structured sentiment metrics for quantitative analysis.

Goals

Example

Fed Speech Sentiment Analysis

┌─────────────────────────────────────────────────────────────┐
│ Data Sources │
│ Fed Website │ YouTube │ FRED API │ Yahoo Finance │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Ingestion │
│ yt-dlp │ Requests │ API Clients │ Scrapy │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Transformation │
│ Whisper (STT) → Text Cleaning → FinBERT (Sentiment) │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Storage │
│ PostgreSQL │ Transcripts │ Scores │ Prices │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Analytics / Consumption │
│ Streamlit Dashboard │ Jupyter │ SQL Queries │
└─────────────────────────────────────────────────────────────┘

Technology Stack

LayerTechnologies
IngestionPython, yt-dlp, Requests, Scrapy
ProcessingOpenAI Whisper, FinBERT, spaCy, pandas
StoragePostgreSQL, S3 (audio files)
OrchestrationCron, GitHub Actions
InfrastructureDocker, Vercel (dashboard)
VisualizationStreamlit, Plotly

Implementation Details

Batch vs Streaming: Batch processing triggered by new speech releases (typically weekly). Fed communications are scheduled events, so real-time processing isn’t necessary.

Schema Design: Normalized relational schema—speeches table linked to sentences (with timestamps), sentiment scores, and market data snapshots. Enables flexible aggregation at speech, paragraph, or sentence level.

Handling Duplicates: Each speech has a unique identifier (date + speaker + event_type). Idempotent upserts prevent duplicate processing on retries.

Scheduling Strategy: GitHub Actions workflow runs daily, checking for new speeches. Processing only triggers when new content is detected.

Tradeoffs: Accepted Whisper’s higher compute cost (vs cloud STT APIs) for better accuracy on financial terminology and ability to run locally.

Data Characteristics

MetricValue
Volume50+ speeches, ~2-3 new/month
FrequencyEvent-driven (Fed calendar)
FormatAudio → Text → Structured
GrowthLinear (~36 speeches/year)

Reliability & Edge Cases

Lessons Learned

What surprised me: FinBERT’s financial domain training made a massive difference vs general sentiment models. “Inflation expectations remain anchored” scores very differently than generic sentiment would suggest.

What broke: Early versions tried to correlate raw sentiment with same-day returns. Added proper event study methodology with pre/post windows for meaningful signal extraction.

What I’d redesign: Would add more structured feature extraction—forward guidance detection, rate path implications, specific topic classification (inflation, employment, banking).

Future Improvements