Automation

Building an Automated Legislative Transcription Pipeline with Whisper and GitHub Actions

Michael Pichardo February 11, 2026

Michigan House and Senate hearing videos are public but have no transcripts. I built a pipeline that scrapes the videos, downloads them, transcribes with Whisper large-v3, runs QC scoring on every segment, applies grammar correction, uploads to S3, and runs three times daily on GitHub Actions — completely unattended.

Why I Built It

The Michigan House and Senate post video recordings of their committee hearings. Thousands of hours of legislative proceedings — witness testimony, member questions, floor debate — are available publicly, but there are no transcripts.

That's an accessibility gap. People who are deaf or hard of hearing can't use the videos. Researchers and journalists who need to search for specific testimony can't grep a video file. Anyone who wants to know what a specific committee said about a specific bill has to scrub through hours of video hoping to find the right moment.

I built a pipeline that converts these videos to searchable, quality-checked transcripts, automatically, without anyone having to run it.

Architecture

The pipeline handles House and Senate separately because they're on different hosting platforms, but they share the same transcription and post-processing stages.

House (house.mi.gov): Selenium scrapes the video archive page, collects video links, downloads MP4s via direct HTTP from https://www.house.mi.gov/ArchiveVideoFiles/{filename}.

Senate (cloud.castus.tv): Selenium scrapes the CastUS platform — a JavaScript SPA — to extract HLS m3u8 stream URLs. Downloads via ffmpeg -c copy (no re-encoding, remux only).

Shared stages: Both feed into the same transcription + QC + grammar correction + S3 upload pipeline.

scrape → download → [Whisper transcribe + S3 upload in parallel] → QC → grammar correct → delete local MP4

The upload and transcription run in parallel threads. S3 upload finishes before Whisper does for most videos, which means the total time is max(upload, transcribe) instead of upload + transcribe. Zero added cycle time.

The SSL Gotcha

house.mi.gov serves an incomplete SSL certificate chain. The intermediate CA is missing. Python's requests fails with SSLCertVerificationError. Installing certifi doesn't help — this is a server-side misconfiguration that certifi can't fix.

Fix: verify=False on all requests to that domain, with urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning). Not ideal, but the alternative is the script doesn't work. Government SSL infrastructure is often incomplete and not in your control.

Why `large-v3` and Not `base`

My first run used Whisper base. The results were unusable.

Three failure modes:

Silence hallucination — base generates repeated tokens on pre-meeting dead air ("you you you you") even when no_speech_prob is high. There's often 5–10 minutes of camera-on-but-nothing-happening time before hearings start.

Wrong language detection — Whisper samples the first 30 seconds to detect language. When that's silence, it guesses wrong. One transcript (CHIL-020326) was detected as Welsh and produced gibberish. Forced English with language="en" in every call to model.transcribe() eliminates this entirely.

Weak vocabulary — base can't handle domain-specific terms. "Quorum" became "corn." "Rep. Tisdale" became "Reptiles Tisdale."

Upgrading to large-v3 fixed all three. It's about 10x slower than base on CPU but handles legislative vocabulary, proper nouns, and accented speech correctly. For offline batch jobs where runtime isn't a constraint, this is the right default.

QC Scoring Every Segment

Even with large-v3, some segments fail. The pipeline scores every segment using four metrics Whisper exposes directly in its output:

| Metric | Threshold | What it catches |

|---|---|---|

| no_speech_prob | > 0.85 | Silence / dead air |

| avg_logprob | < -1.0 | Uncertain transcription |

| compression_ratio | > 2.4 | Repeated token loops |

| temperature | > 0 | Model struggled, used sampling |

A transcript gets qc_passed: true only if no segment exceeds any threshold. The QC result is embedded directly in the JSON as a "qc" key — no separate file, no separate pass.

Transcripts that fail QC are flagged for retranscription. The --retranscribe flag re-downloads and re-transcribes any failed transcript, bypassing the normal "already processed" check.

Results after upgrading to large-v3 + forced English: All 16 transcripts passed QC. The 5 that originally failed with base were retranscribed and all passed.

Grammar Correction

Whisper doesn't add punctuation. Raw transcripts are a wall of lowercase words with no periods, commas, or capitalization. Two-step correction:

Step 1: oliverguhr/fullstop-punctuation-multilang-large (HuggingFace) — adds periods, commas, and question marks.

Step 2: language_tool_python (LanguageTool) — capitalization and grammar. Requires Java. Use version 2.7.3 for Java 11 compatibility.

Output goes to data/transcripts-final/ as .md files. The raw JSON with QC scores stays in data/transcripts/.

GitHub Actions for Scheduling

The pipeline runs three times daily. I started with local cron — it requires the laptop to be awake, which means caffeinate workarounds and ties the pipeline to one machine.

GitHub Actions is better for this: public repo means unlimited free Linux minutes. The pipeline runs on the schedule regardless of whether my laptop is on, closed, or in another city.

Schedule (3x daily, CT, written as UTC):

11 AM CT → 0 17 *
4 PM CT → 0 22 *
9 PM CT → 0 3 *

Important workflow settings:

cancel-in-progress: false — if a run is still going when the next trigger fires, queue the new run instead of killing it. Transcription jobs can easily run longer than the 5-hour window.
Cache the Whisper model (~1 GB) keyed as whisper-large-v3-v1. Without this, every run re-downloads the model.
execution_log.json uses split restore/save: actions/cache/restore to load at the start, actions/cache/save with if: always() to save even if the pipeline fails. If you use the combined actions/cache action and the pipeline fails, you lose the state file and the next run reprocesses everything.

AWS credentials (S3 bucket/keys) go in GitHub Actions secrets, written to .env at runtime in the workflow. Never committed.

What 16 Passing Transcripts Looks Like

The pipeline has been running in production since February 2026. 16 transcripts, all passing QC, all human-audited. The failure pattern is predictable: pre-meeting silence, off-mic side conversations, occasional domain-specific terms that even large-v3 gets wrong. Those show up in the QC scores and get flagged for review.

The grammar-corrected versions in transcripts-final/ are readable enough that you can search them, quote from them, or hand them directly to someone who needs to understand what was said in a hearing.

That's the gap it was filling.

Back to Blog