Vulnerability Consolidation System
- Consolidates vulnerability data from 3 APIs (GitHub Dependabot, Vanta, Jira) into a unified Google Sheets dashboard with SLA deadline tracking
- Parallel collection via ThreadPoolExecutor with failure isolation, if one source is down the others still flow
- 28 property-based tests (Hypothesis) validate merge logic, deduplication stability, and aggregation invariants across arbitrary inputs
- Incremental sync minimizes API calls after the initial full sync, reducing runtime from ~60s to ~20-30s
- Feeds directly into FedRAMP continuous monitoring evidence with clear first_seen/last_seen/status lifecycle for every finding
The Challenge
The platform generates vulnerability findings from multiple sources: GitHub Dependabot alerts across 20+ repositories, Vanta compliance findings from continuous monitoring, and Jira tickets for manual remediation tracking. Each source has its own UI, its own status model, and its own view of the world. Nobody had a single answer to "how many open vulnerabilities do we have right now?" without manually checking three dashboards and reconciling overlapping records.
For FedRAMP continuous monitoring, I need to demonstrate that vulnerabilities are tracked from discovery through remediation with clear timelines. Doing that manually across three systems doesn't scale.
Approach & Role
I built a Python consolidation tool that pulls from all three APIs, merges records by unique identifier, tracks lifecycle status (open, dismissed, fixed, retracted), and writes everything to a unified Google Sheets dashboard with calculated metrics. It runs on a schedule and supports incremental sync so it's not hammering APIs for unchanged data.
The design priorities were reliability (collector failures are isolated. If Vanta is down, GitHub and Jira data still flows), correctness (28 property-based tests validate merge logic, field extraction, and aggregation invariants), and auditability (every record tracks first_seen_date, last_seen_date, and status transitions).
Architecture & Patterns
Pipeline architecture:
- Timestamp Manager reads last sync times from a metadata sheet
- Collectors (GitHub, Vanta, Jira) run in parallel via ThreadPoolExecutor
- Data Validator checks CVE IDs, dates, and URLs before writing
- Data Merger upserts records by unique identifier, preserving historical data and manual entries
- Dashboard Calculator aggregates metrics by severity and priority
- Sheet Writer outputs to multiple tabs with formula hyperlinks and sheet protection
- Google Chat Notifier alerts on new critical/high findings and SLA deadline warnings
Collector design:
- GitHub Collector - incremental sync: sorts by updated_at descending, terminates early when it hits records older than last sync. Filters archived repos, supports an inclusive repo allowlist.
- Vanta Collector - OIDC authentication, pulls from both active findings and remediation endpoints. Always full sync (API limitation).
- Jira Collector - incremental JQL queries with
updated >=filter. Extracts multiple Dependabot links per issue via regex parsing. - All collectors are failure-isolated: if one throws, the others complete and the system writes partial data with a PARTIAL_SUCCESS status.
Data merge logic:
- Upsert-based persistence: reads existing sheet data before writing, matches by GitHub Alert URL / Vanta Finding ID / Jira Issue Key
- Field preservation: retains values when APIs no longer provide them (e.g., Vanta stops returning title/severity once a finding is remediated, but I keep the original values)
- Manual entry protection: "Fixed in Release" column stays editable while all other columns are formula-protected
- Status lifecycle tracking: open to dismissed/fixed/retracted, with retraction for repos removed from the scan list
- Schema migration: automatically adds new tracking columns to existing sheets without data loss
Alerting:
- Google Chat webhook integration for real-time notifications
- Alerts on: new Critical/High vulnerabilities, SLA deadline warnings (10 days and 3 days before deadline)
- Delivery status tracking to avoid duplicate alerts
Testing Strategy
253 tests across 68 test files. The testing approach is split between deterministic unit tests and property-based correctness validation:
Property-based tests (28 properties via Hypothesis):
- Collector failure isolation: any subset of collectors can fail without crashing the system
- Consolidated view properties: merged records never lose data, deduplication is stable
- Dashboard aggregation: metric sums are consistent with source data
- Field extraction: URL parsing and CVE ID extraction handle arbitrary inputs
- Fixed vulnerability exclusion: remediated records never count as active
- GitHub alert URL matching: the regex correctly identifies valid Dependabot URLs across any repository/alert combination
- Data merger properties: upsert logic is idempotent, historical dates are immutable
Deterministic tests cover:
- Each collector's API response parsing and error handling
- Configuration loading and validation
- Alert message formatting
- Date formatting edge cases
- Sheet writer protection logic
- Backward compatibility with schema changes
- Performance monitoring accuracy
100+ iterations per property test ensures edge cases surface.
Impact & Scale
- Consolidates vulnerability data from 3 APIs into a single dashboard updated on schedule
- Parallel collection reduces sync time from ~60s (sequential) to ~20-30s
- Incremental sync minimizes API calls after the initial full sync
- 253 tests with 28 property-based validations ensure merge correctness across schema changes
- Feeds directly into FedRAMP continuous monitoring evidence: clear first_seen/last_seen/status lifecycle for every finding
- SLA alerting catches approaching deadlines before they're missed
- Manual entry preservation means the security team can annotate records without their work being overwritten