Data Products & Projects

Clean, verified, provenance-tracked datasets and AI benchmarks for structured finance, company filings, and machine-readable financial infrastructure.

Home Projects About & Contact GitHub Hugging Face LinkedIn Academic Papers (SSRN)
Daniel Cheah • Projects
Datasets Companies House Structured Finance LLM Benchmarks Hugging Face
Data Products

Licensable datasets and open benchmarks

These projects are designed around a simple thesis: models are increasingly commoditised, but clean, verified, provenance-rich financial data is scarce. The focus is not generic dashboards. It is analyst-grade data products that can be used by humans, LLMs, and downstream data pipelines.

UK Company Financials

Available for Licensing

An ML-ready dataset of 3.7 million UK company filings across approximately 3.5 million companies, parsed from Companies House iXBRL accounts and normalised across the FRS 102 and FRS 105 taxonomies.

  • Turnover, assets, liabilities, net assets, funding structure, and employees.
  • Per-figure provenance for auditability and model-training confidence.
  • Normalised UK GAAP fields, GDPR-clean design, and planned quarterly refreshes.

UK Securitisation SPV & Lender Map

Available for Licensing

A curated, structured-finance map of 551 UK securitisation SPVs and non-bank lenders, grouped by shelf, sponsor, and asset class, with charges-register intelligence converted into analyst-ready fields.

  • RMBS, specialist mortgage, auto ABS, card ABS, and equity release coverage.
  • Charges register mapping across trustees, fixed/floating security, and deal dates.
  • Designed for market mapping, originator analysis, and structured-credit research.

SEC EDGAR Securitisation Connector & Skills

Open Source

A free, open-source Claude / Cowork plugin with a local SEC EDGAR data connector that turns US structured-finance filings into analyst-ready output — searching registered ABS and CMBS deals, pulling prospectuses and investor reports, and running loan-level analysis on Form ABS-EE data. Contributed back to Anthropic’s financial-services repository.

  • Local MCP connector (Python, standard library) exposing six tools; streams 100 MB+ ABS-EE loan tapes with flat memory.
  • Loan-level analytics for auto ABS and conduit CMBS: pool stratifications, balance-weighted coupon / FICO / DSCR / debt yield, property and geographic concentration, and the maturity wall.
  • The only no-subscription connector in the marketplace — free, public, rights-clean EDGAR data. Validated on real deals; Apache 2.0.

UK Accounts LLM Benchmark

Open Benchmark

Can frontier LLMs read UK company accounts? A 1,000-question verified benchmark and five-model evaluation comparing proprietary and open-weight systems. The core finding: LLMs can read the accounts well; the defensible asset is the clean, verified data layer.

  • Benchmark coverage across Claude, GPT-5.5, Gemini, DeepSeek V4 Pro, and GLM 5.2.
  • Designed to show where model performance ends and data infrastructure begins.
  • Supports the commercial thesis behind the UK company financials dataset.
Profile

Hugging Face

Dataset distribution profile for published samples, gated datasets, and machine-readable data products.