Project Record
Reconciliation of Financial Instruments
Book-to-book reconciliation framework for instruments, quantities, prices, and currencies.
Purpose
Match internal and external records while surfacing breaks beyond configurable tolerances.
Context
During platform or custodian migrations, internal books (e.g. Front Arena, order management) and external sources (prime brokers, custodians, fund admins) often disagree on positions, quantities, or prices. Manual reconciliation is error-prone and does not scale. This framework was built to support a probabilistic instrument-matching pipeline so that breaks could be identified and resolved before cutover.
Problem
Records from different systems use different identifiers, rounding, and timing. Simple key-based joins miss valid matches (e.g. same ISIN, different lot sizes or price decimals) and produce false breaks. The system must tolerate small numerical differences while still flagging material discrepancies.
Matching Rule
Candidates are grouped by ISIN and currency, then matched on minimum quantity+price distance. Distance is normalized so that tolerance thresholds can be expressed in basis points or units.
Implementation
- Ingest: normalized positions from internal DB and external files (CSV/API).
- Grouping: by instrument identifier (ISIN) and currency to reduce search space.
- Scoring: weighted L1 or L2 distance on quantity and price; configurable weights and tolerances per asset class.
- Output: matched pairs, unmatched internal, unmatched external, and break report with suggested actions.
Trade-offs
Probabilistic matching improves recall but can hide systematic data issues if tolerances are too loose. We keep an audit trail of all matches and expose parameters so compliance can tune sensitivity.
Use case: inferring cross-system instrument ID mappings from trade logs
A natural extension of the reconciliation framework is statistical record linkage over logs and entities: treat each system’s instrument ID as an entity, each trade or position record as a log, and use probabilistic record linkage to score and match ID pairs based on how their logs co-occur and agree on attributes. The goal is to learn a mapping between instrument IDs across systems with controllable precision, using only routine trade/position logs—no master golden mapping required up front.
Working title (for a methods + use case piece):
Inferring Cross-System Instrument ID Mappings from Trade Logs via Probabilistic Record Linkage
Core idea in one line:
Treat each system’s instrument ID as an “entity,” each trade/position record as a “log,” then use probabilistic record linkage to score and match ID pairs based on how their logs co-occur and agree on attributes.
Problem & setting
- Two (or more) systems with different instrument IDs.
- Observed logs: (timestamp, side, qty, price, currency, venue, desk, …, system_id).
- Goal: learn a mapping between IDs across systems with controllable precision.
- Framing is generic (logs ↔ entities) so it reuses beyond finance (e.g. users ↔ devices, products ↔ catalogues).
Method (end-to-end pipeline)
A. Candidate generation (blocking)
Join in time windows (e.g. ±Δt) and coarse buckets (currency, venue, price bands, trading day). Keep pairs with at least k co-occurring events. Use recordlinkage indexers or Splink blocking rules to shrink comparisons while retaining likely matches.
B. Features (per candidate ID pair)
- Temporal: Jaccard of trading days; same-minute hit count; cross-correlation of event times.
- Price/size: fraction of co-events with |Δprice| < ε and |Δqty| < r·qty.
- Categorical: currency, venue, desk, ISIN (if present); description similarity (e.g. Jaro–Winkler).
- Distributional: KS/AD on price or size distributions.
- Optional: day-by-day re-weighting for corporate actions.
C. Scoring model
- Fellegi–Sunter (unsupervised; EM for m/u rates), or supervised logistic/GBM if ~200 hand-labelled pairs exist.
- Python: recordlinkage (FS / Naive Bayes / ECM), Splink (FS with scalable backends).
D. One-to-one assignment
Convert pair scores to a bipartite graph; maximum-weight matching (Hungarian / Kuhn–Munkres). Enforce 1:1 or relax to 1:many for legacy aliases.
E. Thresholding & calibration
Choose score threshold τ for target precision (e.g. 99.5%) on a validation slice; report recall/coverage; add a “gray zone” for human review.
Evaluation
- Ground truth: adjudicated sample or synthetic data with known mapping.
- Metrics: precision@τ, recall, F1, coverage, pairwise/cluster purity; error taxonomy (same-currency lookalikes, DST offsets, partial trading days, post–corporate-action rekeys).
- Ablations: impact of temporal vs price/size vs text features.
Results (concise)
- Table: precision/recall at several τ; confusion breakdown.
- Figure: PR curve or precision vs threshold.
- On realistic synthetic data (clock skew, rounding, corporate actions): >99% precision at useful coverage with minutes-scale runtimes.
Operational notes
- Privacy: if needed, privacy-preserving RL via Bloom-filter encodings for quasi-identifiers.
- Runtime: blocking + vectorized features → minutes on millions of logs with Splink/DuckDB/Spark.
- Failure modes: corporate actions, stale tick sizes, OTC vs lit venues, rounding.
Conclusion & reuse
General recipe: logs → features → Fellegi–Sunter score → graph match. Reusable for users↔devices, products↔catalogues, etc. Publish synthetic dataset + code (e.g. Zenodo DOI) for reproducibility.
Abstract (drop-in)
We present a practical method to infer cross-system mappings between instrument identifiers using only routine trade/position logs. Our approach frames the problem as probabilistic record linkage: we (i) generate candidate ID pairs via temporal and categorical blocking, (ii) compute agreement features from co-occurring events (time, price, size, venue, description), (iii) estimate match probabilities with the Fellegi–Sunter model, and (iv) enforce global one-to-one consistency via maximum-weight bipartite matching. On realistic synthetic data reflecting clock skew, rounding, and corporate actions, the method attains >99% precision at useful coverage with minutes-scale runtimes. We provide an open-source implementation and synthetic dataset to support reuse across domains where entities must be reconciled from passive logs.
Minimal reproducible code plan
- Libraries: pandas, recordlinkage or splink, scipy (Hungarian via
linear_sum_assignment) or networkx. - Synthetic generator: N base instruments; simulate trades across two systems; randomly permute IDs; inject noise (clock skew, rounding, missing fields) and some non-matches.
- Notebook 1: data gen + feature engineering.
- Notebook 2: FS scoring (or logistic), calibration, Hungarian matching.
- Notebook 3: evaluation + plots.
- Archive repo on Zenodo for DOI; cite in the paper.
Blog-post companion (non-confidential)
- Hook: “We mapped thousands of instrument IDs across systems—using only trade logs.”
- Three diagrams: pipeline, feature table, precision–coverage curve.
- Link to preprint + code DOI; short “try it yourself” snippet.
Tooling to mention
- Python Record Linkage Toolkit (indexing, FS/ECM).
- Splink (scalable FS; docs and guides).
- Fellegi–Sunter theory.
- Hungarian algorithm for final assignment.
- Bloom-filter PPRL if publishing with hashed keys.