Architecture Decisions

Production architectures I've designed and led — real systems handling real traffic under real constraints.

These aren't textbook designs. They're the result of navigating tradeoffs between speed and correctness, convincing teams to invest in the right abstractions, and learning what actually matters when systems can't go down.

Embedded Insurance Platform

A production architecture handling high-throughput traffic with strict regulatory and financial correctness requirements.

The Context

Embedded insurance integrates coverage directly into partner transaction journeys — device protection during mobile checkout, loan protection bundled with credit disbursement, travel insurance in ticket booking. It's highly automated, API-driven, real-time, and partner-distributed across e-commerce, fintech, and NBFC platforms.

Building this platform meant solving a fundamentally different problem than traditional insurance. We needed a configurable product engine, a reliable issuance pipeline, a financially correct ledger, and a compliant reporting backbone — all without rewriting the core for each new partner or line of business.

Architecture Overview

The platform is organized into 6 bounded contexts:

Domain	Responsibility
Product Configuration	Product, Plan, Cover, Benefit, SKU hierarchy
Pricing & Eligibility	Grid-based pricing, rule engine for eligibility
Policy & Issuance	Proposal → Underwriting → Policy → Certificate
Finance & Ledger	Double-entry accounting, float wallet
Claims	FNOL, assessment, approval, payment, reserve tracking
Partner & Overlay	Partner config, pricing/commission overrides

Key Architecture Decisions & Why I Made Them

Decision 1: Immutable Product Versioning

The problem: Early on, we allowed in-place plan edits. A pricing change propagated to active policies in production — incorrect premiums, regulatory exposure, and significant cleanup effort.

The decision: Plans are immutable once published. New version, soft-deprecate the old one, maintain backward compatibility. Partner overlays (custom pricing, modified eligibility) live in a separate layer.

The tradeoff: More storage, more version management complexity. But we never had that issue again. In a regulated industry, immutability is cheaper than incorrectness.

Decision 2: Grid-Based Pricing Over Runtime Actuarial Models

The problem: Traditional insurance uses complex actuarial models at runtime. Embedded insurance operates at a scale and speed where runtime computation introduces latency and failure points.

The decision: Pre-computed pricing grids versioned and preloaded in memory, with partner-level overrides.

Why this matters: At high throughput, every millisecond of pricing computation counts. Grids give us deterministic, auditable pricing with zero runtime risk. The grid versioning also simplified regulatory compliance — when regulators ask "what rate did this policy get?", we can point to the exact grid version.

Decision 3: Double-Entry Immutable Ledger

The problem: This is where most naive insurance systems fail. When I inherited the financial pipeline, there were data inconsistencies between the ledger, finance systems, and compliance reports. Financial closures were slow.

The decision: Every financial movement creates balanced journal entries in an immutable, append-only ledger. Never mix business tables with ledger tables. Never update ledger rows — ever.

Premium collection: Debit Partner Float → Credit Premium Receivable
Allocation: Debit Premium Receivable → Credit Insurer Payable + Commission Payable
Settlement: Debit Insurer Payable → Credit Bank Account

The organizational challenge: The engineering team initially wanted a simpler approach. An immutable ledger is harder to build, harder to debug, and requires more storage. I had to make the case that in a regulated financial system, correctness is non-negotiable — and that the alternative (a mutable financial record) was a ticking time bomb.

The result: Closure time reduced significantly. Reporting accuracy improved. The ledger became the single source of truth for regulatory and financial reporting across all partners.

Decision 4: Failure Handling as a First-Class Concern

The philosophy: In high-throughput insurance systems, every failure mode needs a recovery path, and every recovery path needs testing.

Scenario	Strategy
Payment success, issuance failure	Compensation workflow → Retry → If irrecoverable, refund + reversal entry
Policy created, ledger fails	Mark pending, retry ledger. Never leave a policy without ledger entries
Ledger posted, COI fails	Non-financial failure — retry independently, never reverse ledger
Duplicate requests	Idempotency keys + unique proposal reference + state validation

The lesson: Designing failure handling up front is what separates systems that work at demo scale from systems that work at production scale. At high volume, even a tiny failure rate without a recovery path is unacceptable.

Decision 5: Reporting Architecture — Separate Reads from Writes

The decision: Operational DB → Event Stream → Reporting Warehouse → Regulatory Reports. Never run heavy reports on the transactional database. Reporting is eventually consistent — never block issuance for a reporting write.

Why: When regulators and finance teams need reports, they need them accurate and fast. But generating reports from a high-throughput database would degrade the very system generating the data. This separation gave us compliant reporting with audit trails without sacrificing platform performance.

Production Guarantees

Daily reconciliation comparing ledger vs float vs settlement
Event replay safety for disaster recovery
Versioned pricing models with full audit trails
Immutable ledger (append-only, no updates)
Unique constraints on proposal references
Idempotency keys on all mutating APIs

Platform Results

High-throughput traffic at peak with zero-downtime migrations
Hundreds of dealers onboarded with real-time settlement
Full regulatory compliance across all partners
Zero regulatory non-compliances across audits

Unified Partner Integration Layer

How I eliminated fragmented partner APIs across business lines and built a platform nobody asked for.

The Problem I Saw

The company had grown fast. Each business line had its own partner integration logic — different authentication mechanisms, different API contracts, different onboarding flows. Every new partner meant rebuilding integration logic that already existed somewhere else in the org.

The cost wasn't just engineering hours. It was inconsistent partner experience, painful debugging, and an inability to launch new products quickly because every launch required re-integration with existing partners.

The Architecture Decision

I proposed and led a unified partner integration platform standardizing APIs, authentication, and onboarding across all business lines.

The key design decisions:

Shared authentication and authorization layer — one authentication mechanism for all partner interactions, regardless of product line
Unified policy issuance and claim service standards — common API contracts with product-specific extensions rather than product-specific APIs
Reusable SDKs and unified API documentation — partners integrate once, access all product lines
Multi-LOB coordination layer — routing and orchestration across business lines without coupling

The Organizational Challenge

The technical design was the easy part. The hard part was getting multiple independent teams to agree on shared standards. Each team had their own integration patterns.

I chose to lead by influence rather than authority. I didn't have organizational control over the other teams. So I built the platform within my team first, proved the value, and let adoption happen through results:

Started with my own team as the proving ground
Demonstrated measurable improvements in integration speed
Made the alternative — continuing with fragmented APIs — obviously worse

Results

Engineering efficiency improved significantly by eliminating redundant integration logic
New partner integration timelines reduced substantially
Directly enabled rapid launches of multiple new product lines

Partner Onboarding Framework

How I turned a multi-day manual process into a self-service flow.

The Problem

Every new partner onboarding required multiple days of engineering time — credential setup, configuration, validation, QA checks, deployment approvals. With many active partners and a growing pipeline, the engineering team was becoming a bottleneck to business growth.

The Architecture Decision

Build a self-service onboarding platform that eliminates engineering involvement from routine partner setup:

Self-serve APIs and UI workflows for configuration and credential management
Automated validation and QA checks — no engineering review needed for standard onboarding
Workflow automation integration for access and deployment approvals
Built on top of the unified API layer for instant partner connectivity

The Tradeoff

The pushback was real: "We only onboard a few partners a month. Why build a platform for that?"

The bet wasn't about current volume. It was about where the business was heading. This was an investment in organizational scalability — and it paid off when we needed to rapidly onboard partners for new product launches.

Results

Onboarding time reduced from days to hours
Many partners onboarded with near-zero engineering involvement
Internal throughput improved significantly
Framework reused across all business lines for partner setup

New Business Line: Life Insurance

Architecture decisions behind launching a new insurance vertical and a hybrid product.

The Problem

The company was a General Insurance company entering Life Insurance — a completely different regulatory domain, actuarial model, and compliance framework. And after establishing the initial product, the challenge deepened: could we combine GI and LI into a single hybrid product?

Key Architecture Decisions

Decision 1: Build for configurability, not speed. The team wanted to hardcode assumptions to ship faster. I pushed for building issuance, endorsement, and claims systems that could extend to future products. This slowed the initial launch — but subsequent products shipped faster because the foundations were right.

Decision 2: Cross-entity policy ownership model. For the hybrid product, we designed a flexible ownership model allowing either LOB (GI or LI) to lead issuance while maintaining independent claims control. This required dual-regulatory orchestration — premium apportioning rules that satisfied both regulatory frameworks simultaneously.

Decision 3: Compliance architecture. When new regulatory guidelines required separating covers with different durations under distinct master policies, we designed a flexible policy framework — issuing multiple policies linked under a virtual "shallow" policy while keeping partner integration unchanged. Zero partner-side API changes.

The Organizational Lesson

Building in regulated environments taught me that technical competence alone isn't enough. Earning the trust of compliance, actuarial, and finance teams — all of whom had legitimate concerns about a tech team moving fast — was more important than the architecture itself. I invested weeks in cross-functional alignment before writing a line of code.

Results

Successfully launched new Life Insurance vertical with zero regulatory non-compliances
Delivered a first-of-its-kind hybrid GI + LI product
Core systems reused for subsequent product launches
Scalable compliance architecture for future regulatory updates

Payments & Checkout Platform (E-commerce)

Architecture decisions for systems handling high-throughput traffic during flash sales.

The Context

At a major e-commerce company, I led checkout, payments, and order management — platforms that could not go down during flash sales and high-traffic events. This was where I learned to think about scale, reliability, and the operational reality of mission-critical systems.

Key Architecture Decisions

Decision 1: Closed-Wallet System for Instant Payments

The problem: External payment settlements took 5–7 days for refunds. In e-commerce, slow refunds destroy customer trust.

The decision: Built an internal wallet for instant payments and refunds without external settlement delays. Users could add money and checkout instantly. Refunds credited immediately to wallet balance.

The tradeoff: Managing an internal float wallet introduced financial reconciliation complexity. But instant refunds — versus multi-day bank reversals — was a customer experience decision that justified the engineering investment.

Decision 2: Multi-Provider Payment Gateway with Automatic Failover

The problem: Single payment provider = single point of failure during the highest-traffic moments.

The decision: Integrated multiple payment providers with:

Automatic retry with provider rotation on failure
Configurable routing rules (cost optimization, success rate)
Health monitoring with circuit breakers

The result: Very high payment success rate through intelligent multi-provider routing.

Decision 3: Flash Sale Concurrency Patterns

The problem: During flash sales, thousands of users trying to buy the same limited inventory simultaneously. Overselling = customer complaints, refund costs, and trust erosion.

The decisions:

Pessimistic locking at the inventory layer to prevent oversells
Time-limited cart holds to prevent inventory stockpiling
Queue-based checkout during extreme traffic spikes
Inventory holds with timeout — reserve stock during checkout, release on failure

The result: Zero inventory oversells during flash sales. High throughput sustained during sale events.

Decision 4: HSM-Based Key Management

The problem: Payment gateway credentials stored in application config or environment variables = security vulnerability.

The decision: HSM-based encryption for all payment credentials. Credentials never stored in application config. Key rotation without service restart.

Why this matters: Not glamorous. Not visible. But foundational security decisions are the ones that prevent the headlines you never want to see.

Core Platform Modernization

Zero-downtime migration of high-throughput systems to a unified core platform.

The Problem

Multiple products ran on legacy stacks with frequent scalability issues. The migration to a modern platform was essential for unification and future product reuse — but it involved migrating systems handling high-throughput traffic at peak.

The Architecture Approach

Phase-wise migration of issuance and claim modules with rollback capability at every stage
Domain-driven design patterns for clean service boundaries
Observability and metrics tracking for faster RCA and rollout safety
Shadow traffic testing before cutting over production traffic

The Leadership Challenge

The hardest part wasn't the technical migration — it was managing organizational risk. A failed migration at high volume would impact real customers, real partners, and real revenue. I had to balance the team's confidence with appropriate caution, ensure fallback plans were tested (not just documented), and maintain stakeholder trust throughout a multi-phase rollout.

Results

Zero downtime during migration — including the highest-traffic partner
Uptime and infrastructure efficiency improved significantly
Unified platform across all business lines, enabling faster releases
Platform ready for new product lines

What Ties These Decisions Together

Across all these systems, three principles have guided my architecture decisions:

Correctness over convenience — Immutable ledgers, idempotency keys, and append-only logs are harder to build. But in regulated, high-throughput environments, shortcuts become liabilities.
Build the platform before you need it — The unified integration layer, onboarding framework, and configurable product engine all required upfront investment that wasn't tied to immediate features. Every one of them paid for itself multiple times over.
The organizational decision is harder than the technical one — Choosing the right architecture is important. But getting teams, stakeholders, and regulators aligned behind that choice is where leadership actually happens.