Author: Wolyra

  • Post-Quantum Cryptography Migration: A 2026 Engineering Playbook

    Quantum computing concept with glowing crystalline structures
    Photo by Manuel on Unsplash

    The quantum threat to public-key cryptography is no longer theoretical, and the regulatory clocks are no longer abstract. NIST finalized the first post-quantum standards in August 2024. The NSA’s CNSA 2.0 mandate requires post-quantum cryptography across National Security Systems by 2030. Industry guidance, including from CISA and the major cloud providers, points to 2035 as the practical deadline for everything else. If your TLS termination, code signing, S/MIME, or VPN stack still relies exclusively on RSA or ECC in 2030, you will be migrating under duress.

    This is a playbook for engineering leaders who need to move from “we should look into PQC” to a multi-year migration program with measurable milestones. The work decomposes into four phases: inventory, algorithm selection, hybrid deployment, and crypto-agility. None of them are optional, and the first one is harder than it looks.

    The NIST PQC Standards You Need to Know

    NIST published four standards in the first wave. Each addresses a different cryptographic primitive, and you will likely need three of them in production.

    • ML-KEM (FIPS 203), formerly Kyber. Key encapsulation mechanism. This replaces RSA and ECDH for key exchange in TLS, IPsec, SSH, and any protocol that establishes a session key. ML-KEM-768 is the recommended general-purpose parameter set; ML-KEM-1024 for high-assurance environments.
    • ML-DSA (FIPS 204), formerly Dilithium. Digital signature algorithm. Replaces RSA and ECDSA for code signing, certificate signing, and document signing. ML-DSA-65 is the typical choice; ML-DSA-87 for long-lived signatures.
    • SLH-DSA (FIPS 205), formerly SPHINCS+. Stateless hash-based signatures. Slower and larger than ML-DSA but built on conservative hash-based assumptions, making it the hedge against unforeseen lattice attacks. Use for root certificate authorities, firmware signing, and anything with a multi-decade trust horizon.
    • FALCON (forthcoming as FIPS 206). Lattice-based signatures with smaller signatures than ML-DSA but more complex implementation. Choose FALCON when bandwidth or storage for signatures dominates the cost calculation, such as constrained IoT or high-throughput certificate systems.

    For most enterprise migrations, the working pair is ML-KEM for key exchange and ML-DSA for signatures, with SLH-DSA reserved for the highest trust roots. FALCON enters the conversation only for specific bandwidth-constrained use cases.

    Macro of a chip wafer with intricate metallic lattice patterns
    Photo by Manuel on Unsplash

    Phase One: Cryptographic Inventory

    You cannot migrate what you have not catalogued. The inventory phase is where most programs stall, because cryptography is embedded in places no one documented. Plan for this phase to take six to twelve months in a mid-sized organization, longer if you have significant on-premises footprint or third-party integrations.

    What to Inventory

    The minimum viable inventory covers eight categories: TLS endpoints (both server and client roles), code signing infrastructure, internal and public certificate authorities, S/MIME and email signing, VPN and IPsec tunnels, secrets management and HSM-backed keys, document signing systems, and any cryptography embedded in proprietary protocols or firmware. For each item, capture the algorithm, key length, certificate validity, ownership, and renewal process.

    Tools That Help

    Network scanning with Nmap’s ssl-enum-ciphers script, certificate transparency logs for your domains, and CBOM (cryptography bill of materials) tooling such as IBM’s CBOMkit or the open-source CycloneDX CBOM extension. For source code, semgrep rules targeting calls to crypto primitives in your major languages will surface most usage. None of these tools are complete; they are starting points. Plan for manual review of high-risk systems, particularly anything that loads a certificate or key from a configuration file.

    The Harvest-Now-Decrypt-Later Problem

    The inventory must include data in motion that an adversary could record today and decrypt in a decade. Long-lived secrets, source code, intellectual property, and personal data with extended sensitivity windows are the priority. If your TLS sessions today carry data that will still be sensitive in 2035, those sessions need post-quantum key exchange now, not in 2030.

    Phase Two: Hybrid Deployment

    Hybrid mode combines a classical algorithm with a post-quantum algorithm, deriving the final session key from both. If either algorithm is broken, the other still protects the session. This is the recommended migration pattern from CNSA 2.0, BSI, and the IETF working groups, and it is what AWS, Cloudflare, and Google have already deployed in production TLS.

    Hybrid TLS Today

    TLS 1.3 with the X25519MLKEM768 hybrid key exchange is supported in OpenSSL 3.5, BoringSSL, and recent versions of Chrome, Firefox, and Edge. AWS Network Load Balancer, CloudFront, and KMS support hybrid TLS. Cloudflare enables it by default for inbound connections. Enabling hybrid TLS at your edge is the single highest-leverage move in the migration program: it protects new sessions against harvest-now-decrypt-later with minimal application change, and it surfaces compatibility issues with legacy clients while you still have time to address them.

    Performance Realities

    ML-KEM-768 adds roughly 1.2 kilobytes to the TLS handshake. On modern hardware, the cryptographic cost is negligible; the real cost is in the additional packet, which can push the handshake into a second round trip on lossy networks. ML-DSA signatures are larger than ECDSA by an order of magnitude, which matters for certificate chain size and OCSP stapling. SLH-DSA signatures are larger still, in the 8 to 50 kilobyte range depending on parameters. Budget for these sizes in any protocol with tight MTU constraints or high signature throughput.

    Phase Three: Code Signing and PKI

    Code signing and certificate authorities are harder than TLS because the trust horizon is longer and the verifier population is more diverse. A code signature issued today may need to verify on devices for ten or fifteen years. A root CA certificate may be embedded in firmware that ships for two decades.

    The pragmatic pattern is dual-signing: produce both an ECDSA and an ML-DSA signature, and let the verifier accept either. This requires updates to verifier code wherever signatures are checked, which is the slow path of the migration. Start with the systems that have the longest signature lifetime and the smallest verifier population, typically internal firmware and enterprise software updates. Public code signing for consumer software follows the certificate authority ecosystem, which is moving on its own timeline coordinated through the CA/Browser Forum.

    Phase Four: Crypto-Agility

    The PQC migration is the second of many. The lattice assumptions underlying ML-KEM and ML-DSA are well-studied but not as old as the integer factorization assumption underlying RSA. If a cryptanalytic advance forces a third migration in a decade, the organizations that build crypto-agility now will move in months instead of years.

    What Crypto-Agility Looks Like

    Algorithm identifiers are configuration, not code constants. Every cryptographic operation goes through an abstraction layer that can swap algorithms without changing call sites. Keys carry algorithm metadata, not just key material. Certificate templates and signing pipelines are parameterized by algorithm. Most importantly, the organization runs a periodic exercise in which a designated algorithm is deprecated in a non-production environment and the migration is timed end-to-end. That exercise will surface the hard-coded OIDs, the assumed key sizes, and the legacy clients that the inventory missed.

    Abstract crystalline geometry suggesting lattice based cryptography
    Photo by Growtika on Unsplash

    Realistic Timeline

    The deadlines are not uniform. Federal contractors and National Security Systems have a 2030 hard date under CNSA 2.0. Financial services regulators are signaling 2030 to 2032 for critical infrastructure. The German BSI and French ANSSI recommend completion by 2030 for high-sensitivity data. The 2035 industry-wide horizon is the latest defensible date, not the target.

    A reasonable schedule for a mid-sized enterprise: complete the inventory by end of 2026, deploy hybrid TLS at all internet-facing edges by end of 2027, migrate internal certificate authorities to dual-signing by end of 2028, retire pure-classical signatures from new code-signing operations by end of 2029, and complete the long-tail cleanup by 2032. Every quarter you delay the inventory pushes the entire schedule back, because the inventory is the dependency for every other phase.

    Recommendation

    Start the inventory this quarter. Enable hybrid TLS at your edge before the end of next quarter. Build the algorithm abstraction layer in your shared libraries before you migrate the first internal CA. Do not wait for vendor announcements or perfect tooling; the standards are stable, the production deployments at hyperscalers are live, and the deadlines compress every quarter.

    When This Applies, and When It Does Not

    This playbook applies to any organization that operates its own TLS endpoints, signs its own code, runs internal certificate authorities, or handles data with sensitivity windows extending past 2035. It is overkill for a small startup that consumes managed TLS exclusively from a hyperscaler and signs nothing itself; in that case, the migration happens to you when your providers flip the switch, and your only obligation is to keep client libraries current. For everyone else, the migration is your problem, and the work starts with knowing what cryptography you have.

  • Enterprise LLM Selection Framework for 2026

    Abstract AI neural network visualization with glowing nodes
    Photo by Google DeepMind on Unsplash

    The 2026 LLM market is the first one in which the right answer for an enterprise buyer is genuinely unclear at the level of the model itself. Two years ago, GPT-4 was the safe default and the conversation was about how much you were willing to compromise to escape it. Today the frontier is contested across at least four serious vendors and two open-weight families, capabilities are similar enough that benchmark differences rarely survive contact with a real workload, and the procurement decision has shifted from “which model is best” to “which combination of contracts, capabilities, and exit options is right for our specific risk profile.”

    This framework is what we use with enterprise CTOs evaluating their LLM stack for the next two-year procurement cycle. It is opinionated. It will not give you a single vendor recommendation, because the right answer is almost never a single vendor.

    The 2026 landscape, named honestly

    Five families matter for enterprise selection in 2026. Anthropic Claude 4.7, with the strongest agent and tool-use behavior on the market, the cleanest constitutional AI safety story for regulated industries, and a one-million-token context window that finally makes long-document reasoning practical rather than theoretical. OpenAI GPT-5, with the broadest ecosystem, the most mature fine-tuning and assistants tooling, and the deepest integration into Microsoft enterprise stacks. Google Gemini 2.5 Pro, with native multimodal that genuinely outperforms competitors on video and large-PDF reasoning, and the strongest data-residency story for organizations already on Google Cloud.

    The two open-weight families are DeepSeek-V3, which closed the gap with the frontier on reasoning benchmarks at a fraction of the cost and is a credible self-hosted option if your security posture allows it, and Qwen-3, which is the strongest non-English-first model family and the right choice for any enterprise with substantial Chinese, Japanese, or Korean operations. Meta Llama remains a serious option for fine-tuning workloads and on-premises deployment but has lost ground at the frontier.

    Smaller specialty vendors (Mistral, Cohere, AI21) are still relevant for specific niches: data residency in Europe, retrieval-tuned models, smaller-context efficient models. They are rarely the primary vendor for an enterprise but are often the right secondary choice.

    Abstract iridescent shape representing a large language model
    Photo by Google DeepMind on Unsplash

    Evaluation axes that survive contact with reality

    Public benchmarks are nearly useless for enterprise selection in 2026. The frontier models cluster within a few percentage points on every standard eval, and your specific workload will reorder them. The axes below are the ones that actually drive the decision.

    Capability on your workload

    Build an internal eval set of two hundred to a thousand examples drawn from your real production traffic, with human-graded ground truth. Run every candidate model on it. Track pass rate, latency, and cost per example. This is the only capability number that should drive your decision. Public benchmarks are useful for ruling out clearly inferior options; they cannot rank the top three.

    Latency under realistic load

    Measure p50, p95, and p99 latency at your expected concurrent request rate, not at single-request rate. Vendor-quoted latencies are usually optimistic and rarely reflect the queueing behavior at scale. Time-to-first-token matters for streaming UIs; total response time matters for batch and agent workflows. Different vendors are strong on different axes, and a vendor that is fast at low load can degrade ungracefully at peak.

    Cost per million tokens, in and out

    List prices in 2026 range from under one dollar per million input tokens for some open-weight inference providers to over fifteen dollars per million output tokens for the most premium frontier offerings. The right number is your blended cost per resolved task, which depends on input length, output length, retry rate, and model success rate. A cheaper model that takes three attempts to succeed is more expensive than a premium model that succeeds on the first call. Compute this for your specific workload before signing a contract.

    Data residency and processing geography

    For enterprise customers in regulated industries, data residency is often a hard constraint that eliminates entire vendors. AWS Bedrock, Azure OpenAI Service, and Google Vertex AI all offer regional model deployment with explicit data-residency guarantees; direct API access from the model vendor often does not. Get the contractual data flow in writing, including any zero-retention guarantees, training-data policies, and incident-disclosure obligations. Trust the contract, not the marketing page.

    Fine-tuning and customization support

    If your strategy involves fine-tuning, the vendor’s support for it is a hard constraint. Some frontier vendors offer no fine-tuning at all, some offer only LoRA-style adapters, and some offer full parameter fine-tuning at significant cost. Open-weight models give you full control but require the platform team to support the inference infrastructure. Decide your customization strategy first; then narrow vendors.

    Compliance and contract terms

    SOC 2 Type II is now table-stakes; the absence of it should disqualify a vendor for enterprise use. HIPAA-eligible deployments are available from all major frontier vendors but require a Business Associate Agreement and specific deployment configurations. ISO 27001, FedRAMP, and PCI DSS attestations matter for specific verticals. The contract should also include indemnification for IP claims (most major vendors now offer this), explicit data-use restrictions, and clear breach-notification obligations.

    Procurement red flags

    The following items in a vendor proposal should slow your procurement process and trigger additional review:

    • No commitment to model version stability or deprecation notice (frontier models change behavior on minor version bumps; a deploy that works today can break next week without notice)
    • Indemnification limited to direct damages with low caps (an AI feature that causes a regulatory penalty or class-action exposure can exceed direct contract value by orders of magnitude)
    • Mandatory training-data clauses that require enterprise data to be usable for model improvement, with opt-out only available at a higher pricing tier
    • SLA targets below 99.5 percent for production endpoints, or no SLA at all on the specific models you intend to use
    • Incident notification obligations longer than 72 hours for security incidents, or no obligation to notify on model behavior changes
    • Pricing structures that incentivize the vendor to maximize tokens consumed (per-token-only with no efficiency credits, no committed-use discounts, no rate-limit transparency)

    Multi-vendor strategy

    The single most consequential strategic decision in 2026 LLM procurement is whether to commit to a single vendor or to architect for portability across two or three. Single-vendor commitment buys deeper integration, simpler contracts, volume discounts, and faster iteration. Multi-vendor architecture buys negotiating leverage, vendor-failure resilience, and the ability to route specific workloads to the model that handles them best.

    Our recommendation for any enterprise spending more than two million dollars per year on inference is a primary-secondary architecture: one vendor handles seventy to eighty percent of traffic with deep integration, a second vendor handles the remainder with the same provider abstraction layer. The cost premium of maintaining the second integration is usually under ten percent of the inference budget; the leverage it provides at contract renewal time often exceeds that.

    Build the abstraction layer with care. The temptation is to use a thin wrapper that exposes a lowest-common-denominator API; the result is that you cannot use any vendor’s distinguishing features. The better pattern is a capability-based router: each route in your application declares what it needs (tool use, structured output, long context, low latency, low cost) and the router selects the best vendor for that capability mix. This is more work upfront and pays off for years.

    Flowing abstract waves symbolizing model outputs and embeddings
    Photo by Google DeepMind on Unsplash

    Recommendation

    Run a structured evaluation on your top three candidates with an internal eval set, real production traffic shadowed against each model, and explicit measurement of cost per resolved task. Negotiate enterprise contracts with at least two of them, even if you intend to start with one. Architect for portability without paying the full price of a vendor-neutral abstraction. Revisit the decision every twelve months; the model landscape in 2026 changes quarterly and your procurement strategy should not be locked in for three years.

    When this applies and when it does not

    This framework applies to any organization with an annual inference budget above five hundred thousand dollars or any deployment that touches regulated data or revenue-impacting decisions. The procurement leverage and contract terms become genuinely meaningful at that scale.

    It does not apply to startups in the first eighteen months of building an AI product. There, vendor agility and feature velocity matter more than contractual terms; pick the model that lets you ship fastest and revisit the procurement question when you have product-market fit. It also does not apply to single-developer experiments or proof-of-concept work; the overhead of running this framework on a four-week prototype consumes the time you should be spending on the prototype itself.

  • Build vs Buy: A Decision Framework for Custom Software vs SaaS

    Team collaborating around a whiteboard during a strategy session
    Photo by Mapbox on Unsplash

    Every quarter, an engineering org somewhere greenlights a custom build that should have been a SaaS subscription, or signs a SaaS contract for the one capability that defines its product. Both mistakes cost millions. The question is not whether to build or buy. The question is which decision rule survives contact with the next five years of your roadmap.

    This framework is the one we walk CTOs through during architecture reviews. It assumes you already know how to read an invoice and how to estimate a sprint. What it gives you is a way to defend the decision to a board, a CFO, and to your future self when the tradeoffs surface eighteen months in.

    The Differentiator Rule

    The first filter is brutal and binary. If a capability is part of how you win in the market, build it. If it is not, buy it. Auth flows, billing, helpdesk, error tracking, feature flags, internal analytics dashboards, document signing, video conferencing, status pages: these are not where you win. Customers do not pay you because your SSO is elegant. They pay you because of the thing your competitors cannot do.

    The trap is that engineering teams genuinely enjoy building these things. They are well-scoped, satisfying problems with clear shapes. Auth0, Stripe, Zendesk, Sentry, LaunchDarkly, DocuSign, Daily, Statuspage, Mixpanel, Segment, Snowflake, Looker, Datadog all exist because thousands of teams concluded that those problems were solved well enough by people who solve them full-time. Your team should reach the same conclusion before they write the first migration.

    Total Cost of Ownership Beyond License Fees

    License fees are the most visible cost and almost never the largest one. A useful TCO model spans five years and counts every line item that engineering, finance, and security will eventually pay. The numbers below are illustrative bands we have seen across mid-market consulting engagements, not benchmarks.

    • Build path: initial engineering (loaded cost per FTE multiplied by team size and duration), opportunity cost of those engineers not shipping product features, ongoing maintenance at roughly 15 to 25 percent of initial build per year, on-call burden, security patching, dependency upgrades, infrastructure spend, observability, compliance audits when in scope, and the eventual rewrite that arrives every 4 to 7 years.
    • Buy path: contract value, integration engineering for connecting the SaaS to your stack, vendor management overhead, data egress costs, audit and procurement effort, and the cost of switching if the vendor underperforms or repackages pricing.
    • Hidden cost on both sides: the time leadership spends defending the decision when something breaks. Build it and an outage is your fault. Buy it and an outage is the vendor’s fault but still your problem.

    The honest version of TCO almost always shows that buying is cheaper for the first 18 to 36 months and that build economics only start to compete once your usage scale outgrows the vendor’s pricing model. Below 10,000 active users, build is rarely cheaper. Above 1 million, the math sometimes flips, but not always.

    Switching Cost as the Real Lock-In

    <
    Two paths diverging in a minimalist landscape representing a build or buy decision
    Photo by Vladislav Babienko on Unsplash
    !– /wp:heading –>

    The standard concern about SaaS is vendor lock-in. The standard concern is misframed. The real question is switching cost, and switching cost applies equally to your custom build. A homegrown billing system is locked in too. The lock-in is just to your own team’s tribal knowledge instead of to a vendor’s roadmap.

    What Increases SaaS Switching Cost

    Proprietary data formats with no clean export, deep workflow integrations with custom logic, identity provider entanglement, vendor-specific UI embedded in your own product, and pricing models that compound with usage so that migration windows become exorbitant. The mitigation is to insist on data portability clauses, to keep an integration abstraction layer between your code and the vendor SDK, and to track the cost of staying versus leaving on a yearly basis.

    What Increases Build Switching Cost

    Tribal knowledge that left with the original team, undocumented business rules encoded in stored procedures, custom protocols nobody wants to support, and tightly coupled internal systems that all assume the build will exist forever. The mitigation is documentation, modularization, and an explicit owner. Most internal builds fail this test by year three.

    The Hybrid Pattern That Usually Wins

    Most mature engineering orgs end up with a hybrid posture rather than a pure build or pure buy stance. The pattern looks like this: buy the commodity layers, build a thin orchestration layer on top, and reserve custom engineering for the differentiated workflow that touches the customer. Use Auth0 or WorkOS for identity, but build the tenant-specific authorization model that encodes your domain. Use Stripe for payments, but build the pricing engine that calculates what to charge. Use Snowflake for storage, but build the semantic layer your analysts and product team consume.

    This pattern works because it isolates vendor risk to interchangeable layers and concentrates engineering investment on the layer that compounds. Replacing Stripe with Adyen is painful but tractable. Replacing your pricing engine is a strategic project either way.

    The Anti-Pattern: Rebuilding the Commodity

    The most expensive mistake in this space is rebuilding undifferentiated commodity software because it feels strategic. We see it most often in three forms. The first is the in-house feature flag platform that started as a Friday afternoon hack and now consumes an SRE quarter every year. The second is the bespoke ETL pipeline built to avoid Fivetran or Airbyte license fees that ends up costing four times the annual contract in headcount. The third is the internal admin tool framework that reinvents Retool, Forest, or Appsmith because someone read a blog post about low-code being a trap.

    The pattern is recognizable. Engineers like the work, leadership likes the optionality, and finance does not see the line item because the cost is hidden inside payroll. Two years later, the system has one maintainer who cannot take vacation, no documentation, and a quiet plan to migrate to the SaaS that was rejected on day one

    Laptop on a wooden desk next to a notebook and a coffee cup
    Photo by Andrew Neel on Unsplash
    .

    The Decision Checklist

    When the tradeoff is genuinely close, work through these questions in order. The first one to flip the decision is the answer.

    1. Is this capability part of how we win in the market? If yes, build. If no, continue.
    2. Does a credible vendor exist with a clean API, documented data portability, and a track record beyond 5 years? If no, build or wait. If yes, continue.
    3. Will our usage in 24 months exceed the vendor’s pricing model breakpoint by a factor of 3 or more? If yes, model the breakeven. If no, buy.
    4. Do we have the operational maturity to own this system on-call for the next 5 years? If no, buy. If yes, continue.
    5. Does the build path require us to hire specialist talent we do not currently have? If yes, lean buy. If no, the decision is now financial, and TCO over 5 years decides.

    How to Run the Decision in Practice

    The framework is only useful if it produces a defensible decision in a finite amount of time. The version that survives contact with real organizations looks like a two-week exercise with three artifacts at the end: a one-page TCO model with documented assumptions, a one-page risk register that names the top three failure modes for each path, and a one-page recommendation that the sponsor signs. Anything more becomes a project. Anything less becomes a hallway conversation that gets re-litigated every quarter.

    The most common process failure is letting the analysis sprawl until the decision has been made by inertia. The team that spends six weeks evaluating six vendors at the line-item level is the team that ships a Frankenstein POC because nobody wanted to call the question. Set a deadline, name a decision-maker, and accept that the choice will be made with imperfect information. The cost of a wrong decision is recoverable. The cost of no decision is the year you spent not making it.

    When This Framework Applies

    This framework works for capabilities that are well-defined, have a vendor market, and are not currently a competitive crisis. It works for billing, identity, observability, support, internal tooling, data infrastructure, and most platform layers.

    When It Does Not Apply

    It does not apply to your core product surface. It does not apply to capabilities that no vendor sells because the market does not yet exist. It does not apply when regulatory constraints prohibit data leaving your environment, although in those cases the relevant choice is between self-hosted commercial software and pure custom build, not between commodity SaaS and custom. And it does not apply when speed to market is the only thing that matters and the build path adds 6 months you do not have. In that case, buy now, accept the lock-in, and revisit in 18 months with a real TCO model in hand.

    The discipline is to make the decision once, defend it with numbers, document the assumptions, and revisit those assumptions every two years. The teams that get this right are not smarter. They are more honest about which problems they are paid to solve.