Building AI at the speed of light on Cloudflare's global edge network
For over 15 years, we’ve trusted Cloudflare. Their ever-free tier grants you the world’s fastest DNS without surveillance capitalism. They sell domains at cost. Their free compute tier is the most generous in this galaxy. They’ve earned trust through action, not marketing.
Today, we’re honored to announce that Divinci AI has been accepted into Cloudflare Workers Launchpad Cohort #6—joining 25 other innovative startups building the future on Cloudflare’s edge computing platform.
Why This Matters: Migration to Cloudflare
Joining the Workers Launchpad marks the beginning of our complete infrastructure migration to Cloudflare. Our architecture will cascade through resilience layers:
Primary: Eco-Colo → Secondary: Cloudflare → Tertiary: GCP → Quaternary: AWS
We estimate that when we scale, most compute will run on Cloudflare Workers. Why? Because their pricing structure enables something rare in tech: profitable altruism.
Cloudflare’s economics allow us to expand our budget for supporting non-profits, causes, and organizations making the world better. Specifically, we’re committing resources toward Universal Basic Income research and advocacy (see our UBI blog post).
When infrastructure costs drop, mission-driven work becomes possible.
The Edge Computing Revolution
Traditional cloud computing follows a centralized model: your data travels thousands of miles to a distant data center, gets processed, then returns. For AI applications requiring real-time intelligence, this creates unacceptable latency.
Cloudflare’s edge network changes the equation entirely:
- 330+ cities globally: Your code runs milliseconds from every internet user
- 298% faster than AWS Lambda: Cloudflare Workers outperform traditional serverless by nearly 3x1
- Zero cold starts: V8 isolates eliminate the container startup penalty
- Sub-100ms global latency: Achieving the responsiveness threshold for real-time AI
For Retrieval-Augmented Generation (RAG) systems—where every millisecond compounds through retrieval, embedding, ranking, and generation—edge deployment is transformative.
RAG at the Edge: Why It’s Game-Changing
Retrieval-Augmented Generation emerged in 2024 as the dominant strategy for grounding LLMs in authoritative, up-to-date knowledge. Over 1,200 RAG papers appeared on arXiv in 2024 alone—a 12x increase from 20232.
Traditional RAG architecture suffers from latency accumulation:
- Embedding generation: 50-200ms
- Vector search: 20-100ms (regional database)
- Context retrieval: 50-150ms (object storage)
- LLM generation: 200-800ms
- Network round-trips: 100-400ms (multi-region)
Total latency: 420-1,650ms for a single query.
Edge-based RAG collapses these timelines:
- Embedding, search, retrieval, and generation happen in the same data center
- Document chunks stored at the edge (D1 + Vectorize)
- Network overhead reduced by 60-80%
- Achievable total latency: 100-400ms
This isn’t a minor optimization—it’s the difference between usable and frustrating.
How Divinci Actually Uses Cloudflare — the Production Stack
We’ve avoided the trap of “Cloudflare-flavoured marketing prose” here. What follows is the actual stack as it ships in our monorepo: 29 production Workers, 3 Worker Workflows, 5 Workers AI models, 4 R2 buckets, 6 Queue types, Hyperdrive on Postgres, Durable-Object-backed Containers for PDF and audio, and 36 tail consumers streaming structured logs to observability. The pieces are named after their real bindings and route domains so engineers reading this can grep for them.
Layer 1 — Five core Workers at the edge
Every HTTP request hits one of five custom-domain Workers:
divinci-apiatapi.divinci.app— the REST boundary: auth, JWT validation, route resolution, fan-out to internal workers. Bindings include the FILES R2 bucket, the CACHE KV namespace, the D1 doc-elements database, Workers AI, Hyperdrive, Analytics Engine, and four named Queues. This is the worker that sees the request first.web-client-r2-serveratchat.divinci.app— the static frontend, served directly from R2 through a thin Worker that handles Worker-side rewrites and routing into the SPA.divinci-agent— the answer-composition orchestrator. Pulls context from D1 + KV + R2, decides which Workers AI model to call (or whether to delegate to an external API via Hyperdrive), composes the response.chunks-workflowatrag-workflow.divinci.app— the Worker Workflows entrypoint; called whenever a long-running RAG pipeline needs to be kicked off.connector-sync-worker— the external-ingestion worker that syncs from Dropbox / Drive / similar third-party connectors into the RAG pipeline.
There are 24 more workers behind these five (tail consumers, internal microservices) — the five above are what’s exposed to the public internet.
Layer 2a — Worker Workflows (three multi-step async pipelines)
Cloudflare Workflows replaced our older Durable-Object-based job runners last year. Three workflows are in production today, all using the step.do("name", async () => {…}) checkpoint pattern so each step is independently retried on failure without re-running the whole pipeline:
ReindexWithVersionWorkflow— re-embeds an entire customer corpus when the embedding model version changes. Versions the resulting index so a roll-back is one binding swap.BrowserExtractionWorkflow— extracts text from uploaded documents via the openparse-cf Durable-Object container, then chunks + queues the chunks for embedding.AudioToRagWorkflow— transcribes audio with Workers AI Whisper, runs speaker diarization through the audio-services Container, chunks the transcript, and queues for embedding.
All three are declared in wrangler.toml like:
[[env.production.workflows]]
name = "reindex-with-version"
binding = "REINDEX_WITH_VERSION"
class_name = "ReindexWithVersionWorkflow"Layer 2b — Six Queues, tuned for D1’s single-writer limit
Async work flows through six named Queues, each with max_batch_size, max_concurrency, and max_retries tuned to whatever bottleneck the downstream service has. The chunking and api-jobs queues run at 10-batch / 5-concurrency because they write to D1 (whose per-shard writer is single-threaded); the vectorize and reindex queues run hotter at 25/10 because they call external embedding APIs. The d1-sync queue serialises writes to the per-vector D1 shards so two workflows don’t race on the same row.
The lesson we wish we’d learned earlier: Queues are the only thing that keeps a per-customer-sharded D1 setup honest. Without them, a single tenant with a big upload starves everyone else on the same shard until the request times out.
Layer 3 — R2, D1, KV, and Hyperdrive
The storage layer is split across four primitives, each chosen for a different access pattern.
R2 (four buckets per environment) — the bindings are FILES (RAG documents), AUDIO_FILES (source audio for transcription pipelines), PUBLIC_UPLOADS (chat attachments served at signed-URL endpoints), and TEMP_UPLOADS (the presigned-upload landing pad). Zero egress fees are the headline reason, but the deeper one is the same Worker can sign a URL, accept a multi-MB upload, kick off the BrowserExtractionWorkflow, and serve the resulting RAG context — all without a hop off Cloudflare’s edge.
D1 (per-tenant sharded) — each customer gets their own D1 database, with chunk + metadata in normal tables and a FTS5 virtual table for text-only search. Sharding by customer was the only way to avoid the single-writer bottleneck on hot tenants. The cost is that we manage a fan-out across shards in the application layer; the benefit is one tenant’s spike can’t starve another’s reads.
KV (three namespaces) — CACHE holds JWT validation results and tenant config; EMBEDDING_CACHE is the content-hash → embedding-bytes map with a 30-day TTL (this is the single biggest cost reduction we made — caching embeddings by content hash cut the daily embedding-API bill by an order of magnitude); VECTORIZE_CACHE is the wrapper layer the vectorize-cache worker uses to memoize vector lookups.
Hyperdrive — Postgres connection pooling at the edge. The HYPERDRIVE binding lets a Worker open a Postgres connection without paying the TCP handshake + auth cost on every request. We use it for the small slice of relational data (subscription state, org-level ACLs) that doesn’t fit D1’s sharded model.
Layer 4a — Workers AI (five models in production)
Workers AI is the on-platform inference layer; we use it where the model is small enough that round-tripping to an external provider isn’t worth the latency or cost:
| Model | Binding | What it does |
|---|---|---|
@cf/openai/moderation-stable | content safety | gate every user input through a moderation pass before any other processing |
@cf/huggingface/distilbert-sst-2-int8 | sentiment | quick classification for routing + analytics |
@cf/meta/llama-3-8b-instruct | text generation | the small-model fallback for low-stakes answer composition |
@cf/google/gemma-3-12b-it-preview | text generation | the preview model we use to A/B fine-tunes against |
@cf/openai/whisper-large-v3-turbo | audio transcription | called from the AudioToRagWorkflow for transcription |
For frontier-scale generation (Claude, GPT-4-class) we still route to external providers through Hyperdrive — Workers AI’s catalog is growing but doesn’t yet include the largest models we need for the hardest queries.
Layer 4b — Containers, Email, Analytics, Tail Consumers
Durable-Object Containers are the newest piece of the stack: full Docker images running on the Workers runtime, scoped per DO instance. We run two:
openparse-cfis a Python PDF parser packaged as a Container, called by theBrowserExtractionWorkflowfor document chunking.audio-services-containerruns ffmpeg + pyannote-audio for speaker diarization, called by theAudioToRagWorkflow. Memory-tierstandard-2(6 GB) so the heavier models load without OOM.
Email Workers — a transactional-notification Worker sends product email, and a routing Worker manages inbound mail at email.divinci.app/verified-emails. Both use Cloudflare’s Email Routing primitive instead of an external email API.
Analytics Engine — a Workers Analytics Engine dataset is the structured-event sink for product analytics. Anything we’d previously have sent to Segment/Amplitude lands here first, then forwards downstream.
Tail Consumers (36 workers) — every production worker has its tail_consumers list populated with a dedicated *_tail consumer. Each consumer parses the Worker’s invocation logs and forwards structured events to our observability pipeline. The fan-out is what makes the eight-worker microservice topology debuggable.
Cron Triggers — production runs an orphan-cleanup job every 30 minutes; stage runs every 10 minutes for tighter feedback while we iterate on the cleanup logic.
A note on Vectorize — what we don’t use, and why
We evaluated Cloudflare Vectorize during the migration and ultimately did not adopt it as our primary vector store. The decision had nothing to do with Vectorize itself — it has improved significantly through 2025–2026. The reason we landed on D1 FTS5 + an external embedding service was that our retrieval architecture is hybrid (lexical + semantic with a calibrated re-ranker on top), and FTS5 in D1 gave us the lexical half of that for free, on the same shard as the document metadata. Adding Vectorize would have introduced a second consistency model — a separate index that has to stay in sync with D1 — for marginal recall improvement at the volumes we run. The VECTORIZE_CACHE KV namespace name is a leftover from the evaluation period; the worker behind it now caches embedding lookups, not Vectorize results.
If our retrieval model shifts toward dense-only retrieval at very large scale, Vectorize is the natural next step. Honest answer beats a marketing claim.
The Workers Launchpad Program: What We’re Gaining
Cloudflare’s Workers Launchpad isn’t just credits—it’s a comprehensive accelerator program:
Financial Support:
- Up to $250,000 in cloud credits (one year)
- Eliminates infrastructure costs during critical growth phase
- Enables experimentation without budget constraints
Technical Resources:
- Bootcamp sessions with Cloudflare engineering teams
- Early access to beta products and features
- Design support for architecture optimization
- Direct access to product teams for feedback and bug reports
Network & Growth:
- VC introductions to Cloudflare’s investor network
- Partnership opportunities with Cloudflare’s enterprise customers
- Co-marketing potential with Cloudflare’s brand
Proven Track Record: Since launching in 2022, Workers Launchpad has supported 145 startups from 23 countries. Notable alumni include:
- Nefeli Networks: Acquired by Cloudflare (2024)
- Outerbase: Acquired by Cloudflare (2024)
- Companies now processing billions of monthly requests on Workers
Nearly 1/3 of Cohort #5 were led by female founders—evidence of Cloudflare’s commitment to diverse entrepreneurship.
What This Means for Our Customers
For enterprises using Divinci AI, joining Workers Launchpad translates to tangible benefits:
Performance:
- Sub-100ms AI responses globally: Edge computing eliminates regional bottlenecks
- 99.99% uptime SLA: Cloudflare’s network reliability becomes ours
- Infinite scale: No capacity planning—Workers auto-scale to billions of requests
Privacy & Compliance:
- Data residency: Process data at the edge closest to users
- Zero-knowledge architecture: Cloudflare can’t decrypt customer data
- GDPR/CCPA compliance: Built-in privacy controls and data retention policies
Innovation:
- Beta access: Test cutting-edge features before public release
- Custom integrations: Deeper Cloudflare product integration
- Rapid deployment: Ship new features without infrastructure blockers
Economics:
- Lower costs: Cloudflare’s pricing passes through to customers
- Predictable billing: No surprise egress charges or regional surcharges
- Value reinvestment: Savings redirected to product R&D and customer support
The Road Ahead: Building in Public
Over the coming months, we’ll be documenting our infrastructure migration and lessons learned:
Upcoming deep-dives:
- RAG at the edge: Architecture patterns and performance benchmarks
- D1 for vector metadata: Scaling distributed SQL for AI workloads
- Workflows orchestration: Building multi-step AI pipelines
- Cost analysis: Cloudflare vs. AWS/GCP for AI infrastructure
- Real-world latency: P50/P95/P99 metrics from production traffic
We believe in building in public and sharing knowledge. If you’re building on Cloudflare Workers or exploring edge computing for AI, we’d love to collaborate.
Innovation through the ages: Building the future with timeless principles
Join Us on This Journey
We’re incredibly excited about this partnership and the opportunities ahead. As we build the future of AI-powered enterprise collaboration, Cloudflare’s platform will remain at the heart of our infrastructure—enabling us to deliver exceptional experiences to teams worldwide.
Want to see it in action?
- Request a demo to explore Divinci AI’s platform
- Follow our blog for technical deep-dives and updates
- Join our community to discuss edge AI architecture
Building on Cloudflare Workers?
If you’re exploring edge computing for AI/ML workloads, we’d love to share lessons learned. Reach out at hello@divinci.ai.
About Workers Launchpad
The Cloudflare Workers Launchpad is Cloudflare’s startup accelerator program, providing funding, technical support, and go-to-market resources to companies building on the Workers platform.
Since 2022, the program has supported 145 startups across 23 countries, with two companies acquired by Cloudflare and dozens processing billions of monthly requests.
Learn more about Cohort #6 and participating companies.
Serverless Performance: Cloudflare Workers, Lambda and Lambda@Edge - Cloudflare Engineering Blog (2024)
The Rise and Evolution of RAG in 2024: A Year in Review - RAGFlow Research (2024)
Ready to Build Your Custom AI Solution?
Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.
Get Started TodayJoin the Conversation
We'd love to hear your thoughts on this topic. Join the discussion on your favorite platform:

