Scaling Vector Search: How We Cut Query Latency by 80%

AI
Atharv Mahajan·

When we launched Closot's semantic search, it worked beautifully for small workspaces. But as customers grew to 50,000+ pages, query latency climbed past 450ms — unacceptable for a feature that needs to feel instant. Universal search is Closot's most-used feature. Users hit Cmd+K an average of 47 times per day. That means search must be fast, accurate, and aware of permissions — every single time, across every object type in the workspace: pages, board tickets, meeting notes, wiki pages, calendar events, and even tickets created from chat.

This post is a detailed technical walkthrough of how we rebuilt our search infrastructure, the architectural decisions we made, and the numbers that came out the other side.

The problem

Our initial implementation used brute-force cosine similarity across all document embeddings. Simple, accurate, but O(n). Every new page made every search slightly slower. At 50K documents with 1536-dimensional embeddings, we were scanning 300MB of vectors per query.

The problem compounded as we expanded what gets indexed. In early versions, search only covered pages. But users expected to find everything — a ticket titled "auth regression" on an engineering sprint board, a meeting note from last Tuesday's architecture review, a wiki page about deployment procedures with a verification date from two weeks ago, a calendar event for next week's sprint planning, and a ticket that was created from a chat message by the Closot AI Agent. Each of these object types has different content structures, different metadata, and different access patterns. Our brute-force approach could not scale across all of them.

The solution: HNSW + intelligent caching

We moved to Hierarchical Navigable Small World (HNSW) graphs for approximate nearest-neighbor search. The trade-off — slightly less perfect recall for dramatically faster queries — was worth it. We measured recall@10 at 98.2%, meaning users almost never miss a relevant result.

HNSW works by building a multi-layer graph where each node is an embedding vector and edges connect nearby vectors. At query time, the algorithm starts at the top layer (sparse, for coarse navigation) and descends through denser layers to find the nearest neighbors. The key parameters are M (number of connections per node, we use 16) and efConstruction (quality of the index build, we use 200). Higher values mean better recall but slower index builds — we tuned these over three weeks of A/B testing against production queries.

Search architectureUser query (Cmd+K)Inverted index (BM25)Vector index (HNSW)Learned ranking modelPermission filter (pre-computed)Results (<62ms P95)

Hybrid search: Keywords meet semantics

We run two search engines in parallel for every query. A traditional inverted index handles exact keyword matches, acronyms, ticket IDs, and proper nouns. A vector index handles semantic similarity — so "how do we handle refunds" finds the "Return Policy" wiki page even though neither word appears in the query.

Results from both engines are merged using a learned ranking model. The model considers six signals: BM25 text relevance score, cosine similarity from the vector index, document recency (with time decay), the searcher's personal access patterns (which pages they visit most), page popularity within their teamspace, and object type (users searching from a board context get board results boosted). The model was trained on 2.3 million anonymized search sessions and retrained weekly.

Indexing every object type

The challenge of universal search is that different objects have very different structures. Here is how we handle each:

Pages and wiki articles: Chunked into 512-token blocks with 64-token overlap. Each chunk is embedded independently. Title and headers receive 2x weight in the BM25 index. Wiki pages include their verification date as metadata — stale pages (unverified for 90+ days) are ranked lower.

Board tickets: Each ticket is embedded as a single unit: title + description + labels + comments (truncated to 1024 tokens). Tickets on active sprint cycles receive a recency boost. Priority and label metadata are indexed as filterable facets.

Meeting notes: Embedded as full documents. AI-generated summaries are indexed separately as a high-signal representation. Action items extracted from meeting notes are cross-referenced with their linked tickets.

Calendar events: Event titles and descriptions are indexed. Recurring events are deduplicated — we index the template once, not every instance. Calendar search supports natural-language date queries: "meetings about pricing last month" resolves the date range before searching.

Chat-created tickets: Tickets created by the Closot Agent from chat include the original chat message as additional context in their embedding. This means searching for something said in chat surfaces the ticket even if the ticket title was paraphrased by the agent.

Two-tier caching strategy

We added a two-tier caching strategy: an in-memory LRU cache for the 1000 most-queried workspaces (holding pre-computed HNSW subgraphs), and a Redis layer for warm embeddings of recently accessed documents. Combined with request batching during high-traffic windows (grouping queries that arrive within a 5ms window and executing them as a batch against the index), P95 latency dropped to 62ms.

P95 query latency (ms) — progression of optimizations500300100450msBrute-force210msHNSW90ms+Cache62ms+Batching38ms+Prefetch

Permission-aware results

The trickiest part of universal search is not relevance — it is permissions. Every result must respect workspace, teamspace, and page-level access controls. We pre-compute permission sets per user as a bitmap (updated on access-control changes via a webhook) and filter results at the index level using bitmap intersection. Restricted pages never appear even briefly — the filter happens before results are ranked, not after.

For workspaces with complex permission hierarchies (nested teamspaces with inherited and overridden access), the permission bitmap can be up to 128KB per user. We compress these with roaring bitmaps, bringing the median down to 2KB while maintaining O(1) membership checks.

Embedding pipeline at scale

Documents are embedded asynchronously on save. We chunk pages into 512-token blocks with 64-token overlap, embed each chunk with our fine-tuned model, and store them alongside metadata (page ID, workspace ID, object type, last modified, permission scope). The pipeline processes roughly 2M chunks daily with a 99.97% success rate. Failed embeddings are retried with exponential backoff, and a dead-letter queue catches persistent failures for manual inspection.

The embedding model itself is a fine-tuned variant of a 768-dimensional model (down from the 1536-dimensional model we started with). We found that fine-tuning on workspace-specific content — technical docs, meeting notes, ticket descriptions — let us halve the embedding dimension while actually improving recall@10 by 1.3 percentage points. This cut our vector storage costs by 50% and improved HNSW search speed by roughly 30%.

What comes next

We are working on three improvements. First, personalized re-ranking that learns individual users' search patterns over time. Second, cross-workspace search for organizations with multiple Closot workspaces. Third, real-time index updates — currently there is a 2-3 second delay between a page save and its appearance in search results; we want to bring this under 500ms.

The result today: search that feels instant regardless of workspace size, with semantic understanding that surfaces results keyword search would miss, across every object type in your workspace — pages, tickets, meeting notes, wiki articles, calendar events, and chat-originated content. All under 62ms at P95.

Atharv Mahajan·
Copy link