Building A Better Semantic Cache with Hyperbinder
A recent post describes the following situation: LLM API costs growing 30% month-over-month, 47% of queries being semantic duplicates, and a semantic caching system that brought costs down 73%. It works, but it's also more complex than it needs to be. Here we show that with Hyperbinder, our next-generation neurosymbolic retrieval and reasoning engine, semantic caching comes for free, without the need for data labeling or per-type threshold tuning and other overhead.
The author needed a vector store, a response store, a query classifier, per-type similarity thresholds tuned across four categories, and a human labeling pipeline with 5,000 query pairs scored by three annotators each, just to answer a question that should be structural: are these two queries asking the same thing in the same context?
Here we show how you can build a semantic cache more efficiently using our product, Hyperbinder. You don't need to label data or do per-type threshold tuning.
The payoff is simple. Fewer redundant LLM calls + no manual data labeling = saved time and money.
What Is Semantic Caching?
In case you're unfamiliar: semantic caching is a technique that reduces the cost of LLM API calls by storing the results of previous queries and reusing them when a similar query is made. This is done by encoding the query as a vector and using vector similarity to find similar queries.
The False Positive That Breaks Everything
The issue is that pure vector similarity cannot distinguish what a question is about from how it is phrased.
Here is the example from the article:
Query: "How do I cancel my subscription?" Cached: "How do I cancel my order?" Similarity: 0.87
These questions are 87% similar in embedding space. They have completely different answers. The article's solution is to raise the threshold to 0.92 to 0.97 depending on query type, then build a classifier to route queries to the right threshold.
This works. But it treats a structural problem with statistical tools and makes deciding between a good and bad output a question of a few decimal points. The real issue is that the embedding does not know these queries belong to different domains.
What If The Cache Knew About Structure?
Consider a different representation. Instead of encoding the entire query as a single vector and hoping the threshold separates intent from phrasing, encode the query's structure:
| API field | Value | Encoding role |
|---|---|---|
query | How do I cancel? | SEMANTIC for similarity matching |
context | subscription | EXACT for domain isolation |
response | Go to Settings > ... | SEMANTIC for stored response context |
Now How do I cancel my subscription? and How do I cancel my order? are separated by construction. The semantic component handles phrasing variation like cancel, end, or stop my. The exact component handles domain isolation. No per-type threshold tuning is required to prevent cross-domain contamination.
This is what HyperBinder's compositional encoding does. The cache entry is a structured entry with a multi-slot schema where each slot has its own encoding type. Search operates across slots simultaneously: semantic similarity on the query text, exact matching on the context, and weighted combination for the final score.
from hybi import HyperBinder, SemanticCache hb = HyperBinder(local=True, encode_fn=model.encode) cache = SemanticCache(hb, collection="llm_cache") cache.put( query="How do I cancel my subscription?", context="subscription", response="Go to Settings > Subscription > Cancel.", ) hit = cache.get("Cancel my subscription please", context="subscription") miss = cache.get("Cancel my subscription please", context="order")
The first lookup is a hit. The second is a miss, by construction.
What This Eliminates
The article's production system requires several extra moving parts that become unnecessary when structure is explicit:
| Component | Purpose | HyperBinder equivalent |
|---|---|---|
Vector store (FAISS / Pinecone) | Similarity search | Built into HyperBinder |
Response store (Redis / DynamoDB) | Store cached responses | Built into HyperBinder |
QueryClassifier | Route to per-type thresholds | Not needed because context is explicit |
| Per-type thresholds | Prevent false positives | Not needed because EXACT encoding handles isolation |
Human labeling (5K pairs × 3 annotators) | Tune thresholds | Not needed because structure provides ground truth |
The labeling pipeline is the most telling part. You need human annotators because the embedding space genuinely cannot distinguish cancel subscription from cancel order. They are close in every dimension. No amount of threshold tuning fixes this cleanly. You are drawing a decision boundary through a region where the signal you care about, domain, is not directly encoded.
Compositional encoding puts that signal into the representation itself.
Production-Ready Features
A cache that is only accurate is not enough. TTL, filtering, and staleness all matter in production:
cache = SemanticCache( hb, collection="llm_cache", threshold=0.65, default_ttl=timedelta(hours=4), should_cache=lambda q, ctx, r: "personal" not in r.lower(), ) cache.put("Current BTC price?", "crypto", "$67,200", ttl=timedelta(minutes=5)) cache.invalidate(context="pricing") cache.seed(faq_dataframe)
TTL is stored as an expiry timestamp alongside each entry. There is no background sweeper and no separate TTL store. Expired entries are filtered during get(), so stale data is never returned. The should_cache callback runs before put(), giving you application-level control over what enters the cache.
Benchmark Results
We tested against 275 canonical query-response pairs across 11 domains (account, billing, loyalty/rewards, orders,
privacy/security, product info, promotions, returns, shipping, subscriptions, technical support), with 1,375 semantic
rephrasings (5 per canonical), 30 cross-context adversarial pairs, and 20 intra-context confusable pairs, using
all-MiniLM-L6-v2 embeddings. The cache was populated with 11,000 entries (canonicals plus domain-specific filler).
Threshold sweep with a single global threshold and no per-type threshold tuning:
| Threshold | Recall | Cross-context FP rate |
|---|---|---|
| 0.50 | 1.000 | 0.0% |
| 0.60 | 0.995 | 0.0% |
| 0.65 | 0.981 | 0.0% |
| 0.70 | 0.928 | 0.0% |
| 0.75 | 0.803 | 0.0% |
Cross-context false positives are zero at every threshold tested — including 0.50. Context isolation is algebraic, not
threshold-dependent: EXACT-encoded context keys produce orthogonal vectors, so cross-domain leakage was zero in all tests,
by algebraic design, regardless of how permissive the similarity threshold is.
Intra-context discrimination — confusable query pairs like "cancel my subscription" versus "pause my subscription" within the same domain — scored 100% accuracy across all 20 confusable pairs.
We validated cross-context isolation at scale with 30 adversarial pairs designed to confuse flat-embedding caches,
including pairs like "cancel my subscription" versus "cancel my order" and "update my billing address" versus "update my
shipping address". Zero leaks at every scale point tested:
| Scale | Recall | Cross-context leaks |
|---|---|---|
| 275 | 0.907 | 0/60 |
| 1,000 | 0.928 | 0/60 |
| 5,000 | 0.974 | 0/60 |
| 11,000 | 0.981 | 0/60 |
Recall holds strong as the cache grows, with no degradation from increased density. Cross-context isolation holds perfectly at every scale.
Lookup latency (excluding embedding generation) stays sub-millisecond at typical cache sizes:
| Scale | p50 | p95 | p99 |
|---|---|---|---|
| 275 | 0.3ms | 0.4ms | 0.5ms |
| 1,000 | 0.2ms | 0.4ms | 0.6ms |
| 5,000 | 0.4ms | 0.6ms | 0.8ms |
| 11,000 | 0.7ms | 1.2ms | 2.6ms |
Even at 11,000 entries, p99 is 2.6ms — orders of magnitude cheaper than the LLM call it replaces.
The Deeper Principle
Flattening a structured query into a single vector is a lossy operation. The domain axis, the phrasing axis, and the intent axis all get compressed into the same dimensions. A classifier or per-type threshold is an attempt to recover that lost structure after the fact.
Compositional encoding avoids the loss in the first place. Each axis gets its own encoding type, and similarity respects that structure from the start.
This is why no labeling is needed. Instead of trying to reconstruct a decision boundary from a representation that discarded the distinctions you care about, you preserve those distinctions directly in the representation.
The original article advises "don't use a single global threshold", and for flat embeddings, that advice is correct. When domain information is compressed into the same vector space as phrasing and intent, different query types genuinely need different thresholds to separate them. Compositional encoding removes that requirement. Because domain isolation is handled structurally by the EXACT-encoded context slot, a single global threshold only needs to balance recall against miss rate — a one-dimensional tuning problem instead of a per-category one.
Getting Started
pip install hybi
from hybi import HyperBinder, SemanticCache from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") hb = HyperBinder(local=True, encode_fn=model.encode) cache = SemanticCache(hb, collection="my_cache") cache.seed(faq_dataframe) hit = cache.get(user_query, context=domain) if hit: return hit.response
No vector store to provision. No response store to manage. No classifier to train. No threshold labeling loop. Only structure. Compositional encoding eliminates the false-positive problem that drives the labeling and classification pipeline: structure does the work that statistics cannot.
If the prospect of doing more with vectors intrigues you, we're proud to announce we're now in beta. This is just one of several bleeding-edge features our platform enables. Feel free to sign up if you'd like to see for yourself.