Tag Search That Scales: Why You Must Store BOTH item→tags and tag→items

Tag index diagram

In a tagging-service interview (or design review), don’t hand-wave "search by tag." At scale, a naive approach won't meet latency or cost requirements. The key is a reverse index: you must store both the forward mapping (item → tags) and the reverse mapping (tag → items).

The problem: scanning is infeasible

If you only store item_id → [tags], searching for items with tag "cat" requires scanning every item and checking its tags. With millions or billions of items that becomes impossible to return results in a few hundred milliseconds (or even seconds).

Correct design: keep two views

Maintain two complementary data views:

item_tags: item_id → [tags]
- Fast reads for a specific item (fetch its tags).
tag_items: tag → [item_ids]
- Fast search by tag (fetch items that have that tag).

Example (conceptual):

item_tags:
  item:123 -> ["cat", "cute"]
  item:456 -> ["dog", "cute"]

tag_items:
  "cat"  -> [item:123, item:789, ...]
  "cute" -> [item:123, item:456, ...]

This is essentially an inverted index for tags.

Writes: unavoidable duplication and write amplification

Yes — every add/remove of a tag must update both tables. That increases write amplification and storage, but it's necessary for fast reads. The real question interviewers want you to address is: how do you keep them consistent?

Consistency strategies

Pick based on the guarantees and tooling available to you.

1) Strong consistency (preferred when supported)

Perform both updates in a single atomic transaction (ACID) if your datastore supports multi-row or multi-partition transactions.
Example: two-row transaction in a relational DB or a transactional key-value store.
Pros: simple correctness model, reads always consistent.
Cons: may limit throughput or complicate distributed transactions.

2) Asynchronous eventual consistency (when transactions are impractical)

Emit a change event to an ordered durable log (e.g., Kafka) on tag add/remove.
A background worker consumes events and updates the tag_items view.
Make the update idempotent and include a monotonic operation ID or version.
Use retries with exponential backoff and dead-lettering for failures.
Pros: higher write throughput, decouples read/write paths.
Cons: small window of inconsistency between views.

3) Hybrid approaches

Synchronous update of the primary view (item_tags) + fast best-effort update of tag_items, plus background reconciliation.
Use compare-and-swap or version numbers to avoid lost updates when reconciling.

4) Reconciliation and repair

Periodic full or incremental reconciliation jobs: scan item_tags and rebuild or check tag_items for drift.
Maintain a change-log or vector clocks to reduce reconciliation scope.
Track metrics (lag, failure rate) and alerts for anomalies.

Practical reliability techniques

Make tag_items updates idempotent (set-add, set-remove semantics), so retries are safe.
Include operation metadata (timestamp, op-id) to skip duplicates or out-of-order events.
Use tombstones for deletes to avoid resurrection during eventual consistency windows.
For high-cardinality tags, paginate and shard tag_items (shard by tag or by tag+bucket).
Cache hot tags in memory (Redis) and backfill asynchronously.

Operational considerations

Storage overhead: you store tag references twice (or more) — plan capacity accordingly.
Latency: reading tag → items becomes O(1) (or O(shard)), so queries are fast.
Hot tags: some tags will be very popular — shard, cache, or rate-limit queries for them.
Sorting and paging: store timestamps or scores with item_ids (e.g., for recent-first or ranked results).

Quick interview checklist (what to say)

Describe both views: item_tags and tag_items.
Explain the write amplification trade-off and why it's necessary.
Explain consistency options: atomic transactions vs async + retries + reconciliation.
Mention idempotency, ordering, tombstones, and reconciliation jobs.
Mention sharding, caching, and how you'd handle hot tags and pagination.

TL;DR

Do not design tag search by scanning items. Store both item→tags and tag→items. Handle the resulting write amplification explicitly: use transactions when possible or an async, idempotent event-driven pipeline with reconciliation and monitoring.

Hashtags: #SystemDesign #DataEngineering #InterviewPrep

Tag Search That Scales: Why You Must Store BOTH item→tags and tag→items

Tag Search That Scales: Why You Must Store BOTH item→tags and tag→items

The problem: scanning is infeasible

Correct design: keep two views

Writes: unavoidable duplication and write amplification

Consistency strategies

Practical reliability techniques

Operational considerations

Quick interview checklist (what to say)

TL;DR

Comments

More from this blog

Caching in System Design: The Rules Interviewers Expect You to Know

Caching in System Design: The Rules Interviewers Expect You to Know

High-Score TikTok ML Interview: RecSys Deep Dive and Fast, Bug‑Free Coding

High-Score TikTok ML Interview Experience: RecSys Deep Dive & Fast Coding

Meta SWE Manager Interview — What Really Gets Tested (High-Score Bugfree Users)

Command Palette

Tag Search That Scales: Why You Must Store BOTH item→tags and tag→items

The problem: scanning is infeasible

Correct design: keep two views

Writes: unavoidable duplication and write amplification

Consistency strategies

Practical reliability techniques

Operational considerations

Quick interview checklist (what to say)

TL;DR

Comments

More from this blog