Resumable Chunked Uploads

Backup Interviews: The One Detail That Separates Seniors—Resumable, Chunked Uploads

In cloud backup systems the simplest reliability lever is also the one interviewers love to ask about: resumable, chunked uploads. Backups regularly fail mid-transfer — Wi‑Fi drops, laptops sleep, cellular blips. If you upload whole files atomically, you often restart from zero and miss backup windows. Splitting data into chunks and making uploads resumable changes the game.

Below is a practical, interview-friendly design pattern you can explain and defend.

Problem

Long transfers are brittle: network interruptions, client restarts, and transient server errors will abort uploads.
Re-uploading entire files wastes time, bandwidth, and energy, and increases collision with backup windows.

High-level solution

Split files into fixed-size chunks.
Hash each chunk and use an idempotent API like putChunk(backupId, chunkIndex, checksum).
Persist progress (which chunks are stored) in per-backup metadata so the client can resume exactly where it stopped.
Add exponential backoff, integrity checks, and duplicate handling on the server.

This combination makes uploads resumable, verifiable, and efficient.

Core components

1) Chunking

Use fixed-size chunks (e.g., 4–16 MiB). Fixed size simplifies indexing and retry logic. Consider variable chunking only for dedup-heavy designs.
Compute a checksum (SHA-256 or a strong hash) per chunk. This enables integrity checks and idempotency.

Trade-offs:

Small chunks: better resume granularity, larger index/metadata.
Large chunks: fewer round trips, worse restart cost on failure.

2) Idempotent chunk upload API

Provide a server API like:

putChunk(backupId, chunkIndex, checksum, data)

Server behavior:

If chunkIndex already stored with the same checksum: return success (idempotent).
If stored with different checksum: reject (checksum mismatch) to avoid corruption.
If not present: store chunk and mark it in metadata.

This avoids duplicate work and ensures safe retries.

3) Persisted progress (metadata)

Keep a per-backup metadata object that tracks which chunk indices are already accepted. Minimal metadata schema:

{
  "backupId": "...",
  "fileId": "...",
  "chunkSize": 4194304,
  "chunks": {
    "0": "sha256:...",
    "1": "sha256:...",
    "4": "sha256:..."
  },
  "status": "in_progress"
}

The client can query this metadata and resume uploading only missing chunks. Persist metadata atomically (or use a compare-and-set) so progress is never lost.

4) Retries and backoff

Use exponential backoff with jitter for transient errors.
On client restart, re-check metadata and only upload missing chunks.
Limit retry budget per chunk to avoid infinite loops.

5) Integrity and verification

Server verifies chunk checksum on receive; reject corrupted uploads.
Optionally: server computes its own checksum and cross-checks client-supplied checksum.
After all chunks uploaded, perform a final composition step that verifies the assembled file hash matches the expected file hash (if provided).

Example resumable upload flow (client)

Split file into chunkCount chunks and compute checksums.
Request or create backupId and read metadata about already-uploaded chunks.
For each missing chunk:
- putChunk(backupId, chunkIndex, checksum, data) with retries/backoff
- On success, record progress locally or rely on server metadata
When all chunks present, call finalizeBackup(backupId, expectedDigest) which triggers server-side verification and composition.

Pseudocode:

for i, chunk in enumerate(chunks):
    if server_has_chunk(backupId, i):
        continue
    attempt = 0
    while attempt < MAX_ATTEMPTS:
        try:
            putChunk(backupId, i, checksum(chunk), chunk)
            break
        except TransientError:
            sleep(exponential_backoff(attempt))
            attempt += 1
    if attempt == MAX_ATTEMPTS:
        raise UploadFailed
finalizeBackup(backupId, fileChecksum)

Server-side considerations

Atomic metadata updates: use compare-and-swap to avoid races when multiple clients upload the same backup.
Garbage collection: remove orphaned chunks after a timeout or when a backup is abandoned.
Authorization: ensure clients can only write chunks for their backups.
Storage layout: store chunks keyed by (backupId, chunkIndex) or by checksum (content-addressed) to enable deduplication.
Concurrency: allow parallel chunk uploads to speed up large backups; throttle to control IO.

Additional enhancements

Content-addressed storage: store chunks by checksum to deduplicate across backups and users (watch multi-tenant privacy/legal constraints).
Client-side encryption: encrypt chunks before upload; store per-backup metadata for decryption keys (or use zero-knowledge patterns).
Partial restores: let clients request ranges or subsets of chunks for faster restores.

Interview-ready takeaway

Reliability in backups is not a single promise you make to users — it's a protocol you design and implement: chunk, hash, idempotent put, persist progress, retry with backoff, and verify. Explain the trade-offs (chunk size, metadata complexity, concurrency) and you've shown the mentality of a senior engineer: think beyond "it works once" to "it recovers gracefully."

#CloudComputing #SystemDesign #DevOps

Backup Interviews: The One Detail That Separates Seniors—Resumable, Chunked Uploads

Backup Interviews: The One Detail That Separates Seniors—Resumable, Chunked Uploads

Problem

High-level solution

Core components

1) Chunking

2) Idempotent chunk upload API

3) Persisted progress (metadata)

4) Retries and backoff

5) Integrity and verification

Example resumable upload flow (client)

Server-side considerations

Additional enhancements

Interview-ready takeaway

Comments

More from this blog

High-Score Amazon Data Scientist Interview Experience (Bugfree Users): What to Expect & How to Prepare

High-Score Amazon Data Scientist Interview Experience (Bugfree Users): What to Expect & How to Prepare

Stop Guessing in System Design Interviews: Use These 8 Resources

Stop Guessing in System Design Interviews: 8 Essential Resources

Hospital System OOD: Stop Modeling IDs—Model Relationships

Command Palette

Backup Interviews: The One Detail That Separates Seniors—Resumable, Chunked Uploads

Problem

High-level solution

Core components

1) Chunking

2) Idempotent chunk upload API

3) Persisted progress (metadata)

4) Retries and backoff

5) Integrity and verification

Example resumable upload flow (client)

Server-side considerations

Additional enhancements

Interview-ready takeaway

Comments

More from this blog