Updated 2026-06-11

DeepSeek Rate Limit and user_id Isolation: production rules for V4 traffic

If you run DeepSeek for more than one user, the important question is no longer just model quality. You need to know how much concurrency one account gets, where `user_id` belongs in each protocol, and what a healthy waiting connection looks like before the server actually starts inference. DeepSeek's official Rate Limit & Isolation page now answers those questions directly.

1. The official account-level concurrency numbers

DeepSeek's official Rate Limit & Isolation page says concurrency is calculated at the account level, not per API key. The published limits are 500 concurrent requests for `deepseek-v4-pro` and 2500 for `deepseek-v4-flash`.

That is the baseline you should use when planning routing. If one service account fans out traffic for many tenants, the account ceiling matters more than how many keys you generated under that account.

DeepSeek official concurrency limits
ModelOfficial concurrency limitOperational meaning
deepseek-v4-pro500Reserve for harder reasoning, review, and narrower high-value workloads
deepseek-v4-flash2500Default lane for repeated tool calls, chat, and high-volume background work

Sources checked

2. What user_id is actually for

DeepSeek documents `user_id` as a real production control, not just a logging field. The official page ties it to content-safety isolation, KV-cache isolation, and scheduling isolation under one account.

That means `user_id` should usually represent your own stable application-side user or tenant identifier, as long as it does not include private user data. The docs explicitly say not to place privacy information inside the field.

If you run a shared DeepSeek account behind your own product, `user_id` is the cleanest official mechanism for separating users without inventing your own pseudo-protocol.

3. OpenAI-format and Anthropic-format placement are different

This is where teams often make avoidable mistakes. In DeepSeek's OpenAI-format route, `user_id` belongs inside `extra_body`. In the Anthropic-format route, the official docs place it under `metadata.user_id`.

If you mix those placements, you may think you are isolating users while the backend never receives the field in the expected position.

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Summarize the latest deploy logs."}],
    extra_body={"user_id": "tenant_acme_42"},
)

Sources checked

4. How to handle 429s without guessing

The official docs say requests within the concurrency limit receive a response and requests above the limit return HTTP 429. That means 429s are not a mystery signal; they are the expected backpressure response when your account or one `user_id` lane is over capacity.

The operational answer is usually not 'add more retries everywhere'. It is: lower fan-out, move repetitive traffic to Flash, queue overflow, and separate your highest-value Pro requests from background jobs.

If you are granted increased account concurrency, DeepSeek says each `user_id` can still be limited individually. That is a strong reason to choose a stable and bounded `user_id` strategy instead of leaving it blank for every request.

5. Keep-alive lines are not broken responses

DeepSeek's docs also explain the request keep-alive mechanism. Non-streaming requests may emit empty lines while the HTTP connection remains open. Streaming requests may emit SSE keep-alive comments like `: keep-alive`.

If your client or proxy treats those bytes as malformed output, you can misclassify a healthy waiting request as a failure. The docs also say the server closes the connection if inference has not started after 10 minutes, which gives you a concrete upper bound for pre-inference waiting.

This matters for DeepSeek-first agent tools because a slow but healthy queued request should not look the same as a real transport failure.

FAQ

What are DeepSeek's official concurrency limits?

The official Rate Limit & Isolation page lists 500 concurrent requests for `deepseek-v4-pro` and 2500 for `deepseek-v4-flash`, calculated at the account level.

Is concurrency tracked per API key or per account?

Per account. DeepSeek's docs say concurrency is calculated at the account level regardless of which API key is used.

Where does user_id go in the OpenAI-format API?

Put it in `extra_body.user_id`.

Where does user_id go in the Anthropic-format API?

Put it in `metadata.user_id`.

What do empty lines or `: keep-alive` comments mean?

They are documented keep-alive behavior while the request is still waiting. They are not automatically an error.

DeepSeek traffic gets easier to run when you treat concurrency, `user_id`, and keep-alives as first-class production constraints. Flash should absorb the broad high-volume lane, Pro should stay scarce and deliberate, and your client code should pass `user_id` in the exact place the chosen protocol expects.

Related model comparisons

Continue from this guide into structured DeepSeek-first comparison pages with model tables, routing advice, and pricing context.