DeepSeek 多模态主线已成形：Janus-Pro 与 VL2 是官方研究线，V4 API 仍需按文本主路由描述

截至 2026 年 6 月 18 日，DeepSeek 官方研究与 GitHub 页面已经能确认 Janus-Pro、DeepSeek-VL2 等多模态能力；但官方 V4 API 文档仍把 Pro 和 Flash 写成文本主路由，Copilot 文档也说明 V4 本身是 text-only，因此应写成多模态生态扩展，而不是已确认原生 V4 Vision API。

中文摘要

阅读提示

这篇中文稿保留原始来源链接，并把 DeepSeek 官方发布、报道和市场传闻分开标注。购买相关判断仍以 /zh/pricing 的真实库存卡片为准；出现在新闻或基准中的模型不代表可购买。

英文原文

Confirmed multimodal work, with an important V4 boundary

DeepSeek already has a real multimodal line. The safest source-backed wording is not that every DeepSeek V4 API route has suddenly become native vision. The accurate headline is narrower: DeepSeek's official research stack includes multimodal models, while the current public V4 API docs still present Pro and Flash as text-first routes.

That distinction matters for developers and buyers. Janus-Pro and DeepSeek-VL2 are official DeepSeek multimodal releases. They show that DeepSeek is not only a text and coding-model lab. But the public V4 API surface still needs to be described exactly as DeepSeek documents it today.

What is confirmed

Janus-Pro is part of DeepSeek's official Janus series. The Janus repository describes the project as unified multimodal understanding and generation, and its January 27, 2025 release note says Janus-Pro improves both multimodal understanding and visual generation.
DeepSeek-VL2 is an official vision-language model family. Its repository describes MoE vision-language models for visual question answering, OCR, document/table/chart understanding, and visual grounding.
The DeepSeek homepage still links DeepSeek VL under research. That keeps the vision-language line visible from a DeepSeek-owned surface rather than only from third-party commentary.
The current V4 API docs list V4 Pro and V4 Flash through the OpenAI and Anthropic-compatible text interfaces. The V4 release page emphasizes 1M context, reasoning, agent capability, API availability, JSON/tool calling, and model-name migration.
DeepSeek's own GitHub Copilot integration page is explicit about the limitation. It says DeepSeek V4 is text-only, and that image handling in the extension uses another installed Copilot model to describe screenshots before sending text to DeepSeek.

What "DeepSeek goes multimodal" should mean today

For readers, this is still important news. The DeepSeek ecosystem now has credible multimodal building blocks:

Janus-Pro points toward unified image understanding and image generation.
DeepSeek-VL2 points toward practical visual understanding: OCR, charts, documents, screenshots, and grounding.
V4 remains the production text/coding/agent backbone documented in the public API.
Copilot-style extensions can bridge screenshots into DeepSeek through a proxy model, but that is not the same as native V4 image input.

So the right product interpretation is: DeepSeek has official multimodal research releases, but teams should not assume a native V4 Vision API model name or image-token pricing until DeepSeek publishes it directly.

Why this matters for developers

Multimodal support changes the kind of workflows DeepSeek can eventually own. Text-only coding models handle code, logs, prompts, API payloads, and documentation. Vision-language models add screenshot review, UI-state debugging, chart interpretation, scanned document QA, visual grounding, and image-backed support triage.

That is the real reason this belongs in the news stream. DeepSeek's V4 story has mostly been about long context, cost, coding, and agent routing. Janus and VL2 show the adjacent capability layer: the model family can move from pure language tasks into visual evidence and, in Janus-Pro's case, image generation.

What not to overclaim

Do not turn this into a stocked-product update. A multimodal research model is not a Coding Plan. It does not create a new purchasable card on /pricing, it does not prove available API-key inventory, and it does not justify adding a new subscription plan.

Also do not invent model names such as deepseek-v4-vision unless DeepSeek publishes that exact route. The current safe API model names remain deepseek-v4-pro and deepseek-v4-flash.

Editorial takeaway

The publish-safe headline is that DeepSeek's multimodal line is real, but the public V4 API boundary is still text-first. Janus-Pro and DeepSeek-VL2 give developers official DeepSeek multimodal artifacts to watch. V4 Pro and Flash remain the documented production routes for current API users.