Official2026-06-18

DeepSeek multimodal line comes into focus: Janus-Pro and VL2 are official, while V4 API docs still keep Pro and Flash text-first

Checked on June 18, 2026: DeepSeek's official research and GitHub surfaces already include multimodal work through Janus-Pro and DeepSeek-VL2, covering image understanding, OCR, document/chart reasoning, visual grounding, and text-to-image generation. Official V4 API docs still list V4 Pro and V4 Flash as text-first API routes, and DeepSeek's Copilot page says V4 itself is text-only with an optional vision proxy, so this should be framed as DeepSeek's multimodal ecosystem expanding rather than a newly confirmed native V4 Vision API.

Confirmed multimodal work, with an important V4 boundary

DeepSeek already has a real multimodal line. The safest source-backed wording is not that every DeepSeek V4 API route has suddenly become native vision. The accurate headline is narrower: DeepSeek's official research stack includes multimodal models, while the current public V4 API docs still present Pro and Flash as text-first routes.

That distinction matters for developers and buyers. Janus-Pro and DeepSeek-VL2 are official DeepSeek multimodal releases. They show that DeepSeek is not only a text and coding-model lab. But the public V4 API surface still needs to be described exactly as DeepSeek documents it today.

What is confirmed

  • Janus-Pro is part of DeepSeek's official Janus series. The Janus repository describes the project as unified multimodal understanding and generation, and its January 27, 2025 release note says Janus-Pro improves both multimodal understanding and visual generation.
  • DeepSeek-VL2 is an official vision-language model family. Its repository describes MoE vision-language models for visual question answering, OCR, document/table/chart understanding, and visual grounding.
  • The DeepSeek homepage still links DeepSeek VL under research. That keeps the vision-language line visible from a DeepSeek-owned surface rather than only from third-party commentary.
  • The current V4 API docs list V4 Pro and V4 Flash through the OpenAI and Anthropic-compatible text interfaces. The V4 release page emphasizes 1M context, reasoning, agent capability, API availability, JSON/tool calling, and model-name migration.
  • DeepSeek's own GitHub Copilot integration page is explicit about the limitation. It says DeepSeek V4 is text-only, and that image handling in the extension uses another installed Copilot model to describe screenshots before sending text to DeepSeek.

What "DeepSeek goes multimodal" should mean today

For readers, this is still important news. The DeepSeek ecosystem now has credible multimodal building blocks:

  • Janus-Pro points toward unified image understanding and image generation.
  • DeepSeek-VL2 points toward practical visual understanding: OCR, charts, documents, screenshots, and grounding.
  • V4 remains the production text/coding/agent backbone documented in the public API.
  • Copilot-style extensions can bridge screenshots into DeepSeek through a proxy model, but that is not the same as native V4 image input.

So the right product interpretation is: DeepSeek has official multimodal research releases, but teams should not assume a native V4 Vision API model name or image-token pricing until DeepSeek publishes it directly.

Why this matters for developers

Multimodal support changes the kind of workflows DeepSeek can eventually own. Text-only coding models handle code, logs, prompts, API payloads, and documentation. Vision-language models add screenshot review, UI-state debugging, chart interpretation, scanned document QA, visual grounding, and image-backed support triage.

That is the real reason this belongs in the news stream. DeepSeek's V4 story has mostly been about long context, cost, coding, and agent routing. Janus and VL2 show the adjacent capability layer: the model family can move from pure language tasks into visual evidence and, in Janus-Pro's case, image generation.

What not to overclaim

Do not turn this into a stocked-product update. A multimodal research model is not a Coding Plan. It does not create a new purchasable card on /pricing, it does not prove available API-key inventory, and it does not justify adding a new subscription plan.

Also do not invent model names such as deepseek-v4-vision unless DeepSeek publishes that exact route. The current safe API model names remain deepseek-v4-pro and deepseek-v4-flash.

Editorial takeaway

The publish-safe headline is that DeepSeek's multimodal line is real, but the public V4 API boundary is still text-first. Janus-Pro and DeepSeek-VL2 give developers official DeepSeek multimodal artifacts to watch. V4 Pro and Flash remain the documented production routes for current API users.

Sources checked