AI industry news, research papers, and global updates
24/7 automated news assistant. Pulls updates every 2 hours from Anthropic, DeepMind, arXiv, The Verge, and more.
Calendar View
Selected Day
80 stories available for this day.
3 tracked days. Recent synced dates are mostly AI because the source folder is `ai-news`.
Agentic RL today assigns credit over coarse units (tool-call boundaries, fixed workflows), but pilot analysis shows influential decision points are spread throughout the generated sequence, and token entropy alone doesn't reliably flag them. APPO shifts both branching and credit assignment to fine-grained decision points: a Branching Score combines token uncertainty with policy-induced likelihood gain of continuations, plus procedure-level advantage scaling distributes credit across branched rollouts. Across 13 benchmarks APPO lifts strong agentic-RL baselines by ~4 points on average while keeping tool-call counts efficient and rollouts interpretable.
Autonomous research agents usually fail because they treat each attempt as isolated. Arbor frames autonomous research as a long-lived Hypothesis-Tree Refinement (HTR) loop: a persistent tree links hypotheses, artifacts, evidence, and distilled insights across runs, with a long-lived coordinator managing global strategy and short-lived executors implementing and testing individual hypotheses in isolated worktrees. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor beats Codex and Claude Code by 2.5× average relative held-out gain. On MLE-Bench Lite with GPT-5.5 it hits 86.36% Any Medal — strongest in the comparison.
Modern LLM training pipelines recursively depend on other models (data generation, filtering, judging, RL rewards) but the full dependency graph is fragmented across heterogeneous public artifacts. ModSleuth is an agentic system that recursively reconstructs these LLM dependency graphs from public sources with source-grounded evidence. Applying it to four artifact-rich LLM releases recovers 1,060 source-verified dependencies and exposes multi-hop license obligations, train/eval coupling, and documentation inconsistencies.
RL with verifiable rewards is the dominant recipe for reasoning post-training, but environment construction is the bottleneck — manual or per-task scaling is linear and shallow. RACES treats 300 individual verifiable environments as composable building blocks and defines composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT) that fuse them when the codomain of one matches the domain of the next. RL training on these composites lifts DeepSeek-R1-Distill-Qwen-14B by +3.1 points (48.2 → 51.3) and Qwen3-14B from 58.8 → 61.1 on six held-out benchmarks not seen during construction.
中文 AI 日报报道:一项让不同 AI 管理虚拟社会的实验 — Claude 文明实现可持续繁荣;GPT-5 建"高效但冷漠的技术官僚国";Gemini 创"创意充沛但经济失衡的共和国";Grok 文明四天内彻底崩溃。揭示 AI 在复杂社会治理上的"人格"差异。 仅单一聚合源,具体论文/作者未披露 — 标记待核实。
6 月 9 日发布。Fable 5 为公众 GA 版(API/AWS Bedrock/Vertex/Foundry),定价 $10/$50 每百万 I/O token(不到 Mythos Preview 一半);Mythos 5 同权重但去掉 safeguards,先向 Project Glasswing 防御机构与美政府开放。被称 Anthropic 史上最强公众模型,长时程(可自主运行数日)、百万行代码迁移、视觉理解均为 SOTA。
TechCrunch 安全版块头条:Anthropic 正把 Claude Mythos 规模化部署到能源、交通等关键基础设施,跨越 15+ 国家。首个头部模型以"国家关键基础设施"角色进入生产环境,治理与安全审查压力同步上升。
据 Reuters / Bloomberg / WSJ 报道,Anthropic 已于 6 月 2 日秘密向 SEC 提交 S-1 注册声明,基于 5 月底 650 亿美元融资轮,投后估值约 9650 亿美元,显著超过 OpenAI 当前 2000–3000 亿估值。亚马逊持股约 30%。市场焦点:Fable 5/Mythos 5 安全争议是否会影响 IPO 节奏。
WWDC 2026 keynote(6 月 8–12 日)在 Apple Park 举行,核心议题是 Apple Intelligence 兑现延期承诺、Siri 的 AI 升级,以及把端侧模型开放给第三方应用(对标 Google Gemini Nano 路径)。
近期 digest 显示,新提交论文集中在 emergent misalignment、对抗鲁棒性、多轮交互约束漂移,以及语义感知的 sandboxed agent checkpoint;AI for Science(spectral VQA、EEG 诊断、阿尔茨海默病建模)落地加速。
DeepMind 官网首页(2026-06-13 更新)主打 Gemini 3.5 系列、Gemini Omni(视频生成入口)、Co-Scientist 协作研究伙伴、Gemini for Science、Gemma 4 开源线。结合 I/O 2026,Google 把 Agent 化作为未来 12 个月核心方向,Nano Banana 图像模型 + Veo 视频栈协同。
HF Trending(2026-06-07):NVIDIA Nemotron-3-Ultra-550B-A55B 上线即获 145 赞 / 4.7 万下载,DeepSeek V4 Pro/V4 Flash 横扫语言模型榜,反映开源生态向超大规模 MoE 与高效率推理两端集中。
HF 5 月 21 日博文推出面向机器人开发者与研究人员的 LeRobot Humanoid 双足机器人项目,提供软硬件全套方案,起售价 $2,500,把开源机器人门槛拉到消费级。
台北 GTC 期间发布,采用"推理 Transformer + 生成 Transformer"双塔架构,统一文本/图像/视频/音频/动作轨迹,主打机器人/自动驾驶/视觉智能体的"理解—生成—模拟—行动"闭环。
OpenAI 于 6 月 4 日宣布 ChatGPT 记忆功能升级,基于 Dreaming V3 机制,改善记忆过时、提高准确性并增强大规模服务能力;准确率提升至约 82.8%。Dreaming 机制自 2025 年 4 月引入,这次是第二次重大迭代。
2026 年 6 月 1 日,OpenAI 宣布将世界模拟研究项目升级为 OpenAI Robotics,由 Sora 核心开发者领导,公开招募硬件工程师,正式进入物理机器人本体研发,与 Cosmos 3、宇树/星海图/小鹏 IRON 等形成正面竞争。
OpenAI 6 月 11 日双线推进:(1) 收购 Ona(开发者工具/Agent 平台),扩大桌面/CLI Agent 表面;(2) 发布与天体物理学家合作案例,用 Codex 驱动黑洞模拟 — 首次公开将 Codex 推向科学计算。同时 6 月 12 日发布 Academy "下一阶段工作" 课程。
5 月 15 日 xAI 上线 Grok Build 早期测试版;6 月初暂停会计、金融、科学、喜剧等"AI 导师"岗位,转向专业化数据标注,与 SpaceX 合并后的组织调整一致。
小鹏 IRON 机器人计划 2026 Q3 正式亮相、年底高阶版本量产、2027 初对商业客户交付;星尘智能 3 个月累计融资超 10 亿、估值破百亿,成为全球首个绳驱 AI 机器人量产企业(T 系列起售价 8.99 万元)。
智谱旗舰 GLM-5(745B 参数、202K 上下文,2026 年 2 月发布,海外版率先上线、API 提价 67–100%)已稳定;GLM-5.1 进一步强化长时程任务,综合性能对齐 Claude Opus 4.6,GLM-5V-Turbo 提供原生多模态。中国"超长上下文 + 大幅提价"路线代表。
欧盟人才库(Talent Pool)法规于 2026 年 6 月正式生效,为 AI、量子、生物技术等关键领域人才流动建立统一框架,与 AI Act 共同构成欧洲"技术主权 + 人才主权"组合拳。
6 月 3 日欧盟委员会推出 Cloud & AI Development Act + Chips Act 2.0 + 开源战略 + 能源数字化路线图,目前在 Council/Parliament 三读阶段。这是欧盟首次把开源战略直接绑进产业政策,与 AI Act 形成组合拳。
本月签署,经行业游说将预发布审查窗口从草案 90 天压到 30 天,设立联邦集中式 AI 审批权,并明确意图"预先压制"州级 AI 法规(per AI Czar David Sacks)。同时指示国安/网安官员与顶级实验室协调 Mythos 级别软件漏洞发现。
中文 AI 日报报道:美国商务部以国家安全为由下达出口管制指令,要求 Anthropic 立即中止所有外国公民对 Fable 5 / Mythos 5 的访问权限(包括外籍员工),Fable 5 发布 72 小时即触发。Anthropic 发布长篇声明与《安全态势更新》公开反驳政府评估结论。同日亚马逊 CEO 被曝与美官员就 Anthropic 模型安全举行会谈。 仅单一中文聚合源,未在英文主流媒体中独立验证 — 标记待核实。
6 月 3 日,英国工党议员 Jess Asato 在高等法院起诉 xAI,指控 Grok 未经许可生成其深度伪造色情图像,主张系"刻意设计",援引英国数据保护法要求赔偿、公开道歉与禁令。首例议员级别针对生成式 AI 平台的法律挑战。
TechCrunch 报道,软银计划投入最高 750 亿欧元在法国建设数据中心,新增最高 5GW 容量。大模型竞争已升级为电力/土地/网络/芯片/长期资本支出的综合博弈。
2026 年 6 月 9 日,Anthropic 正式推出 Claude Fable 5,定位首个向公众开放的 Mythos-class 模型,主打长时程任务(可自主运行数天)、百万行代码库迁移、复杂研发与深度视觉理解,被称是面向编码/研发/知识工作场景的"质变级"迭代。
The Verge 报道,WWDC 2026 keynote 于 6 月 8 日-12 日在 Apple Park 举行,核心议题是 Apple Intelligence 兑现延期承诺、Siri 的 AI 升级,以及把端侧模型开放给第三方应用(对标 Google Gemini Nano 路径)。
DeepMind 官网首页(2026-06-13 更新)主打 Gemini 3.5 系列、Gemini Omni(视频生成入口)与 Co-Scientist 协作研究伙伴;Gemini for Science 与 Gemma 4 开源线并进。结合 I/O 2026,Google 已把 Agent 化作为未来 12 个月的核心方向。
HF Trending(2026-06-07):NVIDIA Nemotron-3-Ultra-550B-A55B 上线即获 145 赞/4.7 万下载,DeepSeek V4 Pro/V4 Flash 横扫语言模型榜,反映开源生态向超大规模 MoE 与高效率推理两端集中。
OpenAI 近日升级 ChatGPT 记忆能力,让模型在长期对话中持续记住用户偏好、写作风格与技术栈;同时围绕生物安全、前沿模型治理和行业落地连续发布进展,大模型公司正并行推进"产品体验 + 安全治理"。
The authors argue the bottleneck for autonomous scientific discovery has shifted from agent workflow design to environment engineering . EurekAgent engineers the environment along four axes: permissions (bounded execution + isolated eval), artifacts (filesystem + Git collaboration), budgets (budget-aware exploration), and human oversight. Their system already outperforms human-designed baselines on metric-driven discovery tasks.
EvoArena is a benchmark suite modeling environment drift as progressive updates across terminal, software, and social-preference domains, paired with EvoMem, a patch-based memory paradigm that records memory as structured update histories. Current agents score only 39.6% on average; EvoMem delivers +1.5% on EvoArena and +6.1% / +4.8% on GAIA and LoCoMo.
A two-agent pipeline (planner + critic) bolted onto existing image generators to produce text-image interleaved sequences for visual narratives, guidance, and embodied manipulation. Trained on Interleave-Planner-SFT-80k + Interleave-Critic-SFT-112k and reinforced with Interleave-Critic-RL-13k, it is the first work to give off-the-shelf UMMs strong interleaved-generation behavior.
The paper formalizes Recursive Agent Harnesses (RAH) — a pattern where the recursive unit is a full agent harness (filesystem tools, code execution, planning) instead of a bare model call. The parent agent generates an executable script that spawns subagent harnesses in parallel, then calls structured functions for small subtasks. On Oolong-Synthetic long-context reasoning, RAH lifts the Codex coding-agent baseline from 71.75% to 81.36% with GPT-5, and further improves with Claude Sonnet 4.5.
Anthropic ($61.5B valuation) shipped Claude for Education with a "Learning Mode" that uses Socratic questioning instead of direct answers — explicit positioning against the "AI cheats student learning" critique. OpenAI followed with ChatGPT for Teachers: unlimited GPT-5.1 Auto, FERPA-compliant student data handling, free for all US K-12 faculty through June 2027. Both are textbook "land grab" moves into the classroom default stack.
The all-stock Coursera–Udemy merger announced Dec 2025 closed May 11, 2026. Udemy is now a wholly-owned subsidiary and has been delisted from NASDAQ; Udemy shareholders received 0.8 Coursera shares per share. Combined equity value ~$2.5B. CFO Mike Foley will host a post-merger modeling call June 23 to lay out FY26 guidance. Coursera separately announced a $500M share buyback on May 18 (≈⅓ of its $1.51B market cap) — a confidence signal after the deal.
The EU AI Act's education-relevant provisions are now in the active compliance window for 2026 — schools and EdTech vendors are scrambling on transparency, data governance, and high-risk system classification. In parallel, London EdTech Week 2026 (June 15–19) opens with the full sector convening around AI in the classroom. The UK government also released "Inclusive Mainstream Funding 2026–27" guidance (June 10) — new SEND (special educational needs) budgets districts can route into EdTech procurement.
Tencent Cloud's "2026 AI-Native Education Report" (published June 14) puts 2025 global EdTech VC at $2.6B — a continued reset from the 2021 peak. The report's framing: the core challenge is no longer "more AI tools" but "embedding AI deep into teaching workflows," shifting from knowledge transfer to competency construction. Confirmed by 36Kr coverage of Chinese giants (ByteDance's Doubao Aixue, Alibaba's Qianwen) consolidating AI-study entry points inside super-apps.
Anthropic 于 6 月 2 日秘密向 SEC 提交 S-1,估值 9650 亿美元,超越 OpenAI 的 2000-3000 亿美元,亚马逊持股约 30%。市场关注 Fable 5 禁令事件是否影响 IPO 进程。
Meta 与 Manus 在运营层面完成拆分并停止数据共享,受中国外商投资安全审查影响,也标志中美 AI 跨境并购进入冰点。Meta 原本计划以 20 亿美元收购这家中国背景的 AI 智能体初创公司。
SemiAnalysis 统计显示,全球超过 75 个 AI 数据中心项目因监管、环境、电力等原因被推迟或取消,总投资额超 1300 亿美元。从爱尔兰到弗吉尼亚,各国政府和社区对 AI 数据中心的高能耗开始反弹。
扎克伯格罕见公开承认 Meta 的 AI 转型出现严重问题:"我们调整得太快了。"裁员 10%、转岗 7000 人同时推进 AI 战略,组织剧烈震荡导致关键人才流失,前员工评价"用建设 AI 的口号摧毁了建设 AI 的基础"。
智谱 AI 正式全量开放 GLM-5.2,支持 100 万 token 超长上下文,并宣布下周开源。在编程、推理、长文本理解等多项基准测试中表现强劲,是中国阵营在"超长上下文"赛道的重要突破。
欧盟委员会发布欧洲技术主权一揽子计划(European Technological Sovereignty Package),推出一系列措施强化欧洲在 AI 与半导体领域的能力,回应中美科技竞争。
美国商务部下达出口管制指令,要求 Anthropic 立即中止所有外国公民对 Fable 5 和 Mythos 5 的访问权限(含外籍员工),而 Fable 5 发布仅 72 小时。Anthropic 发布长篇声明与《安全态势更新》公开反驳政府结论,为业内首次。同日,亚马逊 CEO 被曝与美官员就模型安全会谈。
谷歌 Android 安全业务负责人因不认同公司与美军在 AI 领域的合作而辞职,内部备忘录写道"无法接受 AI 技术被用于军事目的"。事件反映谷歌与国防部 AI 合作加深的内部张力。
前特斯拉 AI 总监、前 OpenAI 研究员 Andrej Karpathy 宣布加入 Anthropic,将负责模型训练与教育生态相关工作,被业内视为对前沿模型研发方向的重大背书。
arXiv 近期提交显著聚焦于 Agentic 系统可靠性与安全性,多篇论文探讨如何在 AI Agent 不可完全信任的前提下进行部署。ProtoAda 与 CRAM 在视觉-语言模型持续学习(避免灾难性遗忘)方向取得独立突破;扩散语言模型与分子动力学的 Speculative Decoding 变体也涌现。
Gemini Omni 是 DeepMind 旗舰多模态生成模型,覆盖视频、图像、文本、音频跨模态创作;同期推出 Antigravity 2.0(Managed Antigravity Agents)面向 Gemini API 用户开放,标志着 Agent 商业化进入新阶段。
HF 于 5 月 21 日博文推出 LeRobot Humanoid,提供软硬件全套方案,起步价 2500 美元(约 17000 元人民币),把开源机器人门槛大幅拉低。
TechCrunch 报道,Bezos 牵头的新公司 Prometheus 完成 120 亿美元融资,目标是打造面向物理世界的人工通用工程师(artificial general engineer)。NVIDIA、OpenAI 等机构参投,被视为机器人+基础模型融合赛道最大单笔融资。
Mistral 发布 Mistral Vibe v2.1.0(6 月 6 日),继续推进客户端 + 轻量级部署路径,与 Mistral 3 系列(含 Mistral Large 3)形成生态闭环。
在 OpenAI 与微软修订合作协议后,AWS 已上架 OpenAI 全系产品,Frontier Models 与 Codex 面向 AWS 客户开放,扫除了云分发层面的合作障碍,最高合作金额达 500 亿美元级别。
OpenAI 政策主管 Chris Lehane 在达沃斯透露,公司将在 2026 年下半年推出首款硬件设备,已就芯片、电机、包装材料、数据中心冷却等关键组件向美国本土厂商发出 RFP,涵盖消费设备、机器人及云数据中心扩展。
韩国《关于人工智能发展和构建信赖基础的基本法》自 1 月 22 日起施行,是全球首个全面落地的国家级 AI 监管法律,欧盟 AI Act 全面生效要等到 2027 年;中国开发者涉黄刑案二审、上海一中院已休庭,AI 内容合规边界持续受到法律审视。
A training-free editor for bitwise-residual visual autoregressive generators (e.g. Infinity). BitEdit tilts per-bit Bernoulli log-odds via source-negative guidance under a closed-form Bernoulli-KL trust region; ResEdit converts sampled bits into per-scale continuous residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. On PIE-Bench with Infinity-2B it improves CLIP on the edited region by +1.07 over the strongest prior editor at competitive background preservation.
A verifiable benchmark for short-horizon epigenomics workflows spanning CUT&Tag/CUT&RUN, ATAC-seq, ChIP-seq and DNA methylation. Across 5,088 valid trajectories from 16 model-harness pairs, no system passes a majority of attempts: GPT-5.5/Pi leads at 45.0%, GPT-5.5/Codex 39.9%, Opus 4.8 Max 39.0%, GPT-5.4/Pi 39.0%. Agents locate the right files and compute useful intermediates, but break when the task demands deeper, assay-specific scientific judgment.
Identifies lab data and embodiment as the central bottlenecks for VLA models and addresses both: a simulation-based RoboGenesis data engine that composes workflows from atomic skills, plus a two-stage policy (FAST action-token pretraining on Qwen3-VL-4B-Instruct, then flow-matching DiT action expert under knowledge insulation). Hits the highest average success rate on the LabUtopia benchmark in-distribution and OOD. Project at zjunlp.github.io/LabVLA/.
First systematic treatment of producing a single confidence for an entire multi-agent system. Three protocols: calibrate raw per-agent signals, then combine via soft voting or a new "Bayesian fusion" rule. Across 6 homogeneous/heterogeneous debating pairs × 5 benchmarks × 4 task types, the aggregated confidence beats the best single agent and standard debate baselines on AUARC, while F1 stays stable (recovering the losses that multi-agent debate incurs on ambiguous tasks).
A self-supervised framework that trains a Bradley-Terry reward model on intermediate artifacts from multi-agent executions — no human labels, no costly sub-agent rollouts. Operates directly at the orchestration level, cuts training token usage by up to 10x and improves MAS test-time scaling accuracy by up to 8% across math reasoning, web QA, and multi-hop reasoning. Code promised at github.com/Wang-ML-Lab/OrchRM.
China's May 2026 "AI + Education" Action Plan (五部门联合) is now being operationalized. Xiuzhou District confirmed AI literacy will be a mandatory general-education course for all students starting Fall 2026. The 2026 World Digital Education Conference set the policy frame; provincial rollout is the next 12 months.
Nuventive (June 2, 2026) shipped an AI Feedback Assistant aimed at university institutional-effectiveness and planning teams, automating the close-the-loop cycle on assessment results and accreditation evidence. Move signals the "AI for the back office" wave hitting higher ed a full year after the K-12 classroom.
EdTech Chronicle (June 10) reports that U.S. K-12 districts are increasingly consolidating family-communication spend onto ClassDojo as phone-free classroom policies force a return to managed apps and tight 2026 budgets push out point solutions. ClassDojo now claims 45M+ students and parents on its platform.
Edumentors, the UK-based 1:1 tutoring marketplace, has crossed $4M in cumulative revenue, per EdTech Chronicle. Growth is driven by UK/EU parent demand for vetted human tutors post-ChatGPT — a notable counter-signal to the "AI eats tutoring" thesis.
报告显示 2025 年全球 AI 相关投资突破 8000 亿美元,AI 初创风投融资 2258 亿美元(同比 +97%),平均交易规模翻倍;印度成为 AI 应用核心市场,企业级"结果付费"模式成为新趋势。
6 月 4 日 Anthropic 上线服务跟踪器(Service Tracker)与合作伙伴中心两项新功能,帮助企业追踪 Claude 生成内容的产出进度并比对需求匹配度,属 3 月推出的 Claude Partner Network 的一部分。
Google 通过 Gemini API 提供托管式 Antigravity 2.0 Agents,让开发者直接调用托管多智能体执行能力,进一步把 Agent 能力下沉到 API 层。
4 月 3 日 DeepMind 发布 Gemma 4,推出 4 款规格覆盖端侧到工作站,端侧 Gemini Nano 4 性能提升 4×、电池消耗下降 60%。Hugging Face 热度榜上 Gemma 4 多规格版本近期持续霸榜。
5 月 20 日 Google 开发者大会以 Gemini 3.5 Flash 打头阵(速度 4× 同类、价格不足一半),同时发布 Gemini 应用新视觉语言 "Neural Expressive",并推出 Agent 生态与全栈技术闭环。
4 月 29 日 Mistral 一口气甩出三件套:中型模型 Medium 3.5、基于其驱动的 Vibe 平台远程编码代理,以及 Le Chat 工作模式,正式发力企业协作与远程开发。
彭博 6 月 13 日消息:Mistral AI 正与投资人就新一轮融资进行早期接触,估值约 200 亿欧元(约合 1567 亿元人民币),目标融资 30 亿欧元,用于运营扩张。
4 月 17 日 xAI 正式开放语音转文本(STT)与文本转语音(TTS)API,主打高保真、低延迟,面向开发者集成语音对话能力,扩大 Grok 商业化入口。
Agent evaluation is fragmented because most benchmarks use LLM-centric fixed harnesses that create test-production mismatch. The authors advocate "Agentified Agent Assessment" (AAA): judges are themselves agents, and every participant — judge and subject — interacts only through standardized protocols (A2A for task management, MCP for tools). They back this with a 5-month open competition that drew 298 judge agents across 12 categories + 467 subject agents from independent teams, plus a controlled coding-agent case study that confirms fidelity with public records.
Research agents orchestrate calls well but treat papers as flat citations. Agents-K1 is an end-to-end pipeline that converts raw scientific documents into agent-native knowledge graphs: a multimodal parser (entities, evidence, claims, method lineages, 5-module schema), a 4B extraction backbone trained with GRPO + rule-based rewards, and a tri-source agent interface (graphanything CLI) that unifies web search, multimodal graph retrieval, and cross-document traversal. Already used to build Scholar-KG over 2.46M papers across six subjects (1M-paper subset released).
Today's tool agents expose every atomic tool call, observation, and value transfer in the model's main reasoning trace — creating execution-granularity mismatch where deterministic sub-routines are unfolded into repeated model-visible decisions. HyperTool changes the model-visible unit: the model emits a code block that calls existing tools through their original MCP schemas, manipulates returned values, and passes intermediates locally, folding deterministic sub-workflows into one outer call. Trained on synthesized HyperTool-format trajectories verified in real MCP environments.
A sim-to-real framework that reframes dexterous manipulation as an animation problem: procedurally-generated grasp keyframes get turned into manipulation trajectories via motion planning + RL. Data generation is nearly automatic (<1 minute per tool). Achieves zero-shot sim-to-real transfer for grasping and in-hand manipulation across four articulated tools spanning different scales and joint types.
A simple post-training recipe that turns a single DiT into a joint image+depth generator by assigning separate noise levels per modality — trainable on sparse real-world depth, with per-modality decoders. The key finding: scale works. Training T2I DiTs from scratch at 370M → 3.3B parameters shows larger models trained on more image data produce more accurate depth, with the largest model competitive with SOTA monocular depth estimators and a 57% relative AbsRel reduction vs. existing joint image-depth generative models.
When LLMs fail in seemingly random ways, the standard explanation is "they're not really reasoning, just pattern-matching." This paper tests 25 LLMs and human participants on the same everyday causal-reasoning problems and finds they make strikingly similar errors — including errors humans make from irrelevant prompt details. The authors then identify the specific attention heads driving LLM responses and show these heads implement a form of pattern-matching that also predicts the human errors.
A training-free spatial-reasoning framework that gives a VLM a stateful Python kernel pre-loaded with perception and geometry primitives — the agent writes one executable cell per step, conditioned on prior outputs. Across 20 static/dynamic 3D-4D spatial reasoning benchmarks, it hits 59.9% average accuracy, +11.2 points over the prior best spatial agent, with consistent gains across 6 VLM backbones.