3. 模型与基准 (Models & Benchmarks)
- Open LLM Leaderboard 2026 - Compare Open Source LLM Rankings
13 hours ago - Updated continuously from provider APIs and verified benchmarks. See the LLM Stats Score methodology for how rankings are computed. Best for coding (Arena): Claude Opus 4.6 (21.3 arena score) Best on GPQA Diamond: Claude Mythos Preview (94.6%) Best on AIME 2025: Ge
- DeepSeek-V4-Pro - AI模型价格对比 (2026/6/3)
3 minutes ago · **DeepSeek V4 Pro** 是 DeepSeek 推出的大规模混合专家模型,总参数量为 1.6T(万亿),激活参数量为 49B(十亿),支持 100 万 Token 的上下文窗口。该模型专为高级推理、编程以及长周期智能体工作流而设计,在知识、数学和软件工程基准测试中均表现出色。 基于与 DeepSeek
- SWE-Review Leaderboard
2 hours ago - SWE-Review leaderboard — MiniMax M2.1 leads 1 AI models at 0.089. Software Engineering Review benchmark evaluating code review capabilities
- SWE-Bench Pro Leaderboard AI Coding Benchmark (Public Dataset) | Scale
20 hours ago - Massive Performance Drop on SWE-Bench Pro: A major finding is the significant drop in performance for all models when moving from the SWE-Bench Verified benchmark to the more challenging SWE-Bench Pro. While most top models score over 70% on the verified version, t
- LLM Leaderboard 2026: Compare 300+ Top AI Models by Intelligence, Speed & Price
The LLM Stats Score is a composite that blends verified benchmark results (GPQA Diamond, SWE-Bench Verified, coding-arena), live performance metrics (output throughput, time-to-first-token) and per-token pricing into one comparable number. Pricing and metadata revalidate hourly;
- SWE-Bench Verified Leaderboard
1 day ago - 92 models evaluated on SWE-Bench Verified. Compare scores, rankings, and performance metrics.