LMArena
LMArena
Active

LMArena

LMArena,也就是过去常被称为 LMSYS Chatbot Arena / Chatbot Arena 的平台,是一个基于人类偏好的 AI 模型排行榜,覆盖文本和更多新模态。它适合追踪模型口碑,但不应作为唯一选型依据。

81

Views

0

Likes

Jan 2026

Added

lmarena.ai

Website

Tags

LMArenaChatbot ArenaLMSYSLLM leaderboardhuman preferencemodel evaluationBradley Terry

Product Preview

A quick visual look at LMArena before you visit the official site.

Published 1/21/2026
LMArena screenshot

Editorial Review

About LMArena

它是什么

Chatbot Arena 论文将其描述为通过众包人类两两比较来评估 LLM 的开放平台。LMSYS 方法更新还说明了从在线 Elo 式评分转向 Bradley-Terry 模型,以获得更稳定的评分和置信区间。

适合人群

Chatbot Arena 论文将其描述为通过众包人类两两比较来评估 LLM 的开放平台。LMSYS 方法更新还说明了从在线 Elo 式评分转向 Bradley-Terry 模型,以获得更稳定的评分和置信区间。

核心功能

  • Human-preference leaderboard for comparing frontier AI models.
  • Blind pairwise comparisons aggregated into Elo-like/Bradley-Terry ratings.
  • Leaderboards now span text and broader modalities such as image, video, search, and code on Arena/LMArena surfaces.
  • Useful public signal for model quality, but not a complete benchmark suite.
  • Backed by the Chatbot Arena research paper and LMSYS methodology updates.

真实应用案例

  • Compare model performance before choosing an LLM for a product.
  • Track public preference shifts after new model releases.
  • Explain why human-preference evaluation can differ from static academic benchmarks.
  • Use leaderboard signals in procurement, model-routing, or evaluation planning.
  • Study crowdsourced pairwise evaluation methodology for AI benchmarking.

推荐工作流

  • Check which arena category matches your use case: text, code, image, video, search, or other modality.
  • Look at confidence intervals and recency, not just rank order.
  • Compare Arena results with your own private evals before switching models.
  • Use human preference rankings as one signal alongside cost, latency, safety, context, and tool support.
  • Watch for leaderboard instability and category-specific differences.

优点和限制

  • Very useful for public preference signals and frontier-model comparison.
  • Crowdsourced votes can reflect user mix, prompt mix, UI effects, and recency bias.
  • Pairwise rankings do not replace domain-specific private evaluations.
  • Scores can move as new votes, models, and methodology changes arrive.

可对比替代品

  • HELM for broader academic benchmark coverage.
  • Artificial Analysis for speed, price, and quality metrics.
  • OpenRouter rankings for usage and routing ecosystem signals.
  • Your own eval harness for domain-specific business tasks.

常见问题

Is Chatbot Arena the same as LMArena?

The platform has evolved from LMSYS Chatbot Arena/LMArena branding toward Arena-style leaderboards, but the core idea is human-preference model comparison.

How are models ranked?

The Chatbot Arena paper and LMSYS updates describe blind pairwise comparisons and Bradley-Terry/Elo-like rating methodology.

Should teams choose a model only by Arena rank?

No. Use it as one signal and also evaluate cost, latency, safety, context length, tool use, and your own domain tasks.

参考资料

Ready to try LMArena?

Visit the official website to get started

Visit LMArena

Quick Info

Added
1/21/2026
Published
1/21/2026
Updated
6/12/2026

Share This Tool

Have an AI tool to share?

Submit it to AI Dreamhub

Get your product in front of people actively exploring AI tools.

Submit Your Tool

Related Tools

Artificial Analysis

Artificial Analysis

Artificial Analysis 是独立 AI 模型评测和对比平台,用于选择 LLM、图像模型和 AI 服务商。它追踪模型智能、速度、价格、上下文、延迟、质量和服务商可用性,帮助团队在接入模型前做决策。

Artificial AnalysisAI模型评测LLM排行榜
360
LiveCodeBench

LiveCodeBench

LiveCodeBench is a holistic and contamination-free evaluation benchmark of LLMs for code that continuously collects new problems over time. - 智能 AI 工具,助力您的工作效率。

llm-leaderboardfree
310
Price Per Token

Price Per Token

Compare LLM API pricing across 200+ models from OpenAI, Anthropic, Google, and more. Includes token counters, cost calculators, and benchmark comparisons. - 智能 AI 工具,助力您的工作效率。

llm-leaderboardfree
490
whichllm

whichllm

whichllm 用硬件识别加上关注时效性的基准排名,帮助开发者找出最适合自己机器的本地 LLM,而不是只靠参数规模盲猜。

本地 LLM 选择器硬件感知 AI基准测试排序
40