[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fISRimALEnfM2SgMul-OQMhF-wP6yAvA7wbxAhUFToQY":3},{"code":4,"msg":5,"data":6},200,"操作成功",{"id":7,"title":8,"content":9,"digest":10,"source":10,"coverPath":11,"thumbsCoverPath":12,"isTop":13,"isShow":14,"baseClick":13,"clickCount":15,"createTime":16,"typeId":17,"isNewest":18,"newsInfoTypeRespVo":19,"voiceUrl":22,"voiceSize":23,"taskId":24,"releaseTime":25,"titleEn":26,"contentEn":27,"voiceUrlEn":28,"taskIdEn":29,"voiceSizeEn":30},1201,"AI竞技场，归根到底只是一门生意","\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>“XX发布最强开源大模型，多项基准测试全面超越XX等闭源模型！”\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>“万亿参数开源模型XX强势登顶全球开源模型榜首！”\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>“国产之光！XX模型在中文评测榜单拿下第一！”\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">随着AI时代的到来，各位的朋友圈、微博等社交平台是不是也常常被诸如此类的新闻刷屏了？\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">今天这个模型拿到了冠军，明天那个模型变成了王者。评论区里有的人热血沸腾，有的人一头雾水。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">一个又一个的现实问题摆在眼前：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这些模型所谓的“登顶”比的是什么？谁给它们评分，而评分的依据又是什么？为什么每个平台的榜单座次都不一样，到底谁更权威？\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如果各位也产生了类似的困惑，说明各位已经开始从“看热闹”转向“看门道”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">本文之中，我们便来拆解一下不同类型“AI竞技场”——也就是大语言模型排行榜——的“游戏规则”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">01 类型一：客观基准测试（Benchmark），给AI准备的“高考”\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">人类社会中，高考分数是决定学生大学档次的最主要评判标准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">同样地，在AI领域，也有很多\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">高度标准化\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">的测试题，用来尽可能客观地衡量AI模型在特定能力上的表现。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">因此，在这个大模型产品频繁推陈出新的时代，各家厂商推出新模型后，第一件事就是拿到“高考”考场上跑个分，是骡子是马，拉出来遛遛。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Artificial Analysis平台提出了一项名为“\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Artificial Analysis Intelligence Index（AAII）\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">”的综合性评测基准，汇总了7个极为困难且专注于前沿能力的单项评测结果。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">类似于股票价格指数，AAII能够给出衡量AI智能水平的综合分数，尤其专注于需要\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">深度推理、专业知识和复杂问题解决能力\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">的任务。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这7项评测覆盖了被普遍视作衡量高级智能核心的三个领域：\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">知识推理、数学和编程\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（1）知识与推理领域\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MMLU-Pro：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称Massive Multitask Language Understanding - Professional Level\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MMLU的加强版。MMLU涵盖57个学科的知识问答测试，而MMLU-Pro在此基础上，通过更复杂的提问方式和推理要求，进一步增加难度以测试模型在专业领域的知识广度和深度推理能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">GPQA Diamond：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称Graduate - Level Google - Proof Q&amp;A - Diamond Set\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">此测试机包含生物学、物理学和化学领域的专业问题。与其名称对应，其设计初衷很直白：即使是相关领域的研究生，在允许使用Google搜索的情况下也很难在短时间内找到答案。而Diamond正是其中难度最高的一个子集，需要AI具备较强的推理能力和问题分解能力，而非简单的信息检索。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Humanity’s Last Exam：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">由Scale AI和Center for AI Safety（CAIS）联合发布的一项难度极高的基准测试，涵盖科学、技术、工程、数学甚至是人文艺术等多个领域。题目大多为开放式，不仅需要AI进行多个步骤的复杂推理，还需要AI发挥一定的创造性。这项测试能够有效评估AI是否具备跨学科的综合问题解决能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（2）编程领域\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LiveCodeBench：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这是一项贴近现实的编程能力测试。与传统的编程测试只关注代码的正确性不同，AI会被置于一个“实时”的编程环境中，并根据问题描述和一组公开的测试用例编写代码，而代码将会使用一组更复杂的隐藏测试用例运行并评分。这项测试主要考验AI编程是否具备较高的鲁棒性以及处理边界情况的能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">SciCode：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这一项编程测试则更偏向于学术性，专注于科学计算和编程。AI需要理解复杂的科学问题并用代码实现相应的算法或模拟。除了考验编程技巧，还需要AI对科学原理具备一定深度的理解。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（3）数学领域\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">AIME：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称American Invitational Mathematics Examination\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">美国高中生数学竞赛体系中的一环，难度介于AMC（美国数学竞赛）和USAMO（美国数学奥林匹克）之间。其题目具备较高的挑战性，需要AI具备创造性的解题思路和数学功底，能够衡量AI在高级数学领域中的推理能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MATH-500：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">从大型数学问题数据集“MATH”中随机抽取500道题构成的测试，覆盖从初中到高中竞赛水平的各类数学题目，涵盖代数、几何和数论等领域。题目以LaTeX格式给出，模型不仅要给出答案，还需要有详细的解题步骤，是评估AI形式化数学推理和解题能力的重要标准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Ffd128eb74e204eefb71ad3b0457ef24f\u002FAA1JZb1i.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\">图：Artificial Analysis的AI模型智能排行榜\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">不过，由于模型的用处不同，各大平台并不会采用相同的测评标准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">例如，司南（OpenCompass）的大语言模型榜单根据其自有的闭源评测数据集（CompassBench）进行评测，我们无法得知具体测试规则，但该团队面向社区提供了公开的验证集，每隔3个月更新评测题目。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fc590b7c4b44442a0bebd338ba52b9d3d\u002FAA1JZi6a.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">&nbsp;图：OpenCompass大语言模型榜\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">与此同时，该网站也选取了一些合作伙伴的评测集，针对AI模型的主流应用领域进行评测并发布了测试榜单：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002F305e679c8ed64252bb0dff4de1881c4e\u002FAA1JZ8Fj.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">&nbsp;而HuggingFace也有类似的开源大语言模型榜单，测评标准中包含了前面提过的MATH、GPQA和MMLU-Pro：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002F9c1a65d73f664a3fb253d65c9e72973c\u002FAA1JZb1p.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">图：HuggingFace上的开源大语言模型排行榜\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">在这个榜单中，还增加了一些测评标准，并附有解释：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">IFEval：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称Instruction-Following Evaluation\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">用于测评大语言模型遵循指令的能力，其重点在于格式化。这项测评不仅需要模型给出正确的回答，还注重于模型能否严格按照用户给出的特定格式来输出答案。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">BBH：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称Big Bench Hard\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">从Big Bench基准测试中筛选出的一部分较为困难的任务，构成了专门为大语言模型设计的高难度问题集合。作为一张“综合试卷”，它包含多种类型的难题，如语言理解、数学推理、常识和世界知识等方面。不过，这份试卷上只有选择题，评分标准为准确率。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MuSR：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">全称Multistep Soft Reasoning\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">用于测试AI模型在长篇文本中进行复杂、多步骤推理能力的评测集。其测试过程类似于人类的“阅读理解”，在阅读文章后，需要将散落在不同地方的线索和信息点串联起来才能得到最终结论，即“多步骤”和“软推理”。此测评同样采用选择题的形式，以准确率为评分标准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">CO2&nbsp;Cost：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这是最有趣的一项指标，因为大部分LLM榜单上都不会标注二氧化碳排放量。它只代表了模型的环保性和能源效率，而无法反映其聪明程度和性能。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">同样地，在HuggingFace上搜索LLM Leaderboard，也可以看到有多个领域的排行榜。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Feec20d77925c4c97a17c74757b6a45fc\u002FAA1JZ8Fm.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图：HuggingFace上的其他大语言模型排行榜\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">可以看到，把客观基准测试作为AI的“高考”，其优点很明确：\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">客观、高效、可复现\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">同时，可以快速衡量模型在某一领域或某一方面的“硬实力”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">但伴随“高考”而来的，则是应试教育固有的弊端。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">模型可能在测试中受到数据污染的影响，导致分数虚高，但实际应用中却一问三不知。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">毕竟，在我们先前的大模型测评中，简单的财务指标计算也可能出错。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">同时，客观基准测试很难衡量模型的“软实力”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">文本上的创造力、答案的情商和幽默感、语言的优美程度，这些难以量化、平时不会特意拿出来说的衡量指标，却决定着我们使用模型的体验。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">因此，当一个模型大规模宣传自己在某个基准测试上“登顶”时，它就成为了“单科状元”，这已经是很了不起的成就，但离“全能学霸”还有很远距离。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">02 类型二：人类偏好竞技场（Arena），匿名才艺大比拼\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">前面已经说过，客观基准测试更注重于模型的“硬实力”，但它无法回答一个最实际的问题：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">一个模型，到底用起来“爽不爽”？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">一个模型可能在MMLU测试中知晓天文地理，但面对简单的文字编辑任务却束手无策；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">一个模型可能在MATH测试中秒解代数几何，却无法理解用户话语中的一丝幽默和讽刺。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">面对上述困境，来自加州大学伯克利分校等高校的研究人员组成的LMSys.org团队提出了一个想法：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">“既然模型最终为人而服务，那为什么不直接让人来评判呢？”\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这一次，评判标准不再是试卷和题集，评分标准交到了用户手中。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LMSys Chatbot Arena，一个通过“盲测对战”来对大语言模型进行排名的大型众包平台。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">对战时，两个模型同时登场，并对同一个问题进行解答，由用户决定谁输谁赢。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">用户在投票前无法得知两个“选手”的“真实身份”，有效消除了刻板偏见。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">对于一般用户来说，LMArena的使用方法非常简单：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">登录https:\u002F\u002Flmarena.ai\u002F后，首先由用户进行提问，系统会随机挑选两个不同的大语言模型，并将问题同时发送给它们。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fb28b6f0601e0459790206122f24c4354\u002FAA1JZb1s.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">匿名标注为Assistant A和Assistant B两个模型生成的答案会并排显示，而用户需要根据自己的判断，投票选择最合适的回答。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">而在投票后，系统才会告知用户Assistant A和Assistant B分别是哪个模型，而这次投票也会加入到全球用户的投票数据中。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fb560285509cd4831a2dab6a7d0b15b82\u002FAA1JZoMT.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">图：LMArena文本能力排行榜\u003C\u002Fspan>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LMArena中设计了七个分类的排行榜，分别是Text（文本\u002F语言能力）、WebDev（Web开发）、Vision（视觉\u002F图像理解）、Text-to-Image（文生图）、Image Edit（图像编辑）、Search（搜索\u002F联网能力）和Copilot（智能助力\u002F代理能力）。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">每个榜单都是由用户的投票产生的，而LMArena采用的核心创新机制就是Elo评级系统。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这套系统最初用于国际象棋等双人对战游戏，可用于衡量选手的相对实力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">而在大模型排行榜中，每个模型都会有一个初始分数，即Elo分。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">当模型A在一场对决中战胜模型B时，模型A就可以从模型B那赢得一些分数。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">而赢得多少分数，取决于对手有多少实力。如果击败了分数远高于自己的模型，则会获得大量分数；如果只是击败了分数远低于自己的模型，则只能获得少量分数。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">因此，一旦输给弱者，则会丢掉大量分数。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">这个系统很适合处理大量的“1v1”成对比较数据，能够判断相对强弱而非绝对强弱，并能够使\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">排行榜动态更新\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，更具备可信度。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">尽管有相关研究人员指出LMArena的排行榜存在私测特权、采样不公等问题，但它仍是目前衡量大语言模型综合实力较为权威的排行榜之一。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">在AI新闻满天飞的环境下，它的优势在于\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">消除用户先入为主的偏见\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">同时，我们前面提到的创造力、幽默感、语气和写作风格等难以量化的指标将在投票中得以体现，有助于\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">衡量主观质量\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">但是，简单的流程和直观的“二选一”也为类似的竞技场平台带来了不少局限性：\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">一是\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">聚焦于单轮对话\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：其评测主要采取“一问一答”的方式，而对于需要多轮对话的任务则难以充分进行评估；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">二是\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">存在投票者偏差\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：这是统计中难以避免的现象，投票的用户群体可能更偏向于技术爱好者，其问题类型和评判标准必然无法覆盖普通用户；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">三是\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">主观性过强\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：用户对于“好”和“坏”的评判过于主观，而Elo分数则只是体现主观偏好的平均结果；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">四是\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">缺失事实核查性\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：用户在对两个模型进行评判时，注意力往往放在答案的表述上，而忽视了回答内容的真实性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">03 我们到底该看哪个排行榜？\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">AI江湖的“武林大会”远不止我们提到的这些排行榜。随着AI领域规模的不断扩大，评测的战场本身也变得越来越复杂和多元化。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">很多学术机构或大型AI公司会发布自家的评测报告或自建榜单，体现出技术自信，但作为用户，则需要“打个问号”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">就像足球比赛有主客场之分，机构也可以巧妙地设计评测的维度和题目，使其恰好能放大某些模型的优势，同时规避其弱点。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">另一个更加宏大的趋势是，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">大模型的评测榜单正在从“大一统”走向“精细化”\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">据不完全统计，迄今为止，全球已发布大模型总数达到3755个。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">“千模大战”的时代，一份冗长的通用榜单，显然无法满足所有人的需求。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">因此，评测的趋势也不可避免地走向\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">细分化和垂直化\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">那么回到最初的核心问题：到底谁更权威？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">观点很明确：没有任何一个单一的排行榜是绝对权威的。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">排行榜终究是参考，甚至不客气的说，“AI竞技场”归根到底只是一门生意。对于高频刷榜的模型，我们务必要警惕——不是估值需求驱动，便是PR导向驱动。是骡子是马，终究不是一个竞技场能盖棺定论的。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">但对于普通用户来说，评判一个模型的最终标准是唯一的：它是否真正对你有用。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">评价和选择模型，要先看应用场景\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如果你是程序员，就去试试AI编写代码、检查和修复Bug的能力；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如果你是大学生，就让AI去做文献综述，解释学术名词和概念；\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如果你是营销人，就看看AI能否写出精彩的文案、构思和创意。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">别让“登顶”的喧嚣干扰了你的判断。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">大模型是工具，不是神。看懂排行榜，是为了更好地选择工具。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">与其迷信排行榜，真如把实际问题交给它试一试，哪个模型能\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">最高效优质地解决问题\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，它就是你的“私人冠军”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(187, 187, 187);\">【新闻来源】钛媒体APP 文 | 锦缎 \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Fai%E7%AB%9E%E6%8A%80%E5%9C%BA-%E5%BD%92%E6%A0%B9%E5%88%B0%E5%BA%95%E5%8F%AA%E6%98%AF%E4%B8%80%E9%97%A8%E7%94%9F%E6%84%8F\u002Far-AA1JZi6L?ocid=BingNewsLanding&amp;cvid=2c1b98dc2a524cf6800a32e6f00b1563&amp;ei=10\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(187, 187, 187);\">http:\u002F\u002Fu5a.cn\u002FWbQ6a\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">（本网转发此文章，旨在为读者提供更多的信息资讯，所涉内容不构成投资、消费建议。文章事实如有疑问，请与有关方核实，文章观点非本网观点，仅供读者参考。）\u003C\u002Fspan>\u003C\u002Fp>","","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fe022ca3581bf4f269b15076cf749351d\u002FAI领域.jpg","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fthumbs\u002Fe022ca3581bf4f269b15076cf749351d\u002FAI领域.jpg",0,1,217,"2025-08-08 17:40",2,false,{"id":17,"name":20,"enName":21},"芯位视野","Xinwei Vision","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3A4de17232-dd4e-466d-b295-fecede22a38e%3A0.wav?Expires=1754652797&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=uUKI1lOs8%2FcOkP6RZ65jtefqr5k%3D",24635194,"4de17232-dd4e-466d-b295-fecede22a38e","2025-08-08 17:27","AI Arena, in the end, is just a business.","\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>\"XX releases the strongest open-source large model, surpassing closed-source models like XX in multiple benchmark tests!\"\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>\"The trillion-parameter open-source model XX strongly ascends to the top of the global open-source model list!\"\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">\u003Cem>\"National pride! The XX model takes first place on the Chinese evaluation list!\"\u003C\u002Fem>\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px;\">With the arrival of the AI era, are your social media circles and Weibo frequently flooded with such news?\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Today this model wins the championship, tomorrow that model becomes the king. Some people get excited, while others are confused.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">One after another, real problems are put before us:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">What does it mean for these models to \"ascend\"? Who gives them scores, and what is the basis for scoring? Why do the rankings differ across platforms, and who is more authoritative?\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">If you have similar doubts, it means you have started to move from \"watching the spectacle\" to \"understanding the rules\".\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In this article, we will break down the \"rules of the game\" of different types of \"AI arenas\"—that is, large language model rankings.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">01 Type One: Objective Benchmark Tests (Benchmark), the \"Gaokao\" for AI\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In human society, college entrance exam scores are the main standard determining the level of students' universities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Similarly, in the AI field, there are many\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">highly standardized\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">tests used to measure AI models' performance in specific abilities as objectively as possible.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Therefore, in this era of frequent new large model launches, the first thing each manufacturer does after launching a new model is to take the \"Gaokao\" test, to see whether it's a horse or a donkey.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Artificial Analysis platform proposed a comprehensive evaluation benchmark called \"Artificial Analysis Intelligence Index (AAII)\", which aggregates results from seven extremely difficult and focused single evaluations on cutting-edge capabilities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Like stock price indices, AAII can provide a comprehensive score measuring AI intelligence levels, especially focusing on tasks requiring\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">deep reasoning, professional knowledge, and complex problem-solving abilities\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">These seven evaluations cover three areas widely regarded as core indicators of advanced intelligence:\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">knowledge reasoning, mathematics, and programming\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（1）Knowledge and Reasoning Field\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MMLU-Pro：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: Massive Multitask Language Understanding - Professional Level\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">An enhanced version of MMLU. MMLU covers knowledge Q&A tests across 57 subjects, while MMLU-Pro increases difficulty by using more complex questions and reasoning requirements to test models' breadth of knowledge and depth of reasoning ability in specialized fields.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">GPQA Diamond：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: Graduate-Level Google-Proof Q&amp;A - Diamond Set\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This test set includes professional questions in biology, physics, and chemistry. As its name suggests, it was designed very straightforwardly: even graduate students in related fields find it hard to find answers quickly even with Google search. Diamond is the most challenging subset among them, requiring AI to have strong reasoning and problem decomposition abilities, not just simple information retrieval.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Humanity’s Last Exam：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">A high-difficulty benchmark test jointly released by Scale AI and Center for AI Safety (CAIS), covering various fields including science, technology, engineering, mathematics, and even humanities and arts. The questions are mostly open-ended, requiring AI to perform multi-step complex reasoning and some creativity. This test effectively evaluates whether AI has the ability to solve interdisciplinary problems.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（2）Programming Field\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LiveCodeBench：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This is a programming ability test that closely reflects real-world scenarios. Unlike traditional programming tests that only focus on code correctness, AI is placed in a \"real-time\" programming environment and must write code based on problem descriptions and a set of public test cases. The code will then be run and scored using a more complex set of hidden test cases. This test mainly examines AI's ability to handle robustness and edge cases.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">SciCode：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This programming test is more academic, focusing on scientific computing and programming. AI needs to understand complex scientific problems and implement corresponding algorithms or simulations through code. In addition to testing programming skills, it also requires AI to have a certain depth of understanding of scientific principles.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（3）Mathematics Field\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">AIME：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: American Invitational Mathematics Examination\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Part of the American high school mathematics competition system, it is between AMC (American Mathematics Competition) and USAMO (American Mathematics Olympiad). Its questions are highly challenging, requiring AI to have creative problem-solving approaches and mathematical proficiency, measuring AI's reasoning ability in advanced mathematics.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MATH-500：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">A test composed of 500 randomly selected questions from the large-scale mathematics problem dataset \"MATH\", covering various math problems from middle school to high school competitions, including algebra, geometry, and number theory. The questions are presented in LaTeX format, and models not only need to provide answers but also detailed solution steps, making it an important standard for evaluating AI's formal mathematical reasoning and problem-solving abilities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Ffd128eb74e204eefb71ad3b0457ef24f\u002FAA1JZb1i.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\">Figure: Artificial Analysis AI Model Intelligence Ranking List\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">However, since models serve different purposes, major platforms do not use the same evaluation criteria.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">For example, the large language model ranking list of Sina (OpenCompass) is evaluated based on its own closed-source evaluation dataset (CompassBench), and we cannot know the specific testing rules, but the team provides an open validation set for the community, updating the evaluation questions every three months.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fc590b7c4b44442a0bebd338ba52b9d3d\u002FAA1JZi6a.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">&nbsp;Figure: OpenCompass Large Language Model Ranking\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">At the same time, the website also selects some partners' evaluation sets to evaluate AI models in mainstream application areas and publishes test rankings:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002F305e679c8ed64252bb0dff4de1881c4e\u002FAA1JZ8Fj.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">&nbsp;HuggingFace also has a similar open-source large language model ranking list, with evaluation standards including the aforementioned MATH, GPQA, and MMLU-Pro:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002F9c1a65d73f664a3fb253d65c9e72973c\u002FAA1JZb1p.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">Figure: Open-Source Large Language Model Ranking List on HuggingFace\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In this ranking list, some additional evaluation standards are added, along with explanations:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">IFEval：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: Instruction-Following Evaluation\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Used to evaluate the ability of large language models to follow instructions, with a focus on formatting. This evaluation not only requires models to give correct answers but also emphasizes whether models can strictly output answers according to the specific format given by users.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">BBH：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: Big Bench Hard\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">A selection of some more difficult tasks from the Big Bench benchmark test, forming a high-difficulty question set specifically designed for large language models. As a \"comprehensive exam paper,\" it includes various types of difficult questions, such as language comprehension, mathematical reasoning, common sense, and world knowledge. However, this exam paper only contains multiple-choice questions, with accuracy as the scoring standard.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MuSR：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Full name: Multistep Soft Reasoning\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">An evaluation set used to test AI models' ability to perform complex, multi-step reasoning in long texts. Its testing process is similar to human \"reading comprehension.\" After reading the article, one must connect scattered clues and information points to reach the final conclusion, i.e., \"multi-step\" and \"soft reasoning.\" This evaluation also uses multiple-choice questions, with accuracy as the scoring standard.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">CO2&nbsp;Cost：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This is the most interesting indicator, because most LLM rankings do not list carbon dioxide emissions. It only represents the environmental friendliness and energy efficiency of the model, but cannot reflect its intelligence level and performance.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Similarly, on HuggingFace, searching for LLM Leaderboard, you can also see rankings in multiple fields.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Feec20d77925c4c97a17c74757b6a45fc\u002FAA1JZ8Fm.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure: Other Large Language Model Rankings on HuggingFace\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As you can see, taking objective benchmark tests as the \"Gaokao\" for AI has clear advantages:\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">objective, efficient, and reproducible\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">At the same time, it can quickly measure the \"hard power\" of the model in a specific area or aspect.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">But accompanying the \"Gaokao\" comes the inherent drawbacks of exam-oriented education.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Models may be affected by data contamination during testing, leading to inflated scores, but they may not know anything in practical applications.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">After all, in our previous large model evaluations, even simple financial calculations could go wrong.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">At the same time, objective benchmark tests are difficult to measure the \"soft power\" of the model.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Creativity in text, emotional intelligence and humor in answers, the beauty of language—these difficult-to-quantify metrics that are rarely mentioned in daily life determine our experience using the model.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Therefore, when a model aggressively promotes its \"ascension\" on a certain benchmark test, it becomes a \"single-subject top scorer,\" which is already a remarkable achievement, but it is still far from being a \"versatile scholar.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">02 Type Two: Human Preference Arena (Arena), an anonymous talent show\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As mentioned earlier, objective benchmark tests focus more on the \"hard power\" of the model, but they cannot answer the most practical question:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">How convenient is it to use a model?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">A model might know everything about astronomy and geography in the MMLU test, but be helpless when faced with a simple text editing task;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">A model might solve algebraic geometry problems in the MATH test in seconds, but fail to understand a bit of humor or sarcasm in the user's words.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Faced with these dilemmas, researchers from institutions such as the University of California, Berkeley, formed the LMSys.org team and came up with an idea:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">“Since models ultimately serve people, why not let people judge directly?”\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This time, the evaluation criteria are no longer exams and question sets, and the scoring criteria are handed over to users.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LMSys Chatbot Arena, a large crowdsourcing platform that ranks large language models through \"blind testing battles.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">During the battle, two models appear simultaneously and answer the same question, and users decide who wins and who loses.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Users cannot know the \"true identity\" of the two \"contestants\" before voting, effectively eliminating preconceived biases.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">For general users, the usage method of LMArena is very simple:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">After logging into https:\u002F\u002Flmarena.ai\u002F, first ask a question, and the system will randomly select two different large language models and send the question to both of them at the same time.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fb28b6f0601e0459790206122f24c4354\u002FAA1JZb1s.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The answers generated by the two models labeled anonymously as Assistant A and Assistant B will be displayed side by side, and users need to vote for the most appropriate answer based on their judgment.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">After voting, the system will inform the user which model Assistant A and Assistant B are, and this vote will be added to the global user voting data.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F08\u002Fb560285509cd4831a2dab6a7d0b15b82\u002FAA1JZoMT.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">Figure: LMArena Text Ability Ranking List\u003C\u002Fspan>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LMArena has seven categories of ranking lists, namely Text (Text\u002FLanguage Ability), WebDev (Web Development), Vision (Visual\u002FImage Understanding), Text-to-Image (Text-to-Image), Image Edit (Image Editing), Search (Search\u002FInternet Capability), and Copilot (Smart Assistance\u002FAgent Ability).\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Each list is generated by user votes, and the core innovation mechanism used by LMArena is the Elo rating system.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This system was initially used for two-player games like chess, and it can be used to measure the relative strength of players.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In large model rankings, each model has an initial score, known as the Elo score.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">When model A defeats model B in a match, model A can gain some points from model B.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The number of points won depends on the opponent's strength. If you defeat a model with much higher scores, you will gain a lot of points; if you only defeat a model with much lower scores, you can only gain a few points.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Therefore, once you lose to a weaker opponent, you will lose a lot of points.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This system is suitable for processing large amounts of \"1v1\" paired comparison data, able to determine relative strengths rather than absolute strengths, and can make the\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">ranking list dynamically updated\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, which is more credible.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Although some researchers have pointed out that the rankings on LMArena have issues such as private testing privileges and unfair sampling, it is still one of the more authoritative ranking lists for measuring the overall strength of large language models.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In an environment where AI news is everywhere, its advantage lies in\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">eliminating users' preconceived biases\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">At the same time, the hard-to-quantify indicators such as creativity, humor, tone, and writing style mentioned earlier will be reflected in the voting, helping to\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">measure subjective quality\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">However, the simple process and intuitive \"one-of-two\" also bring many limitations to such arena platforms:\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">First,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">focuses on single-turn conversations\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: its evaluation mainly adopts a \"question-and-answer\" approach, and it is difficult to fully evaluate tasks that require multi-turn conversations;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Second,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">voter bias exists\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: this is an unavoidable phenomenon in statistics, and the user group voting may be more inclined towards tech enthusiasts, and their question types and evaluation criteria are inevitably unable to cover ordinary users;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Third,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">too subjective\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: users' judgments on \"good\" and \"bad\" are too subjective, and the Elo score is only the average result of subjective preferences;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Fourth,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">lack of factual verification\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: when users evaluate two models, they often focus on the expression of the answers and neglect the authenticity of the content.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-center\">\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">03 Which ranking list should we look at?\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The \"Wulin Conference\" of the AI world goes beyond the ranking lists we mentioned. As the scale of the AI field continues to expand, the battlefield of evaluation itself has become increasingly complex and diversified.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Many academic institutions or large AI companies will release their own evaluation reports or self-built ranking lists, reflecting technological confidence, but as users, we need to \"raise a question mark.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Just like football matches have home and away games, institutions can cleverly design the dimensions and topics of the evaluation to highlight certain models' strengths and avoid their weaknesses.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Another more macro trend is that\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">the large model evaluation rankings are moving from \"unified\" to \"refined\"\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">According to incomplete statistics, so far, the total number of large models released globally has reached 3,755.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In the era of \"thousands of models fighting,\" a long and general ranking list obviously cannot meet everyone's needs.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Therefore, the trend of evaluation is inevitably moving towards\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">specialization and verticalization\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">So back to the original core question: who is more authoritative?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The view is clear: no single ranking list is absolutely authoritative.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Rankings are ultimately references, and frankly speaking, the \"AI arena\" is ultimately just a business. For models that frequently appear on the rankings, we must be vigilant—whether driven by valuation demands or PR orientation. Whether it's a horse or a donkey, it's ultimately not enough to be determined by a single arena.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">But for ordinary users, the ultimate criterion for evaluating a model is unique: whether it is truly useful to you.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Evaluating and selecting a model starts with looking at the application scenario\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">If you are a programmer, try the AI's ability to write code, check, and fix bugs;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">If you are a university student, let AI do literature reviews, explain academic terms and concepts;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">If you are a marketing person, see if AI can write impressive copy, come up with ideas and creativity.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Don't let the noise of \"ascending\" interfere with your judgment.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Large models are tools, not gods. Understanding rankings is to better choose tools.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Rather than blindly trusting rankings, it's better to actually assign it a real problem to test. Which model can\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">most efficiently and effectively solve the problem\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, it is your \"private champion.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(187, 187, 187);\">【News Source】 Titanium Media APP | Wen Jinduan \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Fai%E7%AB%9E%E6%8A%80%E5%9C%BA-%E5%BD%92%E6%A0%B9%E5%88%B0%E5%BA%95%E5%8F%AA%E6%98%AF%E4%B8%80%E9%97%A8%E7%94%9F%E6%84%8F\u002Far-AA1JZi6L?ocid=BingNewsLanding&amp;cvid=2c1b98dc2a524cf6800a32e6f00b1563&amp;ei=10\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(187, 187, 187);\">http:\u002F\u002Fu5a.cn\u002FWbQ6a\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">（This article is republished by the website to provide readers with more information and news. The content does not constitute investment or consumption advice. If there are facts in the article that need verification, please contact the relevant party. The views in the article are not the views of the website and are for reference only.）\u003C\u002Fspan>\u003C\u002Fp>","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3A4847dad5-89eb-47ca-b516-09b899c94166%3A0.wav?Expires=1774838500&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=j2dA2dmx7S6Fy0oDHBqSbsAUvEk%3D","4847dad5-89eb-47ca-b516-09b899c94166",17712092]