[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fweShN2ErwB5pRzT_SbzNoVDtby_znLxwFTXswrpEkWE":3},{"code":4,"msg":5,"data":6},200,"操作成功",{"id":7,"title":8,"content":9,"digest":10,"source":10,"coverPath":11,"thumbsCoverPath":12,"isTop":13,"isShow":14,"baseClick":13,"clickCount":15,"createTime":16,"typeId":17,"isNewest":18,"newsInfoTypeRespVo":19,"voiceUrl":22,"voiceSize":23,"taskId":24,"releaseTime":25,"titleEn":26,"contentEn":27,"voiceUrlEn":28,"taskIdEn":29,"voiceSizeEn":30},1325,"一文读懂AI大模型之「盾」，全行业283个LLM基准测试都在这了","\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F62d5da90c73b489a984757df6c549c97\u002FAA1LH1VW.jpg\" width=\"547\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">图｜代表性 LLM 基准测试（按时间线）。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">大模型技术如矛，基准测试（benchmark）如盾。只有矛愈锋利，盾愈坚固，AI 行业才会不断被推向更高处。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如果你身处大模型行业或已经被大模型技术所影响，除了知道有哪些“矛”，还必须了解有哪些“盾”，从而更好地了解人工智能（AI）行业的真正发展现状。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">日前，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">中国科学院深圳先进技术研究院\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">团队及其合作者\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">首次对「LLM 基准测试」的现状与发展进行了系统性回顾\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，并将\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">&nbsp;283 个\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">具有代表性的基准测试分为了三类：&nbsp;\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">通用能力\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（general capabilities）基准测试：涵盖核心语言学、知识和推理等方面的内容；\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">领域特定\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（domain-specific）基准测试：聚焦于自然科学、人文社会科学和工程技术等领域；\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">目标特定\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">（target-specific）基准测试：关注风险、可靠性、代理等方面的内容。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">他们指出，当前基准测试存在因数据污染导致的\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">“分数虚高”\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">、因文化和语言偏见导致的\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">“不公平评估”\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，以及缺乏对过程可信度和动态环境的评估等问题，并为未来基准测试创新提供了可参考的设计范式。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F20a3b2aeca3b442ca38dc6564bbb3c8f\u002FAA1LHbsq.jpg\" width=\"697\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">大模型考试亟待突破\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">自 2017 年 Transformer 架构问世以来，从基础的语言理解与文本生成任务，到复杂的逻辑推理与智能体（Agent）交互，LLM 持续拓展着 AI 的能力边界，进而重塑人机交互模式和信息处理范式。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">如今，LLM 已广泛渗透到智能客服、内容创作、教育、医疗、法律等领域，成为推动数字经济发展和社会智能化转型的核心力量。然而，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">随着 LLM 技术的快速演进，建立一套科学、系统且全面的评估体系变得尤为迫切\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">基准测试作为衡量模型性能的量化工具，不仅是评价模型能力的核心手段，更是推动技术迭代与模型优化的重要因素。通过基准测试，研究者能够客观比较不同模型的性能，准确识别技术瓶颈，并为算法优化与结构设计提供数据支持。同时，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">标准化评估结果也有助于增强用户信任，确保模型在安全性、公平性等方面符合社会和道德规范\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">与早期 LLM 相比，现今 LLM 的参数规模已呈指数级增长，其能力也从单一任务拓展至多任务、多领域。因此，评估内容也从固定任务转变为多任务、多领域，对评估方法的科学性和适应性都提出了更高要求。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">当前，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">大语言模型（LLM）的评估体系仍存在诸多亟待突破的难题\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，如下：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">部分模型在训练阶段可能已接触过评测数据，从而导致评估结果存在“数据泄露效应”，难以真实反映模型的泛化能力。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">静态评测方法在很大程度上无法刻画动态真实世界环境的复杂性，亦难以有效预测模型在新任务或新领域下的适应性表现。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">现有评估指标维度相对单一，难以全面揭示 LLM 在推理、理解、生成等多方面的综合能力。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">在偏见检测、安全漏洞识别以及指令合规性等核心环节上，尚缺乏系统性与可扩展的评估框架。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">大规模评估所需的高昂算力与人力成本，也成为限制 LLM 评估体系可持续发展的关键瓶颈。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">大模型考题的「发展史」\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">语言能力基准的演化，体现了 LLM 进步与评估方法之间持续不断的“军备竞赛”。这一过程的核心动力在于对“广义语言能力”的探索，即从表层的模式匹配，转向对语法、语义及语用等深层语言理解的考察。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">2018 年推出的 GLUE 是一个关键进展\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，它通过将 9 个不同的英语自然语言理解（NLU）任务，如情感分析、文本蕴含，纳入统一框架来应对这一问题。随后，SuperGLUE 引入了更具挑战性的任务，强调复杂推理能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">然而，研究发现，模型倾向于利用数据标注中的人工痕迹。HellaSwag 等基准应运而生。这类任务对人类而言轻而易举，但对模型却具有较高难度，从而更直接地测试常识和脚本知识。在中文方面，CLUE 是首个具有代表性的中文 NLU 基准，而 Xtreme 则扩展至包含 12 个语系、40 种语言，系统评估了形态变化、词序等不同语言属性下的泛化能力。\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">HELM 则引入了“动态基准”概念\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，通过不断扩展场景来动态整合新兴语言维度。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F3bf40073e9914529a75e8ffbe089d6ba\u002FAA1LGXCr.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图｜代表性语言核心基准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">存储并准确提取大量现实世界信息的能力，是现代 LLM 的基石之一。这些模型如同知识库一般，从海量训练语料中吸收信息。因此，衡量其知识范围与可靠性，成为模型评估的重要维度。这类测试通常模拟“闭卷考试”，要求模型完全依赖其内部参数化知识。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">知识评估的演化路径，呈现出 LLM 从信息检索工具向内化知识转变的趋势。MMLU 的引入成为开创性的突破，它确立了一个新的、有影响力的范式。\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MMLU-Pro 通过增加选项数量和推理密集型问题的比例\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，提高了任务的对抗性难度。GPQA 等基准由领域专家设计，旨在实现“防谷歌化”，直接应对模型依赖网络搜索而非内化知识作答的挑战。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">当前最主要的评估形式为多项选择问答（MCQA），其优势在于具备良好的可扩展性，且可以通过准确率这一核心指标实现客观自动评估。AGIEval 和 GAOKAO-Bench 等基准正是采用这一方式，从高风险的人类考试中精选题目。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">HELM 和 BIG-Bench 等框架则将知识能力评估纳入更广泛的指标体系中\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，涵盖鲁棒性、公平性与校准性等维度。为打破英语中心、文本导向的评估范式，业内还提出了如 M3Exam 等多语言基准，以及 GAOKAO-MM、CMMMU 等以中文为主的多模态知识测试。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fe3f5dce86e834fbcac71aa9bbc52e4c6\u002FAA1LGStN.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图｜代表性知识导向基准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">尽管知识导向的基准测试在方法和形式上愈发严谨、多元，但仍面临一系列关键挑战。其中最普遍的挑战，是数据污染的隐患。其次，封闭式评估方法本身也存在局限性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">推理能力——涵盖形式逻辑、常识推理和应用问题求解——是构建高级智能的关键基础。\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">在 LLM 中评估这一能力，对于理解其认知边界与实际应用潜力至关重要。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">逻辑推理领域\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，是 LLM 评估中最为成熟且最密集的方向。整体演进轨迹清晰：从测试离散推理步骤的基础性基准（如 SimpleLogic）开始，逐渐发展到对高度复杂、多步甚至程序化推理的评估（如 LogicPro）。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">常识推理与专业推理\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">的引入，标志着评估维度的进一步拓展。智能需要的不仅仅是形式逻辑。Corr2Cause 与 CLadder 等基准首次尝试系统评估因果推理，推动模型从相关性走向理解。与此同时，主动推理（AR-Bench）和语言规则归纳（IOLBENCH）类基准的出现代表了一种范式转变，将评估从被动的模式识别转向主动的、具备能动性的问题解决。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">在\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">应用推理和上下文推理\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">上，HotpotQA 要求模型定位并连接分散的证据以进行多跳推理，而ARC则需要运用科学知识。BIG-Bench Hard 在 23 个多样任务上专注挑战性组合推理，而 LiveBench 的创新之处在于使用实时的、私有用户查询。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LLM 推理能力的评估方式已从早期的形式逻辑测验不断演进，发展出更加贴近现实应用的复杂评估体系。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fe70b88ca887f4e229379eb7533a872a5\u002FAA1LHgjG.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图｜用于评估 LLM 推理的各种基准的全面概述。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">将评估视角从通用能力转向专业领域，是测试 LLM 能力边界的另一个关键步骤。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">作为人类知识体系中逻辑最严密、结构最有组织的领域之一，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">自然科学对 LLM 的知识基础和推理能力提出了巨大挑战。\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">该领域涵盖数学、物理、化学、生物等核心学科，在此类任务中取得成功不仅要求模型具备扎实的通用能力，还需具备强大的抽象推理、符号操作能力，以及追踪复杂因果链的能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F53bcefd74056409fa5a38c42c0e67597\u002FAA1LHbsQ(1).jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图｜自然科学领域代表性基准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">除了在自然科学中对 LLM 进行理性层面的能力评估，\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">其拟人化的对话特性还使其与人类交流更加自然、高效，从而增强了交互式应用的潜力。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">社会科学作为最以人为中心的领域之一\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，在这一背景下显得尤为重要。一个关键问题是，LLM 能否在法律、知识产权、教育、心理学和金融等领域有效应对现实世界的挑战。所有人文与社会科学领域都高度适用于现实场景。其中的最大挑战之一，是如何科学评估 LLM 在这些领域的知识水平，这涉及定义合适的任务、构建相关的数据集，以及选择适当的评估方法。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F88b4054192fc40c6b074d802fede67b6\u002FAA1LGXCZ.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">图｜人文和社会科学代表性基准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">工程与技术领域是 LLM 的另一座试炼场，\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">它测试模型在任务中的能力，这些任务不仅要求语言流畅，还要求逻辑严谨、功能正确，以及具有深厚的专业知识。不同于通用任务，工程应用往往存在唯一正确答案，或仅有一小部分在严格的物理定律、数学原理或语法规则下成立的合理解。在这一领域中，成功的模型需要能够像真正的工具一样运作，而不仅仅是提供语言交互。因此，工程与技术方向也产生了一系列最为复杂且成熟的评估框架。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Ffc8907dd09524041a9ebce1a0bfd5e52\u002FAA1LHgk5.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">图｜工程与技术领域代表性基准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">未来评估：更安全、更全面\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">随着 LLM 模型从研究原型走向现实部署，尤其是应用于医疗咨询、法律推理、金融顾问或客户支持等高风险场景中，也同步催生了一些显著的风险，如幻觉生成、偏见输出、对抗性脆弱性以及隐私泄露等问题。这些风险已不再停留在理论层面，而是对用户、组织乃至整个社会产生了切实影响。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">因此，风险与可靠性评估已从边缘议题演变为现代 LLM 基准测试体系的核心支柱，其核心动因包括：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">识别与量化\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：系统性地探测 LLM 的各种负面影响模式（如生成有害内容、虚构事实 、泄露私人数据），并量化这些风险的发生频率与严重程度。这需要在多样化且具有挑战性的输入下进行测试，包括极端情况、对抗性提示和边缘案例（如越狱尝试 、带偏见的提示 、高事实密度的查询）。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">风险缓解\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：利用基准测试揭示的弱点推动开发者进行技术改进（如更鲁棒的 RLHF、事实性增强、隐私保护训练），并为部署方提供更有效的防护措施（如内容过滤、使用政策）。最终目标是尽量降低模型出错或造成伤害的可能性。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">符合期望\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：检验模型在复杂的现实交互中，是否能够遵守既定的伦理规范、法律边界与安全标准（即对齐问题），特别是在涉及敏感话题时展现出足够的鲁棒性。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">构建与维持信任\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">：通过提供严谨、可复现的风险评估证据，向用户、监管机构和社会传达某一 LLM 的安全性与可信度，从而推动生态系统健康发展，实现负责任的广泛应用。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">本质上，该研究方向所关注的核心问题是：在具备令人惊叹的能力之外，模型是否足够安全、可靠，且值得信赖？它旨在为模型的责任担保提供实证基础，作为 LLM 从实验室走向现实世界的关键“安全检查点”。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LLM Agent 是基于基础 LLM 构建的自主系统，旨在超越静态的提示-响应交互，并参与以目标为导向的行为。通过整合规划模块、工具使用能力、记忆系统和观察循环等组件，这些 Agent 能够将复杂目标分解为可执行的步骤，与外部环境进行动态交互，并不断迭代调整其策略直至任务完成。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">随着 LLM Agent 在现实场景中的应用日益增加，构建系统化、全面性的评估框架变得尤为重要，评估框架主要包括以下四个维度：\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">特定能力评估\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，侧重于对单一功能（如规划、推理、博弈）以及执行能力（如工具使用、外部控制）的细粒度评估。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">综合能力评估\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，强调在解决复杂任务过程中多种能力的协调与协同。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">领域专业性评估\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，侧重于评估在特定专业领域中应用专门知识并完成任务的有效性。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">安全与风险评估\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">，关注 Agent 在对抗性或不安全场景中的韧性、脆弱性及防护机制。\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">总而言之，要让模型真正融入社会技术系统，评估的重点必须\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">从“模型能做什么”转向“模型应如何负责任地表现”\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">未来的基准需要具备动态性（以匹配模型演进）、因果性（用于解释结果）、包容性（避免偏见）以及鲁棒性（预判风险）。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">实现这一目标，需要跨学科力量的深度协作，在保持技术科学性的同时，也确保与社会价值体系的高度一致。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">【新闻来源】学术头条 \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002F%E4%B8%80%E6%96%87%E8%AF%BB%E6%87%82ai%E5%A4%A7%E6%A8%A1%E5%9E%8B%E4%B9%8B-%E7%9B%BE-%E5%85%A8%E8%A1%8C%E4%B8%9A283%E4%B8%AAllm%E5%9F%BA%E5%87%86%E6%B5%8B%E8%AF%95%E9%83%BD%E5%9C%A8%E8%BF%99%E4%BA%86\u002Far-AA1LGXE2?ocid=msedgdhphdr&amp;cvid=08eea14c1cfa46498156db5aea161a51&amp;ei=23\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(187, 187, 187);\">http:\u002F\u002Fu5a.cn\u002FnTORR\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">（本网转发此文章，旨在为读者提供更多的信息资讯，所涉内容不构成投资、消费建议。文章事实如有疑问，请与有关方核实，文章观点非本网观点，仅供读者参考。）\u003C\u002Fspan>\u003C\u002Fp>","","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fe29f5d13459e412b8412ddb7873c09d0\u002FAI领域.jpg","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fthumbs\u002Fe29f5d13459e412b8412ddb7873c09d0\u002FAI领域.jpg",0,1,50,"2025-09-04 18:15",2,false,{"id":17,"name":20,"enName":21},"芯位视野","Xinwei Vision","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3A47f82f1c-02d8-444f-93b7-fb39af8e306c%3A0.wav?Expires=1756986827&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=%2BbDcFm%2F2xkUxAC0hNbyPb9DBtA8%3D",23356290,"47f82f1c-02d8-444f-93b7-fb39af8e306c","2025-09-04 18:07","Understand the \"Shield\" of AI Large Models in One Article, All 283 LLM Benchmark Tests in the Entire Industry Are Here","\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F62d5da90c73b489a984757df6c549c97\u002FAA1LH1VW.jpg\" width=\"547\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">Figure | Representative LLM Benchmark Tests (by timeline).\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px; color: rgb(255, 153, 0);\" class=\"ql-lineHeight-1-75\">Large models are like a spear, and benchmark tests are like a shield. Only when the spear becomes sharper and the shield stronger will the AI industry continue to be pushed higher.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">If you are in the large model industry or have been affected by large model technology, besides knowing what the 'spear' is, you must also understand what the 'shield' is, so as to better understand the true development situation of the artificial intelligence (AI) industry.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Recently,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">team and its collaborators\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">conducted a systematic review of the current status and development of \"LLM benchmarks\" for the first time\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, and categorized\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">&nbsp;283\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">representative benchmarks into three categories:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">General capabilities\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\"> (general capabilities) benchmarks: covering core linguistics, knowledge, and reasoning aspects;\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Domain-specific\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\"> (domain-specific) benchmarks: focusing on natural sciences, humanities, social sciences, and engineering fields;\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Target-specific\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\"> (target-specific) benchmarks: focusing on risks, reliability, and agents.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">They pointed out that current benchmarks suffer from issues such as \"inflated scores\" caused by data contamination, \"unfair evaluation\" due to cultural and language biases, and lack of assessment of process credibility and dynamic environments, and provided a reference design paradigm for future benchmark innovation.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F20a3b2aeca3b442ca38dc6564bbb3c8f\u002FAA1LHbsq.jpg\" width=\"697\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">The Urgent Need for Breakthroughs in Large Model Examinations\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Since the introduction of the Transformer architecture in 2017, from basic language understanding and text generation tasks to complex logical reasoning and agent (Agent) interactions, LLM has continuously expanded the boundaries of AI capabilities, thereby reshaping human-computer interaction patterns and information processing paradigms.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Now, LLM has been widely applied in intelligent customer service, content creation, education, healthcare, law, and other fields, becoming a core force driving digital economic development and social intelligent transformation. However,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">with the rapid evolution of LLM technology, establishing a scientific, systematic, and comprehensive evaluation system has become particularly urgent\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Benchmarking, as a quantitative tool to measure model performance, is not only the core means of evaluating model capabilities but also an important factor in promoting technological iteration and model optimization. Through benchmarking, researchers can objectively compare the performance of different models, accurately identify technical bottlenecks, and provide data support for algorithm optimization and structural design. At the same time,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">standardized evaluation results can help enhance user trust and ensure that models meet social and ethical norms in terms of security and fairness\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Compared with early LLMs, the parameter scale of current LLMs has grown exponentially, and their capabilities have expanded from single-task to multi-task and multi-domain. Therefore, the evaluation content has shifted from fixed tasks to multi-task and multi-domain, placing higher demands on the scientific nature and adaptability of evaluation methods.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Currently,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">the evaluation system for large language models (LLMs) still faces many challenges that need to be addressed urgently\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, as follows:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Some models may have encountered evaluation data during training, leading to the \"data leakage effect,\" making it difficult to truly reflect the model's generalization ability.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Static evaluation methods largely fail to capture the complexity of the dynamic real-world environment and are also ineffective in predicting the model's adaptability in new tasks or domains.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Existing evaluation metrics are relatively singular, making it difficult to comprehensively reveal the comprehensive capabilities of LLMs in areas such as reasoning, understanding, and generation.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In key areas such as bias detection, security vulnerability identification, and instruction compliance, there is still a lack of systematic and scalable evaluation frameworks.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The high computational power and manpower costs required for large-scale evaluations have become a key bottleneck limiting the sustainable development of the LLM evaluation system.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">The Development History of Large Model Exam Questions\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The evolution of language ability benchmarks reflects the ongoing \"arms race\" between LLM progress and evaluation methods. The core driving force behind this process lies in the exploration of \"general language ability,\" shifting from surface-level pattern matching to deeper linguistic understanding involving grammar, semantics, and pragmatics.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">GLUE, introduced in 2018, was a key advancement\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, which addressed this issue by incorporating nine different English natural language understanding (NLU) tasks, such as sentiment analysis and text entailment, into a unified framework. Subsequently, SuperGLUE introduced more challenging tasks, emphasizing complex reasoning abilities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">However, research found that models tend to exploit human traces in the data annotations. HellaSwag emerged as a benchmark. These tasks are easy for humans but difficult for models, thus testing common sense and script knowledge more directly. In Chinese, CLUE is the first representative Chinese NLU benchmark, while Xtreme expands to include 12 language families and 40 languages, systematically evaluating generalization capabilities under different language properties such as morphology and word order.\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">HELM introduced the concept of \"dynamic benchmarks\"\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, dynamically integrating emerging language dimensions by continuously expanding scenarios.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F3bf40073e9914529a75e8ffbe089d6ba\u002FAA1LGXCr.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure | Representative Language Core Benchmarks.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The ability to store and accurately extract a large amount of real-world information is one of the foundations of modern LLMs. These models function like knowledge bases, absorbing information from massive training corpora. Therefore, measuring their knowledge scope and reliability has become an important dimension of model evaluation. These tests typically simulate a \"closed-book exam,\" requiring the model to rely entirely on its internal parametric knowledge.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The evolution path of knowledge evaluation reflects the trend of LLMs transitioning from information retrieval tools to internalized knowledge. The introduction of MMLU marked a groundbreaking breakthrough, establishing a new influential paradigm.\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">MMLU-Pro increases the difficulty of tasks by increasing the number of options and the proportion of reasoning-intensive questions\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, GPQA and other benchmarks designed by domain experts aim to achieve \"anti-Google,\" directly addressing the challenge of models relying on web searches rather than internalized knowledge to answer questions.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The current most common form of evaluation is multiple-choice question answering (MCQA), which has the advantage of good scalability and can be objectively and automatically evaluated through accuracy, a core metric. AGIEval and GAOKAO-Bench use this approach, selecting questions from high-risk human exams.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Frameworks such as HELM and BIG-Bench have incorporated knowledge capability evaluation into a broader set of metrics\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, covering dimensions such as robustness, fairness, and calibration. To break the English-centric, text-oriented evaluation paradigm, industry also proposed multilingual benchmarks such as M3Exam, as well as multilingual knowledge tests mainly in Chinese, such as GAOKAO-MM and CMMMU.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fe3f5dce86e834fbcac71aa9bbc52e4c6\u002FAA1LGStN.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure | Representative Knowledge-Oriented Benchmarks.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Although knowledge-oriented benchmark tests have become increasingly rigorous and diverse in methodology and form, they still face a series of key challenges. The most common challenge is the risk of data contamination. Secondly, closed evaluation methods themselves have limitations.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Reasoning ability—covering formal logic, common-sense reasoning, and application problem-solving—is a key foundation for building advanced intelligence.\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Evaluating this ability in LLMs is crucial for understanding their cognitive boundaries and practical application potential.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In the field of logical reasoning\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, it is the most mature and intensive direction in LLM evaluation. The overall evolution trajectory is clear: starting from basic benchmarks that test discrete reasoning steps (such as SimpleLogic), it gradually develops to evaluate highly complex, multi-step, or even programmatic reasoning (such as LogicPro).\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Common-sense reasoning and professional reasoning\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">introduces further expansion of evaluation dimensions. Intelligence requires more than just formal logic. Benchmarks such as Corr2Cause and CLadder were the first to attempt systematic evaluation of causal reasoning, pushing models from correlation to understanding. At the same time, the emergence of benchmarks such as Active Reasoning (AR-Bench) and Language Rule Induction (IOLBENCH) represents a paradigm shift, moving evaluation from passive pattern recognition to active, proactive problem solving.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">application reasoning and context reasoning\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, HotpotQA requires the model to locate and connect scattered evidence for multi-hop reasoning, while ARC needs to apply scientific knowledge. BIG-Bench Hard focuses on challenging combinatorial reasoning across 23 diverse tasks, and LiveBench's innovation lies in using real-time, private user queries.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The evaluation methods for LLM reasoning abilities have evolved from early formal logic tests to more complex evaluation systems that are closer to real-world applications.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Fe70b88ca887f4e229379eb7533a872a5\u002FAA1LHgjG.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure | Comprehensive overview of various benchmarks used to evaluate LLM reasoning.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Shifting the evaluation perspective from general capabilities to specialized fields is another key step in testing the boundary of LLM capabilities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As one of the most logically rigorous and structurally organized fields in the human knowledge system,\u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">natural sciences pose significant challenges to the knowledge base and reasoning capabilities of LLMs.\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">This field covers core disciplines such as mathematics, physics, chemistry, and biology. Success in these tasks not only requires the model to have solid general capabilities but also strong abstract reasoning, symbolic operation capabilities, and the ability to track complex causal chains.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F53bcefd74056409fa5a38c42c0e67597\u002FAA1LHbsQ(1).jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure | Representative benchmarks in the field of natural sciences.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In addition to evaluating the rational capabilities of LLMs in natural sciences, \u003C\u002Fspan>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">their human-like conversational characteristics make them more natural and efficient in human communication, thus enhancing the potential for interactive applications.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As one of the most human-centered fields,\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">social sciences are particularly important in this context. A key question is whether LLMs can effectively address real-world challenges in areas such as law, intellectual property, education, psychology, and finance. All humanities and social science fields are highly applicable to real-world scenarios. One of the greatest challenges is how to scientifically evaluate the knowledge level of LLMs in these fields, which involves defining appropriate tasks, constructing relevant datasets, and selecting appropriate evaluation methods.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002F88b4054192fc40c6b074d802fede67b6\u002FAA1LGXCZ.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan class=\"ql-lineHeight-1-75\" style=\"color: rgb(187, 187, 187);\">Figure | Representative benchmarks in humanities and social sciences.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">The engineering and technology field is another testing ground for LLMs,\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">testing the model's capabilities in tasks that require not only fluent language but also rigorous logic, correct functionality, and deep professional knowledge. Unlike general tasks, engineering applications often have a single correct answer or a small number of reasonable solutions under strict physical laws, mathematical principles, or grammatical rules. In this field, successful models need to operate like real tools, not just provide language interaction. Therefore, the engineering and technology direction has produced a series of the most complex and mature evaluation frameworks.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cimg alt=\"undefined\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F09\u002Ffc8907dd09524041a9ebce1a0bfd5e52\u002FAA1LHgk5.jpg\" width=\"undefined\" height=\"undefined\" style=\"display: block; margin: auto;\">\u003Cp class=\"ql-align-center\">\u003Cspan style=\"color: rgb(187, 187, 187);\" class=\"ql-lineHeight-1-75\">Figure | Representative benchmarks in engineering and technology fields.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">Future Evaluation: Safer and More Comprehensive\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As LLM models transition from research prototypes to real-world deployment, especially in high-risk scenarios such as medical consultation, legal reasoning, financial advisory, or customer support, they have also triggered some significant risks, such as hallucination generation, biased output, adversarial vulnerability, and privacy leaks. These risks are no longer theoretical but have had tangible impacts on users, organizations, and even society as a whole.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Therefore, risk and reliability evaluation has evolved from a marginal issue into a core pillar of modern LLM benchmark testing systems, with the following core motivations:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Identification and Quantification\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: Systematically detect various negative impact patterns of LLMs (such as generating harmful content, fabricating facts, leaking private data), and quantify the frequency and severity of these risks. This requires testing under diverse and challenging inputs, including extreme cases, adversarial prompts, and edge cases (such as jailbreak attempts, biased prompts, high-fact-density queries).\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Risk Mitigation\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: Use weaknesses revealed by benchmark tests to drive developers to improve technology (such as more robust RLHF, fact-enhanced training, and privacy protection training), and provide deployers with more effective protective measures (such as content filtering and usage policies). The ultimate goal is to minimize the possibility of the model making mistakes or causing harm.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Alignment with Expectations\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: Test whether the model can comply with established ethical norms, legal boundaries, and safety standards (i.e., alignment issues) in complex real-world interactions, especially demonstrating sufficient robustness when dealing with sensitive topics.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Building and Maintaining Trust\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">: Provide rigorous and reproducible risk assessment evidence to convey the safety and trustworthiness of a particular LLM to users, regulatory bodies, and society, thereby promoting the healthy development of the ecosystem and achieving responsible widespread application.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Essentially, this research direction focuses on the core question: Beyond its amazing capabilities, is the model safe, reliable, and trustworthy? It aims to provide empirical evidence for model accountability, serving as a critical \"safety checkpoint\" for LLMs transitioning from laboratories to the real world.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">LLM Agent is an autonomous system built upon the basic LLM, aiming to go beyond static prompt-response interactions and engage in goal-oriented behaviors. By integrating components such as planning modules, tool usage capabilities, memory systems, and observation loops, these Agents can break down complex goals into executable steps, interact dynamically with the external environment, and continuously iterate and adjust their strategies until the task is completed.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">As the application of LLM Agents in real-world scenarios increases, building a systematic and comprehensive evaluation framework has become particularly important. The evaluation framework mainly includes the following four dimensions:\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Specific Ability Evaluation\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, focusing on fine-grained evaluation of single functions (such as planning, reasoning, game-playing) and execution capabilities (such as tool usage, external control).\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Comprehensive Ability Evaluation\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, emphasizing the coordination and collaboration of multiple abilities in solving complex tasks.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Domain Specificity Evaluation\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, focusing on evaluating the effectiveness of applying specialized knowledge in specific professional fields to complete tasks.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cul>\u003Cli>\u003Cstrong style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Safety and Risk Evaluation\u003C\u002Fstrong>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">, focusing on the resilience, vulnerability, and protective mechanisms of Agents in adversarial or unsafe scenarios.\u003C\u002Fspan>\u003C\u002Fli>\u003C\u002Ful>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">In summary, to allow models to truly integrate into the socio-technical system, the focus of evaluation must shift from \"what the model can do\" to \"how the model should responsibly perform.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">Future benchmarks need to be dynamic (to match model evolution), causal (to explain results), inclusive (to avoid bias), and robust (to anticipate risks).\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">To achieve this goal, it requires in-depth collaboration across disciplines, maintaining technological scientificity while ensuring high consistency with the social value system.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">[News Source] Academic Headlines \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002F%E4%B8%80%E6%96%87%E8%AF%BB%E6%87%82ai%E5%A4%A7%E6%A8%A1%E5%9E%8B%E4%B9%8B-%E7%9B%BE-%E5%85%A8%E8%A1%8C%E4%B8%9A283%E4%B8%AAllm%E5%9F%BA%E5%87%86%E6%B5%8B%E8%AF%95%E9%83%BD%E5%9C%A8%E8%BF%99%E4%BA%86\u002Far-AA1LGXE2?ocid=msedgdhphdr&amp;cvid=08eea14c1cfa46498156db5aea161a51&amp;ei=23\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(187, 187, 187);\">http:\u002F\u002Fu5a.cn\u002FnTORR\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(187, 187, 187);\">（This article is reprinted by this site to provide readers with more information and news. The content does not constitute investment or consumer advice. If there are any questions about the facts of the article, please verify with the relevant parties. The views expressed in the article are not the views of this site and are for reference only.）\u003C\u002Fspan>\u003C\u002Fp>","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3A7dd598a8-64a7-472d-8b4d-604876cb4ffa%3A0.wav?Expires=1774838477&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=fumG7tMU%2B1C9Ch2K0d87%2B4i2fls%3D","7dd598a8-64a7-472d-8b4d-604876cb4ffa",17526636]