[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$f6p23OnWXRMugLio2Xwk748rNHOOHsmlccFb7fDkMUj8":3},{"code":4,"msg":5,"data":6},200,"操作成功",{"id":7,"title":8,"content":9,"digest":10,"source":10,"coverPath":11,"thumbsCoverPath":12,"isTop":13,"isShow":14,"baseClick":13,"clickCount":15,"createTime":16,"typeId":17,"isNewest":18,"newsInfoTypeRespVo":19,"voiceUrl":22,"voiceSize":23,"taskId":24,"releaseTime":25,"titleEn":26,"contentEn":27,"voiceUrlEn":28,"taskIdEn":29,"voiceSizeEn":30},1484,"MIT等机构研究：让AI做\"游戏评委\"，发现模型评判能力的意外真相","\u003Cimg alt=\"\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F03\u002Fhistory\u002F4af127f4e3bf4cab8b6e126e3d6ed11d.png\" width=\"null\" height=\"null\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这项由美国麻省理工学院的凯瑟琳·柯林斯、剑桥大学等多所顶尖院校研究团队联合开展的突破性研究，发表于2025年的arXiv预印本平台（论文编号：arXiv:2510.10930v1），首次系统性地探索了人工智能系统评价游戏好坏的能力。研究团队包括来自MIT、剑桥大学、纽约大学、哈佛大学、普林斯顿大学和斯坦福大学的顶级研究者，这可能是第一次有人认真思考\"AI能否当个称职的游戏评委\"这个看似简单却意义深远的问题。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t传统上，我们总是测试AI能否击败人类玩家——从国际象棋到围棋，从扑克到电子游戏，AI在\"玩游戏\"方面的表现已经让人刮目相看。但这次研究团队换了个角度：不问AI能否玩好游戏，而是问AI能否判断一个游戏值不值得玩。这就像从考察一个人的厨艺转向考察他们的美食鉴赏能力——两者需要的技能截然不同。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究的核心发现颇为出人意料：当AI模型在游戏技巧上越来越接近理论最优水平时，它们对游戏的评判反而可能越来越偏离人类的直觉。这个现象提醒我们，在AI越来越强大的今天，如何让它们理解人类的价值观和偏好，可能比让它们在技术上超越人类更加重要。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t一、游戏评判：比玩游戏更难的挑战\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t要理解这项研究的意义，我们首先需要明白\"评判游戏\"和\"玩游戏\"之间的本质区别。玩游戏时，目标很明确——赢得比赛。但评判游戏时，情况就复杂多了。你需要考虑这个游戏是否公平、是否有趣、是否值得花时间，这些问题没有标准答案。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队设计了一个巧妙的实验框架。他们创造了121个全新的棋盘游戏，每个都是经典井字棋的变种。有些在更大的棋盘上进行，有些改变了获胜条件，还有些给不同玩家设置了不同的规则。这就像是创造了121种不同的\"厨房烹饪挑战\"，每种都有细微但重要的差别。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t关键在于，这些游戏都是全新的，AI模型在训练时从未见过，人类志愿者也是第一次接触。这样就确保了测试的公平性——没有人（或AI）有任何先验优势。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队提出了两个核心问题来测试AI的评判能力。第一个问题相对客观：这个游戏对双方是否公平？换句话说，先手玩家和后手玩家的获胜机会是否大致相等？这个问题虽然复杂，但原则上可以通过数学计算得出准确答案。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t第二个问题就主观得多：这个游戏好玩吗？这就像问\"这道菜好吃吗？\"一样，答案很大程度上取决于个人品味。有人喜欢简单明快的游戏，有人偏爱复杂策略，还有人重视游戏的创新性。这种主观评判正是人工智能面临的最大挑战之一。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t二、实验设计：让AI当评委的严格测试\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队的实验设计可以说是一场精心编排的\"AI审美能力大赛\"。他们邀请了450多名人类志愿者作为\"金标准\"评委，每个游戏大约有20人进行评判。这些人就像美食节目中的专业评委团，为每个游戏的公平性和趣味性打分。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t与此同时，研究团队测试了多种不同类型的AI模型。这些模型就像不同背景的评委——有些是\"直觉型\"的，能快速给出判断但缺乏深入分析；有些是\"思考型\"的，会仔细分析每个细节后再下结论。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t具体来说，研究团队比较了两大类AI模型。第一类是传统的语言模型，它们主要基于在互联网文本上学习到的知识来做判断。这就像一个美食评论家主要基于读过的食谱和餐厅评价来评判一道新菜，而不是真正品尝。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t第二类是新兴的\"推理模型\"，它们能够进行深入的逐步分析。这些模型会在给出最终判断前，先进行详细的思考过程，就像一个专业评委会仔细分析菜品的色香味、营养搭配、创新程度等各个方面。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t为了确保比较的公平性，研究团队还设置了多个\"基准选手\"。其中包括随机选择的模型（相当于完全外行的评委）、基于启发式规则的\"直觉型玩家\"模型、以及使用先进搜索算法的\"专家型玩家\"模型。最重要的是，对于能够精确计算的游戏，研究团队还计算出了理论上的最优解，作为\"完美评委\"的标准。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t三、令人意外的发现：越聪明的AI越不懂人心\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t实验结果揭示了几个引人深思的现象。最令人意外的发现是，在游戏公平性判断方面，AI模型表现出了一种\"聪明的悖论\"。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t当研究团队比较不同AI模型的表现时，发现了一个有趣的倒U型关系。最初，随着AI推理能力的增强，它们对游戏公平性的判断确实越来越接近人类的直觉。这就像学习品酒的新手，随着经验增加，品味越来越接近专业品酒师。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t但当AI的推理能力继续提升，开始接近理论最优水平时，情况发生了逆转。这些超级智能的AI模型虽然能够计算出游戏的理论最优策略，但它们的判断反而开始偏离普通人的直觉。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这个现象的背后原因很有启发性。理论上完美的游戏分析往往会得出与人类直觉相反的结论。比如，一个看起来很公平的游戏，在完美分析下可能先手玩家有微弱优势；而一个看起来偏向某一方的游戏，可能在理论上是完全平衡的。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t具体而言，OpenAI公司的模型系列完美展示了这种现象。从GPT-4到o1再到o3，随着推理能力的增强，模型与人类判断的一致性先升后降。最新的GPT-5模型虽然在计算游戏理论最优解方面表现出色，但在理解人类玩家真实感受方面却不如早期版本。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t四、趣味性评判：AI面临的更大挑战\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t如果说评判游戏公平性还有客观标准可循，那么评判游戏是否有趣就完全进入了主观领域。这部分实验揭示了AI理解人类偏好的更多局限性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t在趣味性评判方面，不同AI模型的表现变得更加\"参差不齐\"。即使是最先进的推理模型，在判断游戏趣味性时也表现出了明显的不一致性。这就像让不同的AI评委品尝同一道菜，它们给出的分数可能差异很大。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队通过分析AI模型的推理过程发现了原因。当评判游戏趣味性时，AI需要考虑多个因素：游戏是否平衡、是否具有挑战性、游戏时长是否合适、策略深度如何、是否具有新颖性等等。虽然大部分AI模型能够识别出这些重要因素，但它们在综合这些因素做出最终判断时表现出了很大差异。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t更有趣的是，研究团队发现AI模型在评判趣味性时使用的\"思考时间\"变化很大，而且这种变化往往无法预测。有些看似简单的游戏会让AI\"苦思冥想\"很久，而有些复杂的游戏AI却能快速给出判断。这种不规律性表明，AI模型在处理主观评判任务时缺乏有效的\"资源分配策略\"。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t五、深入推理过程：AI是如何\"思考\"的\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队对部分AI模型的推理过程进行了详细分析，就像解剖一个评委的思维过程。这些分析揭示了AI评判游戏时采用的不同策略。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t一些AI模型主要依靠\"类比推理\"，它们会将新游戏与已知的经典游戏（如井字棋、五子棋、四子棋等）进行比较。这就像一个美食评委通过与经典菜品对比来评判新菜。这种方法的优点是快速直观，缺点是可能忽略游戏的独特之处。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t另一些AI模型会进行\"显式模拟\"，实际上在脑海中\"玩\"几轮游戏来感受游戏的特点。这种方法更加深入，但也更耗时，而且模拟的质量直接影响最终判断的准确性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t还有一些AI模型试图进行\"数学计算\"，通过分析游戏的数学特性来评判公平性和趣味性。这种方法在评判公平性时相当有效，但在评判趣味性时往往显得过于死板。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t有趣的是，不同AI模型使用这些策略的频率差异很大。一些模型几乎从不进行实际的游戏模拟，主要依靠类比和数学分析；而另一些模型则经常进行详细的游戏模拟。这种差异反映了不同AI架构和训练方法的影响。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t六、资源使用的迷思：为什么AI会\"浪费\"计算力\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队还发现了一个令人困惑的现象：AI模型在评判不同游戏时使用的计算资源（以\"推理令牌\"数量衡量）变化极大，而且这种变化往往缺乏明显的规律。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t直觉上，我们可能认为越复杂的游戏需要AI投入更多的思考时间。但实验数据显示，情况远比这复杂。有些看起来很简单的游戏会让AI使用大量计算资源，而有些明显更复杂的游戏AI却能快速处理。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t更奇怪的是，AI使用的计算资源多少与其最终判断的准确性之间没有明显关系。有时候AI\"深思熟虑\"后给出的答案反而不如\"快速判断\"的结果准确。这就像一个评委花了很长时间品尝和分析，最后给出的评价反而不如第一口的直觉判断准确。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这种现象在评判游戏趣味性时尤为明显。不同AI模型在面对同一个游戏时，使用的计算资源可能相差十倍甚至更多，但它们的最终评判结果可能非常相似。这表明当前的AI模型在\"元推理\"方面还有很大改进空间——它们不知道什么时候应该深思，什么时候应该快速判断。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t七、人机差异：当完美计算遇上人类直觉\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t整个研究最深刻的洞察可能在于揭示了\"计算完美\"与\"人类直觉\"之间的根本性差异。这种差异在游戏评判的两个维度上都有体现，但表现形式不同。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t在公平性评判方面，差异主要源于视角不同。AI模型（特别是高级推理模型）倾向于从理论最优的角度分析游戏，它们关注的是在双方都采用完美策略时的游戏结果。而人类玩家的判断更多基于实际游戏体验——在现实中，很少有人能达到理论最优水平。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这就像专业汽车评测师和普通消费者评价同一辆车。专业评测师可能会从发动机效率、空气动力学等技术角度给出评价，而普通消费者更关心驾驶感受、舒适性、实用性等日常体验。两种评价都有其价值，但针对的受众和目的不同。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t在趣味性评判方面，差异更加复杂和微妙。人类对游戏趣味性的判断往往受到情感、文化背景、个人经历等多种因素影响。而AI模型虽然能够识别游戏的各种客观特征（平衡性、复杂度、创新性等），但在综合这些特征形成整体印象时显得力不从心。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t八、意外的模式：简单游戏的复杂判断\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队还发现了一些有趣的细节模式。比如，某些看起来很简单的游戏变种实际上会引发AI模型的\"深度思考\"，而一些明显更复杂的游戏反而被快速处理。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t通过分析具体案例，研究团队发现这种现象往往与游戏的\"直觉欺骗性\"有关。有些游戏表面看起来简单，但实际的策略空间很大；有些游戏看起来复杂，但策略相对直接。AI模型似乎能够感知到这种\"表象与实质的差异\"，因此在看似简单但实际复杂的游戏上投入更多计算资源。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这种能力本身是令人印象深刻的，表明AI模型具备了某种\"直觉\"来识别问题的真实复杂程度。但问题在于，这种资源分配策略并不总是有效——有时候投入大量计算得到的结果并不比快速判断更准确。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t九、训练数据的隐藏影响\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t虽然测试游戏都是全新创造的，但研究发现AI模型在评判时仍然受到训练数据的显著影响。不同厂商的模型表现出了相似的偏见模式，暗示它们可能从相似的训练数据中学到了类似的游戏评判\"直觉\"。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这种现象特别体现在非推理模型上。这些模型主要依靠在训练中学到的统计模式来做判断，而不是进行实际的逻辑推理。结果是，即使面对全新的游戏，它们的评判仍然带有明显的先入为主色彩。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t相比之下，推理模型虽然也受训练数据影响，但程度较轻。它们更多依靠推理过程中的逻辑分析，因此能够更好地适应全新的游戏类型。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t十、对未来AI发展的启示\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这项研究的意义远超游戏领域。它实际上探讨了一个更根本的问题：随着AI系统变得越来越强大，我们如何确保它们仍然能够理解和服务于人类的需求和价值观？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究结果表明，单纯追求技术性能的提升可能会导致AI系统偏离人类的直觉和偏好。这对AI开发提出了新的挑战：如何在提升AI能力的同时，保持其与人类价值观的一致性？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t在实际应用中，这个问题变得更加重要。比如，如果我们让AI系统帮助设计教育游戏、娱乐产品或者社交平台，我们希望它们的判断基于人类的真实体验，而不是抽象的理论最优。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t研究团队提出了几个可能的解决方向。首先是开发更好的\"资源理性\"推理系统，让AI能够根据任务的重要性和复杂程度动态分配计算资源。其次是在AI训练中更多地融入人类反馈和偏好数据，确保AI的判断能够反映真实的人类体验。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t说到底，这项研究提醒我们，AI的\"智能\"不仅仅体现在解决复杂问题的能力上，也体现在理解人类需求和价值观的能力上。在AI技术快速发展的今天，确保AI系统能够真正服务于人类福祉，可能比单纯追求技术指标更加重要。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t这项研究为我们打开了一扇新的窗户，让我们从全新角度审视AI系统的能力和局限。它告诉我们，评判和选择可能比解决问题更加困难，而理解人类的主观体验可能是AI面临的最大挑战之一。随着AI系统在更多领域发挥作用，这些洞察将变得越来越重要。对于每一个关心AI发展方向的人来说，这项研究都提供了宝贵的思考素材。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\t有兴趣深入了解这项研究的读者，可以通过论文编号arXiv:2510.10930v1在相关学术平台查询完整论文，其中包含了更详细的实验数据和技术细节。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ&amp;A\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ1：AI模型评判游戏能力与游戏技巧之间有什么关系？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA：研究发现了一个意外现象：AI模型的游戏技巧越接近理论最优水平，它们对游戏的评判反而可能越偏离人类直觉。技术上越完美的AI在理解人类真实游戏体验方面可能表现更差，这揭示了\"计算完美\"与\"人类直觉\"之间的根本差异。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ2：为什么AI在评判游戏趣味性时表现不稳定？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA：趣味性评判比公平性评判更加主观和复杂。AI需要综合考虑游戏平衡性、挑战性、策略深度、创新性等多个因素，但在整合这些因素形成最终判断时表现出很大差异。不同AI模型使用的计算资源也变化很大，且资源使用量与判断准确性之间没有明显关系。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ3：这项研究对AI发展有什么实际意义？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA：研究揭示了AI系统面临的一个重要挑战：如何在提升技术能力的同时保持与人类价值观的一致性。这对AI在教育、娱乐、产品设计等需要理解人类主观体验的领域应用具有重要指导意义，提醒我们不能只追求技术指标，还要确保AI能真正服务于人类需求。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(136, 136, 136);\">【科技行者】南方都市报  \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1R8leX?ocid=msedgntphdr&amp;cvid=69265783251e4467a4c4ab98c6ba6166&amp;ei=19\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(136, 136, 136);\">https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FRH7anFgRokfIkEqKFt6gTg\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(136, 136, 136);\">（本网转发此文章，旨在为读者提供更多的信息资讯，所涉内容不构成投资、消费建议。文章事实如有疑问，请与有关方核实，文章观点非本网观点，仅供读者参考。）\u003C\u002Fspan>\u003C\u002Fp>","","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F12\u002Fc5571cd805d04e11bd4fb9677f26c4fa\u002FAI领域.jpg","https:\u002F\u002Fimage.51xinwei.com\u002F2025\u002F12\u002Fthumbs\u002Fc5571cd805d04e11bd4fb9677f26c4fa\u002FAI领域.jpg",0,1,48,"2025-12-04 14:30",2,false,{"id":17,"name":20,"enName":21},"芯位视野","Xinwei Vision","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3Acac7e460-9221-4b91-a7e4-932225d5d3d2%3A0.wav?Expires=1764842923&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=P9akaxBqx30i8JCOjatBX8fK4nM%3D",28227354,"cac7e460-9221-4b91-a7e4-932225d5d3d2","2025-12-04 14:25","MIT Research: Letting AI Act as \"Game Judges,\" Discovering Unexpected Truths About Model Judgment Capabilities","\u003Cimg alt=\"\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F03\u002Fhistory\u002F4af127f4e3bf4cab8b6e126e3d6ed11d.png\" width=\"null\" height=\"null\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis groundbreaking research, conducted by a joint team of top institutions including the Massachusetts Institute of Technology (MIT) and the University of Cambridge, was published on the arXiv preprint platform in 2025 (paper number: arXiv:2510.10930v1). It is the first systematic exploration of artificial intelligence systems' ability to evaluate the quality of games. The research team included top researchers from MIT, the University of Cambridge, New York University, Harvard University, Princeton University, and Stanford University. This might be the first time someone has seriously considered the seemingly simple but profound question: \"Can AI be a competent game judge?\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tTraditionally, we have always tested whether AI can beat human players—from chess to Go, from poker to video games. AI's performance in \"playing games\" has already amazed us. But this research team took a different approach: instead of asking whether AI can play games well, they asked whether AI can determine if a game is worth playing. This is like shifting from examining someone's cooking skills to evaluating their culinary taste—two entirely different sets of skills.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe core findings of the study were quite unexpected: as AI models become closer to the theoretical optimal level in gaming skills, their judgments about games may increasingly deviate from human intuition. This phenomenon reminds us that today, as AI becomes more powerful, understanding human values and preferences may be more important than surpassing humans technologically.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tI. Game Judging: A More Challenging Task Than Playing Games\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tTo understand the significance of this research, we first need to grasp the fundamental difference between \"judging games\" and \"playing games.\" When playing a game, the goal is clear—winning the match. However, when judging a game, the situation becomes much more complex. You need to consider whether the game is fair, whether it is fun, and whether it is worth your time—issues without standard answers.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team designed a clever experimental framework. They created 121 new board games, each a variation of the classic tic-tac-toe. Some were played on larger boards, some had changed winning conditions, and others had different rules for different players. This is like creating 121 different \"kitchen cooking challenges,\" each with subtle but important differences.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe key point is that these games are all new, and AI models have never seen them during training, nor have human volunteers. This ensures the fairness of the test—no one (or AI) has any prior advantage.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team proposed two core questions to test AI's judgment capabilities. The first question is relatively objective: Is the game fair for both sides? In other words, do the first and second players have roughly equal chances of winning? Although this question is complex, it can, in principle, be answered accurately through mathematical calculations.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe second question is much more subjective: Is the game fun? This is like asking \"Is this dish delicious?\" The answer largely depends on personal taste. Some people prefer simple and fast-paced games, while others favor complex strategies, and some value the innovation of the game. This kind of subjective judgment is one of the biggest challenges AI faces.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tII. Experimental Design: A Strict Test for AI as a Judge\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team's experimental design could be described as a carefully orchestrated \"AI aesthetic ability competition.\" They invited over 450 human volunteers as the \"gold standard\" judges, with approximately 20 people rating each game. These people are like professional judges in food programs, scoring each game's fairness and entertainment value.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tAt the same time, the research team tested various types of AI models. These models are like judges with different backgrounds—some are \"intuitive\" ones that can quickly give a judgment but lack deep analysis; others are \"thinking\" ones that analyze every detail before making a conclusion.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tSpecifically, the research team compared two types of AI models. The first type is traditional language models, which mainly make judgments based on knowledge learned from internet text. This is like a food critic who judges a new dish based on recipes and restaurant reviews they have read, rather than actually tasting it.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe second type is emerging \"reasoning models,\" which can perform in-depth step-by-step analysis. These models will first go through detailed thinking processes before giving their final judgment, just like a professional jury would carefully analyze the color, aroma, taste, nutritional balance, and innovation of a dish.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tTo ensure the fairness of the comparison, the research team also set up multiple \"benchmark players.\" These include randomly selected models (equivalent to completely untrained judges), \"intuitive player\" models based on heuristic rules, and \"expert player\" models using advanced search algorithms. Most importantly, for games that can be precisely calculated, the research team also computed the theoretical optimal solution, serving as the standard for a \"perfect judge.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIII. Surprising Findings: The Smarter AI Gets, the Less It Understands Human Feelings\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe experimental results revealed several thought-provoking phenomena. The most surprising finding was an \"intelligence paradox\" in AI model judgments regarding game fairness.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tWhen the research team compared the performance of different AI models, they found an interesting U-shaped relationship. Initially, as AI reasoning abilities improved, their judgments about game fairness indeed became closer to human intuition. This is like a beginner learning to taste wine, gradually developing a taste similar to a professional wine taster.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tBut when AI's reasoning abilities continued to improve, approaching the theoretical optimal level, the situation reversed. These super-intelligent AI models, although able to calculate the theoretical optimal strategy for the game, began to deviate from the intuition of ordinary people.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe underlying reason for this phenomenon is enlightening. Theoretically perfect game analysis often leads to conclusions opposite to human intuition. For example, a game that appears very fair may have a slight advantage for the first player in perfect analysis; while a game that seems biased toward one side may be perfectly balanced in theory.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tSpecifically, the model series from OpenAI perfectly demonstrates this phenomenon. From GPT-4 to o1 and then to o3, as reasoning abilities increased, the consistency between the model and human judgment first rose and then fell. The latest GPT-5 model, although excellent at calculating the theoretical optimal solution for games, performs worse than earlier versions in understanding the real feelings of human players.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIV. Enjoyment Judgment: A Greater Challenge for AI\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIf judging game fairness has some objective standards, then judging whether a game is enjoyable enters the subjective realm entirely. This part of the experiment reveals more limitations in AI's understanding of human preferences.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn terms of enjoyment judgment, the performance of different AI models becomes even more \"uneven.\" Even the most advanced reasoning models show significant inconsistency in judging game enjoyment. This is like letting different AI judges taste the same dish, and their scores may vary greatly.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team discovered the reason by analyzing the reasoning process of AI models. When judging game enjoyment, AI needs to consider multiple factors: whether the game is balanced, whether it is challenging, whether the duration is appropriate, how deep the strategy is, and whether it is innovative, among others. While most AI models can identify these important factors, they show significant differences in synthesizing these factors to form a final judgment.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tMore interestingly, the research team found that the \"thinking time\" used by AI models in judging enjoyment varies greatly, and this variation is often unpredictable. Some seemingly simple games cause AI to \"think deeply\" for a long time, while some complex games are judged quickly by AI. This irregularity indicates that AI models lack effective \"resource allocation strategies\" when handling subjective judgment tasks.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tV. Deep Analysis of Reasoning Processes: How AI \"Thinks\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team conducted a detailed analysis of the reasoning processes of some AI models, like dissecting a judge's thought process. These analyses revealed the different strategies AI uses when judging games.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tSome AI models primarily rely on \"analogy reasoning,\" comparing new games with known classic games (such as tic-tac-toe, five-in-a-row, and four-in-a-row). This is like a food critic judging a new dish by comparing it with classic dishes. The advantage of this method is its speed and intuitiveness, but the disadvantage is that it may overlook the unique aspects of the game.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tOthers perform \"explicit simulations,\" actually \"playing\" a few rounds of the game in their minds to feel the characteristics of the game. This method is more in-depth but also more time-consuming, and the quality of the simulation directly affects the accuracy of the final judgment.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tSome AI models attempt to perform \"mathematical calculations,\" evaluating fairness and enjoyment by analyzing the mathematical properties of the game. This method is quite effective in judging fairness but often appears too rigid when judging enjoyment.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIt is interesting that the frequency with which different AI models use these strategies varies greatly. Some models hardly ever perform actual game simulations, relying mainly on analogy and mathematical analysis; while others frequently conduct detailed game simulations. This difference reflects the impact of different AI architectures and training methods.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tVI. The Mystery of Resource Usage: Why AI \"Wastes\" Computing Power\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team also discovered a confusing phenomenon: the amount of computing resources (measured by the number of \"reasoning tokens\") used by AI models to judge different games varies greatly, and this variation often lacks obvious patterns.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIntuitively, we might think that more complex games require AI to invest more thinking time. However, the experimental data shows that the situation is far more complex. Some games that appear very simple may cause AI to use a large amount of computing resources, while some clearly more complex games may be quickly processed by AI.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tEven stranger is that there is no obvious relationship between the amount of computing resources used and the accuracy of the final judgment. Sometimes the answer given after \"deep thinking\" by AI is less accurate than the result of a \"quick judgment.\" This is like a judge spending a long time tasting and analyzing, only to find that their final evaluation is less accurate than their initial instinctive judgment.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis phenomenon is particularly evident in judging game enjoyment. Different AI models may use ten times or more computing resources for the same game, but their final judgments may be very similar. This suggests that current AI models still have a lot of room for improvement in \"meta-reasoning\"—they don't know when to think deeply and when to make quick judgments.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tVII. Human-Machine Differences: When Perfect Computation Meets Human Intuition\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe deepest insight of the entire study may lie in revealing the fundamental difference between \"computational perfection\" and \"human intuition.\" This difference manifests in both dimensions of game judgment, but in different forms.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn fairness judgments, the difference mainly comes from different perspectives. AI models (especially advanced reasoning models) tend to analyze games from the perspective of theoretical optimality, focusing on the game's outcome when both sides use perfect strategies. Human players' judgments are more based on actual gameplay experiences—few people reach the theoretical optimal level in reality.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis is like the difference between a professional car reviewer and a regular consumer evaluating the same car. Professional reviewers may evaluate a car based on technical aspects such as engine efficiency and aerodynamics, while regular consumers care more about driving experience, comfort, and practicality. Both evaluations have their value, but they target different audiences and purposes.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn terms of enjoyment judgments, the difference is more complex and subtle. Human judgments of game enjoyment are often influenced by emotional, cultural background, and personal experience factors. While AI models can identify various objective features of games (balance, complexity, innovation, etc.), they struggle to synthesize these features into a comprehensive impression.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tVIII. Unexpected Patterns: Complex Judgments for Simple Games\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team also discovered some interesting details. For example, certain seemingly simple game variations actually triggered \"deep thinking\" in AI models, while some obviously more complex games were quickly processed.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tBy analyzing specific cases, the research team found that this phenomenon is often related to the \"intuitive deception\" of the game. Some games appear simple on the surface but have a large strategic space; others look complex but have relatively straightforward strategies. AI models seem to be able to sense this \"difference between appearance and reality,\" thus investing more computing resources in seemingly simple but actually complex games.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis ability itself is impressive, indicating that AI models have some sort of \"intuition\" to recognize the true complexity of problems. However, the problem lies in the fact that this resource allocation strategy is not always effective—sometimes the results obtained from heavy computation are not more accurate than quick judgments.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIX. The Hidden Influence of Training Data\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tAlthough the test games were all newly created, the research found that AI models were still significantly influenced by their training data. Models from different manufacturers showed similar bias patterns, suggesting that they may have learned similar game judgment \"intuitions\" from similar training data.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis phenomenon is especially evident in non-reasoning models. These models mainly rely on statistical patterns learned during training to make judgments, rather than performing actual logical reasoning. As a result, even when facing new games, their judgments still carry a strong preconceived bias.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn contrast, reasoning models, although also influenced by training data, are affected to a lesser extent. They rely more on logical analysis during the reasoning process, so they can better adapt to new types of games.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tX. Implications for Future AI Development\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe significance of this research goes beyond the field of games. It actually explores a more fundamental question: as AI systems become increasingly powerful, how can we ensure they still understand and serve human needs and values?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research results indicate that merely pursuing improvements in technical performance may lead AI systems to deviate from human intuition and preferences. This poses new challenges for AI development: how to maintain consistency with human values while enhancing AI capabilities?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn practical applications, this issue becomes even more important. For example, if we let AI systems help design educational games, entertainment products, or social platforms, we want their judgments to be based on real human experiences, not abstract theoretical optimality.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThe research team proposed several possible solutions. First, developing better \"resource-efficient\" reasoning systems to allow AI to dynamically allocate computing resources based on the importance and complexity of tasks. Second, incorporating more human feedback and preference data into AI training to ensure AI judgments reflect real human experiences.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tIn short, this study reminds us that AI's \"intelligence\" is not only reflected in its ability to solve complex problems, but also in its ability to understand human needs and values. In today's rapidly advancing AI technology, ensuring AI systems truly serve human well-being may be more important than simply pursuing technical indicators.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tThis study opens a new window for us, allowing us to examine AI system capabilities and limitations from a brand-new perspective. It tells us that judging and choosing may be more difficult than solving problems, and understanding human subjective experiences may be one of the biggest challenges AI faces. As AI systems play roles in more fields, these insights will become increasingly important. For everyone concerned about the direction of AI development, this study provides valuable material for reflection.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tReaders interested in深入了解 this study can query the full paper via the paper number arXiv:2510.10930v1 on relevant academic platforms, which includes more detailed experimental data and technical details.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ&amp;A\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ1: What is the relationship between AI model game judging ability and game skills?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA: The study found an unexpected phenomenon: the closer an AI model's game skills get to the theoretical optimal level, the more likely its game judgments may deviate from human intuition. Technologically perfect AI may perform worse in understanding human real game experiences, revealing the fundamental difference between \"computational perfection\" and \"human intuition.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ2: Why does AI show unstable performance in judging game enjoyment?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA: Enjoyment judgment is more subjective and complex than fairness judgment. AI needs to consider multiple factors, such as game balance, challenge, strategy depth, and innovation, but shows significant differences in integrating these factors to form a final judgment. The computing resources used by different AI models also vary greatly, and there is no obvious relationship between resource usage and judgment accuracy.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tQ3: What practical significance does this study have for AI development?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\">\tA: The study reveals an important challenge for AI systems: how to maintain consistency with human values while improving technical capabilities. This has important guiding significance for the application of AI in areas such as education, entertainment, and product design that require understanding human subjective experiences, reminding us that we should not only pursue technical indicators but also ensure that AI truly serves human needs.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(136, 136, 136);\">[Tech Traveler] Southern Metropolis Daily \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1R8leX?ocid=msedgntphdr&amp;cvid=69265783251e4467a4c4ab98c6ba6166&amp;ei=19\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(136, 136, 136);\">https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FRH7anFgRokfIkEqKFt6gTg\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(136, 136, 136);\">（This article is reprinted by this website to provide readers with more information and news. The content does not constitute investment or consumption advice. If you have any doubts about the facts in the article, please verify with the relevant parties. The views in the article are not the views of this website, and are provided for reference only.）\u003C\u002Fspan>\u003C\u002Fp>","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3Ac45726f6-ab0a-4e33-9c79-db374198364c%3A0.wav?Expires=1774838448&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=z4i2K9bGDq8JK8DTE%2BBMgRklWdE%3D","c45726f6-ab0a-4e33-9c79-db374198364c",17880470]