[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fDP3TPefgHaQ66mxchyfxzcOSjxTMVDPEXRxU68DPOuA":3},{"code":4,"msg":5,"data":6},200,"操作成功",{"id":7,"title":8,"content":9,"digest":10,"source":10,"coverPath":11,"thumbsCoverPath":12,"isTop":13,"isShow":14,"baseClick":13,"clickCount":15,"createTime":16,"typeId":17,"isNewest":18,"newsInfoTypeRespVo":19,"voiceUrl":22,"voiceSize":23,"taskId":24,"releaseTime":25,"titleEn":26,"contentEn":27,"voiceUrlEn":28,"taskIdEn":29,"voiceSizeEn":30},1582,"清华大学团队突破AI视频理解难题：用“反常识”训练让机器看懂真相","\u003Cimg alt=\"\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F03\u002Fhistory\u002F76b2b2ad8af94940af674bd53df47bb1.png\" width=\"791\" height=\"null\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">\t这项由清华大学的黄哲、北京航空航天大学的文浩，以及阿里巴巴地图团队的郝爱鸣、宋兵泽等研究者共同完成的研究，发表于2025年12月30日的arXiv预印本平台，论文编号为arXiv:2512.24271v1。有兴趣深入了解的读者可以通过该编号查询完整论文。\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t当前的多模态大语言模型就像一个聪明但容易被表象迷惑的学生。当它们看到一段视频时，往往会依赖于之前学到的\"常识\"来做出判断，而不是真正仔细观察视频中发生了什么。这就好比一个人看到农场场景就自动认为收割机的玉米应该向下流入拖车，即使视频中的玉米实际上是向上飞到天空中的。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这种现象被研究者称为\"视觉无根据幻觉\"。就像一个总是根据剧本行事的演员，即使面前的剧情完全不同，也会按照熟悉的套路来表演。目前的AI模型在处理反常识或者违反物理规律的视频内容时，经常会\"视而不见\"，坚持给出符合常理但与实际画面不符的答案。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队发现，这个问题的根源在于训练数据的不平衡。文本数据的规模和多样性远远超过视频数据，就像一个孩子读了一万本书但只看过十部电影，当然会更相信书本知识而不是眼前所见。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t为了解决这个问题，研究团队开发了一个名为\"DualityForge\"的创新框架。这个系统的核心思想是通过可控的视频编辑技术，将普通的真实世界视频转换为违反常识的反常视频。比如让水往上流、让石头漂浮、让物体突然消失等等。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这种方法就像是给AI学生安排一场\"颠倒世界\"的训练课程。在这个课程中，学生必须学会相信自己的眼睛而不是脑海中的预设知识。当AI同时看到一个物体正常下落的视频和同一个物体向上飞升的编辑版本时，它必须根据实际观察到的内容给出不同的答案，而不能简单地套用\"物体会下落\"这样的常识。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队构建了一个名为\"DualityVidQA\"的大规模数据集，包含14.4万个训练样本和600个测试样本。这个数据集的特点是每个样本都包含一对视频：一个是原始的真实视频，另一个是经过编辑的反常视频。对于同一个问题，这两个视频需要不同的答案，这迫使AI模型必须仔细观察视频内容而不是依赖语言先验。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t一、反常视频的智能制造工厂\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tDualityForge框架就像一个专门制造\"违反常理\"内容的智能工厂。这个工厂有三条不同的生产线，分别负责创造三种类型的反常现象。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t第一条生产线专门处理视觉层面的异常，就像给照片加上各种滤镜效果。这些异常包括不正常的对比度、饱和度、亮度变化，或者局部的图像扭曲。虽然这些改变主要影响视觉质量，但不会改变场景的基本语义含义。研究团队使用OpenCV这样的计算机视觉工具来实现这些效果，就像用Photoshop给图片添加特效一样。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t第二条生产线负责创造语义层面的异常，这些异常会违反场景的基本逻辑。比如让一个物体突然消失、让不存在的东西突然出现、或者用其他物体替换原来的物体。这就像魔术师的表演，物体会违反我们对现实世界的基本认知。为了实现这种效果，研究团队采用了先进的视频编辑模型VACE，它能够在保持视频其他部分不变的情况下，精确地修改特定区域的内容。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t第三条生产线是最复杂的，它专门制造违反常识和物理规律的异常现象。这些异常包括违反物理定律的运动、因果关系的颠倒、材料属性的异常变化，以及不合理的人体动作。为了创造这类异常，研究团队首先使用多模态大语言模型分析图像中的视觉元素，然后生成针对特定异常的编辑指令。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">接着，他们使用FLUX-Kontext模型根据这些指令编辑图像，最后通过VACE模型进行帧间插值，生成流畅的反常视频。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t整个制造过程就像一个精密的手表工厂，每个环节都有严格的质量控制。研究团队使用多个最先进的多模态大语言模型进行交叉验证，确保生成的反常视频确实包含了预期的异常现象，而且这些异常足够明显，能够被人类观察者识别出来。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这个智能工厂最终产出了超过13.5万个包含异常现象的视频，为后续的AI训练提供了丰富的\"反常识\"素材。整个生产过程消耗了大约4万个GPU小时的计算资源，相当于一台高性能计算机连续工作4年半的时间。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t二、双重问答训练的巧妙设计\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队设计的训练方法就像教一个学生同时应对正常考试和\"颠倒世界\"考试。这种训练分为两个阶段：监督学习阶段和强化学习阶段。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t在监督学习阶段，AI模型需要学习处理包含真实视频和反常视频的混合数据集。这个阶段的目标是双重的：一方面要保持模型在处理正常视频时的优秀表现，另一方面要让模型开始注意到反常视频中的异常现象。为了确保训练的平衡性，研究团队采用了均衡采样策略，确保每个训练批次中都包含相等数量的真实样本和反常样本。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这个过程就像教一个学生既要掌握正常的数学规则，又要学会识别数学题目中的\"陷阱\"。学生必须在看到正常题目时给出标准答案，在看到包含反常条件的题目时给出相应的非标准答案。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t强化学习阶段采用了一种名为\"对偶标准化优势训练\"的创新方法。这个方法的核心思想是利用成对视频数据的对比特性，让模型学会根据实际观察到的视频内容调整其推理过程。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t在这个阶段，模型面对的是一种特殊的挑战：对于同一个问题，它必须根据看到的是真实视频还是反常视频给出不同的答案。这就像一个侦探必须根据不同的证据得出不同的结论，而不能总是套用同一套推理模式。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队在强化学习中引入了一个重要的技术创新：对每一对真实-反常视频的优势值进行l1标准化。这种标准化确保了模型在学习过程中对真实视频和反常视频给予同等的关注，避免了模型偏向某一类数据的问题。这就像在天平的两端放置等重的砝码，确保学习过程的平衡性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t具体来说，优势标准化的过程就像调节音响系统的音量平衡。如果左声道和右声道的音量差距过大，听众就会偏向音量更大的一侧。同样地，如果模型在真实视频上的学习信号过强，它就会忽视反常视频中的重要信息。通过标准化处理，研究团队确保了模型能够平等地从两种类型的数据中学习。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t奖励机制的设计也很巧妙。模型的表现主要通过两个方面来评估：答案的正确性和推理格式的规范性。正确性奖励是一个简单的二元分数——答对了得1分，答错了得0分。格式奖励则鼓励模型遵循特定的推理结构，这有助于提高模型输出的可解释性和一致性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t整个训练过程就像培养一个既能在正常环境中工作，又能在极端条件下保持清醒判断的专业人员。通过这种双重训练，AI模型学会了在面对反常现象时依然保持客观观察和准确判断的能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t三、突破性实验成果揭示训练效果\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队对DNA-Train方法进行了全面的实验验证，结果令人印象深刻。在专门设计的DualityVidQA测试集上，经过训练的7B参数模型相比基础的Qwen2.5-VL-7B模型，在反常视频理解任务上实现了24%的相对提升。这个提升幅度相当显著，就像一个原本只能答对50道题的学生，经过特殊训练后能够答对62道题。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t更令人惊喜的是，这种针对反常现象的专门训练不仅没有损害模型在正常视频理解任务上的表现，反而带来了全面的性能提升。在多个通用视频理解基准测试中，DNA-Train模型都表现出了更好的性能，包括TempCompass、MVBench、TOMATO和TVBench等权威评测。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t实验结果显示了当前主流AI模型的一个普遍弱点：几乎所有被测试的模型在处理反常视频时都出现了显著的性能下降。即使是表现最好的商业模型，如GPT-4.1和Gemini-2.5 Pro，在处理真实视频时能达到92%以上的准确率，但在面对反常视频时，准确率就会大幅下降。这就像一个在标准考试中表现优异的学生，在面对\"脑筋急转弯\"类型的问题时就显得手足无措。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t特别值得注意的是，在\"反物理常识\"这个最具挑战性的类别中，大多数模型都表现得非常糟糕。但DNA-Train-7B模型在这个类别中达到了79.2%的准确率，展现出了卓越的抗\"常识干扰\"能力。这表明该模型确实学会了相信自己的\"眼睛\"而不是依赖预设的知识。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队还进行了详尽的消融实验来验证各个组件的作用。他们发现，使用成对数据进行训练是获得良好效果的关键。如果只使用真实视频进行训练，模型在反常视频理解任务上的表现会大幅下降；如果只使用反常视频进行训练，虽然能提高对异常现象的敏感性，但会损害模型在正常视频上的表现。只有使用真实视频和反常视频的配对数据，才能实现两方面性能的协调提升。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t对偶标准化优势训练方法的有效性也得到了充分验证。与传统的强化学习方法相比，这种方法在幻觉检测任务上平均提升了10.8个百分点，在通用视频理解任务上也有1.0个百分点的提升。这证明了优势标准化策略确实能够带来更稳定、更平衡的学习效果。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t实验还验证了该方法在不同规模模型上的通用性。无论是7B、32B还是72B参数的模型，DNA-Train方法都能带来一致的性能提升。这表明该训练范式具有良好的可扩展性，不局限于特定规模的模型。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t更重要的是，研究团队证明了这种方法不仅适用于Qwen2.5-VL模型，在LLaVA-Next-Video等其他主流多模态模型上也能取得显著的改进效果。这说明DNA-Train是一种通用的训练范式，而不是针对特定模型架构的专门优化。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t四、技术创新的深层价值与广泛影响\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这项研究的技术贡献远不止于提高某个特定任务的性能分数，它实际上触及了当前AI系统的一个根本性问题：如何让机器学会真正的视觉推理而不是简单的模式匹配。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t传统的多模态AI训练就像教一个学生通过背诵标准答案来应对考试。学生可能在常规考试中表现优异，但当遇到需要真正理解和分析的新情况时就会暴露出问题。DNA-Train方法的创新之处在于，它教会AI模型进行真正的视觉观察和逻辑推理，而不是依赖记忆中的模式。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这种训练范式的意义可以类比为从\"死记硬背\"向\"理解学习\"的转变。通过让模型同时学习正常和反常的视频内容，并要求它们根据实际观察到的现象给出相应的答案，研究团队实际上是在培养AI的\"批判性思维\"能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tDualityForge框架的另一个重要贡献是解决了反常数据稀缺的问题。在现实世界中，违反物理规律或常识的现象确实很少发生，这使得收集足够的训练数据变得极其困难和昂贵。通过可控的视频编辑技术，研究团队创造了一种可扩展的数据生成方法，这为未来的相关研究开辟了新的道路。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这个框架的设计也体现了深刻的学习理论洞察。通过在编辑过程中嵌入结构化的上下文信息，系统不仅能够生成高质量的反常视频，还能自动生成相应的问答对。这种\"上下文引导的生成\"方法确保了数据的质量和一致性，同时大大降低了人工标注的成本。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t从更广阔的视角来看，这项研究为多模态AI的发展提供了新的思路。当前很多AI系统在处理多模态信息时，往往会过度依赖某一种模态（通常是文本）的信息，而忽视其他模态提供的关键线索。DNA-Train方法通过对比学习的方式，强制模型必须综合考虑所有可用的信息，这有助于构建更加均衡和可靠的多模态AI系统。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t该研究还对AI安全领域具有重要意义。在实际应用中，AI系统可能会遇到各种异常或恶意构造的输入，如果系统过度依赖训练时学到的模式，就可能被这些异常输入误导。通过提高AI模型对反常现象的识别和处理能力，DNA-Train方法实际上增强了系统的鲁棒性和抗攻击能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这种训练方法的影响还可能扩展到其他AI应用领域。比如在自动驾驶系统中，车辆必须能够识别和应对各种异常的道路情况；在医疗诊断系统中，AI必须能够发现那些不符合常见病症模式的罕见疾病。DNA-Train提供的对比学习框架为这些应用场景提供了有价值的参考。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t五、未来发展前景与应用潜力\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这项研究开启了多模态AI训练的新篇章，其影响将远远超出学术研究的范围，为各个行业的实际应用带来革命性的改变。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t在内容审核和事实核查领域，经过DNA-Train训练的AI系统将具备更强的\"火眼金睛\"能力。当前的内容审核系统经常会被精心制作的虚假内容蒙蔽，特别是那些利用深度伪造技术制作的视频。具备反常识识别能力的AI将能够更准确地识别这些经过人工修改的异常内容，为网络安全和信息真实性验证提供更可靠的技术支撑。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t在教育领域，这种技术将催生全新的智能学习系统。传统的AI教学助手往往只能处理标准化的教学内容，而具备反常识理解能力的AI将能够处理更复杂、更具创造性的学习场景。比如在科学教育中，AI可以帮助学生理解那些违反直觉的物理现象，或者在艺术教育中分析那些采用反传统手法的创作作品。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t医疗诊断是另一个具有巨大潜力的应用领域。疾病往往表现为对正常生理状态的偏离，而罕见疾病更是会呈现出完全违反常见症状模式的表现。具备反常识识别能力的AI医疗系统将能够更好地识别这些\"非典型\"病例，为医生提供更准确的诊断支持，特别是在处理那些容易被误诊的罕见疾病时。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t在自动驾驶技术中，这种训练方法的价值更是不言而喻。道路环境中充满了各种异常情况：突然出现的障碍物、违规行驶的车辆、恶劣天气下的特殊路况等等。传统的自动驾驶系统往往在这些\"边缘情况\"下表现不佳，因为它们过于依赖训练数据中的常见模式。DNA-Train方法培养的\"反常识\"敏感性将显著提高自动驾驶系统在复杂环境下的安全性和可靠性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t从技术发展的角度来看，这项研究还为大模型的训练提供了新的思路。当前的大模型训练主要关注于扩大数据规模和模型参数，但DNA-Train研究表明，数据的多样性和质量可能比单纯的数量更加重要。通过精心设计的对比学习任务，即使使用相对较小的数据集，也能够实现显著的性能提升。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t这种方法还为多模态AI的可解释性研究开辟了新的方向。通过分析模型在处理正常和反常视频时的不同表现，研究者可以更好地理解模型的内部工作机制，识别模型的偏见和局限性。这种理解对于构建更加可信和可控的AI系统至关重要。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t研究团队已经承诺将开源他们的数据集和代码，这将为整个研究社区提供宝贵的资源。预期将有更多的研究团队基于这个框架开展进一步的研究，探索不同类型的反常现象、不同的编辑技术、以及不同的训练策略。这种开放式的研究合作将加速相关技术的发展和应用。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t同时，这项研究也提醒我们注意AI系统的局限性。即使是经过专门训练的模型，在面对某些极端的反常情况时仍然可能表现不佳。这说明我们还需要继续努力，不断改进训练方法和评估标准，以构建更加健壮和可靠的AI系统。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\t说到底，这项研究的真正价值在于它为AI系统装上了一双更加敏锐的\"眼睛\"。在一个充满变化和意外的真实世界中，只有具备了真正的观察能力和判断能力的AI，才能成为人类真正可靠的伙伴。这项来自清华大学等机构的研究，正是朝着这个目标迈出的重要一步，它不仅提高了AI的技术水平，更重要的是提升了AI理解世界的深度和准确性。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ&amp;A\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ1：什么是DNA-Train训练方法？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA：DNA-Train是一种针对多模态AI的新型训练方法，包含监督学习和强化学习两个阶段。它通过让AI模型同时学习正常视频和人工编辑的反常视频，迫使模型根据实际观察到的内容而非预设常识来回答问题，从而提高AI的视觉推理能力。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ2：DualityForge框架是如何制造反常视频的？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA：DualityForge框架有三条不同的\"生产线\"：第一条处理视觉异常如对比度、饱和度变化；第二条创造语义异常如物体消失、出现或替换；第三条制造违反物理规律的现象如水往上流、石头漂浮等。整个过程使用先进的视频编辑技术，并通过多个AI模型进行质量验证。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ3：这项研究对普通人的生活有什么实际影响？\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA：这项技术将提高各种AI应用的可靠性，包括更准确的内容审核系统、更智能的教育助手、更精准的医疗诊断、更安全的自动驾驶等。最重要的是，它让AI具备了更强的\"反常识\"识别能力，在面对异常情况时能做出更准确的判断，从而为人类提供更可信的AI服务。\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(136, 136, 136);\">【新闻来源】MSN \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1TRzSR?ocid=msedgntphdr&amp;cvid=6960a044903742af84c57e4ee0ce1732&amp;ei=88\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(136, 136, 136);\"> https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1RTDQy?ocid=BingHp01&amp;cvid=6936317f054647a2afcd53fafcde084a&amp;ei\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(136, 136, 136);\">（本网转发此文章，旨在为读者提供更多的信息资讯，所涉内容不构成投资、消费建议。文章事实如有疑问，请与有关方核实，文章观点非本网观点，仅供读者参考。）\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>","","https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F01\u002F3aa283fc4d874ae78c1d7ad1d3007208\u002FAI领域.jpg","https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F01\u002Fthumbs\u002F3aa283fc4d874ae78c1d7ad1d3007208\u002FAI领域.jpg",0,1,51,"2026-01-12 16:15",2,false,{"id":17,"name":20,"enName":21},"芯位视野","Xinwei Vision","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3Ad3cd85d3-4462-4dd6-896e-9a4b5325fc43%3A0.wav?Expires=1768846017&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=sZgwHWSa2I1XV9M%2B3HRfEmD5u%2FI%3D",32495154,"d3cd85d3-4462-4dd6-896e-9a4b5325fc43","2026-01-12 16:11","Chinese University team breaks through AI video understanding challenges: Using \"counterintuitive\" training to make machines understand the truth","\u003Cimg alt=\"\" src=\"https:\u002F\u002Fimage.51xinwei.com\u002F2026\u002F03\u002Fhistory\u002F76b2b2ad8af94940af674bd53df47bb1.png\" width=\"791\" height=\"null\" style=\"display: block; margin: auto;\">\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cstrong class=\"ql-lineHeight-1-75\" style=\"font-size: 18px; color: rgb(255, 153, 0);\">\tThis study, jointly completed by researchers from Tsinghua University's Huang Zhe, Beihang University's Wen Hao, and Alibaba Map team's Hao Aiming and Song Bingze, was published on the arXiv preprint platform on December 30, 2025, with the paper number arXiv:2512.24271v1. Readers interested in further details can query the complete paper using this number.\u003C\u002Fstrong>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tCurrent multimodal large language models are like smart but easily misled students. When they see a video, they often rely on previously learned \"common sense\" to make judgments rather than truly carefully observing what is happening in the video. This is like a person seeing a farm scene automatically assuming that the corn from the combine harvester should flow downward into the trailer, even though the corn in the video is actually flying upward into the sky.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis phenomenon is called \"visual baseless hallucination.\" Like an actor who always acts according to the script, even if the current plot is completely different, they will perform according to the familiar routine. Current AI models often \"turn a blind eye\" to videos with counterintuitive or physically impossible content, insisting on giving common-sense answers that do not match the actual images.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team found that the root of this problem lies in the imbalance of training data. The scale and diversity of text data far exceed that of video data, like a child who reads ten thousand books but only watches ten movies, of course, would trust book knowledge more than what he sees.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tTo solve this problem, the research team developed an innovative framework called \"DualityForge.\" The core idea of this system is to use controllable video editing technology to convert ordinary real-world videos into counterintuitive abnormal videos. For example, making water flow upwards, making stones float, making objects suddenly disappear, etc.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis method is like arranging a \"reversed world\" training course for AI students. In this course, students must learn to believe their eyes rather than preconceived knowledge in their minds. When AI sees both a normal falling object video and an edited version of the same object flying upward, it must give different answers based on what it actually observes, rather than simply applying the common-sense \"objects fall.\"\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team built a large-scale dataset called \"DualityVidQA,\" containing 144,000 training samples and 600 test samples. The characteristic of this dataset is that each sample includes a pair of videos: one is the original real video, and the other is an edited abnormal video. For the same question, these two videos require different answers, forcing the AI model to carefully observe the video content instead of relying on language priors.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tOne, the intelligent factory for manufacturing abnormal videos\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe DualityForge framework is like a specialized intelligent factory for producing \"counterintuitive\" content. This factory has three different production lines, each responsible for creating three types of abnormal phenomena.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe first production line specializes in handling visual anomalies, similar to adding various filter effects to photos. These anomalies include abnormal contrast, saturation, brightness changes, or local image distortions. Although these changes mainly affect visual quality, they do not change the basic semantic meaning of the scene. The research team uses computer vision tools like OpenCV to achieve these effects, similar to adding special effects to images with Photoshop.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe second production line is responsible for creating semantic anomalies, which violate the basic logic of the scene. For example, making an object suddenly disappear, making non-existent things appear, or replacing the original object with another object. This is like a magician's performance, where objects violate our basic cognitive understanding of the real world. To achieve this effect, the research team used advanced video editing models like VACE, which can precisely modify specific areas while keeping other parts of the video unchanged.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe third production line is the most complex, specializing in creating anomalies that violate common sense and physical laws. These anomalies include motion that violates physical laws, reversed causality, abnormal changes in material properties, and unreasonable human movements. To create such anomalies, the research team first used multimodal large language models to analyze the visual elements in the image, then generated editing instructions targeting specific anomalies.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tNext, they used the FLUX-Kontext model to edit the images based on these instructions, and finally used the VACE model for frame interpolation to generate smooth abnormal videos.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe entire manufacturing process is like a precision watch factory, with strict quality control at each stage. The research team used multiple state-of-the-art multimodal large language models for cross-validation, ensuring that the generated abnormal videos indeed contained the expected anomalies, and these anomalies were significant enough to be identified by human observers.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis intelligent factory eventually produced over 135,000 videos containing anomalies, providing rich \"counterintuitive\" materials for subsequent AI training. The entire production process consumed approximately 40,000 GPU hours of computing resources, equivalent to a high-performance computer working continuously for 4.5 years.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tTwo, the clever design of dual-question training\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe training method designed by the research team is like teaching a student to deal with both normal exams and \"reversed world\" exams. This training is divided into two stages: supervised learning stage and reinforcement learning stage.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn the supervised learning stage, the AI model needs to learn to handle a mixed dataset containing real videos and abnormal videos. The goal of this stage is twofold: on one hand, to maintain the model's excellent performance when dealing with normal videos, and on the other hand, to let the model start noticing the anomalies in the abnormal videos. To ensure the balance of training, the research team adopted an equal sampling strategy, ensuring that each training batch contains an equal number of real samples and abnormal samples.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis process is like teaching a student to master normal mathematical rules and also learn to identify mathematical questions' \"traps.\" The student must give standard answers when seeing normal questions and corresponding non-standard answers when seeing questions with abnormal conditions.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe reinforcement learning stage adopts an innovative method called \"dual normalized advantage training.\" The core idea of this method is to use the comparative characteristics of paired video data to allow the model to adjust its reasoning process based on the actual observed video content.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn this stage, the model faces a special challenge: for the same question, it must give different answers depending on whether it sees a real video or an abnormal video. This is like a detective who must draw different conclusions based on different evidence, rather than always applying the same set of reasoning patterns.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team introduced an important technological innovation in the reinforcement learning: performing L1 normalization on the advantage values of each pair of real-abnormal videos. This normalization ensures that the model pays equal attention to real videos and abnormal videos during the learning process, avoiding the problem of the model biasing towards one type of data. This is like placing equal weights on both ends of a balance, ensuring the balance of the learning process.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tSpecifically, the advantage normalization process is like adjusting the volume balance of an audio system. If the volume difference between the left and right channels is too large, listeners will tend to favor the louder side. Similarly, if the learning signal for real videos is too strong, the model will ignore the important information in the abnormal videos. Through normalization processing, the research team ensured that the model could learn equally from both types of data.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe design of the reward mechanism is also clever. The model's performance is mainly evaluated in two aspects: the correctness of the answer and the standardization of the reasoning format. Correctness rewards are a simple binary score—getting the answer right earns 1 point, getting it wrong earns 0 points. Format rewards encourage the model to follow a specific reasoning structure, which helps improve the explainability and consistency of the model's output.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe entire training process is like cultivating a professional who can work in normal environments and maintain clear judgment under extreme conditions. Through this dual training, the AI model learned the ability to remain objective observation and accurate judgment when facing abnormal phenomena.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThree, breakthrough experimental results reveal the training effectiveness\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team conducted comprehensive experimental validation of the DNA-Train method, and the results are impressive. On the specially designed DualityVidQA test set, the 7B parameter model trained achieved a 24% relative improvement in abnormal video understanding tasks compared to the basic Qwen2.5-VL-7B model. This improvement is quite significant, like a student who could answer 50 questions before, after special training can answer 62 questions.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tMore surprisingly, this specialized training for abnormal phenomena did not harm the model's performance on normal video understanding tasks, but instead brought a comprehensive performance improvement. On multiple general video understanding benchmark tests, the DNA-Train model showed better performance, including authoritative evaluations such as TempCompass, MVBench, TOMATO, and TVBench.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe experimental results revealed a common weakness of current mainstream AI models: almost all tested models showed significant performance degradation when dealing with abnormal videos. Even the best commercial models, such as GPT-4.1 and Gemini-2.5 Pro, can achieve over 92% accuracy on real videos, but their accuracy drops significantly when facing abnormal videos. This is like a student who performs excellently in standard exams but is at a loss when faced with \"brain teaser\" type questions.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIt is particularly worth noting that in the most challenging category of \"anti-common sense,\" most models performed very poorly. However, the DNA-Train-7B model achieved an accuracy of 79.2% in this category, demonstrating excellent resistance to \"common sense interference.\" This indicates that the model has indeed learned to trust its own \"eyes\" rather than rely on pre-set knowledge.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team also conducted detailed ablation experiments to verify the role of each component. They found that training with paired data is key to achieving good results. If only real videos are used for training, the model's performance on abnormal video understanding tasks will significantly drop; if only abnormal videos are used for training, although it can increase sensitivity to anomalies, it will damage the model's performance on normal videos. Only using paired data of real and abnormal videos can achieve coordinated improvements in both aspects of performance.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe effectiveness of the dual normalized advantage training method has been fully verified. Compared to traditional reinforcement learning methods, this method improved the performance by an average of 10.8 percentage points in hallucination detection tasks and by 1.0 percentage points in general video understanding tasks. This proves that the advantage normalization strategy indeed brings more stable and balanced learning effects.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe experiments also verified the generality of this method across different model sizes. Whether it is 7B, 32B, or 72B parameter models, the DNA-Train method can bring consistent performance improvements. This indicates that the training paradigm has good scalability and is not limited to specific model sizes.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tMore importantly, the research team proved that this method is not only applicable to the Qwen2.5-VL model, but can also achieve significant improvements on other mainstream multimodal models such as LLaVA-Next-Video. This shows that DNA-Train is a generic training paradigm, not a specialized optimization for a specific model architecture.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tFour, the deep value and wide impact of technological innovation\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe technical contributions of this research go beyond improving the performance score of a specific task; it actually touches a fundamental issue of current AI systems: how to make machines learn true visual reasoning rather than simple pattern matching.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tTraditional multimodal AI training is like teaching a student to pass exams by memorizing standard answers. The student may perform well in regular exams, but when faced with new situations that require true understanding and analysis, problems will be exposed. The innovation of the DNA-Train method lies in teaching AI models to perform true visual observation and logical reasoning, rather than relying on remembered patterns.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe significance of this training paradigm can be likened to the transition from \"rote memorization\" to \"understanding learning.\" By making models learn both normal and abnormal video content simultaneously and requiring them to provide corresponding answers based on what they actually observe, the research team is actually cultivating the AI's \"critical thinking\" ability.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tAnother important contribution of the DualityForge framework is solving the problem of scarce abnormal data. In the real world, phenomena that violate physical laws or common sense are indeed rare, making it extremely difficult and expensive to collect sufficient training data. Through controllable video editing technology, the research team created a scalable data generation method, opening up new paths for future related research.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe design of this framework also reflects profound learning theory insights. By embedding structured contextual information during the editing process, the system not only generates high-quality abnormal videos but also automatically generates corresponding question-answer pairs. This \"context-guided generation\" method ensures data quality and consistency while greatly reducing the cost of manual annotation.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tFrom a broader perspective, this research provides new ideas for the development of multimodal AI. Many AI systems currently process multimodal information by over-relying on one modality (usually text), ignoring other modalities' critical clues. The DNA-Train method, through contrastive learning, forces the model to consider all available information, helping to build a more balanced and reliable multimodal AI system.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis research also has significant implications for the AI safety field. In practical applications, AI systems may encounter various abnormal or maliciously constructed inputs. If the system overly relies on the patterns learned during training, it may be misled by these abnormal inputs. By improving the AI model's ability to recognize and handle abnormal phenomena, the DNA-Train method actually enhances the system's robustness and anti-attack capabilities.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe impact of this training method may also extend to other AI application fields. For example, in autonomous driving systems, vehicles must be able to identify and respond to various abnormal road conditions; in medical diagnostic systems, AI must be able to detect rare diseases that do not conform to common disease patterns. The contrastive learning framework provided by DNA-Train offers valuable references for these application scenarios.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tFive, future development prospects and application potential\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis research opens a new chapter in multimodal AI training, and its influence will far exceed the scope of academic research, bringing revolutionary changes to practical applications in various industries.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn the field of content review and fact verification, AI systems trained with DNA-Train will have stronger \"keen eyes\" capability. Current content review systems are often deceived by meticulously crafted false content, especially those made using deepfake technology. AI with counterintuitive recognition capability will be able to more accurately identify these manually modified abnormal content, providing more reliable technical support for cybersecurity and information authenticity verification.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn the education field, this technology will give rise to new intelligent learning systems. Traditional AI teaching assistants can only handle standardized teaching content, while AI with counterintuitive understanding capability will be able to handle more complex and creative learning scenarios. For example, in science education, AI can help students understand physics phenomena that defy intuition, or in art education, analyze creative works that use unconventional techniques.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tMedical diagnosis is another area with great potential for application. Diseases often manifest as deviations from normal physiological states, and rare diseases will show completely different symptom patterns. AI medical systems with counterintuitive recognition capability will be better at identifying these \"atypical\" cases, providing more accurate diagnostic support for doctors, especially in the treatment of diseases that are easy to misdiagnose.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn the field of autonomous driving technology, the value of this training method is self-evident. Road environments are full of various abnormal situations: sudden obstacles, illegal vehicles, special road conditions in bad weather, etc. Traditional autonomous driving systems often perform poorly in these \"edge cases\" because they rely too much on common patterns in training data. The \"counterintuitive\" sensitivity cultivated by the DNA-Train method will significantly improve the safety and reliability of autonomous driving systems in complex environments.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tFrom a technical development perspective, this research also provides new ideas for large model training. Currently, large model training mainly focuses on expanding data size and model parameters, but the DNA-Train study shows that the diversity and quality of data may be more important than sheer quantity. Through carefully designed contrastive learning tasks, even with relatively small datasets, significant performance improvements can be achieved.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThis method also opens up new directions for the study of multimodal AI interpretability. By analyzing the different performances of the model when dealing with normal and abnormal videos, researchers can better understand the internal workings of the model, identify the biases and limitations of the model. This understanding is crucial for building more trustworthy and controllable AI systems.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tThe research team has committed to open sourcing their dataset and code, which will provide valuable resources for the entire research community. It is expected that more research teams will conduct further research based on this framework, exploring different types of abnormal phenomena, different editing techniques, and different training strategies. This open research collaboration will accelerate the development and application of related technologies.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tAt the same time, this research also reminds us of the limitations of AI systems. Even models that have been specifically trained may still perform poorly in certain extreme abnormal situations. This indicates that we still need to continue working to improve training methods and evaluation standards to build more robust and reliable AI systems.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tIn short, the true value of this research lies in equipping AI systems with a more sensitive \"eye.\" In a real world full of changes and surprises, only AI with true observational and judgment abilities can become a reliable partner for humans. This research from institutions such as Tsinghua University is an important step toward this goal, not only improving AI's technical level, but more importantly, enhancing AI's depth and accuracy in understanding the world.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ&amp;A\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ1: What is the DNA-Train training method?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA: DNA-Train is a new training method for multimodal AI, consisting of two stages: supervised learning and reinforcement learning. It allows AI models to learn normal videos and manually edited abnormal videos simultaneously, forcing the model to answer questions based on actual observations rather than preconceived common sense, thereby improving the AI's visual reasoning ability.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ2: How does the DualityForge framework produce abnormal videos?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA: The DualityForge framework has three different \"production lines\": the first handles visual anomalies such as contrast and saturation changes; the second creates semantic anomalies such as object disappearance, appearance, or replacement; the third produces phenomena that violate physical laws such as water flowing upwards and stones floating. The entire process uses advanced video editing technology and verifies quality through multiple AI models.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tQ3: What practical impact does this research have on ordinary people's lives?\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"font-size: 18px;\" class=\"ql-lineHeight-1-75\">\tA: This technology will improve the reliability of various AI applications, including more accurate content review systems, smarter educational assistants, more precise medical diagnoses, and safer autonomous driving. Most importantly, it gives AI a stronger \"counterintuitive\" identification ability, allowing it to make more accurate judgments when facing abnormal situations, thus providing more trustworthy AI services for humans.\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>\u003Cp>\u003Cspan style=\"color: rgb(136, 136, 136);\">【News source】MSN \u003C\u002Fspan>\u003Ca href=\"https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1TRzSR?ocid=msedgntphdr&amp;cvid=6960a044903742af84c57e4ee0ce1732&amp;ei=88\" rel=\"noopener noreferrer\" target=\"_blank\" style=\"color: rgb(136, 136, 136);\"> https:\u002F\u002Fwww.msn.cn\u002Fzh-cn\u002Fnews\u002Fother\u002Far-AA1RTDQy?ocid=BingHp01&amp;cvid=6936317f054647a2afcd53fafcde084a&amp;ei\u003C\u002Fa>\u003C\u002Fp>\u003Cp class=\"ql-align-justify\">\u003Cspan style=\"color: rgb(136, 136, 136);\">（This article is reprinted by this website to provide readers with more information and news, and the content involved does not constitute investment or consumption advice. If there are any doubts about the facts of the article, please verify with the relevant parties. The views of the article are not the views of this website and are for reference only.）\u003C\u002Fspan>\u003C\u002Fp>\u003Cp>\u003Cbr>\u003C\u002Fp>","https:\u002F\u002Fxinwei-dev-test.oss-cn-shenzhen.aliyuncs.com\u002Fintelligent\u002Faudio%3A5aabdc0b-660f-41bf-832f-76879271aee5%3A0.wav?Expires=1774838430&OSSAccessKeyId=LTAI5tNvY2RkKjZw4LLWsrPK&Signature=IdGLAqytSrD2tVqaUgs86HHddis%3D","5aabdc0b-660f-41bf-832f-76879271aee5",17498746]