In this paper, we first propose a novel method for transferring material transformations across different scenes. Building on disentangled Neural Radiance Field (NeRF) representations, our approach learns to map Bidirectional Reflectance Distribution Functions (BRDF) from pairs of scenes observed in varying conditions, such as dry and wet. The learned transformations can then be applied to unseen scenes with similar materials, therefore effectively rendering the transformation learned with an arbitrary level of intensity. Extensive experiments on synthetic scenes and real-world objects validate the effectiveness of our approach, showing that it can learn various transformations such as wetness, painting, coating, etc. Our results highlight not only the versatility of our method but also its potential for practical applications in computer graphics. We publish our method implementation, along with our synthetic/real datasets on this https URL
在这篇论文中,我们首先提出了一种在不同场景之间转移材料变换的新方法。基于分解的神经辐射场(NeRF)表示,我们的方法学习将双向反射分布函数(BRDF)从观察到的不同条件下的成对场景映射出来,例如干燥和湿润状态。学得的变换可以应用于具有相似材质的未见过的场景,从而有效地以任意强度级别渲染所学的变换。我们在合成场景和现实世界物体上的大量实验验证了我们方法的有效性,结果显示它可以学习各种变换,如湿润度、涂漆、涂层等。我们的结果不仅突出了方法的灵活性,还展示了其在计算机图形学中的实际应用潜力。我们将该方法的实现以及合成/真实数据集发布于这个网址:https://url.com
https://arxiv.org/abs/2411.08037
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see this https URL .
在这篇论文中,我们论证了迭代计算与扩散模型为不仅生成任务而且视觉感知任务提供了一个强大的范式。我们将深度估计、光流和分割等任务统一在图像到图像的转换下,并展示了扩散模型如何从这些感知任务的训练时间和测试时间计算扩展中受益。通过对这些扩展行为的仔细分析,我们提出了各种技术来有效地训练用于视觉感知任务的扩散模型。我们的模型使用显著较少的数据和计算量实现了改进或与最先进方法相当的表现。要使用我们的代码和模型,请参见此链接:https://www.example.com(请将此URL替换为实际提供的链接)。
https://arxiv.org/abs/2411.08034
While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.
尽管三维内容生成技术有了显著进步,现有方法仍面临输入格式、潜在空间设计和输出表示方面的挑战。本文介绍了一种新的三维生成框架,该框架解决了这些挑战,提供了具有可扩展性和高质量的三维生成,并引入了交互式的点云结构化潜空间。我们的框架采用变分自编码器(VAE),将多视角姿势RGB-D(深度)-N(法线)渲染作为输入,使用一种独特的潜在空间设计来保留3D形状信息,并加入级联潜在扩散模型以改进形状-纹理解耦。所提出的方法名为GaussianAnything,支持多种条件下的三维生成,允许点云、标题和单/多视角图像的输入。值得注意的是,新提出的潜空间自然实现了几何结构-纹理的分离,从而使得3D感知编辑成为可能。实验结果表明,我们的方法在多个数据集上的表现优于现有的方法,在文本和图像引导的三维生成方面均表现出色。
https://arxiv.org/abs/2411.08033
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
在实际的自然语言处理(NLP)应用中,大型语言模型(LLMs)因其在大规模数据集上的广泛训练而提供了有前景的解决方案。然而,LLMs 的庞大尺寸和高计算需求限制了它们在许多应用中的实用性,尤其是在需要进一步微调的情况下。为了解决这些限制,通常更倾向于部署较小的模型。但是,标记数据的稀缺性阻碍了小模型的训练。相比之下,未标记的数据往往容易获得,并且可以通过使用 LLMs 生成伪标签来用于训练小型模型。这使得小型模型(学生)能够从 LLMs(教师)获取知识,同时降低计算成本。这一过程带来了诸如潜在噪声伪标签等挑战。因此,选择高质量和信息丰富的数据对于提高模型性能以及提升数据利用率效率至关重要。为了解决这些问题,我们提出了 LLKD 方法,它能够在使用较少计算资源和较少数据的情况下实现从 LLMs 的知识蒸馏(Knowledge Distillation)。LLKD 是一种自适应样本选取方法,融合了教师和学生的信号。具体来说,它优先选择那些教师对其标签表现出高度自信的样本,这表明可靠的标签,并且学生展示出高信息需求的样本,识别出需要进一步学习的挑战性样本。我们的综合实验显示,LLKD 在各种数据集上实现了更高的数据效率和优越性能。
https://arxiv.org/abs/2411.08028
Physical reasoning is an important skill needed for robotic agents when operating in the real world. However, solving such reasoning problems often involves hypothesizing and reflecting over complex multi-body interactions under the effect of a multitude of physical forces and thus learning all such interactions poses a significant hurdle for state-of-the-art machine learning frameworks, including large language models (LLMs). To study this problem, we propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact -- the domino effect of the ensued object interactions and their dynamics thus offering a challenging yet controlled setup, with the goal of reasoning being to infer the stability of the objects after the impact. To solve this complex physical reasoning task, we present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs, and synergizes these abilities with the world models built into modern physics engines. Specifically, LLMPhy uses an LLM to generate code to iteratively estimate the physical hyperparameters of the system (friction, damping, layout, etc.) via an implicit analysis-by-synthesis approach using a (non-differentiable) simulator in the loop and uses the inferred parameters to imagine the dynamics of the scene towards solving the reasoning task. To show the effectiveness of LLMPhy, we present experiments on our TraySim dataset to predict the steady-state poses of the objects. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance, while demonstrating superior convergence against standard black-box optimization methods and better estimation of the physical parameters.
物理推理是机器人在现实世界中操作时所需的一项重要技能。然而,解决这类推理问题通常涉及对多种物理力作用下的复杂多体互动进行假设和反思,因此学习所有这些互动对于最先进的机器学习框架(包括大型语言模型(LLMs))来说是一个巨大的障碍。为了研究这个问题,我们提出了一项新的物理推理任务和一个数据集,称为TraySim。我们的任务涉及到预测托盘上多个物体在受到外部冲击后的动态情况——随之而来的物体质的多米诺效应及其动力学提供了一个具有挑战性但又受控的设置,目标是通过推理来判断冲击后物体质的稳定性。为了解决这个复杂的物理推理问题,我们提出了LLMPhy,这是一种零样本黑盒优化框架,它利用了大型语言模型的物理学知识和程序合成能力,并将这些能力与现代物理引擎内置的世界模型相结合。具体来说,LLMPhy 使用一个 LLM 来生成代码以迭代估计系统的物理超参数(如摩擦、阻尼、布局等),通过使用循环中的非可微模拟器进行隐式分析-综合方法来进行估算,并利用推断出的参数来想象场景的动力学情况,从而解决推理任务。为了展示LLMPhy的有效性,我们在我们的TraySim数据集上进行了实验,预测物体的稳态姿态。结果显示,结合大型语言模型和物理引擎可以实现最先进的零样本物理推理性能,同时与标准黑盒优化方法相比具有优越的收敛性和更好的物理参数估计效果。
https://arxiv.org/abs/2411.08027
Trees continue to fascinate with their natural beauty and as engineering masterpieces optimal with respect to several independent criteria. Pythagorean tree is a well-known fractal design that realistically mimics the natural tree branching structures. We study various types of Pythagorean-like fractal trees with different shapes of the base, branching angles and relaxed scales in an attempt to identify and explain which variants are the closest match to the branching structures commonly observed in the natural world. Pursuing simultaneously the realism and minimalism of the fractal tree model, we have developed a flexibly parameterised and fast algorithm to grow and visually examine deep Pythagorean-inspired fractal trees with the capability to orderly over- or underestimate the Leonardo da Vinci's tree branching rule as well as control various imbalances and branching angles. We tested the realism of the generated fractal tree images by means of the classification accuracy of detecting natural tree with the transfer-trained deep Convolutional Neural Networks (CNNs). Having empirically established the parameters of the fractal trees that maximize the CNN's natural tree class classification accuracy we have translated them back to the scales and angles of branches and came to the interesting conclusions that support the da Vinci branching rule and golden ratio based scaling for both the shape of the branch and imbalance between the child branches, and claim the flexibly parameterized fractal trees can be used to generate artificial examples to train robust detectors of different species of trees.
树木以其自然之美和在多个独立标准下作为工程杰作的最优性质继续吸引着人们的注意。毕达哥拉斯树是一种著名的分形设计,能够真实地模仿自然树木的分支结构。我们研究了各种类型的类似毕达哥拉斯的分形树,这些树具有不同的基底形状、分支角度和放松的比例,试图识别并解释哪些变体最接近于自然界中常见的分支结构。追求分形树模型的真实性和极简性的同时,我们开发了一种灵活参数化且快速的算法来生成和可视化深层次受毕达哥拉斯启发的分形树,并具有有序地高估或低估列奥纳多·达·芬奇树木分支规则的能力,以及控制各种不平衡和分支角度。通过转移训练的深度卷积神经网络(CNNs)检测自然树木的分类准确性,我们测试了生成的分形树图像的真实性。在实证建立使CNN的自然树木类分类准确性最大化的分形树参数后,我们将这些参数转换为枝条的比例和角度,并得出了有趣的结论:支持达·芬奇分支规则以及基于黄金比例对树枝形状和子树枝之间的不平衡进行缩放。我们声称这种灵活参数化的分形树可以用于生成人工示例以训练不同种类树木的鲁棒检测器。
https://arxiv.org/abs/2411.08024
We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.
我们提出了一种基于大型语言模型(LLM)的数据生成框架,该框架具有可控制的因果结构。特别地,我们定义了一个过程,可以将任意的语言模型和有向无环图(DAG)转换为序列驱动的结构性因果模型(SD-SCM)。广义上讲,SD-SCM是一种用户定义结构、LLM定义结构方程式的因果模型。我们描述了SD-SCM如何根据所需的因果结构从观察分布、干预分布和反事实分布中进行采样。然后,我们利用这一过程提出了一种新的因果推断方法基准测试类型,能够生成个体级别的反事实数据,而无需手动指定变量之间的功能关系。我们创建了一个包含数千个数据集的示例基准,并在这些数据集上对一系列流行的估计方法进行了测试,评估了平均效应、条件平均效应和个体治疗效果的估计,包括存在隐藏混杂因素的情况。除了生成数据之外,同样的过程还允许我们检验LLM中是否编码了因果效应的存在性。这一过程可以作为审计LLM中的错误信息、歧视或其它不希望的行为的基础。我们认为SD-SCMs可以作为一种有用的工具,在任何需要具有可控制因果结构的顺序数据的应用中发挥作用。
https://arxiv.org/abs/2411.08019
Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.
大规模的3D生成模型需要大量的计算资源,但往往在捕捉高分辨率下的精细细节和复杂几何形状方面表现不佳。我们认为这种局限性源于当前表示方法的低效,它们缺乏有效的建模所需的紧凑性。为了解决这个问题,我们引入了一种名为Wavelet Latent Diffusion(WaLa)的新方法,该方法将3D形状编码成基于小波的紧凑潜在编码。具体来说,我们将一个$256^3$的符号距离场压缩到一个$12^3 \times 4$的潜在网格中,实现了高达2427倍的压缩率,并且细节损失极小。这种高水平的压缩使得我们的方法能够有效地训练大规模生成网络而不增加推理时间。我们的模型(包括条件和无条件模型)含有大约十亿个参数,并成功地以$256^3$分辨率生成高质量的3D形状。此外,尽管模型规模较大,WaLa提供了快速推理的能力,在两到四秒内产生形状,具体取决于条件。我们在多个数据集上展示了最先进的性能,生成质量、多样性和计算效率方面都有显著提高。我们开源了代码,并且据我们所知,发布了跨不同模式的最大预训练3D生成模型。
https://arxiv.org/abs/2411.08017
The works of Gatys et al. demonstrated the capability of Convolutional Neural Networks (CNNs) in creating artistic style images. This process of transferring content images in different styles is called Neural Style Transfer (NST). In this paper, we re-implement image-based NST, fast NST, and arbitrary NST. We also explore to utilize ResNet with activation smoothing in NST. Extensive experimental results demonstrate that smoothing transformation can greatly improve the quality of stylization results.
Gatys等人的一项研究表明,卷积神经网络(CNN)具备生成艺术风格图像的能力。这种将内容图像转换为不同风格的过程被称为神经样式迁移(NST)。在这篇论文中,我们重新实现了基于图像的NST、快速NST和任意NST。同时,我们也探索了在NST中使用带有激活平滑处理的ResNet的方法。大量的实验结果表明,通过平滑变换可以极大地提升样式的生成质量。
https://arxiv.org/abs/2411.08014
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
帕金森病(PD)中的言语障碍提供了重要的早期诊断指标。虽然基于语音的PD检测模型显示出了强大的性能,但其可解释性仍处于探索初期。本研究系统评估了几种解释方法,以识别特定于PD的语音特征,旨在支持开发准确且具有可解释性的模型,用于PD诊断和监测的临床决策。我们的方法包括:(i) 使用主流的解释技术获得属性分配和显著图;(ii) 通过一系列公认的指标定量评估这些图及其通过并集和交集组合所得结果的忠实度;(iii) 从辅助分类器的角度评估这些显著图对PD检测所传达的信息。我们的研究结果显示,虽然解释与分类器相一致,但它们往往无法为领域专家提供有价值的信息。
https://arxiv.org/abs/2411.08013
While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of ``expressivity,'' and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.
虽然大型语言模型(LLMs)在某些维度上表现出色,但它们表达人类有效沟通所需的隐性语言线索的能力仍然不明。本文介绍了ExpressivityArena,这是一个用于衡量LLM隐性沟通能力的Python库。我们提供了一个全面的框架来评估任意LLM的表现力,并探讨其实际意义。为此,我们精炼了“表现力”的定义和测量方法,并使用我们的框架进行了一系列小规模实验。这些实验测试了LLMs在创意和逻辑任务中的表现,如诗歌创作、编程以及基于情感的回应等。之后通过ExpressivityArena中的自动化评分器来评估它们的表现力,这是我们验证过的最实用的测试手段。在此基础上,我们进一步了解了LLM在对话中保持表现力的能力。我们的研究结果表明,尽管存在一些限制,LLMs有能力生成和理解富有表现力的内容。这些见解将指导未来具有表现力的LLM的发展和部署。本文同时提供了ExpressivityArena的代码。
https://arxiv.org/abs/2411.08010
Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.
将大型语言模型(LLMs)在对抗场景下的输出(例如网络攻击和虚假信息传播)归因于特定的模型面临显著挑战,这些问题的重要性可能会逐渐增加。我们使用形式语言理论,特别是Gold引入并由Angluin扩展的极限下语言识别方法来研究这个问题。通过将LLM的输出建模为形式语言,我们分析了是否可以通过有限的文本样本唯一地确定其来源模型。结果表明,在某些语言类别的不可识别性条件下,并且在关于微调模型重叠输出的一些温和假设下,理论上无法将输出归因于特定的LLMs。这一结论同样适用于考虑到Transformer架构表达能力限制的情况。即使有直接访问模型或全面监控的能力,显著的计算障碍也阻碍了归因工作的进行。这些发现强调,在大型语言模型的影响持续扩大的情况下,迫切需要采取主动措施来减轻其对抗性使用所带来的风险。
https://arxiv.org/abs/2411.08003
What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.
什么机制驱动了大型语言模型(LLMs)中的语言泛化?这一问题引起了相当多的关注,大多数研究分析的是LLMs的语言技能在多大程度上类似于规则。迄今为止,还不清楚是否可以用类比过程的结果来同样很好地解释LLMs中的语言泛化,这些过程可以形式化为对存储示例的相似性操作。先前研究的一个主要缺点是其专注于高度规律性的语言现象,在这类现象中,基于规则的方法和类比方法作出相同的预测。在这里,我们转而考察派生形态学,特别是英语形容词名词化,这一领域显示出显著的变化性。我们提出了一种新的方法来调查LLMs中的语言泛化:集中于GPT-J模型,我们将基于规则学习的和类比学习的认知模型拟合到LLM训练数据上,并将其对一组新造形容词的预测与LLM进行比较,使我们可以直接得出关于潜在机制的结论。如预期那样,对于具有常规名词化模式的形容词,基于规则的方法和类比方法同样很好地解释了GPT-J的预测。然而,对于具有变化性名词化模式的形容词,类比模型提供了更好的匹配度。此外,即使是对常规形式而言,GPT-J的行为对个别单词频率也非常敏感,这种行为与类比处理方式的一致性高于基于规则的方式。这些发现反驳了GPT-J在形容词名词化上的语言泛化涉及规则的假设,暗示存储示例中的相似性操作是潜在机制。总体来说,我们的研究提示,在LLMs的语言泛化中,类比过程发挥的作用可能大于以前的认识。
https://arxiv.org/abs/2411.07990
We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.
我们证明了基尼系数可以作为统一的度量标准,用于评估向量空间中的多对多(全体对全体)相似性。我们对各种图像数据集的分析表明,具有最高基尼系数的图像是最彼此相似的,而具有最低基尼系数的图像是最不相似的。我们还展示了这种关系对于来自不同语料库的矢量化文本嵌入同样成立,这突显了我们的方法的一致性和其在不同类型数据上的广泛应用性。此外,我们证明了选择与测试数据集分布紧密匹配的机器学习训练样本比确保数据多样性更为重要。选取具有较高基尼系数的典型和代表性训练样本相比简单地拥有一个基尼系数较低的多样化训练集能显著提高模型性能。因此,在信息非常稀疏的情况下,基尼系数可以用作有效选择机器学习训练样本的标准,并且我们的选择方法优于随机抽样方法。
https://arxiv.org/abs/2411.07983
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order this http URL, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes -- thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emph{training} loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy'' regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.
二次优化已被证明可以加速深度神经网络在许多应用中的训练过程,通常每轮迭代在训练损失上的进展比一次方法更快。然而,关于二次方法的一般化性能的讨论仍在继续。理论研究很难在高度简化的模型类之外的可处理设置中进行——因此,现有理论对实际深度学习应用的相关性仍不清楚。类似地,在大规模模型和真实数据集中的经验研究表明,由于需要在实践中近似二次更新,结果受到显著干扰。通常情况下,无法明确区分观察到的一般化行为是特别来自参数更新的二次特性,还是反映所使用的特定结构(如 Kronecker)近似的特征或任何基于阻尼向一次更新插值的结果。 在这里,我们首次展示了在一类足够表达以有意义地应用于常见基准数据集的深度可逆架构中,精确的高斯-牛顿(GN)更新具有可处理的形式。我们利用这一新颖设置来研究 GN 优化器的训练和一般化属性。我们发现精确 GN 的一般化性能不佳。在小批量训练环境中,这表现为即使在 *训练* 损失上进展迅速饱和,参数更新被发现在适应每个小批次方面过度拟合,而没有产生支持向其他小批次推广所需的特征。我们的实验是在“懒惰”模式下运行的,在这种模式中,神经切核(NTK)在整个训练过程中变化很少。这种行为与在神经表示上没有显著的变化相关联,解释了一般化能力缺乏的原因。
https://arxiv.org/abs/2411.07979
Coronary artery disease (CAD), one of the most common cause of mortality in the world. Coronary artery calcium (CAC) scoring using computed tomography (CT) is key for risk assessment to prevent coronary disease. Previous studies on risk assessment and calcification detection in CT scans primarily use approaches based on UNET architecture, frequently implemented on pre-built models. However, these models are limited by the availability of annotated CT scans containing CAC and suffering from imbalanced dataset, decreasing performance of CAC segmentation and scoring. In this study, we extend this approach by incorporating the self-supervised learning (SSL) technique of DINO (self-distillation with no labels) to eliminate limitations of scarce annotated data in CT scans. The DINO model's ability to train without requiring CAC area annotations enhances its robustness in generating distinct features. The DINO model is trained on to focus specifically on calcified areas by using labels, aiming to generate features that effectively capture and highlight key characteristics. The label-guided DINO (DINO-LG) enhances classification by distinguishing CT slices that contain calcification from those that do not, performing 57% better than the standard DINO model in this task. CAC scoring and segmentation tasks are performed by a basic U-NET architecture, fed specifically with CT slices containing calcified areas as identified by the DINO-LG model. This targeted identification performed by DINO-LG model improves CAC segmentation performance by approximately 10% and significant increase in CAC scoring accuracy.
冠状动脉疾病(CAD)是全球最常见的死亡原因之一。使用计算机断层扫描(CT)进行冠状动脉钙化(CAC)评分对于评估心脏病风险和预防冠心病至关重要。此前关于在CT扫描中评估风险和检测钙化的研究主要基于UNET架构的方法,并经常采用预先构建的模型。然而,这些模型受限于带有CAC注释的CT扫描数据的可用性以及不平衡的数据集问题,导致CAC分割和评分性能下降。 在这项研究中,我们通过引入DINO(无标签自蒸馏)的自我监督学习(SSL)技术来扩展这种方法,以消除在CT扫描中标注数据稀缺的限制。DINO模型无需CAC区域标注即可进行训练的能力增强了其生成区分性特征的稳健性。该模型通过使用标注信息专注于钙化区域的训练,旨在生成能有效捕捉和突出关键特性的特征。标签引导的DINO(DINO-LG)通过区分含有钙化的CT切片与不含钙化的切片来提升分类性能,在此任务中比标准DINO模型提高了57%。 CAC评分和分割任务由一个基本的U-NET架构完成,该架构特别使用了由DINO-LG模型识别出含钙化区域的CT切片。这种由DINO-LG模型执行的目标识别提高了大约10%的CAC分割性能,并显著提升了CAC评分的准确性。
https://arxiv.org/abs/2411.07976
We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
我们提出了JanusFlow,这是一个强大的框架,能够在单一模型中统一图像理解和生成。JanusFlow引入了一种极简的架构,将自回归语言模型与校正流(一种最先进的生成建模方法)相结合。我们的关键发现表明,校正流可以直接在大型语言模型框架内进行训练,无需复杂的架构修改。为了进一步提升我们统一模型的性能,我们采用了两种主要策略:(i) 解耦理解编码器和生成编码器,以及(ii) 在统一训练过程中对其表示进行对齐。广泛的实验显示,JanusFlow在其各自的领域中实现了与专业模型相当或更优的表现,并在标准基准测试上显著超越了现有的统一方法。这项工作代表了向更加高效且多功能的视觉-语言模型迈出的重要一步。
https://arxiv.org/abs/2411.07975
The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.
大型语言模型(LLMs)的高级角色扮演能力为开发角色扮演代理(RPAs)铺平了道路。然而,现有的基准测试,如HPD,将手动评分的角色关系纳入上下文以使LLMs对连贯性进行排序,以及SocialBench,在多选题任务中使用由LLM生成的具体档案来评估角色偏好,都面临着诸如泛化能力差、隐含且不准确的判断和过长的上下文长度等限制。为了解决上述问题,我们提出了一种自动化的、可扩展的和通用的方法范式。具体来说,我们通过从一个通用知识图谱中提取关系来构建基准,并利用RPA内在的幻觉特性使其能够在不同角色间互动,采用ChatGPT进行立场检测并定义了关系幻觉及三个相关指标。大量的实验验证了我们的指标的有效性和稳定性。我们的研究进一步探讨了影响这些指标的因素,并讨论了关系幻觉与事实性之间的权衡。
https://arxiv.org/abs/2411.07965
To date there is little publicly available scientific data on Unidentified Aerial Phenomena (UAP) whose properties and kinematics purportedly reside outside the performance envelope of known phenomena. To address this deficiency, the Galileo Project is designing, building, and commissioning a multi-modal ground-based observatory to continuously monitor the sky and conduct a rigorous long-term aerial census of all aerial phenomena, including natural and human-made. One of the key instruments is an all-sky infrared camera array using eight uncooled long-wave infrared FLIR Boson 640 cameras. Their calibration includes a novel extrinsic calibration method using airplane positions from Automatic Dependent Surveillance-Broadcast (ADS-B) data. We establish a first baseline for the system performance over five months of field operation, using a real-world dataset derived from ADS-B data, synthetic 3-D trajectories, and a hand-labelled real-world dataset. We report acceptance rates (e.g. viewable airplanes that are recorded) and detection efficiencies (e.g. recorded airplanes which are successfully detected) for a variety of weather conditions, range and aircraft size. We reconstruct $\sim$500,000 trajectories of aerial objects from this commissioning period. A toy outlier search focused on large sinuosity of the 2-D reconstructed trajectories flags about 16% of trajectories as outliers. After manual review, 144 trajectories remain ambiguous: they are likely mundane objects but cannot be elucidated at this stage of development without distance and kinematics estimation or other sensor modalities. Our observed count of ambiguous outliers combined with systematic uncertainties yields an upper limit of 18,271 outliers count for the five-month interval at a 95% confidence level. This likelihood-based method to evaluate significance is applicable to all of our future outlier searches.
截至目前,关于那些据称超出已知现象性能范围的不明空中现象(UAP)的公开科学数据仍然很少。为解决这一不足,伽利略项目正在设计、建造并调试一个多模态地面观测站,以持续监测天空,并进行所有空中现象的严格长期普查,包括自然和人造的。该计划的关键仪器之一是由八台未制冷的长波红外FLIR Boson 640相机组成的全空红外相机阵列。它们的校准包括使用自动依赖监视-广播(ADS-B)数据中的飞机位置的一种新颖的外在校准方法。我们利用从ADS-B数据、合成三维轨迹和手标注的真实世界数据中得出的实际数据集,为五个月田间操作期间系统的性能建立了首个基线。我们报告了各种天气条件、范围和飞机尺寸下的接受率(如可视并记录的飞机)和检测效率(如成功检测到的被记录下来的飞机)。从这一调试期重建了约50万条空中物体轨迹。一个以二维重建轨迹的大波动性为焦点的小规模异常搜索将大约16%的轨迹标记为异常值。经过人工审查后,仍有144个轨迹模糊不清:它们很可能是平凡的对象,但在目前的发展阶段无法通过距离和运动学估计或其他传感器模态来阐明。我们观察到的模糊异常数量结合系统不确定度,在95%置信水平下得到五个月期间异常数上限为18,271。这种基于可能性的方法用于评估显著性适用于我们未来的所有异常搜索。
https://arxiv.org/abs/2411.07956
Modern software for propositional satisfiability problems gives a powerful automated reasoning toolkit, capable of outputting not only a satisfiable/unsatisfiable signal but also a justification of unsatisfiability in the form of resolution proof (or a more expressive proof), which is commonly used for verification purposes. Empirically, modern SAT solvers produce relatively short proofs, however, there are no inherent guarantees that these proofs cannot be significantly reduced. This paper proposes a novel branch-and-bound algorithm for finding the shortest resolution proofs; to this end, we introduce a layer list representation of proofs that groups clauses by their level of indirection. As we show, this representation breaks all permutational symmetries, thereby improving upon the state-of-the-art symmetry-breaking and informing the design of a novel workflow for proof minimization. In addition to that, we design pruning procedures that reason on proof length lower bound, clause subsumption, and dominance. Our experiments suggest that the proofs from state-of-the-art solvers could be shortened by 30-60% on the instances from SAT Competition 2002 and by 25-50% on small synthetic formulas. When treated as an algorithm for finding the shortest proof, our approach solves twice as many instances as the previous work based on SAT solving and reduces the time to optimality by orders of magnitude for the instances solved by both approaches.
现代的命题可满足性问题软件提供了一个强大的自动推理工具包,不仅能够输出一个可满足/不可满足的信号,还能够以解析证明(或更复杂的证明)的形式给出不可满足性的理由。这种形式通常用于验证目的。经验表明,现代SAT求解器产生的证明相对短小,但没有内在保证这些证明不能显著缩减。本文提出了一种寻找最短解析证明的新分支定界算法;为此,我们引入了证明的层列表表示法,该方法按子句的间接层次对其进行分组。正如我们将要展示的那样,这种表示法打破了所有置换对称性,从而改进了最先进的对称破缺技术,并指导了一个新的证明最小化工作流的设计。此外,我们还设计了剪枝程序,这些程序基于证明长度下限、子句包含和支配进行推理。我们的实验表明,来自最先进求解器的证明可以在SAT竞赛2002中的实例上缩短30-60%,在小型合成公式上则可缩短25-50%。当作为寻找最短证明的算法使用时,我们的方法解决的问题实例数是以前基于SAT求解的工作的两倍,并将达到最优所需的时间减少了几个数量级,在两种方法都解决的实例中也是如此。
https://arxiv.org/abs/2411.07955