PubSummarizer - NeurIPS 2023 Oral Papers

A Measure-Theoretic Axiomatisation of Causality

Paper URL: https://openreview.net/attachment?id=sPLTQSf6GI&name=pdf

Topics

causality measure theory probability causal spaces intervention

Summary

This paper proposes a measure-theoretic axiomatization of causality, addressing the lack of a universally accepted framework by introducing the concept of "causal spaces." These spaces integrate probability theory with causal information through "causal kernels," which describe the effects of interventions on systems. The authors argue that using Kolmogorov's measure-theoretic framework as a foundation allows for a more rigorous understanding of causal relationships, particularly in complex scenarios involving cycles, latent variables, and stochastic processes, which existing frameworks struggle to address. The paper further compares causal spaces to traditional models like structural causal models and potential outcomes, highlighting their advantages and limitations while emphasizing the need for future work in embedding counterfactuals and actual causality within this new framework.

A Rigorous Link between Deep Ensembles and (Variational) Bayesian Methods

Paper URL: https://openreview.net/attachment?id=eTHawKFT4h&name=pdf

Topics

Bayesian methods Deep learning Variational inference Ensemble methods Wasserstein gradient flows

Summary

This paper establishes a rigorous connection between Bayesian methods, variational inference, and deep ensemble techniques in deep learning by reformulating the optimization problems they encounter. By leveraging the framework of Wasserstein gradient flows, the authors unify various approaches for uncertainty quantification and demonstrate that different algorithms arise from choices related to regularizers in a generalized variational inference context. The paper introduces novel algorithms like interacting deep ensembles, which are shown to converge to a global minimizer, and contrasts their efficacy with traditional methods. Through theoretical insights and numerical experiments, the authors highlight the advantages of the proposed methodologies over conventional variational inference approaches, suggesting pathways for future advancements in Bayesian deep learning.

A Single-Loop Accelerated Extra-Gradient Difference Algorithm with Improved Complexity Bounds for Constrained Minimax Optimization

Paper URL: https://openreview.net/attachment?id=wIlmx4bHrO&name=pdf

Topics

minimax optimization algorithm acceleration constrained optimization gradient methods convergence analysis

Summary

This paper introduces a novel single-loop Extra-Gradient Difference Acceleration (EGDA) algorithm designed for solving constrained nonconvex-nonconcave (NC-NC) minimax problems, achieving a significant improvement in convergence rates. By utilizing a new extra-gradient difference step and incorporating momentum acceleration, the proposed algorithm attains a complexity of O(ϵ^2) for finding ε-stationary points, outperforming existing methods that achieve complexities of O(ϵ^4) or eO(ϵ^3). Additionally, the algorithm's applicability extends to constrained nonconvex-concave (NC-C) and convex-nonconcave (C-NC) problems, retaining the same optimal complexity O(ϵ^2). The theoretical analysis and numerical experiments validate the enhanced performance and efficiency of the EGDA algorithm in various minimax optimization contexts.

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning

Paper URL: https://openreview.net/attachment?id=O0Lz8XZT2b&name=pdf

Topics

double descent model complexity statistical learning effective parameters machine learning

Summary

The paper challenges the conventional understanding of the relationship between model complexity and prediction error, traditionally represented by a U-shaped curve. It addresses the phenomenon of "double descent," where test error decreases again after a peak as model parameters exceed the number of training samples. The authors argue that previous claims of double descent in classical statistical learning methods like linear regression, trees, and boosting do not contradict traditional statistical intuition. They demonstrate that these observed curves can be explained by considering multiple underlying complexity axes and that when effective parameter counts are measured appropriately, the double descent shapes revert to traditional U-curves. By interpreting various machine learning models as smoothers, the study provides a new lens for understanding parameter counting and emphasizes the importance of effective parameters in assessing model performance on unseen data.

Abide by the law and follow the flow: conservation laws for gradient flows

Paper URL: https://openreview.net/attachment?id=kMueEV8Eyy&name=pdf

Topics

gradient flows conservation laws implicit bias machine learning ReLU networks

Summary

The paper investigates the geometric properties of gradient descent dynamics in large machine learning models, emphasizing the concept of conservation laws—quantities preserved during the optimization process. It rigorously defines these laws, demonstrates how to compute the maximal number of independent conservation laws using Lie algebra techniques, and presents algorithms to identify polynomial conservation laws. The authors showcase their findings through various examples, particularly focusing on ReLU network architectures, confirming that existing conservation laws are complete and no other independent laws exist. This work contributes to understanding the implicit bias of optimization initialization and generalization in over-parameterized models, paving the way for further exploration of optimization dynamics in machine learning.

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation

Paper URL: https://openreview.net/attachment?id=R6KJN1AUAR&name=pdf

Topics

latent variables additive decoders representation learning identifiability Cartesian-product extrapolation

Summary

This paper presents a novel approach to latent variable identification and out-of-support image generation using a specific class of decoders termed "additive decoders." These decoders are particularly effective for images that can be represented as sums of object-specific images, enabling both the identification of latent variables up to permutation and the generation of new images through a process called Cartesian-product extrapolation. The authors establish theoretical conditions under which these decoders guarantee identifiability and demonstrate empirically that additivity is crucial for both identifiability and extrapolation in simulated datasets. This work contributes to the understanding of object-centric representation learning and nonlinear independent component analysis by providing insights into the mathematical foundations that allow for effective disentanglement of latent factors.

Are Emergent Abilities of Large Language Models a Mirage?

Paper URL: https://openreview.net/attachment?id=ITw9edRDlD&name=pdf

Topics

emergent abilities large language models metric choice model performance AI scaling

Summary

This paper challenges the notion that large language models (LLMs) exhibit emergent abilities—sudden, unpredictable enhancements in performance as model size increases—arguing instead that these phenomena may stem from the selection of metrics used to evaluate model outputs. The authors propose that nonlinear or discontinuous metrics can create the illusion of emergent abilities, while linear or continuous metrics reveal smoother, more predictable performance improvements. They provide a mathematical framework and conduct multiple analyses using the InstructGPT/GPT-3 family and the BIG-Bench benchmark, demonstrating that changing the evaluation metric can eliminate perceived emergent abilities. Their findings suggest that emergent abilities might not be intrinsic properties of the models but rather artifacts of the measurement techniques employed by researchers.

Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models

Paper URL: https://openreview.net/attachment?id=9VqMaSjf7U&name=pdf

Topics

brain imaging visual cortex generative models diffusion models image synthesis

Summary

The paper introduces Brain Diffusion for Visual Exploration (BrainDiVE), a novel data-driven method that utilizes large-scale generative models to synthesize images aimed at activating specific regions of the human visual cortex, thereby enhancing our understanding of its functional organization. Traditional approaches in neuroscience often rely on manually curated stimuli, which can limit the exploration of brain function. In contrast, BrainDiVE employs diffusion models guided by fMRI data to generate images with high semantic specificity for category-selective regions, allowing for the identification of subtle differences and novel sub-regions within these areas. The results demonstrate that BrainDiVE effectively elucidates fine-grained preferences in the visual system and offers a promising avenue for further investigation into cortical organization.

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Paper URL: https://openreview.net/attachment?id=mayAyPrhJI&name=pdf

Topics

Gradient Estimation ReinMax Deep Learning Discrete Variables Backpropagation

Summary

The paper presents a new approach called ReinMax to enhance gradient estimation for deep learning models dealing with discrete latent variables, addressing limitations of backpropagation which is traditionally suited for continuous variables. The authors analyze the Straight-Through (ST) estimator, demonstrating its first-order approximation nature, and propose ReinMax as a second-order accurate method by integrating Heun's method without requiring second-order derivatives. Extensive experiments show that ReinMax outperforms existing state-of-the-art methods in various tasks, offering insights into hyperparameter optimization and improving the understanding of gradient estimators for discrete variables.

Bridging RL Theory and Practice with the Effective Horizon

Paper URL: https://openreview.net/attachment?id=Lr2swAfwff&name=pdf

Topics

reinforcement learning effective horizon sample complexity deep learning empirical performance

Summary

The paper introduces a new theoretical complexity measure called the effective horizon, which aims to bridge the gap between reinforcement learning (RL) theory and practice. The authors analyze deep RL algorithms like PPO and DQN in conjunction with a newly constructed dataset, BRIDGE, comprising 155 deterministic Markov Decision Processes (MDPs). They discover that a property related to the alignment of Q-values under random and optimal policies significantly predicts the success of deep RL algorithms. The effective horizon serves as a more reliable predictor of empirical performance than traditional sample complexity bounds, revealing its potential to explain the impact of techniques like reward shaping and pre-trained exploration policies. Overall, the findings suggest that understanding the effective horizon can lead to better theoretical insights and practical improvements in deep RL.

Causal normalizing flows: from theory to practice

Paper URL: https://openreview.net/attachment?id=QIFoCI7ca1&name=pdf

Topics

causal inference normalizing flows autoregressive models causal graphs interventions

Summary

This paper explores the application of causal normalizing flows (NFs) for causal inference, demonstrating that causal models can be identified using autoregressive NFs from observational data when the causal ordering is known. It first establishes a theoretical framework linking non-linear independent component analysis (ICA) to causal inference, followed by a detailed examination of design choices for causal NFs that effectively capture the underlying causal data-generating processes. The authors introduce a do-operator within the causal NF framework to facilitate interventional and counterfactual analyses. Empirical evaluations show that the proposed causal NFs outperform traditional approaches in both accuracy and efficiency, effectively handling real-world scenarios involving mixed discrete-continuous data and partial causal knowledge, as demonstrated through extensive experiments including a real-world use case on fairness auditing in credit assessments.

Characteristic Circuits

Paper URL: https://openreview.net/attachment?id=5W7cXno10k&name=pdf

Topics

characteristic circuits probabilistic models uncertainty reasoning spectral domain density estimation

Summary

The paper introduces characteristic circuits (CCs), a novel family of tractable probabilistic models designed to effectively manage heterogeneous data by utilizing characteristic functions in the spectral domain. Unlike traditional probabilistic circuits (PCs), which struggle with mixed data types and often lack closed-form density functions, CCs provide a unified framework that allows for efficient learning and inference of high-dimensional distributions without relying on a specific base measure. The authors demonstrate that CCs can outperform state-of-the-art density estimators on various benchmark datasets, showcasing their ability to compute densities, marginals, and moments efficiently. This work highlights the potential of CCs for enhancing probabilistic modeling in complex, real-world scenarios.

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity

Paper URL: https://openreview.net/attachment?id=i913TUOvTK&name=pdf

Topics

video reconstruction brain activity fMRI deep learning cognitive neuroscience

Summary

The paper presents MinD-Video, a novel method for reconstructing high-quality videos from brain activity recorded via functional Magnetic Resonance Imaging (fMRI). This work addresses the challenge of translating continuous visual experiences into video format, which has been less explored compared to static image reconstruction. MinD-Video employs a progressive learning approach, integrating masked brain modeling and multimodal contrastive learning with spatiotemporal attention, along with co-training an augmented Stable Diffusion model. The method demonstrates superior video reconstruction capabilities, achieving 85% accuracy in semantic tasks and a structural similarity index (SSIM) of 0.19, outperforming previous benchmarks by 45%. Additionally, the model shows biological plausibility, reflecting established cognitive processes and offering insights into the neural mechanisms of visual perception.

Clifford Group Equivariant Neural Networks

Paper URL: https://openreview.net/attachment?id=n84bzMrGUD&name=pdf

Topics

Clifford Group Equivariant Neural Networks Clifford Algebra Orthogonal Groups Machine Learning

Summary

The paper introduces Clifford Group Equivariant Neural Networks (CGENNs), a new framework for constructing equivariant neural networks leveraging the properties of the Clifford algebra and its associated groups. The authors explore the Clifford group, which acts as an orthogonal automorphism on the entire Clifford algebra while maintaining the multivector grading. They demonstrate that this group action preserves both the vector space and multiplicative structures of the algebra, facilitating the development of equivariant neural network layers. CGENNs are shown to effectively generalize to various dimensional inner-product spaces and achieve state-of-the-art performance on several tasks, including physics experiments and geometric computations, showcasing the advantages of incorporating geometric properties into neural network architectures.

Conformal Meta-learners for Predictive Inference of Individual Treatment Effects

Paper URL: https://openreview.net/attachment?id=IwnINorSZ5&name=pdf

Topics

individual treatment effects conformal prediction meta-learners predictive inference machine learning

Summary

This paper presents a novel framework called conformal meta-learners for predictive inference of individual treatment effects (ITEs) using machine learning. Unlike traditional methods that primarily yield point estimates of conditional average treatment effects (CATE), conformal meta-learners provide predictive intervals for ITEs by applying conformal prediction (CP) to CATE meta-learners. The authors demonstrate that these conformal meta-learners are valid under certain stochastic dominance conditions and can efficiently estimate ITEs while maintaining desirable properties of CATE estimators. Through numerical experiments, the framework shows effective coverage and efficiency in comparison to existing methods. This work addresses challenges in causal inference by enabling direct inference on ITEs, thereby improving the understanding of treatment effect heterogeneity across individuals.

DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models

Paper URL: https://openreview.net/attachment?id=1zo4iioUEs&name=pdf

Topics

soft robotics generative models diffusion models co-design physics-based simulation

Summary

The paper introduces DiffuseBot, a novel framework that utilizes physics-augmented generative diffusion models to design and optimize soft robots. By integrating physical simulations into the diffusion process, DiffuseBot generates robot morphologies that excel in various tasks, including locomotion and manipulation. The framework allows for co-optimization of robot design and control by leveraging insights from differentiable simulations, bridging the gap between virtual and physical robot capabilities. The authors demonstrate the efficacy of DiffuseBot through extensive simulations and a proof-of-concept physical robot, highlighting its potential for accelerating design cycles and enhancing robotic performance across diverse applications.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Paper URL: https://openreview.net/attachment?id=HPuSIXJaa9&name=pdf

Topics

Direct Preference Optimization Language Models Human Preferences Reinforcement Learning Policy Training

Summary

The paper introduces Direct Preference Optimization (DPO), an innovative algorithm designed to enhance the alignment of large-scale unsupervised language models (LMs) with human preferences without relying on traditional reinforcement learning from human feedback (RLHF). DPO simplifies the optimization process by directly optimizing a language model's policy based on a binary cross-entropy objective derived from human preference data, thereby eliminating the need for an explicit reward model and complex sampling strategies during training. Experimental results demonstrate that DPO performs comparably or better than existing RLHF methods in tasks such as sentiment modulation, summarization, and single-turn dialogue, while being more stable and computationally efficient, thus significantly lowering the barrier for implementing preference-based tuning of language models.

EgoEnv: Human-centric environment representations from egocentric video

Paper URL: https://openreview.net/attachment?id=rybsHQ4DXy&name=pdf

Topics

egocentric video environment representation predictive modeling video understanding augmented reality

Summary

The paper presents EgoEnv, a novel approach for learning human-centric environment representations from egocentric video that enhances standard video understanding techniques by linking visual features to the underlying physical space. By training models on simulated 3D environments, EgoEnv captures the camera-wearer's local surroundings, allowing for predictive modeling of unseen environmental contexts. The approach demonstrates improved performance on human-centric tasks, such as room classification and natural language query localization in real-world video datasets, outperforming traditional clip-based methods. The findings highlight the potential of using simulated data to transfer knowledge to complex real-world scenarios, thereby setting a new state-of-the-art in the field.

Emergence of Shape Bias in Convolutional Neural Networks through Activation Sparsity

Paper URL: https://openreview.net/attachment?id=QzcZb3fWmW&name=pdf

Topics

shape bias convolutional neural networks sparse coding object recognition image synthesis

Summary

This paper investigates the difference in bias between human visual systems and convolutional neural networks (CNNs), which tend to favor texture over shape in object recognition. The authors propose that enforcing sparse coding, specifically through a non-differential Top-K operation, can induce a shape bias in CNNs. By implementing this sparse coding mechanism, they demonstrate that CNNs can better decompose objects into structural parts, leading to improved robustness against texture-based distractions and enhanced coherence in synthetic images. Their experiments reveal that Top-K responses primarily encode structural information, while non-Top-K responses focus on texture, thereby bridging the bias gap between machine and human vision. The findings suggest that sparse coding principles might play a role in the shape bias observed in human visual perception.

Entropic Neural Optimal Transport via Diffusion Processes

Paper URL: https://openreview.net/attachment?id=fHyLsfMDIs&name=pdf

Topics

optimal transport entropic optimal transport neural networks Schrodinger bridge machine learning

Summary

The paper presents a new neural algorithm for computing the entropic optimal transport (EOT) plan between continuous probability distributions using samples. It introduces a saddle-point reformulation of the dynamic EOT, also known as the Schrodinger Bridge problem, enabling an end-to-end learning approach that is computationally efficient and stable even for small values of the entropy regularization coefficient. The authors demonstrate the efficacy of their method through empirical results on various large-scale EOT tasks, showing significant improvement over existing techniques in terms of performance and applicability to real-world problems, particularly in generating diverse outputs for tasks like image super-resolution. The proposed algorithm and its implementation are publicly accessible.

Evaluating Post-hoc Explanations for Graph Neural Networks via Robustness Analysis

Paper URL: https://openreview.net/attachment?id=eD534mPhAg&name=pdf

Topics

Graph Neural Networks Explainability Evaluation Metrics Adversarial Robustness Out-of-Distribution Generalization

Summary

This paper presents a novel evaluation metric for assessing the explainability of Graph Neural Networks (GNNs) called OOD-resistant Adversarial Robustness (OAR). Traditional evaluation methods often struggle with out-of-distribution (OOD) issues, leading to unreliable assessments of explanations. OAR addresses these limitations by leveraging adversarial robustness principles, evaluating the quality of explanations based on their resistance to adversarial attacks while ensuring adherence to the original data distribution through an OOD reweighting mechanism. Additionally, a simplified version, SimOAR, is proposed to enhance computational efficiency, particularly for large datasets, with minimal performance trade-offs. Extensive empirical experiments demonstrate that both OAR and SimOAR significantly outperform existing evaluation metrics, providing more reliable and consistent assessments of GNN explanations.

Exact Bayesian Inference on Discrete Models via Probability Generating Functions: A Probabilistic Programming Approach

Paper URL: https://openreview.net/attachment?id=FtNruwFEs3&name=pdf

Topics

Bayesian inference probabilistic programming generating functions exact inference discrete models

Summary

This paper introduces an exact Bayesian inference method for discrete statistical models, leveraging probability generating functions (PGFs) to facilitate exact computation of posterior probabilities, moments, and variances, even with infinite support and continuous priors. The authors present a new probabilistic programming language, SGCL, designed to express complex discrete and continuous statistical models while ensuring that every program can be translated into a generating function for automated inference. They developed a tool called Genfer, which utilizes automatic differentiation for efficient computation without requiring computer algebra, demonstrating superior performance compared to existing exact inference tools on various benchmarks. The approach is shown to be competitive with Monte Carlo methods on real-world problems, achieving exact results while avoiding approximation errors, thereby addressing significant challenges in Bayesian statistics related to posterior distribution computation.

Fine-Tuning Language Models with Just Forward Passes

Paper URL: https://openreview.net/attachment?id=Vota6rFhBQ&name=pdf

Topics

zeroth-order optimization memory-efficient training language models fine-tuning non-differentiable objectives

Summary

This paper introduces MeZO, a memory-efficient zeroth-order (ZO) optimizer designed for fine-tuning large language models (LMs) without the memory overhead associated with traditional backpropagation. By adapting the ZO-SGD method to operate in-place, MeZO allows the training of models with billions of parameters using the same memory footprint as inference. Experiments demonstrate that MeZO significantly outperforms in-context learning and linear probing, achieving comparable performance to full fine-tuning while reducing memory requirements by up to 12 times. Furthermore, MeZO is compatible with parameter-efficient tuning techniques and can optimize non-differentiable objectives, highlighting its versatility for a variety of downstream tasks. Theoretical insights support the empirical results, showing that adequate pre-training and task prompts facilitate effective optimization with MeZO, even for large models.

Generalizing Nonlinear ICA Beyond Structural Sparsity

Paper URL: https://openreview.net/attachment?id=gI1SOgW3kw&name=pdf

Topics

Nonlinear ICA Identifiability Structural Sparsity Machine Learning Latent Variables

Summary

This paper addresses the limitations of nonlinear independent component analysis (ICA) by proposing new identifiability results that extend the framework beyond the conventional assumptions of structural sparsity and independence among sources. The authors demonstrate that identifiability can be achieved in cases of undercompleteness (more observed variables than sources), partial sparsity, and source dependence. They introduce flexible grouping structures of sources, allowing for the identification of latent variables even when certain sparsity or independence conditions are violated. Empirical validation is provided through experiments on synthetic and real-world datasets, suggesting the practical applicability of the proposed framework for scientific discovery and disentangled representations in machine learning.

Going beyond persistent homology using persistent homology

Paper URL: https://openreview.net/attachment?id=27TdrEvqLD&name=pdf

Topics

persistent homology graph neural networks topological data analysis color-separating sets graph classification

Summary

This paper explores the limitations and expressivity of persistent homology (PH) when applied to attributed graphs, particularly in the context of message-passing graph neural networks (MP-GNNs). The authors introduce the concept of color-separating sets to fully characterize the class of graphs that PH can distinguish based on the persistence of connected components derived from vertex and edge colors. They demonstrate that vertex- and edge-level PH have distinct expressive powers and propose a novel method called RePHINE that integrates both levels to enhance graph classification performance. Theoretical results underpin RePHINE's advantages, which are empirically validated across various datasets, showing significant improvements over standard PH methods and existing topological neural networks.

How to Turn Your Knowledge Graph Embeddings into Generative Models

Paper URL: https://openreview.net/attachment?id=RSGNGiB1q4&name=pdf

Topics

knowledge graph embeddings generative models link prediction probabilistic circuits maximum likelihood estimation

Summary

This paper presents a novel approach to transforming popular knowledge graph embedding (KGE) models, such as COMPL EX and RESCAL, into generative models known as generative KGE circuits (GeKCs). By interpreting KGE score functions as structured computational graphs, the authors demonstrate that these models can achieve efficient maximum-likelihood estimation (MLE) and sampling while adhering to logical constraints. The proposed methods, which include non-negative activation restrictions and squaring of outputs, enhance the scalability and performance of KGE models in link prediction tasks across large graphs with millions of entities. Experimental results indicate that the GeKCs maintain competitive link prediction accuracy compared to traditional KGE models while providing better probabilistic interpretations and allowing for the integration of logical constraints.

Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language

Paper URL: https://openreview.net/attachment?id=dVnhdm9MIg&name=pdf

Topics

Few-shot learning Bayesian reasoning Concept learning Natural language processing Inductive bias

Summary

This paper presents a model for few-shot concept learning that mimics human-like inductive reasoning by employing Bayesian methods over natural language hypotheses. The model generates candidate concepts expressed in natural language, which are evaluated against a learned prior based on human judgments, allowing for efficient inference across a diverse hypothesis space. By leveraging large language models, the approach captures human generalization patterns for abstract concepts, such as numerical sets, and demonstrates improved accuracy in concept-learning tasks compared to traditional Bayesian and program-learning models. The findings suggest that integrating human-like inductive biases into AI systems could enhance their data efficiency and generalization capabilities.

Image Captioners Are Scalable Vision Learners Too

Paper URL: https://openreview.net/attachment?id=A7feCufBhL&name=pdf

Topics

image captioning contrastive pretraining vision encoders multimodal models representation learning

Summary

This paper presents a comparative analysis of image captioning and contrastive pretraining approaches for developing vision encoders, specifically using a standard encoder-decoder transformer architecture. The authors demonstrate that image captioning as a standalone pretraining strategy yields competitive performance in classification tasks and outperforms contrastive pretraining on vision-and-language tasks. Through careful matching of training data, compute resources, and model capacity, they reveal that captioning exhibits superior scaling behavior and offers significant advantages for downstream multimodal applications. Additionally, they introduce a new pretraining technique called CapPa, which alternates between autoregressive and parallel decoding, further enhancing the performance of vision encoders. Overall, the findings challenge the prevailing notion that captioning is an inferior pretraining strategy, highlighting its potential for effective vision representation learning.

Improved Algorithms for Stochastic Linear Bandits Using Tail Bounds for Martingale Mixtures

Paper URL: https://openreview.net/attachment?id=TXoZiUZywf&name=pdf

Topics

stochastic linear bandits upper confidence bounds martingale mixtures regret guarantees convex programming

Summary

This paper introduces enhanced algorithms for the stochastic linear bandit problem, focusing on the development of tighter confidence sequences utilizing a novel tail bound for adaptive martingale mixtures. The algorithms, named Convex Martingale Mixture UCB (CMM-UCB) and Analytic Martingale Mixture UCB (AMM-UCB), leverage these confidence sequences to enable efficient action selection through convex programming, leading to competitive worst-case regret guarantees. The authors demonstrate that their confidence sequences outperform existing methods both theoretically and empirically, resulting in improved performance across various hyperparameter tuning tasks. The study highlights the importance of tighter confidence bounds in optimizing bandit algorithms and suggests potential avenues for future research, particularly in extending these results to non-linear reward functions.

Jailbroken: How Does LLM Safety Training Fail?

Paper URL: https://openreview.net/attachment?id=jA235JGM09&name=pdf

Topics

jailbreak LLM safety adversarial attacks competing objectives mismatched generalization

Summary

The paper investigates vulnerabilities in large language models (LLMs) like GPT-4 and Claude v1.3, focusing on how safety training fails against jailbreak attacks. It identifies two primary failure modes: competing objectives, where safety goals conflict with model capabilities, and mismatched generalization, where safety training does not cover all potential input scenarios the model can encounter. Using these insights, the authors constructed new jailbreak methods that successfully bypassed safety measures, demonstrating that despite extensive safety training, LLMs remain vulnerable to adversarial manipulation. The findings highlight the necessity for safety mechanisms to be as sophisticated as the models themselves and argue that merely scaling up models will not inherently solve these safety issues.

Learning Linear Causal Representations from Interventions under General Nonlinear Mixing

Paper URL: https://openreview.net/attachment?id=q131tA7HCT&name=pdf

Topics

causal representation learning nonlinear mixing identifiability interventions contrastive learning

Summary

This paper addresses the challenge of learning causal representations from interventions in scenarios where the mixing function is nonlinear and the latent variables are Gaussian. The authors establish strong identifiability results for models with unknown single-node interventions, extending prior work that focused on simpler cases. They present a contrastive learning algorithm designed to identify latent variables effectively and assess its performance across various tasks. The findings reveal that identifiability can be achieved without requiring knowledge of intervention targets or paired data, thus making significant strides in the field of causal representation learning, particularly in complex real-world applications.

Learning Transformer Programs

Paper URL: https://openreview.net/attachment?id=Pe9WxkN8Ff&name=pdf

Topics

Transformer Programs mechanistic interpretability machine learning algorithmic problems natural language processing

Summary

The paper presents a novel approach to training Transformers that are inherently interpretable by design, termed "Transformer Programs." Building on an existing programming language (RASP), the authors propose a modified Transformer architecture that can be trained via gradient-based optimization and subsequently converted into discrete, human-readable programs in Python. This method facilitates the interpretation of model behavior, enabling the debugging of errors and the identification of the circuits employed for problem-solving. The authors validate their approach through various tasks, including algorithmic challenges and natural language processing applications, demonstrating that Transformer Programs achieve comparable performance to standard Transformers while being significantly easier to interpret. Overall, the work aims to advance the field of mechanistic interpretability in machine learning by creating models that are both effective and understandable.

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Paper URL: https://openreview.net/attachment?id=AOKU4nRw1W&name=pdf

Topics

text-to-image generation linguistic binding attention maps diffusion models attribute correspondence

Summary

This paper presents SynGen, an innovative approach to enhance attribute correspondence in text-to-image generation models, specifically addressing issues of improper binding where visual attributes fail to correctly align with their corresponding linguistic modifiers. By syntactically analyzing text prompts to identify entities and modifiers, SynGen employs a novel loss function that aligns cross-attention maps with the linguistic structure of the input prompt during inference. Through evaluation on three datasets, including a newly designed challenge set, SynGen demonstrates significant improvements over existing state-of-the-art methods in generating faithful images that accurately reflect the input descriptions, emphasizing the effectiveness of integrating linguistic information in the image generation process.

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

Paper URL: https://openreview.net/attachment?id=cB0BImqSS9&name=pdf

Topics

Machine Learning Neural Networks Sub-quadratic Architecture Monarch Matrices Performance Evaluation

Summary

The paper presents MONARCH MIXER (M2), a novel architecture designed to achieve sub-quadratic scaling in both sequence length and model dimensions, overcoming the limitations of existing architectures such as Transformers that scale quadratically. M2 employs Monarch matrices, a class of structured matrices that efficiently captures various linear transformations, ensuring high hardware efficiency on GPUs. The authors demonstrate M2's efficacy through experiments in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling, where M2 outperforms or matches state-of-the-art models with fewer parameters and increased throughput. The findings suggest M2 could pave the way for more efficient machine learning models, warranting further exploration and optimization.

Nearly Tight Bounds For Differentially Private Multiway Cut

Paper URL: https://openreview.net/attachment?id=QDByreuQyk&name=pdf

Topics

differential privacy min s-tcut multiway k-cut algorithms graph theory

Summary

This paper presents significant advancements in differentially private algorithms for the min s-tcut and multiway k-cut problems, which are crucial for applications in graph theory and various machine learning tasks. The authors establish nearly tight bounds for both lower and upper error limits in the private min s-tcut algorithm, demonstrating that it can achieve privacy without compromising runtime efficiency. Their algorithm maintains an additive error of O(n) for edge-differential privacy while running at the speed of non-private algorithms. Furthermore, they introduce a novel approach for the multiway k-cut problem that reduces the additive error to O(n/log k), significantly more efficient than previous methods. The empirical evaluation supports the theoretical findings, showing that their algorithm's performance closely aligns with non-private counterparts, while also preserving data privacy.

Online RL in Linearly q -Realizable MDPs Is as Easy as in Linear MDPs If You Learn What to Ignore

Paper URL: https://openreview.net/attachment?id=HV85SiyrsV&name=pdf

Topics

reinforcement learning Markov decision processes linear realizability online algorithms sample complexity

Summary

This paper investigates online reinforcement learning (RL) within episodic Markov decision processes (MDPs) under the linear -realizability assumption, which generalizes linear MDPs by allowing for states where all actions have approximately equal values. The authors propose a new algorithm, S KIPPY ELEANOR, which identifies states to ignore, effectively transforming the problem into a linear MDP setting. They demonstrate that this approach yields an -optimal policy after a polynomial number of interactions with the MDP, thus achieving the first polynomial-sample-complexity result for online RL in linearly realizable MDPs. The paper includes a thorough theoretical analysis and proves the algorithm's efficiency even in the presence of misspecification errors.

Optimal Learners for Realizable Regression: PAC Learning and Online Learning

Paper URL: https://openreview.net/attachment?id=w116w62fxH&name=pdf

Topics

PAC Learning Online Learning Realizable Regression Statistical Complexity Optimal Learners

Summary

This paper investigates the statistical complexity of realizable regression within the frameworks of Probably Approximately Correct (PAC) learning and online learning. The authors introduce a minimax instance optimal learner for realizable regression and propose new combinatorial dimensions that characterize learnability in these settings. They establish necessary and sufficient conditions for PAC learnability based on the scaled graph and DS dimensions, and they introduce an optimal online learner that achieves minimax optimal cumulative loss. The results highlight gaps in existing dimensions for regression, contrasting them with binary and multiclass classification, and resolve an open question pertaining to the characterization of online realizable regression. The work aims to deepen the understanding of learning theory, specifically regarding the complexities of real-valued function prediction.

Optimizing Solution-Samplers for Combinatorial Problems: The Landscape of Policy-Gradient Method

Paper URL: https://openreview.net/attachment?id=mmTy1iyU5G&name=pdf

Topics

combinatorial optimization policy gradient methods deep learning optimization landscape solution samplers

Summary

This paper presents a theoretical framework for analyzing the effectiveness of deep neural networks as solution generators for combinatorial optimization problems, specifically focusing on policy gradient methods. The authors investigate the existence of generative models that can produce approximately optimal solutions while ensuring a polynomial number of parameters and a benign optimization landscape that avoids sub-optimal stationary points. They provide a positive answer to this question for several well-known combinatorial problems, including Max-Cut, Min-Cut, Maximum-Weight-Bipartite-Matching, and the Traveling Salesman Problem. Additionally, the paper introduces novel regularization techniques to enhance the optimization process, demonstrating through theoretical and empirical evidence that these methods can mitigate issues related to vanishing gradients and local minima, thereby improving the performance of solution samplers in practical scenarios.

Ordering-based Conditions for Global Convergence of Policy Gradient Methods

Paper URL: https://openreview.net/attachment?id=sW8yGZ4uVJ&name=pdf

Topics

Policy Gradient Methods Global Convergence Linear Function Approximation Softmax Policy Gradient Natural Policy Gradient

Summary

The paper investigates the global convergence of policy gradient methods, specifically under linear function approximation for finite-arm bandits. It establishes that global convergence is not solely dependent on approximation error, challenging previous assumptions that it is a key factor. The authors demonstrate that both the standard Softmax policy gradient (PG) and natural policy gradient (NPG) can achieve global convergence even with non-zero approximation errors, contingent upon specific conditions related to the representation of policies and rewards. For NPG, convergence is guaranteed if the projection of the reward preserves the optimal action's rank, while for Softmax PG, a non-domination condition and the ability to maintain reward ranking are critical. Experimental results corroborate these theoretical findings, emphasizing a need to reassess the role of approximation error in characterizing the convergence properties of policy gradient methods.

Privacy Auditing with One (1) Training Run

Paper URL: https://openreview.net/attachment?id=f38EY21lBw&name=pdf

Topics

differential privacy auditing machine learning empirical analysis single training run

Summary

The paper introduces a novel framework for auditing differentially private (DP) machine learning systems using just one training run, capitalizing on the ability to independently add or remove multiple training examples. By linking differential privacy with statistical generalization, the authors demonstrate that their approach can yield meaningful empirical lower bounds on privacy parameters without the computational burden of running multiple models, which typically requires hundreds of training sessions. The methodology is validated through experiments with DP-SGD on the CIFAR-10 dataset, achieving significant lower bounds on privacy parameters while maintaining model accuracy. This work represents a significant advancement in privacy auditing, making it more feasible for large-scale machine learning applications.

Private Everlasting Prediction

Paper URL: https://openreview.net/attachment?id=y8UAQQHVTX&name=pdf

Topics

private learning everlasting prediction sample complexity differential privacy PAC learning

Summary

This paper introduces the concept of private everlasting prediction (PEP), which extends the notion of private prediction to accommodate an unlimited stream of classification queries while safeguarding the privacy of both the training set and the adaptive queries. The authors highlight the limitations of traditional private learners, which often exhibit high sample complexity, particularly in the context of learning threshold functions. They propose a generic construction for PEP that is applicable to concept classes with finite VC dimensions, demonstrating that their approach requires an initial training sample size that is quadratic in the VC dimension. The paper also discusses the implications of their findings for private prediction and the potential for efficient implementations in specific contexts, while leaving open questions about the reduction of sample complexity and computational efficiency.

QLoRA: Efficient Finetuning of Quantized LLMs

Paper URL: https://openreview.net/attachment?id=OUIFPHEgJU&name=pdf

Topics

Efficient finetuning Quantized language models Low Rank Adapters Memory optimization Chatbot performance

Summary

The paper introduces QLORA, a novel approach for efficiently finetuning quantized large language models (LLMs), specifically enabling the finetuning of a 65B parameter model on a single 48GB GPU without losing performance compared to traditional 16-bit methods. QLORA utilizes a frozen, 4-bit quantized pretrained model, which incorporates Low Rank Adapters (LoRA) to facilitate the training process. Key innovations include a new quantization technique called 4-bit NormalFloat (NF4), Double Quantization for further memory reduction, and Paged Optimizers to manage memory spikes. The resulting Guanaco model family achieves state-of-the-art performance on the Vicuna benchmark, closely rivaling ChatGPT while reducing memory requirements significantly. The paper also emphasizes the importance of dataset quality over size in model performance and evaluates chatbot capabilities using both human and GPT-4 assessments, revealing discrepancies in current benchmark evaluations.

Random Cuts are Optimal for Explainable k-Medians

Paper URL: https://openreview.net/attachment?id=MFWgLCWgUB&name=pdf

Topics

explainable clustering k-medians competitive ratio RANDOM COORDINATE CUT decision trees

Summary

This paper establishes that the RANDOM COORDINATE CUT algorithm achieves the optimal competitive ratio for explainable k-medians clustering in 1-dimensional space, matching the lower bound proposed by Dasgupta et al. (2020). The authors analyze the algorithm's performance, demonstrating that its competitive ratio is bounded by \(2 \ln k + 2\), which effectively aligns with the previously established \(O(\log k)\) lower bound. The study emphasizes the importance of explainability in machine learning clustering techniques by employing threshold decision trees to enhance interpretability, thus allowing algorithmic decisions to be better understood in critical applications. Additionally, the paper provides a straightforward analysis through the concept of a Set Elimination Game, which serves as a foundation for evaluating the algorithm's efficiency.

Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition

Paper URL: https://openreview.net/attachment?id=1vzF4zWQ1E&name=pdf

Topics

face recognition bias mitigation neural architecture search fairness hyperparameter optimization

Summary

The paper "Rethinking Bias Mitigation: Fairer Architectures Make for Fairer Face Recognition" by Dooley et al. addresses the inherent biases in face recognition systems, traditionally attributed to biased training data. The authors argue that biases are also embedded within the neural network architectures themselves. They conduct a large-scale analysis revealing significant impacts of architectural choices and hyperparameters on fairness. By utilizing a novel approach that combines neural architecture search (NAS) and hyperparameter optimization (HPO), the authors develop models that outperform existing architectures in both accuracy and multiple fairness metrics on prominent datasets like CelebA and VGGFace2. These new models demonstrate promising generalization capabilities across various datasets and sensitive attributes, thereby offering a new paradigm for achieving fairness in face recognition systems. The code and models are made publicly available for further research and applications.

Rotating Features for Object Discovery

Paper URL: https://openreview.net/attachment?id=fg7iyNK81W&name=pdf

Topics

binding problem object discovery machine learning Rotating Features continuous representations

Summary

This paper introduces Rotating Features, an innovative approach to scaling continuous and distributed object-centric representations in machine learning, specifically addressing the binding problem in human cognition. The authors critique existing slot-based methods for their limitations in representing uncertainty and flexibility in object representation, proposing Rotating Features as a generalization of complex-valued features to higher dimensions. This method enhances object discovery capabilities from simple toy datasets to complex real-world data by leveraging a novel evaluation procedure and applying pretrained features. The results demonstrate that Rotating Features can effectively represent multiple objects simultaneously and adapt well to various input complexities, advancing the field of object-centric representation learning.

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

Paper URL: https://openreview.net/attachment?id=Sf9goJtTCE&name=pdf

Topics

Gaussian Processes Stochastic Gradient Descent Bayesian Optimization Approximate Inference Computational Efficiency

Summary

This paper investigates the application of stochastic gradient descent (SGD) to sample from Gaussian process (GP) posteriors, addressing the computational challenges posed by the cubic cost associated with traditional GP methods. It presents a novel approach that reformulates the posterior sampling problem into optimization tasks amenable to SGD, which allows for efficient sampling even in large-scale or ill-conditioned scenarios. The authors demonstrate that, despite slower convergence rates, SGD can yield high-quality predictions and uncertainty estimates comparable to those obtained from more computationally intensive methods, achieving state-of-the-art performance on various regression tasks and a large-scale Bayesian optimization benchmark. Key findings include a spectral analysis of SGD's implicit bias, which suggests that accurate predictions can still be achieved even when convergence to the optimum is not fully realized.

Scaling Data-Constrained Language Models

Paper URL: https://openreview.net/attachment?id=j5BuTrEj35&name=pdf

Topics

language models data constraints scaling laws compute allocation training efficiency

Summary

The paper investigates scaling large language models (LLMs) under data-constrained conditions, addressing the diminishing returns of data repetition and optimizing compute allocation. Through extensive experiments, the authors demonstrate that training on repeated data for up to four epochs yields minimal loss changes compared to unique data, while excess repetition leads to diminishing returns. They propose a new scaling law that integrates these findings with the existing Chinchilla scaling laws. Additionally, the study explores alternative strategies to mitigate data scarcity, such as augmenting training datasets with code and relaxing filtering criteria, revealing that these methods can enhance model performance in low-data scenarios. The findings suggest a path forward for scaling language models effectively despite data limitations, emphasizing the importance of adjusting training methodologies.

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Paper URL: https://openreview.net/attachment?id=Dkmpa6wCIx&name=pdf

Topics

generalization sharpness minimization neural networks overparameterization loss landscape

Summary

This paper investigates the relationship between sharpness minimization algorithms and generalization in overparameterized neural networks, revealing that the connection is complex and depends on model architecture and data distribution. The authors identify three scenarios involving two-layer ReLU networks: (1) flatness guarantees generalization, (2) non-generalizing flattest models exist where sharpness minimization fails, and (3) sharpness minimization can still achieve generalization even with non-generalizing flattest models. These findings challenge the notion that sharpness minimization directly leads to better generalization, suggesting that additional factors must be considered to fully understand generalization in neural networks.

Siamese Masked Autoencoders

Paper URL: https://openreview.net/attachment?id=yC3q7vInux&name=pdf

Topics

video representation learning self-supervised learning Siamese networks visual correspondence masked autoencoders

Summary

The paper introduces Siamese Masked Autoencoders (SiamMAE), an innovative extension of Masked Autoencoders (MAE) designed to enhance visual correspondence learning from video data. SiamMAE employs a unique asymmetric masking strategy, where 95% of the future frame's patches are masked while the past frame remains intact. This approach encourages the model to focus on object motion and develop object-centric representations. The authors demonstrate that SiamMAE significantly outperforms state-of-the-art self-supervised methods across various tasks, including video object segmentation and pose keypoint propagation, without relying on data augmentation or complex tracking techniques. The findings suggest that leveraging temporal information through asymmetric masking is crucial for effective correspondence learning in video representations, paving the way for future research in this domain.

Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

Paper URL: https://openreview.net/attachment?id=73XPopmbXH&name=pdf

Topics

sample complexity stochastic gradient descent single index models smoothing techniques implicit regularization

Summary

This paper investigates the sample complexity required for learning single index models under isotropic Gaussian distributions, particularly focusing on the effectiveness of stochastic gradient descent (SGD). Previous findings indicated a gap between the sample complexity necessary for gradient-based methods and the theoretical lower bounds, with SGD requiring significantly more samples than what was indicated by Correlational Statistical Query (CSQ) lower bounds. The authors address this gap by introducing a smoothing technique applied to the loss landscape, demonstrating that by utilizing a smoothed loss, SGD can achieve the optimal sample complexity of \(ndk/2\) samples, aligning with CSQ bounds for models where the information exponent \(k > 2\). The analysis connects this improvement to the enhanced signal-to-noise ratio resulting from smoothing, which allows the learning process to avoid poor local minima, thus improving convergence properties.

Spatial-frequency channels, shape bias, and adversarial robustness

Paper URL: https://openreview.net/attachment?id=KvPwXVcslY&name=pdf

Topics

spatial frequency object recognition neural networks adversarial robustness human vision

Summary

This paper investigates the spatial frequency information utilized by humans and neural networks for object recognition, employing critical band masking to compare their performance in recognizing natural images under noise. The study finds that humans rely on a narrow, one-octave-wide spatial frequency channel for recognition tasks, consistent across various stimuli, while neural networks exhibit significantly broader channels—2-4 times wider than humans. This discrepancy in channel width correlates with differences in shape bias and adversarial robustness, with adversarial training further increasing the networks' channel bandwidth beyond human levels. The findings suggest that aligning the spatial frequency channels of neural networks with those of humans could enhance their robustness against adversarial attacks.

Students Parrot Their Teachers: Membership Inference on Model Distillation

Paper URL: https://openreview.net/attachment?id=a2Yg9Za6Rb&name=pdf

Topics

Membership inference Model distillation Privacy attack Machine learning Adversarial training

Summary

This paper investigates the effectiveness of model distillation as a privacy-preserving technique in machine learning, focusing on its vulnerability to membership inference attacks. The authors demonstrate that simply relying on distillation does not adequately protect sensitive training data from being inferred, as their developed attacks reveal that information can leak from teacher models to student models, even without direct access to training examples. They find that the similarity between teacher and student datasets, as well as data poisoning, significantly increases privacy risks. Moreover, they propose mitigating strategies such as deduplication and employing differential privacy, emphasizing the need for comprehensive privacy measures beyond model distillation alone.

Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models

Paper URL: https://openreview.net/attachment?id=0A9f2jZDGW&name=pdf

Topics

task arithmetic weight disentanglement tangent space pre-trained models neural tangent kernel

Summary

The paper investigates task arithmetic in vision-language models, emphasizing its potential for efficient model editing by manipulating weights directly in the tangent space. It identifies weight disentanglement as a critical factor enabling effective task arithmetic, revealing that distinct directions in weight space correspond to localized function space regions for different tasks. The authors demonstrate that fine-tuning in the tangent space enhances weight disentanglement, leading to improved performance on various benchmarks. They also connect task arithmetic to the spatial localization of the neural tangent kernel (NTK) eigenfunctions, establishing that weight disentanglement emerges during pre-training. The findings suggest that linearized fine-tuning can significantly enhance task arithmetic performance, offering insights for developing more effective model editing techniques.

Tester-Learners for Halfspaces: Universal Algorithms

Paper URL: https://openreview.net/attachment?id=Kv8GJkV19S&name=pdf

Topics

tester-learner halfspaces structured distributions Poincar inequality log-concave distributions

Summary

The paper introduces the first universal tester-learner for halfspaces that operates efficiently across a broad class of structured distributions, specifically those satisfying the Poincar inequality. This tester-learner is designed to accept a wide variety of distributions without being tailored to any single target distribution. The proposed algorithm runs in polynomial time and guarantees an error of O(opt) + ε on any labeled distribution it accepts. It utilizes hypercontractivity checks via a sum-of-squares program, marking a significant advancement over previous works that were limited to specific distributions, such as Gaussian or log-concave. Additionally, under the assumption of known Massart noise, it achieves an error rate of opt + ε, thereby extending its applicability and performance for various learning scenarios.

The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks

Paper URL: https://openreview.net/attachment?id=S5wmbQc1We&name=pdf

Topics

neural networks algorithm discovery modular arithmetic Clock algorithm Pizza algorithm

Summary

This paper explores whether neural networks trained on algorithmic tasks, specifically modular addition, reliably rediscover known algorithms. By examining two algorithms—Clock and Pizza—the authors demonstrate that small changes in model hyperparameters can lead to qualitatively different algorithmic implementations. The Clock algorithm, which aligns with traditional modular arithmetic, is shown to be one of several possible solutions, as some networks implement the less intuitive Pizza algorithm, characterized by averaging embeddings rather than using multiplication. The findings suggest a rich diversity of algorithmic behaviors in neural networks, emphasizing the need for new interpretability tools to navigate the complex algorithmic phase space that these models inhabit.

The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

Paper URL: https://openreview.net/attachment?id=jDIlzSU8wJ&name=pdf

Topics

diffusion models optical flow monocular depth estimation uncertainty estimation generative models

Summary

This paper demonstrates the effectiveness of denoising diffusion probabilistic models for optical flow and monocular depth estimation, challenging the traditional reliance on specialized architectures and loss functions. The authors introduce the Denoising Diffusion Vision Model (DDVM), which excels in uncertainty capturing and allows for Monte Carlo inference, outperforming existing state-of-the-art methods on benchmark datasets like NYU and KITTI. The model integrates self-supervised pre-training, innovative training techniques for handling noisy data, and a coarse-to-fine refinement approach, achieving significant improvements in performance metrics such as relative depth error and optical flow outlier rates. The findings indicate that diffusion models could serve as a powerful and flexible framework for dense vision tasks, emphasizing their potential for capturing multimodal distributions and handling ambiguities in data.

Toolformer: Language Models Can Teach Themselves to Use Tools

Paper URL: https://openreview.net/attachment?id=Yacmpz84TH&name=pdf

Topics

Toolformer Language Models API Calls Self-Supervised Learning Natural Language Processing

Summary

The paper introduces Toolformer, a novel language model that enhances its capabilities by learning to utilize external tools through simple APIs in a self-supervised manner. Unlike traditional models, which often require extensive human annotations or are limited to specific tasks, Toolformer autonomously decides when and how to call various APIs, such as calculators, search engines, and translation systems, based on a few demonstrations. This approach not only improves its performance across a range of downstream tasks in zero-shot settings but also maintains its language modeling abilities. Toolformer, based on a 6.7 billion parameter GPT-J model, surpasses even larger models like GPT-3 in several benchmarks, demonstrating its effectiveness in addressing inherent limitations of language models, such as arithmetic skills and factual accuracy.

ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings

Paper URL: https://openreview.net/attachment?id=BHXsb69bSx&name=pdf

Topics

tool embeddings language models tool integration in-context learning problem-solving

Summary

The paper introduces ToolkenGPT, a novel approach that enhances large language models (LLMs) by integrating external tools through the concept of tool embeddings, referred to as "toolkens." Unlike traditional methods that require extensive fine-tuning or are constrained by limited context in in-context learning, ToolkenGPT allows for the dynamic addition of multiple tools and utilizes extensive demonstration data to learn toolken embeddings efficiently. The framework prompts the LLM to switch modes when a tool is called, enabling it to generate arguments for tool execution seamlessly. Experimental results demonstrate that ToolkenGPT significantly outperforms existing baselines in various domains, including numerical reasoning, knowledge-based question answering, and embodied plan generation, showcasing its ability to adeptly utilize a wide range of tools in complex scenarios without the need for costly fine-tuning.

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective

Paper URL: https://openreview.net/attachment?id=qHrADgAdYu&name=pdf

Topics

Chain-of-Thought Large Language Models Theoretical Analysis Decision-Making Mathematical Reasoning

Summary

This paper investigates the theoretical foundations of Chain-of-Thought (CoT) prompting, which has been shown to significantly enhance the performance of Large Language Models (LLMs) on complex tasks, particularly in mathematics and reasoning. The authors employ circuit complexity theory to demonstrate that bounded-depth Transformers struggle to directly solve basic arithmetic and equation tasks without CoT, requiring super-polynomial model sizes. In contrast, they establish that constant-size autoregressive Transformers can effectively utilize CoT to generate step-by-step derivations, enabling them to tackle a broader class of decision-making problems, such as Dynamic Programming. Empirical experiments further confirm that models trained with CoT consistently outperform those trained for direct predictions, emphasizing the critical role of CoT in unlocking the potential of LLMs for solving intricate real-world tasks.

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Paper URL: https://openreview.net/attachment?id=liMSqUuVg9&name=pdf

Topics

Transformers In-Context Learning Algorithm Selection Machine Learning Statistical Theory

Summary

The paper presents a comprehensive statistical theory for transformers' capabilities in in-context learning (ICL), demonstrating that they can implement various standard machine learning algorithms—such as least squares and Lasso—in context without explicit parameter updates. The authors establish that transformers can perform adaptive in-context algorithm selection, allowing them to choose different algorithms based on input sequences, thus enhancing predictive performance. They construct two mechanisms for algorithm selection: pre-ICL testing and post-ICL validation, providing theoretical guarantees for their approaches. The experimental results affirm the theoretical findings, showcasing strong ICL and algorithm selection capabilities in standard transformer architectures.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Paper URL: https://openreview.net/attachment?id=5Xc1ecxO1h&name=pdf

Topics

language models problem solving Tree of Thoughts decision-making reasoning

Summary

The paper introduces the Tree of Thoughts (ToT) framework, enhancing the problem-solving capabilities of large language models (LMs) by allowing them to engage in deliberate decision-making processes that involve exploring multiple reasoning paths and self-evaluating their choices. Unlike traditional left-to-right inference mechanisms, ToT enables LMs to maintain a tree of coherent text units (thoughts) and apply search algorithms, such as breadth-first and depth-first search, to navigate through potential solutions. Experimental results demonstrate that ToT significantly improves performance in tasks requiring complex reasoning, including the Game of 24, Creative Writing, and Mini Crosswords, achieving markedly higher success rates compared to existing prompting techniques like Chain of Thought. The framework's flexibility and generality position it as a promising approach for tackling diverse problem-solving challenges in the realm of LMs.

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation

Paper URL: https://openreview.net/attachment?id=NnMEadcdyD&name=pdf

Topics

diffusion models ELBO data augmentation noise scheduling image generation

Summary

This paper establishes a theoretical connection between diffusion model objectives and the Evidence Lower Bound (ELBO), demonstrating that common diffusion objectives can be viewed as weighted integrals of ELBOs across varying noise levels, with specific weightings influencing the model's performance. The authors show that when the weighting function is monotonic, these objectives correspond to maximizing the ELBO with Gaussian noise perturbation as a form of data augmentation. Through experiments on the ImageNet dataset, they explore various monotonic weightings and demonstrate that their proposed approaches achieve state-of-the-art results in image generation. The findings suggest significant implications for optimizing diffusion models and understanding their relationship to other generative modeling techniques.

User-Level Differential Privacy With Few Examples Per User

Paper URL: https://openreview.net/attachment?id=PITeSdYQkv&name=pdf

Topics

differential privacy user-level privacy machine learning sample complexity algorithm design

Summary

This paper addresses the challenge of user-level differential privacy (DP) in scenarios where each user contributes only a few examples, as opposed to the previously studied example-rich cases. The authors present a generic method to transform item-level DP algorithms into user-level DP algorithms, yielding significant reductions in the number of users required to achieve similar utility, specifically a multiplicative savings of O(m) where m is the number of examples per user. They also propose techniques for both approximate and pure DP, adapting existing mechanisms like the exponential mechanism to fit the user-level framework. The results yield new sample complexity bounds for various learning tasks, including PAC learning, while highlighting the computational inefficiencies of the proposed algorithms. Overall, the paper contributes to advancing the understanding of user-level DP, providing algorithms that are useful for practical machine learning applications while outlining open questions for future research.

Visual Instruction Tuning

Paper URL: https://openreview.net/attachment?id=w0H2xGHlkw&name=pdf

Topics

visual instruction tuning multimodal models GPT-4 language-image data AI assistant

Summary

The paper introduces LLaVA, a novel approach to visual instruction tuning that leverages GPT-4 to generate multimodal instruction-following data, aiming to enhance the capabilities of large multimodal models (LMMs) for visual and language tasks. LLaVA effectively connects a vision encoder with a language model to create a general-purpose assistant capable of interpreting and responding to visual instructions. The authors construct two evaluation benchmarks to assess the model's performance across diverse tasks. Experimental results demonstrate that LLaVA exhibits strong multimodal chat capabilities and achieves state-of-the-art accuracy on the Science QA dataset, outperforming existing models. The paper also emphasizes the importance of creating high-quality multimodal instruction-following data and provides open-source access to these resources for further research in the field.

When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning

Paper URL: https://openreview.net/attachment?id=oML3v2cFg2&name=pdf

Topics

inverse reinforcement learning maximum likelihood estimation offline learning world model policy optimization

Summary

The paper introduces a novel approach to Offline Inverse Reinforcement Learning (IRL) by framing it as a maximum likelihood estimation problem, addressing the challenges posed by limited expert demonstrations and distribution shifts in the environment dynamics. The authors develop a bi-level optimization framework where the upper level maximizes the likelihood of observed expert actions, while the lower level conservatively estimates the expert's policy and the world model. This method incorporates uncertainty estimation to penalize state-action pairs with high uncertainty, thereby enhancing the reliability of the reward recovery process. The proposed algorithm, termed Offline ML-IRL, demonstrates significant improvements over existing offline IRL and imitation learning benchmarks, particularly in continuous control tasks using the MuJoCo simulator and various datasets from the D4RL benchmark. The authors provide statistical and computational guarantees for the performance of their method, emphasizing its applicability in safety-sensitive domains.

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment

Paper URL: https://openreview.net/attachment?id=APGXBNkt6h&name=pdf

Topics

Transformers Reinforcement Learning Memory Credit Assignment Long-term Dependencies

Summary

This paper investigates the effectiveness of Transformers in reinforcement learning (RL) by distinguishing between memory capabilities and credit assignment abilities. The authors define memory length and credit assignment length, and design configurable tasks to empirically evaluate these aspects in RL contexts. The findings reveal that Transformers significantly enhance long-term memory, enabling them to recall observations up to 1500 steps back, but do not provide improvements in long-term credit assignment. The study highlights the need for careful task design in RL benchmarks and suggests that while Transformers are powerful for memory tasks, they do not universally solve all RL challenges.

Why think step by step? Reasoning emerges from the locality of experience

Paper URL: https://openreview.net/attachment?id=rcXXNFVlEn&name=pdf

Topics

reasoning language models chain-of-thought statistical structure inference

Summary

The paper investigates the effectiveness of chain-of-thought reasoning in language models and its connection to the statistical structure of training data. It posits that reasoning through intermediate steps allows models to make more accurate inferences when the training data consists of overlapping local clusters of related variables. Experimental results demonstrate that models trained on locally structured data significantly benefit from reasoning steps, leading to lower bias in estimating conditional probabilities, particularly for pairs of variables not frequently co-occurring in training. The findings highlight the importance of local statistical dependencies in enhancing the reasoning capabilities of both humans and language models, suggesting that such reasoning improves data efficiency.