1

Transformers Learn Shortcuts to Automata

Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far …

Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms

Neural Networks (NNs) struggle to efficiently learn certain problems, such as parity problems, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in …

Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit

There is mounting empirical evidence of mph{emergent phenomena} in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical …

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases …

Understanding Contrastive Learning Requires Incorporating Inductive Biases

Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain …

Anti-Concentrated Confidence Bonuses for Scalable Exploration

Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a …

Investigating the Role of Negatives in Contrastive Representation Learning

Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to …

Gone Fishing: Neural Active Learning with Fisher Embeddings

There is an increasing need for effective active learning algorithms that are compatible with deep neural networks. While there are many classic, well-studied sample selection methods, the non-convexity and varying internal representation of neural …

Acceleration via Fractal Learning Rate Schedules

When balancing the practical tradeoffs of iterative methods for large-scale optimization, the learning rate schedule remains notoriously difficult to understand and expensive to tune. We demonstrate the presence of these subtleties even in the …

Statistical Estimation from Dependent Data

We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioning on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a …