Papers
Papers by Oliver Steele and collaborators, across haptics, human–computer interaction, and machine-learning systems.
LLM performance estimators and storage-tier placement across the fidelity spectrum. Start with spec-only roofline physics — adding framework dispatch as a third performance ceiling alongside compute and bandwidth. The design-space paper then argues fidelity should be an explicit knob rather than a property baked into each estimator, and that spec-only physics already closes most of the gap to a fully calibrated model. The placement paper deploys a calibrated estimator at the system level: utility-based placement across HBM, HBF, DRAM, and NVMe with wear modeling.
Multi-objective Bayesian optimization with calibrated uncertainty and practitioner-utility decision regret. The first paper attacks Pareto-front coverage and uniformity directly via \(L_p\)-norm cycling and gap-guided sampling. The second feeds a calibrated estimator's per-observation variance into the acquisition and switches the evaluation metric from hypervolume to SLO regret on real GPU job placement. The third deploys the resulting decision policies in an uncertainty-aware LLM-performance estimation stack — pairing the right policy with a well-calibrated base.
Probing what transformers encode and where each signal lives in the residual stream. A structural-probes audit finds the classic distance probe transfers poorly across architectures — the probe's subspace is architecture-specific. The anaphora paper extends the same probe family to coreference and finds binding-as-proximity at Cohen's d = 0.83, with a cross-model section showing binding is even more model-specific than tree distance. The role-encoding paper takes a different angle entirely: a decision-position filter resolves four CE-family contrastive recipes that pooled metrics cannot tell apart. Two negative-result papers close the section, both pushing back on attention-head clustering as a stand-alone interpretability tool.
Eigenvalue structure of transformer weights as a lens on architecture, training dynamics, and pruning. A cross-family survey of 22 LLMs rules out five hypotheses about eigenvalue structure; the surviving three-zone γ pattern appears only above ~400M parameters and reappears across model families. The Pythia case study zooms in on one of the survey's most striking outliers — a scale-point departure at 2.8B and a family-level signature in logit sharpness — and walks through six candidate explanations, none of which survive cross-family controls. Tracing the three-zone pattern through Pythia checkpoints then shows when in training it emerges. The pruning paper turns the diagnostic into a procedure: spectral-guided magnitude pruning sharply outperforms plain L1, with near-zero Jaccard overlap in which weights each method removes.
Implementation studies of entropy coding and gradient compression. An implementation of Han et al.'s 2008 equiprobable-partitioning entropy codec fills two practical gaps the original skipped — overflow-safe block sizing and byte-alignment accounting — and diagnoses the ~50% efficiency ceiling as a structural consequence of small-block metadata overhead. A bit-width-stratified gradient codec ties pure ANS on dense GPT-2 gradients and loses to sparsity-prefilter+ANS on sparse ones; the bucketing complexity is never load-bearing.
Work I did as an undergraduate.