We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per tok...
Quantitative mode stability for the wave equation on the Kerr-Newman spacetime
Risk-Aware Objective-Based Forecasting in Inertia Management
Chainalysis: Geography of Cryptocurrency 2023
Periodicity in Cryptocurrency Volatility and Liquidity
Impact of Geometric Uncertainty on the Computation of Abdominal Aortic Aneurysm Wall Strain
Simulation-based Bayesian inference with ameliorative learned summary statistics -- Part I