Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Haobo Wang, Kaichun Hu, James Yi, Frank Li, Zhiyu Zhang, Tianxiang Xu, Gang Zhao, Ziang Ling, Lowes Yang
Large Language Models (LLMs) have achieved impressive performance in diverse natural language processing tasks, but specialized domains such as Web3 present new challenges and require more tailored evaluation. Despite the significant user base and capital flows in Web3, encompassing smart contracts, decentralized finance (DeFi), non-fungible tokens (NFTs), decentralized autonomous organizations (DAOs), on-chain governance, and novel token-economics, no comprehensive benchmark has systematically assessed LLM performance in this domain. To address this gap, we introduce the DMind Benchmark, a holistic Web3-oriented evaluation suite covering nine critical subfields: fundamental blockchain concepts, blockchain infrastructure, smart contract, DeFi mechanisms, DAOs, NFTs, token economics, meme concept, and security vulnerabilities. Beyond multiple-choice questions, DMind Benchmark features domain-specific tasks such as contract debugging and on-chain numeric reasoning, mirroring real-world scenarios. We evaluated 26 models, including ChatGPT, Claude, DeepSeek, Gemini, Grok, and Qwen, uncovering notable performance gaps in specialized areas like token economics and security-critical contract analysis. While some models excel in blockchain infrastructure tasks, advanced subfields remain challenging. Our benchmark dataset and evaluation pipeline are open-sourced on https://huggingface.co/datasets/DMindAI/DMind_Benchmark, reaching number one in Hugging Face's trending dataset charts within a week of release.
Quantitative mode stability for the wave equation on the Kerr-Newman spacetime
Risk-Aware Objective-Based Forecasting in Inertia Management
Chainalysis: Geography of Cryptocurrency 2023
Periodicity in Cryptocurrency Volatility and Liquidity
Impact of Geometric Uncertainty on the Computation of Abdominal Aortic Aneurysm Wall Strain
Simulation-based Bayesian inference with ameliorative learned summary statistics -- Part I