A Validated Semantics for LLVM IR

Ph.D. dissertation by Juneyoung Lee (download)

Submitted to the Dept. of Computer Science and Engineering, Seoul National University, Korea
Recipient of Best Thesis Award from College of Engineering, SNU (one thesis chosen per dept.)

Thesis committee

Prof. Bernhard Egger (Chair), Seoul National University, Korea
Prof. Chung-Kil Hur (Advisor), Seoul National University, Korea
Prof. Jae W. Lee, Seoul National University, Korea
Prof. Nuno P. Lopes (Advisor), Microsoft Research, UK
Prof. Kyungmin Bae, POSTECH, Korea

Abstract

Intermediate representation (IR) is a language that is used internally by a compiler to represent programs. Translation to an IR should preserve guarantees from the source language's specification because they enable various optimizations. This naturally makes an IR a language that is rich with high-level information.

In LLVM, the semantics of important high-level features in IR was not rigorously defined. It caused compiler optimizations in LLVM to use different interpretations, and bad interactions between the optimizations resulted in miscompilation bugs that are hard to fix. To solve this problem, the IR’s semantics must be defined precisely. Then, optimizations that are incorrect with respect to the chosen semantics must be fixed. Both processes are challenging because LLVM is a large, fastly evolving software.

This thesis proposes (1) formal semantics of LLVM IR that resolves critical problems that we have found in the old IR semantics, making it consistent (2) a translation validation framework for LLVM's optimizations to validate the new semantics. We show that the old semantics of undefined behavior and memory model in the IR cannot explain important optimizations in LLVM. We propose new semantics that solves this problem. Next, we present Alive2, a translation validation framework for LLVM based on the new semantics. Alive2 relies on an automatic theorem prover to validate optimizations without any hints from LLVM. It supports most of integer and float operations, memory operations, function calls, and branches. To make validation practical, resources used by the tool is bounded.

The new formal semantics of undefined behavior has been adopted by LLVM. The `freeze' instruction that is proposed by us is officially added into LLVM 10.0, and the official document is updated to use our semantics. Also, critical problems in the old memory model we have found were shared with compiler developers, and patches have landed in LLVM to fix it. Alive2 has found more than 50 miscompilation bugs in LLVM so far and is used daily by LLVM developers.