CS224 Program Analysis@Shanghaitech 24 Fall Notes

yjmstr / 2024-10-10 / 原文

1. Introduction

Rice's Theorem

Static Analysis analyzes a program P to reason about its behaviors and determines whether it satisfies some properties before running P.

Rice's Theorem: Any non-trivial property of the behavior of programs in a recursively enumerable language is undecidable.

我们可以把任意程序看成一个从输入到输出上的函数（输入输出对的集合），该函数描述了程序的行为关于该函数/集合的任何非平凡属性，都不存在可以检查该属性的通用算法

Regular languages $\in$ Context-Free Languages $\in$ Recursive Languages $\in$ Recursively Enumerable Languages $\in$ Non-Recursively Enumerable Languages

A language is recursively enumerable (r.e.) if it is the set of strings accepted by some TM.
A language is recursive if it is the set of strings accepted by some TM that halts on every input.
A property is trivial if either it is not satisfied by any r.e. language, or if it satisfies by all r.e. languages; otherwise it is non-trivial（平凡属性，要么对全体程序都为真，要么对全体程序都为假）

Examples of trivial properties:

The programs are either the empty or they contain at least one word.
The length of a program must be finite.

Approximate: Useful Static Analysis

Original decision problem: output Yes or No

Approximate solution to decision problem: Yes, No or Unknown

Must Analysis, lower/under approximate: only output yes or unknown （outputs information that must be true）
May Analysis, upper/over approximate: only output no or unknown （outputs information that may be true）

Objective: Answer Yes or No as much as possible, answer unknown as little as possible

假设正确答案是一个集合 S，must analysis 总是返回 S 的子集，may analysis 总是返回集合 S 的超集

Necessity of Soundness

Soundness (健壮性) in static-analysis refers to the property that if an analysis reports no errors, then the program is guaranteed to be free of the class of errors being checked. In other words, a sound static analysis will not miss any real bugs or defects in the code that it claims to detect. 可以理解为允许误报，但不能漏报。

Completeness (完备性) 可以理解为可能漏报，但不误报的情况。

Static Analysis: ensure (or get close to) soundness, while making good trade-offs between analysis precision and analysis speed.

Abstraction + Over-approximation

Two Words to Conclude Static Analysis (for most analysis): Abstraction + Over-approximation

Abstraction

e.g: ints -> signs (+, -, 0, ⊤(unknown), ⊥(undefined))

Overapproximation

transfer function

In static analysis, transfer functions define how to evaluate different program statements on abstract values.

Transfer functions are defined according to “analysis problem” and the “semantics” of different program statements

e.g: '+' + '-' = '⊤'(unknown)

control flows

As it’s impossible to enumerate all paths in practice, flow merging (as a way of over-approximation) is taken for granted in most static analyses.

2. IR

一般基于 IR 做分析，而不是 AST。

AST

high-level and closed to grammar structure •
usually language dependent
suitable for fast type checking
lack of control flow information

low-level and closed to machine code
usually language independent
compact and uniform
contains control flow information
usually considered as the basic for static analysis

3-Address Code (3AC)

at most one operator on the right side of an instruction.

e.g

t2 = a + b + 3 is not 3AC
t1 = a + b is 3AC

each 3AC contains at most 3 addresses. Address can be one of the following:

Name: such as a, b
Constant: such as 3
Compiler-generated temporary: such as t1, t2

Some Common 3AC Forms:

x = y bop z ( bop: binary arithmetic or logical operation)
x = uop y (uop: unary operation (minus, negation, casting) )
x = y
goto L
if x goto L
if x rop y goto L (rop: relational operator (>,<,==,>=,<=, etc))

Soot

Soot is the most pupular static analysis framework for java

Its IR is Jimple: typed 3-address code

Static Single Assignment (SSA) (Optional)

All assignments in SSA are to variables with distinct names

Give each definition a fresh name
Propagate fresh name to subsequent uses
Every variable has exactly one definition

When a variable use is at control flow merges, a special merge operator, ∅ (called phi-function), is introduced to select the values at merge nodes

Why SSA

Flow information is indirectly incorporated into the unique variable names.

Define-and-Use pairs are explicit

Why not SSA

SSA may introduce too many variables and phi-functions

May introduce inefficiency problem when translating to machine code (due to copy operations)

Control Flow Analysis

Usually refer to building Control Flow Graph (CFG). CFG serves as the basic structure for static analysis.

The node in CFG can be an individual 3-address instruction, or (usually) a Basic Block (BB)

Basic Block

Basic blocks(BB) are maximal sequences of consecutive 3AC with the properties that

it can be entered only at the beginning, i.e., the first instruction in the block.
it can be exited only at the end, i.e., the last instruction in the block.

How to build Basic blocks

INPUT: A sequence of three-address instructions of P

OUTPUT: A list of basic blocks of P

METHOD:

Determine the leaders in P
- The first instruction in P is a leader
- Any target instruction of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
Build BBs for P
- A BB consists of a leader and all its subsequent instructions until the next leader

Control Flow Graph

The nodes of CFG are basic blocks

There is an edge from block A to block B if and only if

There is a conditional or unconditional jump from the end of A to the beginning of B
B immediately follows A in the original order of instructions and A does not end in an unconditional jump

We say that A is a predecessor of B, and B is a successor of A

Usually we add two nodes, Entry and Exit.

They do not correspond to executable IR
A edge from Entry to the BB containing the first instruction of IR
A edge to Exit from any BB containing an instruction that could be the last instruction of IR

3. Data Flow Analysis

Each execution of an IR statement transforms an input state to a new output state，the input (output) state is associated with program point before (after) the statement

Data-flow analysis is to find a solution to a set of safe-approximation-directed constraints on the IN[s]’s and OUT[s]’s, for all statements s.

constraints based on semantics of statements (transfer functions)
- Forward Analysis: OUT[s] = f_s(IN[s])
- Backward Analysis: IN[s] = f_s(OUT[s])
constraints based on the flows of control
- Control flow with a BB
  - $IN[𝑆_{𝑖+1}] = OUT[S_i ], \forall i = 1,2,…,n-1$
- Control flow among BBs
  - $IN[B]=IN[S_1]$
  - $OUT[B]=OUT[S_n]$
  - $OUT[B]=f_B(IN[B]),𝑓_𝐵 = 𝑓_{s_n} ∘ ⋯ ∘ 𝑓𝑠_2 ∘ 𝑓𝑠_1$
  - $IN[B] = ∧\ P \ a \ predecessor \ of \ B \ OUT[P]$
  - 反向的情况类似

Reaching Definitation Analysis

A definition d at program point p reaches a point q if there is a path from p to q such that d is not “killed” along that path

A definition of a variable v is a statement that assigns a value to v
Translated as: definition of variable v at program point p reaches point q if there is a path from p to q such that no new definition of v appears on that path
Reaching definitions can be used to detect possible undefined variables. e.g., introduce a dummy definition for each variable v at the entry of CFG, and if the dummy definition of v reaches a point p where v is used, then v may be used before definition (as undefined reaches v)

The definitions of all the variables in a program can be represented by bit vectors

e.g 用 0000 表示 4 个变量当前都没有定义

$D: v = x \ op \ y$ “generates” a definition D of variable v and “kills” all the other definitions in the program that defines variable v, while leaving the remaining incoming definitions unaffected

Transfer Function:

$OUT[B]=gen_B\cup(IN[B]-kill_B)$

Control Flow:

$IN[B]=\cup _{P\ a\ predecessor \ of \ B} \ OUT[P]$

Algorithm of Reaching Definition Analysis

Input: CFG with $kill_B$ and $gen_B$ computed for each basic block B

Output: IN[B] and OUT[B] for each basic block B

method:

$OUT[entry] = \empty$
$for (each \ basic \ block \ B \ except \ entry), \ OUT[B]=\empty $
$while (changes\ to\ any\ OUT\ occur)$
- $for (each \ basic \ block \ B \ except \ entry)\{$
  - $IN[B]=\cup _{P\ a\ predecessor \ of \ B} \ OUT[P]$
  - $OUT[B]=gen_B\cup(IN[B]-kill_B)$
- $\}$

why it can finally stop?

$gen_S$ and $kill_S$ remain unchanged for each basic block S
When more facts flow in IN[S], the “more facts” either
- is killed, or
- is $IN[S]-kill_s$($survivor_S$)
When a fact is added to OUT[S], through either $gen_s$ , or $survivor_S$ , it stay there forever
Thus OUT[S] never shrinks (e.g., 0 → 1 or 1→ 1)
As the set of facts in finite (e.g., all definitions in the program, there must exist a pass of iteration during which nothing is added to any OUT, and then the algorithm terminates

Live Variables Analysis

Live variables analysis tells whether the value of variable v at program point p could be used along some path in CFG starting at p. If so, v is live at p; otherwise, v is dead at p.

information of live variables can be used for register allocations. e.g., at some point all registers are full and we need to use one, then we should favor using a register with a dead value.

类似地，用 bit vector 来表示 variables

如果要判断 v 在 p 点是否 live，可以从后往前推，看看路径上是否重新定义了 v，如果在 v 被使用前重新定义了 v，那么 v 在 p 点就是 dead 的。

control flow:

OUT[B] = $\cup _{S\ a\ successor\ of B}IN[S]$

transfer function:

IN[B] = $use_B\cup(OUT[B]-def_B)$

Algorithm of Live Variables Analysis

Input: CFG with $use_B$ and $def_B$ computed for each basic block B

Output: IN[B] and OUT[B] for each basic block B

Method:

$IN[exit] = \empty$
$for (each \ basic \ block \ B \ except \ exit), \ IN[B]=\empty $
$while (changes\ to\ any\ IN\ occur)$
- $for (each \ basic \ block \ B \ except \ exit)\{$
  - $OUT[B]=\cup _{S\ a\ successor \ of \ B} \ IN[S]$
  - $IN[B]=use_B\cup(OUT[B]-def_B)$
- $\}$