doc/handbook/verification.tex

   1 \section{Verification}
   2
   3 \newcommand{\cmdline}[1]{\texttt{#1}}
   4 \newcommand{\srcfile}[1]{\texttt{#1}}
   5 \newcommand{\code}[1]{\texttt{#1}}
   6 \newcommand{\asgn}{\sqsubseteq}
   7 \newcommand{\qed}{$\Box$}
   8 \newcommand{\norm}[1]{\textrm{norm}\left(#1\right)}
   9 \newcommand{\reft}[2]{\langle #1, #2\rangle}
  10 \newtheorem{definition}{Definition}
  11 \newtheorem{lemma}{Lemma}
  12 \newtheorem{theorem}{Theorem}
  13 \newenvironment{proof}{\noindent\emph{Proof.} }{}
  14
  15 Cacao verifies class files according to the JVM
  16 Specification \cite{javavm99} with some exceptions. We will first give
  17 an overview of the verifier implementation and then discuss specific
  18 points where the implementation deviates from the official JVM
  19 Specification.
  20
  21 You can turn off class file verification using the \cmdline{-noverify}
  22 option\footnote{The \cmdline{-noverify} option disables verifier
  23 checks which are especially time-consuming. Checks which can be
  24 performed without significant overhead are always enabled.} when
  25 invoking
  26 \cmdline{cacao}.
  27
  28 \subsection{Implementation Overview}
  29
  30 The verification checks which are concerned with the overall structure
  31 of the class file and the contents of the constant pool (logical
  32 passes 1 and 2 in the specification \cite{javavm99}) reside in
  33 \srcfile{loader.c} (with UTF-8 validation in
  34 \srcfile{tables.c}). These passes of the verifier are executed when a
  35 class file is loaded and linked, before the JIT starts to compile any
  36 method within the loaded class.
  37
  38 The checks concerning the bytecode array of a method (logical passes 3
  39 and 4 in the specification \cite{javavm99}) are performed when the JIT
  40 compiles the method\footnote{Usually when the method is called for the
  41 first time.}. These checks are implemented in
  42 \srcfile{jit/parse.c}, \srcfile{jit/stack.c} and
  43 \srcfile{jit/typecheck.c}. The files \srcfile{typeinfo.[ch]}
  44 implement the type system used by \srcfile{jit/typecheck.c}.
  45
  46 \srcfile{jit/parse.c} contains the following checks which are performed
  47 while translating the Java Bytecode to the intermediate
  48 representation:
  49
  50 \begin{itemize}
  51 \item Detect invalid opcodes,
  52 \item detect unexpected end of bytecode,
  53 \item check if branch targets are within bounds,
  54 \item check if branch targets are valid instruction starts,
  55 \item check constant pool indices and types,
  56 \item check local variable indices (no type check),
  57 \item check exception handler ranges.
  58 \end{itemize}
  59
  60 The code in \srcfile{jit/stack.c} builds a model of the operand stack
  61 and checks the intermediate instructions for type safety regarding the
  62 five basic stack slot types: integer\footnote{The basic \emph{integer}
  63 type is the union of the Java types \code{int}, \code{short},
  64 \code{byte}, \code{char} and \code{boolean}.}, float, long, double and
  65 address. (The basic type \emph{address} encompasses all reference and
  66 \code{returnAddress} types.) Checks performed while building the stack
  67 model include:
  68
  69 \begin{itemize}
  70 \item Detect operand stack underflow,
  71 \item detect operand stack overflow,
  72 \item detect undefined stack depth at merging points,
  73 \item detect undefined basic stack slot types at merging points,
  74 \item check type-safe use of stack slots regarding the five basic
  75       types.
  76 \end{itemize}
  77
  78 The stack model and the intermediate instructions then enter the type
  79 checker implemented in \srcfile{jit/typecheck.c}. The type checker
  80 uses a special type system to perform the following checks:
  81
  82 \begin{itemize}
  83 \item Check type-safe use of reference types,
  84 \item check type-safe use of local variables,
  85 \item detect invalid use of uninitialized objects,
  86 \item check type-safe method invocations and field access,
  87 \item check local subroutines and use of \code{returnAddress}es,
  88 \item check type-safe use of \code{return} instructions,
  89 \item check access rights.
  90 \end{itemize}
  91
  92 \subsection{Identifiers}
  93
  94 Cacao currently does not check if the names of classes, methods and
  95 fields are valid Java identifiers since this would require including a
  96 Unicode database in Cacao. Instead the following checks are performed:
  97
  98 \begin{itemize}
  99 \item Names must be valid non-empty UTF-8 strings.
 100 \item Names must not contain ASCII control characters or the
 101       \code{NUL}-character.
 102 \item A method name starting with the ASCII character `\code{<}' must be
 103       either \code{<init>} or \code{<clinit>}.
 104 \item A \code{NameAndType} entry in the constant pool may not
 105       reference a method named \code{<clinit>}.
 106 \item Field names may not start with `\code{<}'.
 107
 108 \end{itemize}
 109
 110 \subsection{Reference Types}
 111
 112 The approach for checking reference types described in the JVM
 113 Specification \cite{javavm99} does only work well for single
 114 inheritance since the rule for merging reference types is taking the
 115 least common ancestor of the types. When multiple inheritance via
 116 interfaces is used this can lead to type-safe programs being rejected
 117 by the verifier.
 118
 119 Cacao's rule for merging reference types is equivalent to using set
 120 union instead of the least common ancestor. To optimize for the common
 121 case of trivial merges\footnote{Merges of a class $C$ with itself or
 122 one of its superclasses or subclasses.}, however, the Cacao
 123 implementation uses a slightly more complex scheme which is detailed
 124 below.
 125
 126 \subsubsection{Array Classes}
 127
 128 For the purpose of verification Cacao assumes that array classes have
 129 subclass relationships induced by their component types. For example
 130 \code{Integer[]} is considered a subclass of \code{Number[]}. The
 131 class \code{Object[]} and all primitive arrays are considered to be
 132 subclasses of a (pseudo) class \code{Arraystub} with the following
 133 declaration:
 134
 135 \begin{verbatim}
 136 class Arraystub extends Object implements Cloneable, java.io.Serializable
 137 {
 138 }
 139 \end{verbatim}
 140
 141 \subsubsection{Formal Treatment}
 142
 143 \begin{definition}
 144 A reference type in Cacao is an ordered pair $\reft{C}{M}$ where $C$
 145 is a class\footnote{In this section the term `class' includes Java
 146 classes, Java interfaces and Java array classes.} and M is a (possibly
 147 empty) set\footnote{Implemented as a sorted list.} of proper
 148 subclasses of $C$. The Cacao type corresponding to a simple Java
 149 class, interface, or array class $C$ is $\reft{C}{\emptyset}$.
 150 \end{definition}
 151
 152 The following rule is used to check if $\reft{C}{M}$ is assignable
 153 (denoted $\sqsubseteq$) to the class $K$:
 154
 155 \begin{eqnarray*}
 156 \reft{C}{M}\asgn K
 157 & \Longleftrightarrow &
 158 (C\asgn K) \vee ((M \neq \emptyset) \wedge
 159 (\forall c \in M: c\asgn K)) \\
 160 & \Longleftrightarrow &
 161 \forall c \in \textrm{cls}(\reft{C}{M}): c\asgn K
 162 \end{eqnarray*}
 163
 164 where
 165
 166 \[\textrm{cls}(\reft{C}{M}) =
 167 \left\{ \begin{array}{ll} \{C\} & \mbox{if $M = \emptyset$} \\
 168                          M     & \mbox{if $M \neq \emptyset$}
 169        \end{array}
 170 \right.\]
 171
 172 The rule for merging two reference types is:
 173
 174 \[\reft{C_1}{M_1} \sqcup \reft{C_2}{M_2} =
 175 \norm{\textrm{lca}(C_1,C_2),\textrm{cls}(\reft{C_1}{M_1}) \cup \textrm{cls}(\reft{C_2}{M_2})}\]
 176
 177 where $\textrm{lca}(C_1,C_2)$ denotes the least common ancestor of
 178 $C_1$ and $C_2$ and
 179
 180 \[\norm{C,M} =
 181 \left\{ \begin{array}{ll} \reft{C}{\emptyset} & \mbox{if $C \in M$} \\
 182                           \reft{C}{M}         & \mbox{if $C \notin M$}
 183        \end{array}
 184 \right.\]
 185
 186 We will now prove that the merging rule is sound in the sense that it
 187 does not lead to the verification of unsafe programs or to the
 188 rejection of type-safe programs.
 189
 190 \begin{lemma}
 191 \label{lemma:sub}
 192 For any class $C$ and any reference type $\reft{C_2}{M_2}$ where $C_2$
 193 is $C$ or a subclass of $C$
 194
 195 \[\reft{C}{\emptyset}\asgn K \Longrightarrow \reft{C_2}{M_2}\asgn K\]
 196
 197 for all classes $K$.
 198 \end{lemma}
 199 \begin{proof}
 200 All classes in $\textrm{cls}(\reft{C_2}{M_2})$ are $C$ or subclasses
 201 of $C$ and thus assignable to $C$. The lemma follows by transitivity.\hfill\qed
 202 \end{proof}
 203
 204 \begin{theorem}
 205 For any two reference types $T_1=\reft{C_1}{M_1}$ and $T_2=\reft{C_2}{M_2}$ the merging rule satisfies
 206
 207 \[((T_1\asgn K) \wedge (T_2\asgn K))
 208 \Longleftrightarrow (T_1 \sqcup T_2) \asgn K\]
 209
 210 for all classes $K$.
 211 \end{theorem}
 212
 213 \begin{proof}
 214 Let $T = T_1 \sqcup T_2 = \reft{C}{M}$ and
 215 $M_V = \textrm{cls}(T_1) \cup \textrm{cls}(T_2)$.
 216
 217 \noindent\textit{Case 1: $C \notin M_V$.} Because of $M = M_V \neq
 218 \emptyset$ we have $\textrm{cls}(T) = M_V$ and the
 219 theorem becomes
 220
 221 \[\forall K:
 222 ((\forall c \in \textrm{cls}(T_1): c\asgn K)
 223 \wedge
 224 (\forall c \in \textrm{cls}(T_2): c\asgn K))
 225 \Longleftrightarrow
 226 \forall c \in \textrm{cls}(T_1) \cup \textrm{cls}(T_2): c\asgn K\]
 227
 228 which is true by definition of $\cup$.
 229
 230 \noindent\textit{Case 2: $C \in M_V$.} We have $C =
 231 \textrm{lca}(C_1,C_2)$ and $M = \emptyset$. Because all elements of
 232 $M_1$ and $M_2$ are proper subclasses of $C_1$ and $C_2$ respectively
 233 we know that $C \notin M_1 \cup M_2$. From this and $C \in M_V$
 234 follows that $C \in \{C_1\}\cup\{C_2\}$ and so at least one of $M_1$,
 235 $M_2$ must be $\emptyset$. Without reduction of generality let $M_1 =
 236 \emptyset$. If $C = C_1$ we get
 237
 238 \[
 239 ((\reft{C}{\emptyset}\asgn K) \wedge (\reft{C_2}{M_2}\asgn K))
 240 \Longleftrightarrow
 241 \reft{C}{\emptyset}\asgn K
 242 \]
 243
 244 \noindent From left to right this is trivial and from right to left it
 245 follows from Lemma~\ref{lemma:sub}.
 246
 247 \noindent If $C \neq C_1$ then $C = C_2$ and $M_2 = \emptyset$
 248 (because of $C \in M_V = \{C_1\}\cup\textrm{cls}(T_2)$) and we get:
 249
 250 \[
 251 ((\reft{C_1}{\emptyset}\asgn K) \wedge (\reft{C}{\emptyset}\asgn K))
 252 \Longleftrightarrow
 253 \reft{C}{\emptyset}\asgn K
 254 \]
 255
 256 \noindent which is proven in the same way.\hfill\qed
 257
 258 \end{proof}
 259
 260
 261 \subsection{Local Subroutines}
 262
 263 Cacao uses a very different scheme for type checking local subroutines
 264 than that implied by the JVM Specification. We will now discuss the
 265 reasons for this deviation and explain why the official specification
 266 should be considered flawed with respect to local subroutines (see
 267 also \cite{Coglio01}).
 268
 269 Type checking local subroutines tends to be the most complex part of
 270 bytecode verification. This arises from the requirement \cite{javavm99}
 271 that subroutines be polymorphic over the types of local variables that
 272 they do not use\footnote{Intuitively this means that if a subroutine
 273 does not use a certain local variable this variable will maintain its
 274 type across any \code{jsr} `call' of that subroutine.}. This
 275 requirement cannot be met by a simple data flow analysis which would
 276 merge the types of all local variables entering a subroutine via
 277 different \code{jsr} instructions.
 278
 279 The JVM Specification suggests an implementation of the verifier which
 280 tries to model a kind of call stack for local subroutines. The problem
 281 with this approach -- apart from the severe restrictions it inflicts
 282 on subroutines -- is that the behavior of exceptions is undefined with
 283 respect to this (fictitious) call stack. When an exception can be
 284 thrown at a certain point in a subroutine it is impossible for the
 285 verifier to determine whether the responsible handler should be
 286 considered to be `inside' or `outside' the subroutine. Thus the
 287 fundamental flaw of this approach is the implicit assumption that
 288 subroutines have well-defined boundaries, which is not the
 289 case\footnote{At least it is not guaranteed by the documented
 290 constraints on the bytecode and the semantics of \code{jsr} and
 291 \code{ret}.}.
 292
 293 Cacao has therefore adopted an alternative algorithm for type checking
 294 local subroutines. The algorithm is formally described and proven to
 295 be sound in \cite{Coglio02}. We will now give an informal explanation
 296 of the algorithm's idea and discuss a certain restriction which the
 297 Cacao implementation introduces in addition to those in
 298 \cite{Coglio02}.
 299
 300 The basic idea is to treat the \code{jsr} instruction like any other
 301 control flow instruction while avoiding the `lossy' merging of types.
 302 We denote as $(s,v)$ a \emph{pair}, where $s$ is a list of types
 303 corresponding to the stack slots and $v$ is a vector of types
 304 corresponding to the local variables at a certain bytecode
 305 instruction. A simple data flow analysis would map every position in
 306 the bytecode to a single pair. The enhanced algorithm maps every
 307 position in the bytecode to a non-empty set of pairs representing the
 308 alternative possible `assignments' of types to the stack slots and
 309 local variables at this position. When control flow is merged at a
 310 certain point we just compute the union of the sets of pairs which
 311 are propagated from the predecessors of the merging point. This way
 312 when we are type checking a subroutine we can maintain all possible
 313 types that a (possibly unused) variable can have `inside' the
 314 subroutine.
 315
 316 The second important measure is typing \code{returnAddress}es by the
 317 bytecode address they return to. This means if we encounter a
 318 \code{jsr} instruction we assign the stack top a type $\code{ret}_p$
 319 where $p$ is the address of the instruction following the \code{jsr}.
 320 Thus when we reach a $\code{ret}\ i$ instruction we just have to look
 321 at the types assigned to the variable $i$ in each pair for the current
 322 instruction\footnote{The type of variable $i$ must be a type of the
 323 form $\code{ret}_x$ in each pair, otherwise the bytecode is invalid.}
 324 and we know the target addresses that the \code{ret} can reach and
 325 which pairs propagate to which target address.
 326
 327 The Cacao implementation differs from the algorithm in \cite{Coglio02}
 328 in one point: Cacao always merges stack slot types (with the exception
 329 of \code{returnAddress}es) when merging control flow. In other words:
 330 Cacao does not keep a set of pairs $(s,v)$, instead it keeps a single
 331 list of types\footnote{Stored directly in the stack model.} $s$ for the
 332 stack and a set of vectors\footnote{See the
 333 \code{typevector\_} and \code{typevectorset\_} functions in
 334 \srcfile{typeinfo.c}.} $\{v_1, v_2, \dots\}$ for the local variables.
 335 \code{returnAddress}es on the stack contain a special list which keeps a
 336 $\code{ret}_{x_i}$ type for every vector $v_i$ in the set of local
 337 variable vectors. This design was chosen in order to keep the impact
 338 of subroutine verification on the stack model and the code dealing
 339 with stack operations minimal. The consequence of this is that
 340 subroutines may not be polymorphic with respect to stack slot
 341 types. This is a restriction when compared to the original algorithm
 342 in \cite{Coglio02} but it is in agreement with the JVM Specification,
 343 which requires that at any instruction in the bytecode the stack depth
 344 and the types of the stack slots must be well-defined.