\section{Verification} \newcommand{\cmdline}[1]{\texttt{#1}} \newcommand{\srcfile}[1]{\texttt{#1}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\asgn}{\sqsubseteq} \newcommand{\qed}{$\Box$} \newcommand{\norm}[1]{\textrm{norm}\left(#1\right)} \newcommand{\reft}[2]{\langle #1, #2\rangle} \newtheorem{definition}{Definition} \newtheorem{lemma}{Lemma} \newtheorem{theorem}{Theorem} \newenvironment{proof}{\noindent\emph{Proof.} }{} Cacao verifies class files according to the JVM Specification \cite{javavm99} with some exceptions. We will first give an overview of the verifier implementation and then discuss specific points where the implementation deviates from the official JVM Specification. You can turn off class file verification using the \cmdline{-noverify} option\footnote{The \cmdline{-noverify} option disables verifier checks which are especially time-consuming. Checks which can be performed without significant overhead are always enabled.} when invoking \cmdline{cacao}. \subsection{Implementation Overview} The verification checks which are concerned with the overall structure of the class file and the contents of the constant pool (logical passes 1 and 2 in the specification \cite{javavm99}) reside in \srcfile{loader.c} (with UTF-8 validation in \srcfile{tables.c}). These passes of the verifier are executed when a class file is loaded and linked, before the JIT starts to compile any method within the loaded class. The checks concerning the bytecode array of a method (logical passes 3 and 4 in the specification \cite{javavm99}) are performed when the JIT compiles the method\footnote{Usually when the method is called for the first time.}. These checks are implemented in \srcfile{jit/parse.c}, \srcfile{jit/stack.c} and \srcfile{jit/typecheck.c}. The files \srcfile{typeinfo.[ch]} implement the type system used by \srcfile{jit/typecheck.c}. \srcfile{jit/parse.c} contains the following checks which are performed while translating the Java Bytecode to the intermediate representation: \begin{itemize} \item Detect invalid opcodes, \item detect unexpected end of bytecode, \item check if branch targets are within bounds, \item check if branch targets are valid instruction starts, \item check constant pool indices and types, \item check local variable indices (no type check), \item check exception handler ranges. \end{itemize} The code in \srcfile{jit/stack.c} builds a model of the operand stack and checks the intermediate instructions for type safety regarding the five basic stack slot types: integer\footnote{The basic \emph{integer} type is the union of the Java types \code{int}, \code{short}, \code{byte}, \code{char} and \code{boolean}.}, float, long, double and address. (The basic type \emph{address} encompasses all reference and \code{returnAddress} types.) Checks performed while building the stack model include: \begin{itemize} \item Detect operand stack underflow, \item detect operand stack overflow, \item detect undefined stack depth at merging points, \item detect undefined basic stack slot types at merging points, \item check type-safe use of stack slots regarding the five basic types. \end{itemize} The stack model and the intermediate instructions then enter the type checker implemented in \srcfile{jit/typecheck.c}. The type checker uses a special type system to perform the following checks: \begin{itemize} \item Check type-safe use of reference types, \item check type-safe use of local variables, \item detect invalid use of uninitialized objects, \item check type-safe method invocations and field access, \item check local subroutines and use of \code{returnAddress}es, \item check type-safe use of \code{return} instructions, \item check access rights. \end{itemize} \subsection{Identifiers} Cacao currently does not check if the names of classes, methods and fields are valid Java identifiers since this would require including a Unicode database in Cacao. Instead the following checks are performed: \begin{itemize} \item Names must be valid non-empty Java `Utf8' strings. \item Names must not contain ASCII control characters or the \code{NUL}-character. \item A method name starting with the ASCII character `\code{<}' must be either \code{} or \code{}. \item A \code{NameAndType} entry in the constant pool may not refer to a method named \code{}. \item Field names may not start with `\code{<}'. \end{itemize} \subsection{Reference Types} The approach for checking reference types described in the JVM Specification \cite{javavm99} does only work well for single inheritance since the rule for merging reference types is taking the least common ancestor of the types. When multiple inheritance via interfaces is used this can lead to type-safe programs being rejected by the verifier. Cacao's rule for merging reference types is equivalent to using set union instead of the least common ancestor. To optimize for the common case of trivial merges\footnote{Merges of a class $C$ with itself or one of its superclasses or subclasses.}, however, the Cacao implementation uses a slightly more complex scheme which is detailed below. \subsubsection{Array Classes} For the purpose of verification Cacao assumes that array classes have subclass relationships induced by their component types. For example \code{Integer[]} is considered a subclass of \code{Number[]}. The class \code{Object[]} and all primitive arrays are considered to be subclasses of a (pseudo) class \code{Arraystub} with the following declaration: \begin{verbatim} class Arraystub extends Object implements Cloneable, java.io.Serializable { } \end{verbatim} \subsubsection{Formal Treatment} \begin{definition} A reference type in Cacao is an ordered pair $\reft{C}{M}$ where $C$ is a class\footnote{In this section the term `class' includes Java classes, Java interfaces and Java array classes.} and M is a (possibly empty) set\footnote{Implemented as a sorted list.} of proper subclasses of $C$. The Cacao type corresponding to a simple Java class, interface, or array class $C$ is $\reft{C}{\emptyset}$. \end{definition} The following rule is used to check if $\reft{C}{M}$ is assignable (denoted $\sqsubseteq$) to the class $K$: \begin{eqnarray*} \reft{C}{M}\asgn K & \Longleftrightarrow & (C\asgn K) \vee ((M \neq \emptyset) \wedge (\forall c \in M: c\asgn K)) \\ & \Longleftrightarrow & \forall c \in \textrm{cls}(\reft{C}{M}): c\asgn K \end{eqnarray*} where \[\textrm{cls}(\reft{C}{M}) = \left\{ \begin{array}{ll} \{C\} & \mbox{if $M = \emptyset$} \\ M & \mbox{if $M \neq \emptyset$} \end{array} \right.\] The rule for merging two reference types is: \[\reft{C_1}{M_1} \sqcup \reft{C_2}{M_2} = \norm{\textrm{lca}(C_1,C_2),\textrm{cls}(\reft{C_1}{M_1}) \cup \textrm{cls}(\reft{C_2}{M_2})}\] where $\textrm{lca}(C_1,C_2)$ denotes the least common ancestor of $C_1$ and $C_2$ and \[\norm{C,M} = \left\{ \begin{array}{ll} \reft{C}{\emptyset} & \mbox{if $C \in M$} \\ \reft{C}{M} & \mbox{if $C \notin M$} \end{array} \right.\] We will now prove that the merging rule is sound in the sense that it does not lead to the verification of unsafe programs or to the rejection of type-safe programs. \begin{lemma} \label{lemma:sub} For any class $C$ and any reference type $\reft{C_2}{M_2}$ where $C_2$ is $C$ or a subclass of $C$ \[\reft{C}{\emptyset}\asgn K \Longrightarrow \reft{C_2}{M_2}\asgn K\] for all classes $K$. \end{lemma} \begin{proof} All classes in $\textrm{cls}(\reft{C_2}{M_2})$ are $C$ or subclasses of $C$ and thus assignable to $C$. The lemma follows by transitivity.\hfill\qed \end{proof} \begin{theorem} For any two reference types $T_1=\reft{C_1}{M_1}$ and $T_2=\reft{C_2}{M_2}$ the merging rule satisfies \[((T_1\asgn K) \wedge (T_2\asgn K)) \Longleftrightarrow (T_1 \sqcup T_2) \asgn K\] for all classes $K$. \end{theorem} \begin{proof} Let $T = T_1 \sqcup T_2 = \reft{C}{M}$ and $M_V = \textrm{cls}(T_1) \cup \textrm{cls}(T_2)$. \noindent\textit{Case 1: $C \notin M_V$.} Because of $M = M_V \neq \emptyset$ we have $\textrm{cls}(T) = M_V$ and the theorem becomes \[\forall K: ((\forall c \in \textrm{cls}(T_1): c\asgn K) \wedge (\forall c \in \textrm{cls}(T_2): c\asgn K)) \Longleftrightarrow \forall c \in \textrm{cls}(T_1) \cup \textrm{cls}(T_2): c\asgn K\] which is true by definition of $\cup$. \noindent\textit{Case 2: $C \in M_V$.} We have $C = \textrm{lca}(C_1,C_2)$ and $M = \emptyset$. Because all elements of $M_1$ and $M_2$ are proper subclasses of $C_1$ and $C_2$ respectively we know that $C \notin M_1 \cup M_2$. From this and $C \in M_V$ follows that $C \in \{C_1\}\cup\{C_2\}$ and so at least one of $M_1$, $M_2$ must be $\emptyset$. Without reduction of generality let $M_1 = \emptyset$. If $C = C_1$ we get \[ ((\reft{C}{\emptyset}\asgn K) \wedge (\reft{C_2}{M_2}\asgn K)) \Longleftrightarrow \reft{C}{\emptyset}\asgn K \] \noindent From left to right this is trivial and from right to left it follows from Lemma~\ref{lemma:sub}. \noindent If $C \neq C_1$ then $C = C_2$ and $M_2 = \emptyset$ (because of $C \in M_V = \{C_1\}\cup\textrm{cls}(T_2)$) and we get: \[ ((\reft{C_1}{\emptyset}\asgn K) \wedge (\reft{C}{\emptyset}\asgn K)) \Longleftrightarrow \reft{C}{\emptyset}\asgn K \] \noindent which is proven in the same way.\hfill\qed \end{proof} \subsection{Local Subroutines} Cacao uses a very different scheme for type checking local subroutines than that implied by the JVM Specification. We will now discuss the reasons for this deviation and explain why the official specification should be considered flawed with respect to local subroutines (see also \cite{Coglio01}). Type checking local subroutines tends to be the most complex part of bytecode verification. This arises from the requirement \cite{javavm99} that subroutines be polymorphic over the types of local variables that they do not use\footnote{Intuitively this means that if a subroutine does not use a certain local variable this variable will maintain its type across any \code{jsr} `call' of that subroutine.}. This requirement cannot be met by a simple data flow analysis which would merge the types of all local variables entering a subroutine via different \code{jsr} instructions. The JVM Specification suggests an implementation of the verifier which tries to model a kind of call stack for local subroutines. The problem with this approach -- apart from the severe restrictions it inflicts on subroutines -- is that the behavior of exceptions is undefined with respect to this (fictitious) call stack. When an exception can be thrown at a certain point in a subroutine it is impossible for the verifier to determine whether the responsible handler should be considered to be `inside' or `outside' the subroutine. Thus the fundamental flaw of this approach is the implicit assumption that subroutines have well-defined boundaries, which is not the case\footnote{At least it is not guaranteed by the documented constraints on the bytecode and the semantics of \code{jsr} and \code{ret}.}. Cacao has therefore adopted an alternative algorithm for type checking local subroutines. The algorithm is formally described and proven to be sound in \cite{Coglio02}. We will now give an informal explanation of the algorithm's idea and discuss a certain restriction which the Cacao implementation introduces in addition to those in \cite{Coglio02}. The basic idea is to treat the \code{jsr} instruction like any other control flow instruction while avoiding the `lossy' merging of types. We denote as $(s,v)$ a \emph{pair}, where $s$ is a list of types corresponding to the stack slots and $v$ is a vector of types corresponding to the local variables at a certain bytecode instruction. A simple data flow analysis would map every position in the bytecode to a single pair. The enhanced algorithm maps every position in the bytecode to a non-empty set of pairs representing the alternative possible `assignments' of types to the stack slots and local variables at this position. When control flow is merged at a certain point we just compute the union of the sets of pairs which are propagated from the predecessors of the merging point. This way when we are type checking a subroutine we can maintain all possible types that a (possibly unused) variable can have `inside' the subroutine. The second important measure is typing \code{returnAddress}es by the bytecode address they return to. This means if we encounter a \code{jsr} instruction we assign the stack top a type $\code{ret}_p$ where $p$ is the address of the instruction following the \code{jsr}. Thus when we reach a $\code{ret}\ i$ instruction we just have to look at the types assigned to the variable $i$ in each pair for the current instruction\footnote{The type of variable $i$ must be a type of the form $\code{ret}_x$ in each pair, otherwise the bytecode is invalid.} and we know the target addresses that the \code{ret} can reach and which pairs propagate to which target address. The Cacao implementation differs from the algorithm in \cite{Coglio02} in one point: Cacao always merges stack slot types (with the exception of \code{returnAddress}es) when merging control flow. In other words: Cacao does not keep a set of pairs $(s,v)$, instead it keeps a single list of types\footnote{Stored directly in the stack model.} $s$ for the stack and a set of vectors\footnote{See the \code{typevector\_} and \code{typevectorset\_} functions in \srcfile{typeinfo.c}.} $\{v_1, v_2, \dots\}$ for the local variables. \code{returnAddress}es on the stack contain a special list which keeps a $\code{ret}_{x_i}$ type for every vector $v_i$ in the set of local variable vectors. This design was chosen in order to keep the impact of subroutine verification on the stack model and the code dealing with stack operations minimal. The consequence of this is that subroutines may not be polymorphic with respect to stack slot types. This is a restriction when compared to the original algorithm in \cite{Coglio02} but it is in agreement with the JVM Specification, which requires that at any instruction in the bytecode the stack depth and the types of the stack slots must be well-defined.