\documentclass[12pt]{article} \begin{document} \title{Array Bound-Check Removal} \author{Kruegel Christopher\\TU Vienna\\cacao@complang.tuwien.ac.at} \maketitle \section{Introduction} In safe programming languages like Java, all array accesses are checked. It is assured that the used index is greater or equal to zero and less than the array length, otherwise an ArrayIndexOutOfBoundsException is thrown. It is obvious that this gain in saftey causes a run-time overhead that is especially unpleasant in loops where these checks have to be performed many times. Often it is possible to remove these checks in loops by inserting additional tests at the beginning of the loop and examinig the code structure and loop condition thereby saving run- time speed at each loop execuion. The following algorithm performs this task for a special set of array accesses in loops. It should be obvious that it is not always possible to predict the value of an array index by inserting tests at the beginning of the loop. An example would be a global variable that is used as an array index. This variable could be changed by a different thread virtually any time, so removing a bound check for this array access is not a good idea. The following algorithm only performs bound check removal, when the induction variable (the variable used as array index) is local and is only modified by adding/subtracting a constant inside the loop body or remains unchanged (is constant). If a constant is used as index, optimzation can take place as well. When other variables are used to modify the induction variable or when different arithmetic operations are used, no optimization can take place. Nevertheless, the most common use for index variables in loops is their increment or decrement by a constant, so most of the array access should be considered for optimization. \section{Initialization} Before array bound checks in loops can be eliminated, the loops have to be detected. The algorithm performs its analysis on intermediate code level, so we cannot rely on just looking at while/for statments. A basic block analysis has been completed, so the algorithm can work on this data structure already. It uses the Lengauer-Tarjan algorithm to find existing loops, which bases on determining the dominator tree (a reference with extensive documentation for this algorithm can be found in \cite{tig}). It uses a depth first search on the control flow graph to build a spanning tree. Then the semidominator and dominator theorem are utilized to extract the dominator tree and by looking at back edges (a back edge is an edge where the target node dominates the source node) all loops can be detected. Before starting to look at each loop, the case of different loops that share the same header node has to be considered. The algorithm works on each loop not looking on any other loop and performs its optimzation. The procedures that analyze the control flow of the program, build a flow graph and extract the loops can be found in the files graph.c (for control flow graph) and loop.c (for the loop extractig algorithm). The control flow graph is simple built by looking at the last instruction of each basic block and then deciding, which other blocks can be reached by that instruction. One problem can occur, when lookup/table-switch instructions cause multiple jumps to the same node. It must be prevented that the target node´s predecessor list contains that node more than once. So an additionally array is needed to prevent double entries in the predecessor array. When the necessary flow data structure and the list of all loops has been built, the main algorithm can start. \section{Algorithm} The procedures for the main algorithm can be found in analyze.c. Before each loops is processed on its own, two global preprocessing steps are necessary. The first step deals with loops, that share the same header node. This special case can happen when a loop body ends with an if- then/else statment. The loop finding algorithm then reports two different loops, which share the same header node. When additional tests are inserted at the header node and another loop sharing the same header is later evaluated, inserting different tests, problems could arise. To prevent this from happening, loops sharing the same header node are simply merged into a bigger one by unioning their nodes. Because the nodes of the loop are already sorted by increasing basic block numbers, a simple merge of the nodes can be done (analyze\_double\_headers). The second step before the loop by loop analysis commences is the building of a loop hierarchie. Nested loops cause problems when variables that are used as array indexes elsewhere get modified. Because these modifications (eg. an increment by a constant) can happen an unknown number of times, their results are unpredictable. These modifications in nested loops have to be recognized and reacted upon accordingly. The algorithm builds a tree where each node represents a loop with its parent being the directly surrounding loop. A dummy root node, representing the whole function has to be inserted as well. When loops have to be duplicated because of optimzed array accesses, it is important to extend or duplicate exception entries as well. Because of that, exceptions are inserted into the tree as follows. Every exception is inserted at the node in the hierarchie that represents the loop, that directly contains this exception. When an exception is part of a nested loop, it is inserted only at the node, representing this nested loop, because by following this nodes parent pointers, we find all other loops, that contain the exception (analyze\_nested). Finally, the sequentiel order of the loops is determined by a topological sort of the loops, satisfying the condition, that all nested loops have to be optimzed before their parent loop is processed. This can be archieved by a post- order traversal of the hierarchie tree with skipping the root node. After these global steps have been completed, each loop is processed. Then the loops are checked for array accesses (analyze\_for\_array\_access) and the process exits if none are found. Finally, two special cases must be considered. \begin{enumerate} \item The algorithm uses the loop condition to guarantee certain bounds for variables. If the loop condition consists of an or- statement, these guarantees can not be held up. If the loop statement reads \begin{verbatim} while ((var > 0) or true) \end{verbatim} vars value is not bound to be greater than zero. When an or-statement is found, loop optimization is stopped. \item Variables that are used as indexes in array accesses might be modified in exception handling code of try-catch statements inside the loop. Because code in catch blocks is not considered in the control flow graph and because these modifications happen unpredictable (and should occur rarely) loops containing these catch blocks with index variable modifications are not optimized. During the scan for array accesses within the loop, all interesting variables are recorded. A variable is considered interesting, when it is used as index in an array access or when its value is changes (due to a store instruction) (analyze\_or\_exceptions and scan\_global\_list). \end{enumerate} The next step is responsible for the analyzation of the loop condition itself. The loop condition can create a so called dynamic constraint on a variable. If one side of the loop condition is constant during the execution of the loop the other side can be safely assumed to be less than, greater or equal than or equal to this side (depending on the operator) at the start of each loop pass. It is important to notice the difference to so called static constraints on variables. That means that these variables can be safely assumed to stay below or above a certain constant during the loop execution by simple testing them prior to loop entry. This is obviously true for constants but does also hold for variables that are only decremented or incremented. For these, a static lower or upper bound can be guaranteed for the whole loop execution by simply inserting a single test prior to loop entry. Dynamic constraints, on the contrary, can vary in different parts of the loop. After the variable is tested in the loop condition, it might be changed to different values in different paths of execution. All variables, that are never changed or that get only decremented/incremented have static and dynamic constraints, all others can only have dynamic ones (init\_constraint). Now the core bound check removal procedure is started. Beginning directly after the loop condition all changes of variables that are used as index variables somewhere in the loop are recorded. This is an iterative process since different paths of execution may yield different changes to these variables. Especially changes in nested loops cause these changes to become unpredictable and therefore no upper/lower bound can be held up any further. After this iterative process caused all changes to become stable, each array access is examined (e.g. a statement like \newpage \begin{verbatim} if (x == true) i++; else i+=2; \end{verbatim} must result in an increase of 2 for i when control flow joins after the if-statement). If it is possible, by inserting additional static/dynamic tests as needed, to assure that the index variable is greater or equal to zero and less than the array length, the bound checks are removed (remove\_bound\_checks). It is possible that more than one static tests gets inserted for a single variable, when it is used in different array access (possibly with different constants added or subtracted). Because all tests have to hold for this variable to allow optimzed code to be executed, only the strictest tests have to be done (e.g. if an integer array x[] is accessed by i in the statements x[i+1] and x[i-1], it has to be guaranteed that i $>$ 1 (for second statement) and that i $<$ arraylength-1 (for the first statement)). Parallel to the insertion of new tests, the number of needed instructions for the new loop header node is accordingly increased. When optimzing the loop, it is important to differentiate between the loop head and the rest of the basic blocks, forming the body. The dynamic constraints can only be used for array accesses in the loop body, as the loop comparison is done at the end of the header node. Nevertheless, all static constraints can still be used to optimze array access in the header block. Because it is possible that an array reference is null prior to entering the loop, all array references that are loaded to compute an arraylength have to be checked against null. After all new tests have been determined, it is necessary to reorder the code and insert the new tests (insert\_static\_checks). The first step is the creation of a new header node. Because all jumps to the beginning of the old loop now have to get to the new header first, it is more efficient to just replace the code in the old header block by the new tests. So only jumps within the loops need to be patched and the rest of the code can remain untouched. For each constraint that has been found in the previous step of the algorithm, two values have to be loaded and are then compared. Because it is possible during runtime that these tests fail (and no guarantee can be made) a copy of the loop with the original, checked array accesses must exist. Depending on the outcome of the test cascade, a jump to the optimized or original loop is made. To copy the loop, all basic blocks that are part of the loop are copied and appended to the end of the global basic block list. After that, both the original and the copy of the loop need post- processing to redirect certain jump targets. All jumps to the old loop head (which now contains the static checks) in the original loop have to be redirected to the newly inserted block, that holds the code for the original loop head. In the copied loop, all jumps inside the loop have to be redirected to jumps to the copied blocks. When loops are duplicated, these changes must be reflected in the node list of all parent loops as well. So the hierarchie tree is climbed and the new nodes are added to all enclosing loops. Because these node lists are sorted, it is necessary to deal with the new loop head in a different way. The loop head has to be inserted into the correct place of the parent loops while all other copied nodes can simply be appended to the basic block list. Now all exceptions have to be examined. There are three different cases, that must be considered. An exception can be part of the loop body (ie. is inside the loop), an exception can contain the loop or an exception is in a different part of the code (and does not have to be further considered). To be able to find all exceptions that are part of the loop, the algorithm uses the hierarchie tree and gets all exceptions of all children nodes. The exception handlers of these loops have to be copied and jumps to the original loop have to be redirected to the copied loop. The start and end code pointers of the protected area have to be moved to the copied blocks. The exception handler code is identified by looking at the successor nodes starting from the handler block. As long as control flow does not reach a loop node again, all blocks are considered as part of the handler code. Exceptions that surround the loop have to be handled different. It is not necessary to copy the exception handler, as it is not part of the loop. Nevertheless it is important to extend the protected area to the copied blocks. So the first and last block (including copied exception handler blocks of nested loops) that got copied are stored. The exceptions are then extended to contain all these new blocks and jump to the original handler. As no code is duplicated, nothing needs to be patched. When climbing up the hierarchie tree by following the parent pointer, not all exceptions of parent nodes really enclose the loop. It is possible that a parent loop contains an exception and the loop, that is optimized, right after each other. Those exceptions must not be handled and are ignored. Because of the layout of the exception table, where appropriate exceptions (that could be nested) are found by a linear search from the start, it is necessary to insert newly created exceptions right after their original ones. One performance problem still remains after these modifications. The new loop header that replaces the original one is located exactly where the old one was. A fall through from the previous block (that could be part of the loop) to the loop header must be patched by inserting an additional goto-instruction to the original head of the loop. Because this would insert an additional jump into the loop body, performance may deteriorate. To solve this shortcoming, the new loop head has to be moved before the first node of the loop. This might cause the problem of moving the loop head to the beginning of the global basic block list. This pointer points to the initial basic block array and can not be easily reassigned. A new pointer is needed that temporary holds the begin of the basic block list and is assigned to the original pointer after all optimization step have been finished. Any fall through from the predecessor of the loop head now reaches the old loop head, that has been inserted right after the new loop head. After all loops have been processed, register allocation and code genration proceed as usual and optimized code is generated. \section{Helper functions} An important helper function is stored in tracing.c. This functions is needed to determine the variables/constants that participate in array accesses or comparisons. As all work is done on java bytecode, it is necessary to find the arguments of an interesting operation by examining the stack. Unfortunately these values can just be temporary and origin in variables loaded earlier and getting modified by arithmetic operations. To find the variables that participate in an array access, one has to walk back instructions and looking at the stack, until an appropriate load is found. The function tracing preforms this task by steping back instruction for instruction and record the stack changes these instructions cause until the correct load or push constant operating has been found or until it becomes clear, that it is impossible to determine the origin (e.g. when a function return value is used). The value is then delivered in a special structure and used by the main algorithm. \section{Performance - impact and gain} It is obvious that the optimization process causes additionally compile time overhead but can give performance gains during runtime, when 3 less instructions are executed per array access many times in a loop. The additional overhaed is mainly caused by the needed control flow analysis, which has to be done for every method and is linear to the number of basic blocks. Then the loop scaning algorithm trys to find loops in the control flow graph. This algorithm has to be started every time a method is compiled and runs in O(n*ld(n)) where n is the number of basic blocks. After the loops have been scanned for array access, the runtime of the final bound check removal and loop copying process mainly depends on the nesting depth of the hierarchie tree. For every additional nesting depth, the number of basic blocks, that must be copied and patch is doubled, resulting in exponential overhead according to the nesting depth. Additionally, the search for modifications of index variables, which is an iterative process, becomes slowed down by deeply nested loops. Things get worse when many exceptions are involved, because not only exceptions in children nodes have to be considered, but all enclosing exceptions as well. Nevertheless those cases should rarely occur and in real environments, the net gain can be significant. Especially in tight loops, when arrays are initialized or their elements summed up, three less instructions can gain up to 30 percent speedup, when loops run many times. \subsection{Tables} \vspace{4mm} \begin{tabular}{|l|c|c|c|} \hline & Compile time (in ms) & \multicolumn{2}{c|}{Run time (in ms) } \\ \multicolumn{1}{|c|}{\raisebox{1ex}{Cacao Options}} & javac & perf & sieve 1000 1000\\ \hline \rule{0mm}{5mm}with optimization (-oloop) & 176 & 1510 & 98 \\ without optimization & 118 & 1720 & 131 \\ \hline \end{tabular} \begin{thebibliography}{9} \bibitem{tig} A. Appel; modern compiler implementation in C; Cambridge University Press; 1998 \bibitem{aho} A. Aho, R.Sethi, J. Ullman; Compilers - Principles, Techniques, and Tools; Addison-Wesly; 1986 \end{thebibliography} \end{document}