doc/array.tex

   1 \documentclass[12pt]{article}
   2
   3 \begin{document}
   4
   5 \title{Array Bound-Check Removal}
   6 \author{Kruegel Christopher\\TU Vienna\\cacao@complang.tuwien.ac.at}
   7 \maketitle
   8
   9 \section{Introduction}
  10
  11 In safe programming languages like Java, all array accesses are
  12 checked. It is assured that the used index is greater or equal to
  13 zero and less than the array length, otherwise an
  14 ArrayIndexOutOfBoundsException is thrown. It is obvious that this
  15 gain in saftey causes a run-time overhead that is especially
  16 unpleasant in loops where these checks have to be performed many
  17 times. Often it is possible to remove these checks in loops by
  18 inserting additional tests at the beginning of the loop and
  19 examinig the code structure and loop condition thereby saving run-
  20 time speed at each loop execuion. The following algorithm performs
  21 this task for a special set of array accesses in loops. It should
  22 be obvious that it is not always possible to predict the value of
  23 an array index by inserting tests at the beginning of the loop. An
  24 example would be a global variable that is used as an array index.
  25 This variable could be changed by a different thread virtually any
  26 time, so removing a bound check for this array access is not a
  27 good idea. The following algorithm only performs bound check
  28 removal, when the induction variable (the variable used as array
  29 index) is local and is only modified by adding/subtracting a
  30 constant inside the loop body or remains unchanged (is constant).
  31 If a constant is used as index, optimzation can take place as
  32 well. When other variables are used to modify the induction
  33 variable or when different arithmetic operations are used, no
  34 optimization can take place. Nevertheless, the most common use for
  35 index variables in loops is their increment or decrement by a
  36 constant, so most of the array access should be considered for
  37 optimization.
  38
  39 \section{Initialization}
  40
  41 Before array bound checks in loops can be eliminated, the loops
  42 have to be detected. The algorithm performs its analysis on
  43 intermediate code level, so we cannot rely on just looking at
  44 while/for statments. A basic block analysis has been completed, so
  45 the algorithm can work on this data structure already. It uses the
  46 Lengauer-Tarjan algorithm to find existing loops, which bases on
  47 determining the dominator tree (a reference with extensive
  48 documentation for this algorithm can be found in \cite{tig}). It uses a
  49 depth first search on the control flow graph to build a spanning
  50 tree. Then the semidominator and dominator theorem are utilized to
  51 extract the dominator tree and by looking at back edges (a back
  52 edge is an edge where the target node dominates the source node)
  53 all loops can be detected. Before starting to look at each loop,
  54 the case of different loops that share the same header node has to
  55 be considered. The algorithm works on each loop not looking on any
  56 other loop and performs its optimzation. The procedures that
  57 analyze the control flow of the program, build a flow graph and
  58 extract the loops can be found in the files graph.c (for control
  59 flow graph) and loop.c (for the loop extractig algorithm). The
  60 control flow graph is simple built by looking at the last
  61 instruction of each basic block and then deciding, which other
  62 blocks can be reached by that instruction. One problem can occur,
  63 when lookup/table-switch instructions cause multiple jumps to the
  64 same node. It must be prevented that the target node´s predecessor
  65 list contains that node more than once. So an additionally array
  66 is needed to prevent double entries in the predecessor array. When
  67 the necessary flow data structure and the list of all loops has
  68 been built, the main algorithm can start.
  69
  70
  71 \section{Algorithm}
  72
  73 The procedures for the main algorithm can be found in analyze.c.
  74 Before each loops is processed on its own, two global
  75 preprocessing steps are necessary.
  76
  77 The first step deals with loops, that share the same header node.
  78 This special case can happen when a loop body ends with an if-
  79 then/else statment. The loop finding algorithm then reports two
  80 different loops, which share the same header node. When additional
  81 tests are inserted at the header node and another loop sharing the
  82 same header is later evaluated, inserting different tests,
  83 problems could arise. To prevent this from happening, loops
  84 sharing the same header node are simply merged into a bigger one
  85 by unioning their nodes. Because the nodes of the loop are already
  86 sorted by increasing basic block numbers, a simple merge of the
  87 nodes can be done (analyze\_double\_headers).
  88
  89 The second step before the loop by loop analysis commences is the
  90 building of a loop hierarchie. Nested loops cause problems when
  91 variables that are used as array indexes elsewhere get modified.
  92 Because these modifications (eg. an increment by a constant) can
  93 happen an unknown number of times, their results are
  94 unpredictable. These modifications in nested loops have to be
  95 recognized and reacted upon accordingly. The algorithm builds a
  96 tree where each node represents a loop with its parent being the
  97 directly surrounding loop. A dummy root node, representing the
  98 whole function has to be inserted as well. When loops have to be
  99 duplicated because of optimzed array accesses, it is important to
 100 extend or duplicate exception entries as well. Because of that,
 101 exceptions are inserted into the tree as follows. Every exception
 102 is inserted at the node in the hierarchie that represents the
 103 loop, that directly contains this exception. When an exception is
 104 part of a nested loop, it is inserted only at the node,
 105 representing this nested loop, because by following this nodes
 106 parent pointers, we find all other loops, that contain the
 107 exception (analyze\_nested). Finally, the sequentiel order of the
 108 loops is determined by a topological sort of the loops, satisfying
 109 the condition, that all nested loops have to be optimzed before
 110 their parent loop is processed. This can be archieved by a post-
 111 order traversal of the hierarchie tree with skipping the root
 112 node.
 113
 114 After these global steps have been completed, each loop is
 115 processed. Then the loops are checked for array accesses
 116 (analyze\_for\_array\_access) and the process exits if none
 117 are found. Finally, two special cases must be considered.
 118
 119 \begin{enumerate}
 120
 121 \item The algorithm uses the loop condition to guarantee certain
 122 bounds for variables. If the loop condition consists of an or-
 123 statement, these guarantees can not be held up. If the loop
 124 statement reads
 125
 126 \begin{verbatim}
 127 while ((var > 0) or true)
 128 \end{verbatim}
 129
 130 vars value is not  bound to be greater than zero. When an or-statement
 131 is found, loop optimization is stopped.
 132
 133 \item Variables that are used as indexes in array accesses might be
 134 modified in exception handling code of try-catch statements inside
 135 the loop. Because code in catch blocks is not considered in the
 136 control flow graph and because these modifications happen
 137 unpredictable (and should occur rarely) loops containing these
 138 catch blocks with index variable modifications are not optimized.
 139 During the scan for array accesses within the loop, all
 140 interesting variables are recorded. A variable is considered
 141 interesting, when it is used as index in an array access or when
 142 its value is changes (due to a store instruction)
 143 (analyze\_or\_exceptions and scan\_global\_list).
 144
 145 \end{enumerate}
 146
 147 The next step is responsible for the analyzation of the loop
 148 condition itself. The loop condition can create a so called
 149 dynamic constraint on a variable. If one side of the loop
 150 condition is constant during the execution of the loop the other
 151 side can be safely assumed to be less than, greater or equal
 152 than or equal to this side (depending on the operator) at the
 153 start of each loop pass. It is important to notice the difference
 154 to so called static constraints on variables. That means that
 155 these variables can be safely assumed to stay below or above a
 156 certain constant during the loop execution by simple testing them
 157 prior to loop entry. This is obviously true for constants but does
 158 also hold for variables that are only decremented or incremented.
 159 For these, a static lower or upper bound can be guaranteed for the
 160 whole loop execution by simply inserting a single test prior to
 161 loop entry. Dynamic constraints, on the contrary, can vary in
 162 different parts of the loop. After the variable is tested in the
 163 loop condition, it might be changed to different values in
 164 different paths of execution. All variables, that are never
 165 changed or that get only decremented/incremented have static and
 166 dynamic constraints, all others can only have dynamic ones
 167 (init\_constraint).
 168
 169 Now the core bound check removal procedure is started. Beginning
 170 directly after the loop condition all changes of variables that
 171 are used as index variables somewhere in the loop are recorded.
 172 This is an iterative process since different paths of execution
 173 may yield different changes to these variables. Especially changes
 174 in nested loops cause these changes to become unpredictable and
 175 therefore no upper/lower bound can be held up any further. After
 176 this iterative process caused all changes to become stable, each
 177 array access is examined (e.g. a statement like
 178 \newpage
 179 \begin{verbatim}
 180 if (x == true)
 181     i++;
 182 else
 183     i+=2;
 184 \end{verbatim}
 185 must result in an increase of 2 for i when control
 186 flow joins after the if-statement). If it is possible, by
 187 inserting additional static/dynamic tests as needed, to assure
 188 that the index variable is greater or equal to zero and less than
 189 the array length, the bound checks are removed
 190 (remove\_bound\_checks). It is possible that more than one static
 191 tests gets inserted for a single variable, when it is used in
 192 different array access (possibly with different constants added or
 193 subtracted). Because all tests have to hold for this variable to
 194 allow optimzed code to be executed, only the strictest tests have
 195 to be done (e.g. if an integer array x[] is accessed by i in the
 196 statements x[i+1] and x[i-1], it has to be guaranteed that i $>$ 1
 197 (for second statement) and that i $<$ arraylength-1 (for the first
 198 statement)). Parallel to the insertion of new tests, the number of
 199 needed instructions for the new loop header node is accordingly
 200 increased. When optimzing the loop, it is important to
 201 differentiate between the loop head and the rest of the basic
 202 blocks, forming the body. The dynamic constraints can only be used
 203 for array accesses in the loop body, as the loop comparison is
 204 done at the end of the header node. Nevertheless, all static
 205 constraints can still be used to optimze array access in the
 206 header block. Because it is possible that an array reference is
 207 null prior to entering the loop, all array references that are
 208 loaded to compute an arraylength have to be checked against null.
 209
 210 After all new tests have been determined, it is necessary to
 211 reorder the code and insert the new tests (insert\_static\_checks).
 212 The first step is the creation of a new header node. Because all
 213 jumps to the beginning of the old loop now have to get to the new
 214 header first, it is more efficient to just replace the code in the
 215 old header block by the new tests. So only jumps within the loops
 216 need to be patched and the rest of the code can remain untouched.
 217 For each constraint that has been found in the previous step of
 218 the algorithm, two values have to be loaded and are then compared.
 219 Because it is possible during runtime that these tests fail (and
 220 no guarantee can be made) a copy of the loop with the original,
 221 checked array accesses must exist. Depending on the outcome of the
 222 test cascade, a jump to the optimized or original loop is made. To
 223 copy the loop, all basic blocks that are part of the loop are
 224 copied and appended to the end of the global basic block list.
 225 After that, both the original and the copy of the loop need post-
 226 processing to redirect certain jump targets. All jumps to the old
 227 loop head (which now contains the static checks) in the original
 228 loop have to be redirected to the newly inserted block, that holds
 229 the code for the original loop head. In the copied loop, all jumps
 230 inside the loop have to be redirected to jumps to the copied
 231 blocks. When loops are duplicated, these changes must be reflected
 232 in the node list of all parent loops as well. So the hierarchie
 233 tree is climbed and the new nodes are added to all enclosing
 234 loops. Because these node lists are sorted, it is necessary to
 235 deal with the new loop head in a different way. The loop head has
 236 to be inserted into the correct place of the parent loops while
 237 all other copied nodes can simply be appended to the basic block
 238 list.
 239
 240 Now all exceptions have to be examined. There are three different
 241 cases, that must be considered. An exception can be part of the
 242 loop body (ie. is inside the loop), an exception can contain the
 243 loop or an exception is in a different part of the code (and does
 244 not have to be further considered). To be able to find all
 245 exceptions that are part of the loop, the algorithm uses the
 246 hierarchie tree and gets all exceptions of all children nodes. The
 247 exception handlers of these loops have to be copied and jumps to
 248 the original loop have to be redirected to the copied loop. The
 249 start and end code pointers of the protected area have to be moved
 250 to the copied blocks. The exception handler code is identified by
 251 looking at the successor nodes starting from the handler block. As
 252 long as control flow does not reach a loop node again, all blocks
 253 are considered as part of the handler code. Exceptions that
 254 surround the loop have to be handled different. It is not
 255 necessary to copy the exception handler, as it is not part of the
 256 loop. Nevertheless it is important to extend the protected area to
 257 the copied blocks. So the first and last block (including copied
 258 exception handler blocks of nested loops) that got copied are
 259 stored. The exceptions are then extended to contain all these new
 260 blocks and jump to the original handler. As no code is duplicated,
 261 nothing needs to be patched. When climbing up the hierarchie tree
 262 by following the parent pointer, not all exceptions of parent
 263 nodes really enclose the loop. It is possible that a parent loop
 264 contains an exception and the loop, that is optimized, right after
 265 each other. Those exceptions must not be handled and are ignored.
 266 Because of the layout of the exception table, where appropriate
 267 exceptions (that could be nested) are found by a linear search
 268 from the start, it is necessary to insert newly created exceptions
 269 right after their original ones.
 270
 271 One performance problem still remains after these modifications.
 272 The new loop header that replaces the original one is located
 273 exactly where the old one was. A fall through from the previous
 274 block (that could be part of the loop) to the loop header must be
 275 patched by inserting an additional goto-instruction to the
 276 original head of the loop. Because this would insert an additional
 277 jump into the loop body, performance may deteriorate. To solve
 278 this shortcoming, the new loop head has to be moved before the
 279 first node of the loop. This might cause the problem of moving the
 280 loop head to the beginning of the global basic block list. This
 281 pointer points to the initial basic block array and can not be
 282 easily reassigned. A new pointer is needed that temporary holds
 283 the begin of the basic block list and is assigned to the original
 284 pointer after all optimization step have been finished. Any fall
 285 through from the predecessor of the loop head now reaches the old
 286 loop head, that has been inserted right after the new loop head.
 287 After all loops have been processed, register allocation and code
 288 genration proceed as usual and optimized code is generated.
 289
 290 \section{Helper functions}
 291
 292 An important helper function is stored in tracing.c. This
 293 functions is needed to determine the variables/constants that
 294 participate in array accesses or comparisons. As all work is done
 295 on java bytecode, it is necessary to find the arguments of an
 296 interesting operation by examining the stack. Unfortunately these
 297 values can just be temporary and origin in variables loaded
 298 earlier and getting modified by arithmetic operations. To find the
 299 variables that participate in an array access, one has to walk
 300 back instructions and looking at the stack, until an appropriate
 301 load is found. The function tracing preforms this task by steping
 302 back instruction for instruction and record the stack changes
 303 these instructions cause until the correct load or push constant
 304 operating has been found or until it becomes clear, that it is
 305 impossible to determine the origin (e.g. when a function return
 306 value is used). The value is then delivered in a special structure
 307 and used by the main algorithm.
 308
 309 \section{Performance - impact and gain}
 310
 311 It is obvious that the optimization process causes additionally
 312 compile time overhead but can give performance gains during
 313 runtime, when 3 less instructions are executed per array access
 314 many times in a loop. The additional overhaed is mainly caused by
 315 the needed control flow analysis, which has to be done for every
 316 method and is linear to the number of basic blocks. Then the loop
 317 scaning algorithm trys to find loops in the control flow graph.
 318 This algorithm has to be started every time a method is compiled
 319 and runs in O(n*ld(n)) where n is the number of basic blocks.
 320 After the loops have been scanned for array access, the runtime of
 321 the final bound check removal and loop copying process mainly
 322 depends on the nesting depth of the hierarchie tree. For every
 323 additional nesting depth, the number of basic blocks, that must be
 324 copied and patch is doubled, resulting in exponential overhead
 325 according to the nesting depth. Additionally, the search for
 326 modifications of index variables, which is an iterative process,
 327 becomes slowed down by deeply nested loops. Things get worse when
 328 many exceptions are involved, because not only exceptions in
 329 children nodes have to be considered, but all enclosing exceptions
 330 as well. Nevertheless those cases should rarely occur and in real
 331 environments, the net gain can be significant. Especially in tight
 332 loops, when arrays are initialized or their elements summed up,
 333 three less instructions can gain up to 30 percent speedup, when loops run
 334 many times.
 335
 336
 337 \subsection{Tables}
 338
 339 \vspace{4mm}
 340
 341 \begin{tabular}{|l|c|c|c|}
 342
 343 \hline
 344
 345 & Compile time (in ms) & \multicolumn{2}{c|}{Run time (in ms) } \\
 346
 347 \multicolumn{1}{|c|}{\raisebox{1ex}{Cacao Options}} & javac & perf & sieve 1000 1000\\ \hline
 348
 349 \rule{0mm}{5mm}with optimization (-oloop) & 176 & 1510 & 98 \\
 350
 351 without optimization & 118 & 1720 & 131 \\
 352
 353 \hline
 354
 355 \end{tabular}
 356
 357
 358 \begin{thebibliography}{9}
 359
 360 \bibitem{tig} A. Appel; modern compiler implementation in C; Cambridge University Press; 1998
 361
 362 \bibitem{aho} A. Aho, R.Sethi, J. Ullman; Compilers - Principles,
 363 Techniques, and Tools; Addison-Wesly; 1986
 364
 365 \end{thebibliography}
 366
 367 \end{document}
 368
 369
 370
 371
 372
 373
 374
 375