docs/ssapre.txt

   1 SSAPRE stands for "SSA based Partial Redundancy Elimination".
   2
   3 The algorithm is explained in this paper:
   4
   5 Partial Redundancy Elimination in SSA Form (1999)
   6 Robert Kennedy, Sun Chan, SHIN-MING LIU, RAYMOND LO, PENG TU, FRED CHOW
   7 ACM Transactions on Programming Languages and Systems
   8
   9 http://citeseer.ist.psu.edu/kennedy99partial.html
  10
  11 In this document I give a gentle introduction to the concept of "partial"
  12 redundancies, and I explain the basic design decisions I took in implementing
  13 SSAPRE, but the paper is essential to understand the code.
  14
  15 Partial Redundancy Elimination (or PRE) is an optimization that (guess what?)
  16 tries to remove redundant computations.
  17 It achieves this by saving the result of "not redundant" evaluations of
  18 expressions into appositely created temporary variables, so that "redundant"
  19 evaluations can be replaced by a load from the appropriate variable.
  20
  21 Of course, on register starved architectures (x86) a temporary could cost more
  22 than the evaluation itself... PRE guarantees that the live range of the
  23 introduced variables is the minimal possible, but the added pressure on the
  24 register allocator can be an issue.
  25
  26 The nice thing about PRE is that it not only removes "full" redundancies, but
  27 also "partial" ones.
  28 A full redundancy is easy to spot, and straightforward to handle, like in the
  29 following example (in every example here, the "expression" is "a + b"):
  30
  31 int FullRedundancy1 (int a, int b) {
  32      int v1 = a + b;
  33      int v2 = a + b;
  34      return v1 + v2;
  35 }
  36
  37 PRE would transform it like this:
  38
  39 int FullRedundancy1 (int a, int b) {
  40      int t = a + b;
  41      int v1 = t;
  42      int v2 = t;
  43      return v1 + v2;
  44 }
  45
  46 Of course, either a copy propagation pass or a register allocator smart enough
  47 to remove unneeded variables would be necessary afterwords.
  48
  49 Another example of full redundancy is the following:
  50
  51 int FullRedundancy2 (int a, int b) {
  52      int v1;
  53
  54      if (a >= 0) {
  55           v1 = a + b; // BB1
  56      } else {
  57           a = -a; // BB2
  58           v1 = a + b;
  59      }
  60
  61      int v2 = a + b; // BB3
  62      return v1 + v2;
  63 }
  64
  65 Here the two expressions in BB1 and BB2 are *not* the same thing (a is
  66 modified in BB2), but both are redundant with the expression in BB3, so the
  67 code can be transformed like this:
  68
  69 int FullRedundancy2 (int a, int b) {
  70      int v1;
  71      int t;
  72
  73      if (a >= 0) {
  74           t = a + b; // BB1
  75           v1 = t;
  76      } else {
  77           a = -a; // BB2
  78           t = a + b;
  79           v1 = t;
  80      }
  81
  82      int v2 = t; // BB3
  83      return v1 + v2;
  84 }
  85
  86 Note that there are still two occurrences of the expression, while it can be
  87 easily seen that one (at the beginning of BB3) would suffice.
  88 This, however, is not a redundancy for PRE, because there is no path in the
  89 CFG where the expression is evaluated twice.
  90 Maybe this other kind of redundancy (which affects code size, and not the
  91 computations that are actually performed) would be eliminated by code hoisting,
  92 but I should check it; anyway, it is not a PRE related thing.
  93
  94 An example of partial redundancy, on the other hand, is the following:
  95
  96 int PartialRedundancy (int a, int b) {
  97      int v1;
  98
  99      if (a >= 0) {
 100           v1 = a + b; // BB1
 101      } else {
 102           v1 = 0; // BB2
 103      }
 104
 105      int v2 = a + b; // BB3
 106      return v1 + v2;
 107 }
 108
 109 The redundancy is partial because the expression is computed more than once
 110 along some path in the CFG, not all paths.
 111 In fact, on the path BB1 - BB3 the expression is computed twice, but on the
 112 path BB2 - BB3 it is computed only once.
 113 In this case, PRE must insert new occurrences of the expression in order to
 114 obtain a full redundancy, and then use temporary variables as before.
 115 Adding a computation in BB2 would do the job.
 116
 117 One nice thing about PRE is that loop invariants can be seen as partial
 118 redundancies.
 119 The idea is that you can get into the loop from two locations: from before the
 120 loop (at the 1st iteration), and from inside the loop itself (at any other
 121 iteration). If there is a computation inside the loop that is in fact a loop
 122 invariant, PRE will spot this, and will handle the BB before the loop as a
 123 place where to insert a new computation to get a full redundancy.
 124 At this point, the computation inside the loop would be replaced by an use of
 125 the temporary stored before the loop, effectively performing "loop invariant
 126 code motion".
 127
 128 Now, this is what PRE does to the code.
 129
 130 But how does it work?
 131
 132 In "classic" solutions, PRE is formulated as a data flow analysis problem.
 133 The Muchnick provides a detailed description of the algorithm in this way (it
 134 is by far the most complex problem of this kind in the whole book).
 135 The point is that this algorithm does not exploit the SSA form.
 136 In fact, it has to perform all that amount of data flow analysis exactly
 137 because it does not take advantage of the work already done if you have put
 138 the program into SSA form.
 139
 140 The SSAPRE algorithm, on the other hand, is designed with SSA in mind.
 141 It fully exploits the properties of SSA variables, it also explicitly reuses
 142 some data structures that must have been computed when building the SSA form,
 143 and takes great care to output its code in that form already (which means that
 144 the temporaries it introduces are already "versioned", with all the phi
 145 variables correctly placed).
 146
 147 The main concept used in this algorithm is the "Factored Redundancy Graph" (or
 148 FRG in short). Basically, for each given expression, this graph shows all its
 149 occurrences in the program, and redundant ones are linked to their
 150 "representative occurrence" (the 1st occurrence met in a CFG traversal).
 151 The central observation is that the FRG is "factored" because each expression
 152 occurrence has exactly one representative occurrence, in the same way as the
 153 SSA form is a "factored" use-definition graph (each use is related to exactly
 154 one definition).
 155 And in fact building the FRG is much like building an SSA representation, with
 156 PHI nodes and all.
 157 By the way, I use "PHI" for "PHI expression occurrences", and "phi" for the
 158 usual phi definitions in SSA, because the paper uses an uppercase phi greek
 159 letter for PHI occurrences (while the lowercase one is standard in SSA).
 160
 161 The really interesting point is that the FRG for a given expression has exactly
 162 the same "shape" of the use-definition graph for the temporary var that must
 163 be introduced to remove the redundancy, and this is why SSAPRE can easily emit
 164 its output code in correct SSA form.
 165
 166 One particular characteristic of the SSAPRE algorithm is that it is "sparse",
 167 in the sense that it operates on expressions individually, looking only at the
 168 specific nodes it needs. This is in contrast to the classical way of solving
 169 data flow analysis problems, which is to operate globally, using data
 170 structures that work "in parallel" on all the entities they operate on (in
 171 practice bit sets, with for instance one bit per variable, or one bit per
 172 expression occurrence, you get the idea). This is handy (it exploits the
 173 parallelism of bit operations), but in general setting up those data
 174 structures is time consuming, and if the number of the things represented by
 175 the bits is not fixed in advance the code can get quite messy (bit sets must
 176 become growable).
 177 Moreover, applying special handling to individual expressions becomes a very
 178 tricky thing.
 179
 180 SSAPRE, on the other hand, looks at the whole program (method in our case) only
 181 once, when it scans the code to collect (and classify) all expression
 182 occurrences.
 183 From here on, it operates on one expression at a time, looking only at its
 184 specific data structures (which, for each expression, are much smaller than
 185 the whole program, so the operations are fast).
 186
 187 This approach has another advantage: the data structures used to operate on
 188 one expressions can be recycled when operating on other expressions, making
 189 the memory usage of the compiler lower, and (also important) avoiding losing
 190 time with memory allocations at all.
 191 This reflects directly on the design of those data structures.
 192 We can better see these advantages following which data structures are used
 193 during the application of SSAPRE to a method.
 194
 195 The steps that are performed are the following:
 196
 197     * Initialization: scan the whole program, collect all the occurrences, and
 198       build a worklist of expressions to be processed (each worklist entry
 199       describes all the occurrences of the given expression).
 200 Here the data structures are the following:
 201 - One struct (the working area), containing the worklist and other
 202  "global" data. The worklist itself contains an entry for each expression
 203   which in turn has an entry for each occurrence.
 204 - One "info" struct for each BB, containing interesting dominance and CFG
 205   related properties of the BB.
 206
 207 Then, for each entry in the worklist, these operations are performed:
 208
 209     * PHI placement: find where the PHI nodes of the FRG must be placed.
 210     * Renaming: assign appropriate "redundancy classes" to all occurrences (it
 211       is like assigning variable versions when building an SSA form).
 212     * Analyze: compute various flags in PHI nodes (which are the only places
 213       that define where additional computations may be added).
 214       This conceptually is composed of two data flow analysis passes, which in
 215       practice only scan the PHI nodes in the FRG, not the whole code, so they
 216       are not that heavy.
 217     * Finalize: make so that the FRG is exactly like the use-def graph for the
 218       temporary that will be introduced (it generally must be reshaped a bit
 219       according to the flags computed in the previous step).
 220       This is also made of two steps, but more for implementation reasons than
 221       for conceptual ones.
 222     * Code motion: actually update the code using the FRG.
 223 Here, what's needed is the following:
 224 - PHI occurrences (and some flags sbout them)
 225 - PHI argument occurrences (and some flags sbout them)
 226 - The renaming stack
 227 In practice, one can observe that each BB can have at most one PHI (we work
 228 on one expression at a time), and also one PHI argument (which we consider
 229 occurring at the end of the BB). Therefore, we do not have separate structures
 230 for these, but store them directly in the BB infos (which are kept for the
 231 whole SSAPRE invocation).
 232 The renaming stack is managed directly as a list of occurrences, with
 233 special handling for PHI nodes (which, being represented directly by their
 234 BB, are not "occurrences").
 235
 236
 237 So far, the only two missing things (with respect to SSAPRE in the paper) are
 238 unneeded PHIs eliminantion and the handling of "composite" expressions.
 239 Otherwise, the implementation is complete.
 240
 241 Other interesting issues are:
 242 - SSAPRE has the assumption that:
 243   - each SSA variable is related to one "original" (not SSA) variable, and
 244   - no more than one version of each original variable is live at the same time
 245     in the CFG.
 246   It would be better to relax these assumptions.
 247 - SSAPRE operates on "syntactic" redundancies, not on "values".
 248   GVNPRE (or other means of merging GVN) would be a nice alternative, see
 249   "http://www.cs.purdue.edu/homes/vandrutj/papers/thesis.pdf".