doc/handbook/threads.tex

   1 \section{Multi threading}
   2
   3 \subsection{Introduction}
   4
   5 Threads are an integral part of the Java programming language. A Java
   6 Runtime Environment (JRE) has to implement threads to be compliant. A
   7 thread implementation includes the creation, execution, switching,
   8 killing and synchronization of threads. In Java the latter is provided
   9 by monitors. Monitors ensure that, for a given object, only one thread
  10 at a time can execute certain methods, known as synchronized methods,
  11 and blocks which are marked by the keyword \texttt{synchronized}.
  12
  13 Monitors are usually implemented using mutex (mutual exclusion). A
  14 mutex is a data structure which contains the necessary information to
  15 guarantee that only one unit of execution can perform a critical
  16 section at the same time \cite{Stallings95}.
  17
  18 As we show in section \ref{MonitorTest} a fast implementation of the
  19 synchronization mechanism is crucial for the efficiency of Java. One
  20 implementation produced more than 800\% overhead in one of our tests.
  21
  22
  23 \subsection{Related work}
  24
  25 Java threads can be implemented using threads provided by the
  26 operating system kernel, as user-level libraries, or as a combination
  27 of the two. There exist a number of articles describing different
  28 thread implementation aspects but none captures the problem of
  29 efficient monitor operations for objects.
  30
  31 Sun's first implementation of the JavaVM on Solaris was based on
  32 user-level threads. The current implementation uses a combination of
  33 kernel and user-level threads. Some of the advantages of this approach
  34 are outlined in \cite{SunThreads97}.
  35
  36 The freely available JavaVM implementation {\tt kaffe} by Tim
  37 Wilkinson uses user-level threads \cite{Wilkinson:97}. Until version
  38 0.9, each object contained the complete mutex data structure. This
  39 enabled a fast monitor implementation but used a lot more memory than
  40 necessary.
  41
  42 Apart from thread implementations used in JavaVM's there are many
  43 other thread standards and implementations, the most notable being the
  44 IEEE POSIX extension \cite{PosixThreads96}.
  45
  46 In \cite{Mueller93}, Mueller describes a library implementation of
  47 POSIX threads on a standard UNIX system. This library is a user-level
  48 implementation which need no support from the operating system. A very
  49 popular library of user-level primitives for implementing threads is
  50 {\em QuickThreads} by David Keppel, described in \cite{Keppel93}.
  51
  52 Bershad et al. present a fast mechanism for mutual exclusion on
  53 uniprocessor systems \cite{Bershad+92}. They have a software solution
  54 for the implementation of an atomic test-and-set operation which is
  55 faster than the corresponding hardware instruction. However, their
  56 implementation relies on assistance from the operating system.
  57
  58
  59 \subsection{Implementation basics}
  60
  61 A complete thread implementation for Java has to provide:
  62
  63 \begin{itemize}
  64
  65 \item thread creation and destruction,
  66
  67 \item low level thread switching (usually implemented in assembly
  68       language),
  69
  70 \item thread scheduling,
  71
  72 \item synchronization primitives,
  73
  74 \item a thread safe non-blocking input/output library.
  75
  76 \end{itemize}
  77
  78 Cacao's current implementation of threads is based mainly on the
  79 threading code of {\tt kaffe} version 0.7, which has been released
  80 under a BSD-style license and can thus be used freely
  81 \cite{Wilkinson:97}. As mentioned above, {\tt kaffe}'s threads are
  82 completely user-level, which means, for example, that they cannot take
  83 advantage of a multiprocessor system.
  84
  85 \begin{table*}
  86 \begin{center}
  87 \begin{tabular}{|l|l|c|c|c|c|c|c|} \hline
  88                   &                     & \multicolumn{3}{|c|}{mutex test}
  89                                         & \multicolumn{3}{|c|}{tree test}\\ \hline
  90                   &                     & \multicolumn{2}{|c|}{run time in secs}&call cost
  91                                         & \multicolumn{2}{|c|}{run time in secs}&call cost\\
  92 Machine           & JavaVM              & no sync & sync  & in $\mu$s & no sync & sync   & in $\mu$s \\ \hline\hline
  93 DEC 21064A 289MHz & OSF JDK port (1.0.2)&  1.20   & 4.14  &   9.8     &  8.37   & 34.69  & 9.8    \\ \hline
  94 DEC 21064A 289MHz & DEC JDK interpreter &  1.71   & 12.80 &   36.97   &  12.25  & 143.10 & 39.93  \\ \hline
  95 DEC 21064A 289MHz & DEC JDK JIT         &  1.06   & 11.05 &   33.30   &  7.80   & 130.50 & 37.45  \\ \hline
  96 Pentium-S 166MHz  & Linux JDK 1.1.3     &  0.96   & 1.69  &   2.43    &  7.93   & 16.12  & 2.50   \\ \hline
  97 AMD K6 200MHz     & Sun JPP JIT (WinNT) &  0.57   & 0.94  &   1.23    &  2.8    & 9.8    & 2.13   \\ \hline
  98 AMD K6 200MHz     & Navigator 4.04 (WinNT) & 0.03 & 0.77  &   2.46    &  2.3    & 9.2    & 2.11   \\ \hline
  99 %PowerPC 604e 200MHz & Linux JDK Beta    &  1.16   & 2.05  &   2.96    &         &        &        \\ \hline
 100 %Sun Ultra-1       & JDK 1.1.2           &  0.90   & 1.25  &   1.16    &         &        &        \\ \hline
 101 %DEC 21064A 300MHz & Cacao unoptimized   &  0.24   & 0.48  &   0.80    &  0.24   & 0.48   & 0.80   \\ \hline
 102 DEC 21064A 289MHz & Cacao               &  0.28   & 0.40  &  0.40     &  1.46   & 4.71   & 0.99   \\ \hline
 103 \end{tabular}
 104 \caption{Overhead for calling synchronized methods (one object)}
 105 \label{SyncronizedOverhead}
 106 \end{center}
 107 \end{table*}
 108
 109 There are several reasons why we chose this approach:
 110
 111 \begin{itemize}
 112
 113 \item Thread support differs from operating system to operating
 114 system. Not only do some operating systems provide different APIs to
 115 other systems, but even if they do provide the same interface (most
 116 often in the form of POSIX threads), the costs of various operations
 117 are often very different across platforms.
 118
 119 \item It was a complete implementation, tailored for the use in a
 120 JavaVM and thus made it possible for us to get thread support up and
 121 running quickly.
 122
 123 \item Parts of Cacao are not yet thread-safe (most notably the
 124 compiler and the garbage collector), thus making it difficult to use a
 125 kernel-supported solution.
 126
 127 \end{itemize}
 128
 129 Each thread in a Java program corresponds to an instance of the
 130 \texttt{java.lang.Thread} class. In order for the Java run time
 131 environment (JRE) to associate platform specific information with such
 132 an instance, it has a private member called \texttt{PrivateInfo} of
 133 type \texttt{int}, which can be used by the JRE. Kaffe version 0.7
 134 used this member as a pointer to a context structure. Since pointers
 135 are 64-bit values on the Alpha, we use an array of context structures.
 136 \texttt{PrivateInfo} is then used as an index into this array.
 137
 138 A fixed-size stack is allocated for each thread. The stack size for
 139 ordinary threads can be specified as a command-line parameter. Special
 140 threads (such as the garbage collector) have their own stack sizes. In
 141 order to catch stack overflows without the burden of checking the
 142 stack top at each method entry, we simply guard the stack top with one
 143 or more memory pages. The memory protection bits of these pages are
 144 modified to cause a memory protection violation when accessed. The
 145 number of guard pages can be specified on the command-line.
 146
 147 Thread switching is implemented straightforwardly, namely by saving
 148 callee-save registers to the stack, switching to the new stack,
 149 restoring callee-save registers and performing a subroutine return.
 150
 151 Scheduling is very simple in Cacao: higher priority threads always get
 152 precedence over lower priority threads, and threads of the same
 153 priority are scheduled with a round-robin algorithm.
 154
 155 I/O in user-level threads is a problem since UNIX I/O calls are, by
 156 default, blocking. They would block not just the current thread but
 157 the whole process. This is undesirable. It is thus common practice for
 158 thread libraries to use non-blocking I/O and, in the case of an
 159 operation which would block, suspend the corresponding thread and
 160 perform a multiplexed I/O operation (\texttt{select(2)} in UNIX) on
 161 all such blocked files regularly, especially if there are no runnable
 162 threads available.
 163
 164
 165 \subsection{Motivation}
 166
 167 \label{MonitorTest}
 168
 169 To optimize an implementation it is necessary to find the parts which
 170 are critical to performance. Therefore, we analysed the cost of
 171 monitors with two small test programs.  The {\em mutex test} program
 172 simply invoked a method 300000 times, once with the method being
 173 declared \texttt{synchronized} and once without. The other test, {\em
 174 tree test}, allocated a balanced binary tree with 65535 elements and
 175 recursively traversed it 50 times using a method, again once being
 176 synchronized and once being not.
 177
 178 Table \ref{SyncronizedOverhead} shows the differences in run-time for
 179 the two versions of the programs for several JavaVM's. The table
 180 includes the run times for both versions of the programs in seconds.
 181 The column `call cost' gives the overhead of a call of a synchronized
 182 method compared to a call of a non-synchronized one. From these
 183 numbers it is obvious that calls to synchronized methods, or monitors
 184 in general, are much slower than ordinary method calls.
 185
 186 \begin{table*}
 187 \begin{center}
 188 \begin{tabular}{|l|r|r|r|r|} \hline
 189 Application & Objects allocated & Objects with mutex & Monitor
 190 operations & Parallel Mutexes \\ \hline\hline
 191 javac       & 111504            & 13695              &
 192 840292             & 5        \\ \hline
 193 Biss        & 84939             & 13357              &
 194 1058901            & 12       \\ \hline
 195 Jigsaw      & 215411            & 23804              &
 196 855691             & 25       \\ \hline
 197 \end{tabular}
 198 \caption{Mutual exclusion statistics}
 199 \label{MutexStatistics}
 200 \end{center}
 201 \end{table*}
 202
 203 The question that arises is now whether this has any influence on
 204 common Java programs. To answer this question, we have used a modified
 205 version of \texttt{kaffe} to gather statistics about monitor usage.
 206 The results are summarized in table \ref{MutexStatistics}.
 207
 208 Javac is an invocation of Sun's \texttt{javac} on the Toba source
 209 files \cite{Proebsting+97} and is thus single-threaded. Execution of
 210 this program takes only a few seconds using Cacao with threads
 211 disabled. Biss is a more or less typical working session with the Java
 212 development environment of the Biss-AWT \cite{BissAWT97}. It is
 213 slightly multithreaded. Jigsaw invokes the HTTP server Jigsaw
 214 \cite{Jigsaw97} of the World Wide Web Consortium and lets it serve
 215 identical parallel requests from seven hosts, amounting to about one
 216 megabyte split across 200 files for each request. This application is
 217 highly multithreaded.
 218
 219 The table contains the number of objects allocated during execution
 220 and the number of objects for which a monitor has been entered. The
 221 `Monitor operations' column gives the total number of operations
 222 performed on monitors, while the numbers in the `Parallel Mutexes'
 223 column show the greatest number of mutexes that were locked
 224 simultaneously.
 225
 226 These numbers demonstrate that the performance of monitor operations
 227 is of paramount importance for a fast JavaVM implementation. We did
 228 not include the number of blocking monitor operations because there
 229 were just two of them, namely in the Biss application. It was only
 230 after we modified \texttt{kaffe} to switch to another thread before
 231 each \texttt{monitorenter} operation that the Biss and Jigsaw
 232 applications performed a few thousand blocking \texttt{monitorenter}s.
 233
 234
 235 \subsection{Mutex implementation}
 236
 237 Our mutex data structure includes a pointer to the thread that has
 238 currently locked the mutex (\texttt{holder}). If the mutex is not
 239 locked at all, this is a null pointer. Because one thread can lock the
 240 same mutex multiple times, we need a count of how many lock operations
 241 without corresponding unlocks have been performed on the mutex
 242 (\texttt{count}). When it goes down to zero, the mutex is not locked
 243 by any thread. Furthermore, we need to keep track of the threads which
 244 have tried to lock the mutex but were blocked and are now waiting for
 245 it to become unlocked (\texttt{waiters}).
 246
 247 The data structure is defined as follows:
 248
 249 \begin{verbatim}
 250 struct mutex {
 251   int count;
 252   thread *holder;
 253   thread *waiters;
 254 }
 255 \end{verbatim}
 256
 257 The locking of a mutex can now be specified as in
 258 figure~\ref{lockMutex}.
 259
 260 \begin{figure}
 261 \begin{verbatim}
 262 void lockMutex (struct mutex *mux) {
 263   disablePreemption();
 264
 265   if (mux->holder == NULL) {
 266     mux->holder = currentThread;
 267     mux->count = 1;
 268   } else if (mux->holder == currentThread) {
 269     mux->count++;
 270   } else {
 271     blockUntilMutexIsNotLocked(mux);
 272     mux->holder = currentThread;
 273     mux->count = 1;
 274   }
 275
 276   enablePreemption();
 277 }
 278 \end{verbatim}
 279 \caption{Code for \texttt{lockMutex()}}
 280 \label{lockMutex}
 281 \end{figure}
 282
 283 The macro \texttt{disablePreemption()} simply sets a global flag
 284 indicating that a critical section is currently being executed and
 285 that preemption must not take place. \texttt{enablePreemption()}
 286 unsets the flag and checks whether a thread switch is necessary (see
 287 below). On a multiprocessor system this would need to be implemented
 288 using a test-and-set instruction, or some equivalent.
 289
 290 The inverse operation, namely the unlocking of the mutex, can be
 291 expressed as in figure~\ref{unlockMutex}.
 292
 293 \begin{figure}
 294 \begin{verbatim}
 295 void unlockMutex (struct mutex *mux) {
 296   disablePreemption();
 297
 298   --mux->count;
 299   if (mux->count == 0) {
 300     mux->holder = NULL;
 301     if (mux->waiters != NULL) {
 302       thread *next = mux->waiters;
 303       mux->waiters = next->next;
 304       resumeThread(next);
 305     }
 306   }
 307
 308   enablePreemption();
 309 }
 310 \end{verbatim}
 311 \caption{Code for \texttt{unlockMutex()}}
 312 \label{unlockMutex}
 313 \end{figure}
 314
 315 The function \texttt{resumeThread()} sets a flag which results in a
 316 thread switch to the given thread after preemption has been
 317 re-enabled.
 318
 319 These algorithms are simple, straightforward and reasonably efficient.
 320
 321
 322 \subsection{Object-Mutex relation}
 323 \label{objectmutex}
 324
 325 Since the JavaVM specification states that each object has a mutex
 326 associated with it, the obvious solution seems to be to include the
 327 mutex structure in each object. The mutex structure could be reduced
 328 to a length of eight bytes if we used thread numbers instead of
 329 pointers. But, using such a solution, the javac application would
 330 allocate nearly one megabyte of mutex data, just for a few seconds of
 331 execution. This is unacceptable.
 332
 333 On the other hand, the figures show that it is very seldom that more
 334 than 20 mutexes are locked at the same time. This suggests that using
 335 a hash table, indexed by the address of the object for which a monitor
 336 operation is to be performed could be very efficient. This is exactly
 337 what Cacao does.
 338
 339 We use a hash function which uses the $2 n$ least significant bits of
 340 the address where $2^n$ is the size of the hash table. This function
 341 can be implemented in four RISC instructions. Since we ran into
 342 performance problems with overflow handling by internal chaining, we
 343 now use external chaining with a preallocated pool of overflow entries
 344 managed by a free list.
 345
 346 An entry in the hash table has the following structure:
 347
 348 \begin{verbatim}
 349 struct entry {
 350   object *obj;
 351   struct mutex mux;
 352   struct entry *next;
 353 }
 354 \end{verbatim}
 355
 356 In order to minimize the overhead of the locking/unlocking operations,
 357 we should strive for code sequences small enough to be inlined, yet
 358 powerful enough to handle the common case.  We have chosen to handle
 359 the first entry in the collision chain slightly differently from the
 360 rest by not destroying the associated mutex when the count goes down
 361 to zero. Instead the decision is deferred until when a mutex with the
 362 same hash code gets locked and would thus use this location.  The
 363 major benefit of this approach is that the code to lock the mutex need
 364 not (in the common case) check for the location to be free, since each
 365 hash table location will, during the course of execution, only be free
 366 at the beginning of the program and will then never be freed again.
 367 The unlocking code is simplified by the fact that the code need not
 368 check whether the hash table location should be freed.  Based on the
 369 experience that a recently locked mutex is likely to be locked again
 370 in the near future, this technique also improves overall performance.
 371
 372 \begin{figure}
 373 \begin{verbatim}
 374  1 void monitorenter (object *o)
 375  2 {
 376  3   entry *e;
 377  4   disablePreemption();
 378  5
 379  6   e = firstChainEntry(o);
 380  7   if (e->obj == o
 381  8       && e->mux.holder
 382  9          == currentThread)
 383 10     e->mux.count++;
 384 11   else
 385 12     lockMutexForObject(e,o);
 386 13
 387 14   enablePreemption();
 388 15 }
 389 \end{verbatim}
 390 \caption{Code of \texttt{monitorenter()}}
 391 \label{monitorenter}
 392 \end{figure}
 393
 394 \begin{figure}
 395 \begin{verbatim}
 396  1 void monitorexit (object *o)
 397  2 {
 398  3   entry *e;
 399  4   disablePreemption();
 400  5
 401  6   e = firstChainEntry(o);
 402  7   if (e->obj == o)
 403  8     e->mux.count--;
 404  9   else
 405 10     unlockMutexForObject(e,o);
 406 11
 407 12   enablePreemption();
 408 13 }
 409 \end{verbatim}
 410 \caption{Code of \texttt{monitorexit()}}
 411 \label{monitorexit}
 412 \end{figure}
 413
 414
 415 \begin{table*}
 416 \begin{center}
 417 \begin{tabular}{|l|r|r|r|r|r|}    % monitorEnter
 418 \hline
 419 Program     & Line 6 & Line 10 & count 0 & Line 12 & Ratio $12/6$ \\ \hline\hline
 420 javac       & 420147 & 405978  & 381019  & 14169   & 3.372 \%     \\ \hline
 421 Biss        & 384350 & 370171  & 425038  & 14179   & 3.689 \%     \\ \hline
 422 Jigsaw      & 426695 & 391680  & 294957  & 35015   & 8.206 \%     \\ \hline
 423 \end{tabular}
 424 \caption{Execution statistics for \texttt{monitorenter()}}
 425 \label{monitorEnter}
 426 \end{center}
 427 \end{table*}
 428
 429
 430 \begin{figure}
 431 \begin{verbatim}
 432  1 void lockMutexForObject (entry *e,
 433  2                          object *o) {
 434  3   disablePreemption();
 435  4
 436  5   if (e->obj != NULL)
 437  6     if (e->mux.count == NULL)
 438  7       claimEntry(e,o);
 439  8     else
 440  9       while (e->obj != o) {
 441 10         if (e->next == NULL) {
 442 11           e = e->next = allocEntry(o);
 443 12           break;
 444 13         }
 445 14         e = e->next;
 446 15       }
 447 16   else
 448 17     e->obj = o;
 449 18
 450 19   lockMutex(&e->mux);
 451 20
 452 21   enablePreemption();
 453 22 }
 454 \end{verbatim}
 455 \caption{Code for \texttt{lockMutexForObject()}}
 456 \label{lockMutexForObject}
 457 \end{figure}
 458
 459 \begin{figure}
 460 \begin{verbatim}
 461  1 void unlockMutexForObject (entry *e,
 462  2                            object *o) {
 463  3   disablePreemption();
 464  4
 465  5   if (e->obj == o)
 466  6     unlockMutex(&e->mux);
 467  7   else {
 468  8     /* Assuming entry is there */
 469  9     while (e->next->obj != o)
 470 10       e = e->next;
 471 11     unlockMutex(&e->next->mux);
 472 12     if (e->next->mux.count == 0)
 473 13       e->next = freeEntry(e->next);
 474 14   }
 475 15
 476 16   enablePreemption();
 477 17 }
 478 \end{verbatim}
 479 \caption{Code for \texttt{unlockMutexForObject()}}
 480 \label{unlockMutexForObject}
 481 \end{figure}
 482
 483 See figures~\ref{monitorenter} and~\ref{monitorexit} for the code of
 484 the corresponding JavaVM instructions. These functions handle (as we
 485 show below) the most common case and depend on the two functions for
 486 the general case presented in figures~\ref{lockMutexForObject}
 487 and~\ref{unlockMutexForObject} (we now assume that
 488 \texttt{enablePreemption()}/\texttt{disablePreemption()} pairs can be
 489 nested).
 490
 491 Even if they are not short enough to be inlined in every synchronized
 492 method, these functions are certainly small enough to make the effort
 493 of coding them as specially tailored, assembly language functions
 494 worthwhile. This bypasses the standard subroutine linkage conventions,
 495 gaining a little extra speed.
 496
 497 \subsection{Results}
 498
 499 \begin{table}
 500 \begin{center}
 501 \begin{tabular}{|l|r|r|r|r|}      % monitorExit
 502 \hline
 503 Program & Line 6 & Line 8 & Line 10 & Ratio $10/6$ \\ \hline\hline
 504 javac       & 420146 & 419815 & 331     & 0.078 \% \\ \hline
 505 Biss        & 384368 & 383281 & 1087    & 0.282 \% \\ \hline
 506 Jigsaw      & 428997 & 416890 & 12107   & 2.822 \% \\ \hline
 507 \end{tabular}
 508 \caption{Execution statistics for \texttt{monitorexit()}}
 509 \label{monitorExit}
 510 \end{center}
 511 \end{table}
 512
 513 To demonstrate that nearly all cases are indeed handled by these small
 514 routines, we have written a small application which simulates the
 515 locking and unlocking operations of the three applications we used
 516 above (tables~\ref{monitorEnter} and \ref{monitorExit}). As can be
 517 seen, only a small percentage of cases need to be handled in the
 518 general routines \texttt{lockMutexForObject()} and
 519 \texttt{unlockMutexForObject()}.
 520
 521 The column `count 0' gives the number of mutex locks where the {\tt count}
 522 of the mutex before locking is zero. We present this number to evaluate an
 523 implementation where {\tt monitorenter()} checks the {\tt count} instead of
 524 the {\tt holder} to determine if the mutex can be locked immediately. The
 525 numbers seem to be in favour of checking {\tt holder} but at least indicate
 526 that the difference is neglectable. Furthermore, locking the mutex after
 527 only having checked {\tt count} needs an additional assignment (setting
 528 {\tt holder}).
 529
 530 \begin{table*}
 531 \begin{center}
 532 \begin{tabular}{|l|r|r|r|r|} \hline
 533             & \multicolumn{4}{|c|}{miss rates by size of cache} \\
 534 Application & 1 Element & 2 Elements & 4 Elements & 8 Elements \\
 535 \hline\hline
 536 javac       & 15.076 \% &  9.757 \% &  4.931 \% & 3.193 \% \\ \hline
 537 Biss        & 13.488 \% &  8.349 \% &  4.765 \% & 3.141 \% \\ \hline
 538 Jigsaw      & 43.694 \% & 37.700 \% & 22.680 \% & 5.182 \% \\ \hline
 539 \end{tabular}
 540 \caption{Results of cache simulation}
 541 \label{CacheResults}
 542 \end{center}
 543 \end{table*}
 544
 545 We have also considered the possibility of using a cache of recently
 546 used mutexes to improve performance, similar to a
 547 translation-lookaside buffer in microprocessors which cache the
 548 mapping between virtual and physical memory pages. To evaluate whether
 549 this would be worthwhile, we have simulated caches with one, two, four
 550 and eight elements using the three applications as test candidates. We
 551 have used least-recently-used as the cache replacement strategy.
 552 Though this is not easily implemented in software, it provides a good
 553 estimate of the best hit rate that can be achieved with an efficient
 554 implementation.  Table~\ref{CacheResults} summarizes the results.
 555
 556 \begin{table*}
 557 \begin{center}
 558 \begin{tabular}[b]{|l|c|c|c|c|c|}
 559 \hline
 560                              & JavaLex &  javac & espresso &  Toba   &
 561 java\_cup \\ \hline
 562 run time without threads     &   2.82  &   4.91 &    3.23  &   4.32
 563 &    1.35   \\ \hline
 564 run time with threads        &   3.89  &   5.40 &    3.23  &   5.53
 565 &    1.56   \\ \hline
 566 overhead (optimized impl.)   &   37 \% &  10 \% &    0 \%  &   28 \%
 567 &    15 \%  \\ \hline
 568 number of lock/unlock pairs  & 1818889 & 420146 &   2837   & 1558370 &
 569 124956   \\ \hline
 570 \end{tabular}
 571 \caption{Overhead of monitor operations}
 572 \label{CACAOMutexCost}
 573 \end{center}
 574 \end{table*}
 575
 576 Using an implementation supporting the monitor routines, as discussed
 577 in section~\ref{objectmutex}, and one implementation without thread
 578 support, we have run several applications on a 300MHz DEC 21064A (see
 579 table~\ref{CACAOMutexCost}). For these single threaded applications,
 580 the overhead introduced by monitor operations ranges from 0\% to 37\%,
 581 depending on the number of monitor operations in the applications.
 582 Note, however, that this cannot be compared to the overhead figures
 583 given in table \ref{SyncronizedOverhead}, since these applications do
 584 more than just entering and exiting a monitor.
 585
 586 Using the implementation described, the {\em mutex test} application
 587 for table \ref{SyncronizedOverhead} took 0.40 seconds with a
 588 synchronized and 0.28 seconds with an ordinary method to complete. In
 589 this program the time spent on locking/unlocking a mutex is 0.40
 590 microseconds. The reason for the higher cost of mutex operations in
 591 the {\em tree test} is that this test violates the locality of monitor
 592 operations. Overall, these numbers compare very favorably with the
 593 other implementations.
 594
 595 For most single-threaded applications, the monitor overhead can be
 596 eliminated completely. If it is possible to determine statically that
 597 the dynamic class-loader and the \texttt{java.lang.Thread} class are
 598 not used, synchronization code need not be generated.
 599
 600
 601 \subsection{Thin locks}
 602
 603