\section{Multi threading}

\subsection{Introduction}

Threads are an integral part of the Java programming language. A Java
Runtime Environment (JRE) has to implement threads to be compliant. A
thread implementation includes the creation, execution, switching,
killing and synchronization of threads. In Java the latter is provided
by monitors. Monitors ensure that, for a given object, only one thread
at a time can execute certain methods, known as synchronized methods,
and blocks which are marked by the keyword \texttt{synchronized}.

Monitors are usually implemented using mutex (mutual exclusion). A
mutex is a data structure which contains the necessary information to
guarantee that only one unit of execution can perform a critical
section at the same time \cite{Stallings95}.

As we show in section \ref{MonitorTest} a fast implementation of the
synchronization mechanism is crucial for the efficiency of Java. One
implementation produced more than 800\% overhead in one of our tests.


\subsection{Related work}

Java threads can be implemented using threads provided by the
operating system kernel, as user-level libraries, or as a combination
of the two. There exist a number of articles describing different
thread implementation aspects but none captures the problem of
efficient monitor operations for objects.

Sun's first implementation of the JavaVM on Solaris was based on
user-level threads. The current implementation uses a combination of
kernel and user-level threads. Some of the advantages of this approach
are outlined in \cite{SunThreads97}.

The freely available JavaVM implementation {\tt kaffe} by Tim
Wilkinson uses user-level threads \cite{Wilkinson:97}. Until version
0.9, each object contained the complete mutex data structure. This
enabled a fast monitor implementation but used a lot more memory than
necessary.

Apart from thread implementations used in JavaVM's there are many
other thread standards and implementations, the most notable being the
IEEE POSIX extension \cite{PosixThreads96}.

In \cite{Mueller93}, Mueller describes a library implementation of
POSIX threads on a standard UNIX system. This library is a user-level
implementation which need no support from the operating system. A very
popular library of user-level primitives for implementing threads is
{\em QuickThreads} by David Keppel, described in \cite{Keppel93}.

Bershad et al. present a fast mechanism for mutual exclusion on
uniprocessor systems \cite{Bershad+92}. They have a software solution
for the implementation of an atomic test-and-set operation which is
faster than the corresponding hardware instruction. However, their
implementation relies on assistance from the operating system.


\subsection{Implementation basics}

A complete thread implementation for Java has to provide:

\begin{itemize}

\item thread creation and destruction,

\item low level thread switching (usually implemented in assembly
      language),

\item thread scheduling,

\item synchronization primitives,

\item a thread safe non-blocking input/output library.

\end{itemize}

Cacao's current implementation of threads is based mainly on the
threading code of {\tt kaffe} version 0.7, which has been released
under a BSD-style license and can thus be used freely
\cite{Wilkinson:97}. As mentioned above, {\tt kaffe}'s threads are
completely user-level, which means, for example, that they cannot take
advantage of a multiprocessor system.

\begin{table*}
\begin{center}
\begin{tabular}{|l|l|c|c|c|c|c|c|} \hline
                  &                     & \multicolumn{3}{|c|}{mutex test}
                                        & \multicolumn{3}{|c|}{tree test}\\ \hline
                  &                     & \multicolumn{2}{|c|}{run time in secs}&call cost
                                        & \multicolumn{2}{|c|}{run time in secs}&call cost\\
Machine           & JavaVM              & no sync & sync  & in $\mu$s & no sync & sync   & in $\mu$s \\ \hline\hline
DEC 21064A 289MHz & OSF JDK port (1.0.2)&  1.20   & 4.14  &   9.8     &  8.37   & 34.69  & 9.8    \\ \hline
DEC 21064A 289MHz & DEC JDK interpreter &  1.71   & 12.80 &   36.97   &  12.25  & 143.10 & 39.93  \\ \hline
DEC 21064A 289MHz & DEC JDK JIT         &  1.06   & 11.05 &   33.30   &  7.80   & 130.50 & 37.45  \\ \hline
Pentium-S 166MHz  & Linux JDK 1.1.3     &  0.96   & 1.69  &   2.43    &  7.93   & 16.12  & 2.50   \\ \hline
AMD K6 200MHz     & Sun JPP JIT (WinNT) &  0.57   & 0.94  &   1.23    &  2.8    & 9.8    & 2.13   \\ \hline
AMD K6 200MHz     & Navigator 4.04 (WinNT) & 0.03 & 0.77  &   2.46    &  2.3    & 9.2    & 2.11   \\ \hline
%PowerPC 604e 200MHz & Linux JDK Beta    &  1.16   & 2.05  &   2.96    &         &        &        \\ \hline
%Sun Ultra-1       & JDK 1.1.2           &  0.90   & 1.25  &   1.16    &         &        &        \\ \hline
%DEC 21064A 300MHz & Cacao unoptimized   &  0.24   & 0.48  &   0.80    &  0.24   & 0.48   & 0.80   \\ \hline
DEC 21064A 289MHz & Cacao               &  0.28   & 0.40  &  0.40     &  1.46   & 4.71   & 0.99   \\ \hline
\end{tabular}
\caption{Overhead for calling synchronized methods (one object)}
\label{SyncronizedOverhead}
\end{center}
\end{table*}

There are several reasons why we chose this approach:

\begin{itemize}

\item Thread support differs from operating system to operating
system. Not only do some operating systems provide different APIs to
other systems, but even if they do provide the same interface (most
often in the form of POSIX threads), the costs of various operations
are often very different across platforms.

\item It was a complete implementation, tailored for the use in a
JavaVM and thus made it possible for us to get thread support up and
running quickly.

\item Parts of Cacao are not yet thread-safe (most notably the
compiler and the garbage collector), thus making it difficult to use a
kernel-supported solution.

\end{itemize}

Each thread in a Java program corresponds to an instance of the
\texttt{java.lang.Thread} class. In order for the Java run time
environment (JRE) to associate platform specific information with such
an instance, it has a private member called \texttt{PrivateInfo} of
type \texttt{int}, which can be used by the JRE. Kaffe version 0.7
used this member as a pointer to a context structure. Since pointers
are 64-bit values on the Alpha, we use an array of context structures.
\texttt{PrivateInfo} is then used as an index into this array.

A fixed-size stack is allocated for each thread. The stack size for
ordinary threads can be specified as a command-line parameter. Special
threads (such as the garbage collector) have their own stack sizes. In
order to catch stack overflows without the burden of checking the
stack top at each method entry, we simply guard the stack top with one
or more memory pages. The memory protection bits of these pages are
modified to cause a memory protection violation when accessed. The
number of guard pages can be specified on the command-line.

Thread switching is implemented straightforwardly, namely by saving
callee-save registers to the stack, switching to the new stack,
restoring callee-save registers and performing a subroutine return.

Scheduling is very simple in Cacao: higher priority threads always get
precedence over lower priority threads, and threads of the same
priority are scheduled with a round-robin algorithm.

I/O in user-level threads is a problem since UNIX I/O calls are, by
default, blocking. They would block not just the current thread but
the whole process. This is undesirable. It is thus common practice for
thread libraries to use non-blocking I/O and, in the case of an
operation which would block, suspend the corresponding thread and
perform a multiplexed I/O operation (\texttt{select(2)} in UNIX) on
all such blocked files regularly, especially if there are no runnable
threads available.


\subsection{Motivation}

\label{MonitorTest}

To optimize an implementation it is necessary to find the parts which
are critical to performance. Therefore, we analysed the cost of
monitors with two small test programs.  The {\em mutex test} program
simply invoked a method 300000 times, once with the method being
declared \texttt{synchronized} and once without. The other test, {\em
tree test}, allocated a balanced binary tree with 65535 elements and
recursively traversed it 50 times using a method, again once being
synchronized and once being not.

Table \ref{SyncronizedOverhead} shows the differences in run-time for
the two versions of the programs for several JavaVM's. The table
includes the run times for both versions of the programs in seconds.
The column `call cost' gives the overhead of a call of a synchronized
method compared to a call of a non-synchronized one. From these
numbers it is obvious that calls to synchronized methods, or monitors
in general, are much slower than ordinary method calls.

\begin{table*}
\begin{center}
\begin{tabular}{|l|r|r|r|r|} \hline
Application & Objects allocated & Objects with mutex & Monitor
operations & Parallel Mutexes \\ \hline\hline
javac       & 111504            & 13695              &
840292             & 5        \\ \hline
Biss        & 84939             & 13357              &
1058901            & 12       \\ \hline
Jigsaw      & 215411            & 23804              &
855691             & 25       \\ \hline
\end{tabular}
\caption{Mutual exclusion statistics}
\label{MutexStatistics}
\end{center}
\end{table*}

The question that arises is now whether this has any influence on
common Java programs. To answer this question, we have used a modified
version of \texttt{kaffe} to gather statistics about monitor usage.
The results are summarized in table \ref{MutexStatistics}.

Javac is an invocation of Sun's \texttt{javac} on the Toba source
files \cite{Proebsting+97} and is thus single-threaded. Execution of
this program takes only a few seconds using Cacao with threads
disabled. Biss is a more or less typical working session with the Java
development environment of the Biss-AWT \cite{BissAWT97}. It is
slightly multithreaded. Jigsaw invokes the HTTP server Jigsaw
\cite{Jigsaw97} of the World Wide Web Consortium and lets it serve
identical parallel requests from seven hosts, amounting to about one
megabyte split across 200 files for each request. This application is
highly multithreaded.

The table contains the number of objects allocated during execution
and the number of objects for which a monitor has been entered. The
`Monitor operations' column gives the total number of operations
performed on monitors, while the numbers in the `Parallel Mutexes'
column show the greatest number of mutexes that were locked
simultaneously.

These numbers demonstrate that the performance of monitor operations
is of paramount importance for a fast JavaVM implementation. We did
not include the number of blocking monitor operations because there
were just two of them, namely in the Biss application. It was only
after we modified \texttt{kaffe} to switch to another thread before
each \texttt{monitorenter} operation that the Biss and Jigsaw
applications performed a few thousand blocking \texttt{monitorenter}s.


\subsection{Mutex implementation}

Our mutex data structure includes a pointer to the thread that has
currently locked the mutex (\texttt{holder}). If the mutex is not
locked at all, this is a null pointer. Because one thread can lock the
same mutex multiple times, we need a count of how many lock operations
without corresponding unlocks have been performed on the mutex
(\texttt{count}). When it goes down to zero, the mutex is not locked
by any thread. Furthermore, we need to keep track of the threads which
have tried to lock the mutex but were blocked and are now waiting for
it to become unlocked (\texttt{waiters}). 

The data structure is defined as follows:

\begin{verbatim}
struct mutex {
  int count;
  thread *holder;
  thread *waiters;
}
\end{verbatim}

The locking of a mutex can now be specified as in
figure~\ref{lockMutex}.

\begin{figure}
\begin{verbatim}
void lockMutex (struct mutex *mux) {
  disablePreemption();

  if (mux->holder == NULL) {
    mux->holder = currentThread;
    mux->count = 1;
  } else if (mux->holder == currentThread) {
    mux->count++;
  } else {
    blockUntilMutexIsNotLocked(mux);
    mux->holder = currentThread;
    mux->count = 1;
  }

  enablePreemption();
}
\end{verbatim}
\caption{Code for \texttt{lockMutex()}}
\label{lockMutex}
\end{figure}

The macro \texttt{disablePreemption()} simply sets a global flag
indicating that a critical section is currently being executed and
that preemption must not take place. \texttt{enablePreemption()}
unsets the flag and checks whether a thread switch is necessary (see
below). On a multiprocessor system this would need to be implemented
using a test-and-set instruction, or some equivalent.

The inverse operation, namely the unlocking of the mutex, can be
expressed as in figure~\ref{unlockMutex}.

\begin{figure}
\begin{verbatim}
void unlockMutex (struct mutex *mux) {
  disablePreemption();

  --mux->count;
  if (mux->count == 0) {
    mux->holder = NULL;
    if (mux->waiters != NULL) {
      thread *next = mux->waiters;
      mux->waiters = next->next;
      resumeThread(next);
    }
  }

  enablePreemption();
}
\end{verbatim}
\caption{Code for \texttt{unlockMutex()}}
\label{unlockMutex}
\end{figure}

The function \texttt{resumeThread()} sets a flag which results in a
thread switch to the given thread after preemption has been
re-enabled.

These algorithms are simple, straightforward and reasonably efficient.


\subsection{Object-Mutex relation}
\label{objectmutex}

Since the JavaVM specification states that each object has a mutex
associated with it, the obvious solution seems to be to include the
mutex structure in each object. The mutex structure could be reduced
to a length of eight bytes if we used thread numbers instead of
pointers. But, using such a solution, the javac application would
allocate nearly one megabyte of mutex data, just for a few seconds of
execution. This is unacceptable.

On the other hand, the figures show that it is very seldom that more
than 20 mutexes are locked at the same time. This suggests that using
a hash table, indexed by the address of the object for which a monitor
operation is to be performed could be very efficient. This is exactly
what Cacao does.

We use a hash function which uses the $2 n$ least significant bits of
the address where $2^n$ is the size of the hash table. This function
can be implemented in four RISC instructions. Since we ran into
performance problems with overflow handling by internal chaining, we
now use external chaining with a preallocated pool of overflow entries
managed by a free list.

An entry in the hash table has the following structure:

\begin{verbatim}
struct entry {
  object *obj;
  struct mutex mux;
  struct entry *next;
}
\end{verbatim}

In order to minimize the overhead of the locking/unlocking operations,
we should strive for code sequences small enough to be inlined, yet
powerful enough to handle the common case.  We have chosen to handle
the first entry in the collision chain slightly differently from the
rest by not destroying the associated mutex when the count goes down
to zero. Instead the decision is deferred until when a mutex with the
same hash code gets locked and would thus use this location.  The
major benefit of this approach is that the code to lock the mutex need
not (in the common case) check for the location to be free, since each
hash table location will, during the course of execution, only be free
at the beginning of the program and will then never be freed again.
The unlocking code is simplified by the fact that the code need not
check whether the hash table location should be freed.  Based on the
experience that a recently locked mutex is likely to be locked again
in the near future, this technique also improves overall performance.

\begin{figure}
\begin{verbatim}
 1 void monitorenter (object *o)
 2 {
 3   entry *e;
 4   disablePreemption();
 5 
 6   e = firstChainEntry(o);
 7   if (e->obj == o
 8       && e->mux.holder
 9          == currentThread)
10     e->mux.count++;
11   else
12     lockMutexForObject(e,o);
13 
14   enablePreemption();
15 }
\end{verbatim}
\caption{Code of \texttt{monitorenter()}}
\label{monitorenter}
\end{figure}

\begin{figure}
\begin{verbatim}
 1 void monitorexit (object *o)
 2 {
 3   entry *e;
 4   disablePreemption();
 5 
 6   e = firstChainEntry(o);
 7   if (e->obj == o)
 8     e->mux.count--;
 9   else
10     unlockMutexForObject(e,o);
11 
12   enablePreemption();
13 }
\end{verbatim}
\caption{Code of \texttt{monitorexit()}}
\label{monitorexit}
\end{figure}


\begin{table*}
\begin{center}
\begin{tabular}{|l|r|r|r|r|r|}    % monitorEnter
\hline
Program     & Line 6 & Line 10 & count 0 & Line 12 & Ratio $12/6$ \\ \hline\hline
javac       & 420147 & 405978  & 381019  & 14169   & 3.372 \%     \\ \hline
Biss        & 384350 & 370171  & 425038  & 14179   & 3.689 \%     \\ \hline
Jigsaw      & 426695 & 391680  & 294957  & 35015   & 8.206 \%     \\ \hline
\end{tabular}
\caption{Execution statistics for \texttt{monitorenter()}}
\label{monitorEnter}
\end{center}
\end{table*}


\begin{figure}
\begin{verbatim}
 1 void lockMutexForObject (entry *e,
 2                          object *o) {
 3   disablePreemption();
 4 
 5   if (e->obj != NULL)
 6     if (e->mux.count == NULL)
 7       claimEntry(e,o);
 8     else
 9       while (e->obj != o) {
10         if (e->next == NULL) {
11           e = e->next = allocEntry(o);
12           break;
13         }
14         e = e->next;
15       }
16   else
17     e->obj = o;
18 
19   lockMutex(&e->mux);
20 
21   enablePreemption();
22 }
\end{verbatim}
\caption{Code for \texttt{lockMutexForObject()}}
\label{lockMutexForObject}
\end{figure}

\begin{figure}
\begin{verbatim}
 1 void unlockMutexForObject (entry *e,
 2                            object *o) {
 3   disablePreemption();
 4 
 5   if (e->obj == o)
 6     unlockMutex(&e->mux);
 7   else {
 8     /* Assuming entry is there */
 9     while (e->next->obj != o)
10       e = e->next;
11     unlockMutex(&e->next->mux);
12     if (e->next->mux.count == 0)
13       e->next = freeEntry(e->next);
14   }
15 
16   enablePreemption();
17 }
\end{verbatim}
\caption{Code for \texttt{unlockMutexForObject()}}
\label{unlockMutexForObject}
\end{figure}

See figures~\ref{monitorenter} and~\ref{monitorexit} for the code of
the corresponding JavaVM instructions. These functions handle (as we
show below) the most common case and depend on the two functions for
the general case presented in figures~\ref{lockMutexForObject}
and~\ref{unlockMutexForObject} (we now assume that
\texttt{enablePreemption()}/\texttt{disablePreemption()} pairs can be
nested).

Even if they are not short enough to be inlined in every synchronized
method, these functions are certainly small enough to make the effort
of coding them as specially tailored, assembly language functions
worthwhile. This bypasses the standard subroutine linkage conventions,
gaining a little extra speed.

\subsection{Results}

\begin{table}
\begin{center}
\begin{tabular}{|l|r|r|r|r|}      % monitorExit
\hline
Program & Line 6 & Line 8 & Line 10 & Ratio $10/6$ \\ \hline\hline
javac       & 420146 & 419815 & 331     & 0.078 \% \\ \hline
Biss        & 384368 & 383281 & 1087    & 0.282 \% \\ \hline
Jigsaw      & 428997 & 416890 & 12107   & 2.822 \% \\ \hline
\end{tabular}
\caption{Execution statistics for \texttt{monitorexit()}}
\label{monitorExit}
\end{center}
\end{table}

To demonstrate that nearly all cases are indeed handled by these small
routines, we have written a small application which simulates the
locking and unlocking operations of the three applications we used
above (tables~\ref{monitorEnter} and \ref{monitorExit}). As can be
seen, only a small percentage of cases need to be handled in the
general routines \texttt{lockMutexForObject()} and
\texttt{unlockMutexForObject()}.

The column `count 0' gives the number of mutex locks where the {\tt count}
of the mutex before locking is zero. We present this number to evaluate an
implementation where {\tt monitorenter()} checks the {\tt count} instead of
the {\tt holder} to determine if the mutex can be locked immediately. The
numbers seem to be in favour of checking {\tt holder} but at least indicate
that the difference is neglectable. Furthermore, locking the mutex after
only having checked {\tt count} needs an additional assignment (setting
{\tt holder}).

\begin{table*}
\begin{center}
\begin{tabular}{|l|r|r|r|r|} \hline
            & \multicolumn{4}{|c|}{miss rates by size of cache} \\
Application & 1 Element & 2 Elements & 4 Elements & 8 Elements \\
\hline\hline
javac       & 15.076 \% &  9.757 \% &  4.931 \% & 3.193 \% \\ \hline
Biss        & 13.488 \% &  8.349 \% &  4.765 \% & 3.141 \% \\ \hline
Jigsaw      & 43.694 \% & 37.700 \% & 22.680 \% & 5.182 \% \\ \hline
\end{tabular}
\caption{Results of cache simulation}
\label{CacheResults}
\end{center}
\end{table*}

We have also considered the possibility of using a cache of recently
used mutexes to improve performance, similar to a
translation-lookaside buffer in microprocessors which cache the
mapping between virtual and physical memory pages. To evaluate whether
this would be worthwhile, we have simulated caches with one, two, four
and eight elements using the three applications as test candidates. We
have used least-recently-used as the cache replacement strategy.
Though this is not easily implemented in software, it provides a good
estimate of the best hit rate that can be achieved with an efficient
implementation.  Table~\ref{CacheResults} summarizes the results.

\begin{table*}
\begin{center}
\begin{tabular}[b]{|l|c|c|c|c|c|}
\hline 
                             & JavaLex &  javac & espresso &  Toba   &
java\_cup \\ \hline
run time without threads     &   2.82  &   4.91 &    3.23  &   4.32 
&    1.35   \\ \hline
run time with threads        &   3.89  &   5.40 &    3.23  &   5.53 
&    1.56   \\ \hline
overhead (optimized impl.)   &   37 \% &  10 \% &    0 \%  &   28 \%
&    15 \%  \\ \hline
number of lock/unlock pairs  & 1818889 & 420146 &   2837   & 1558370 & 
124956   \\ \hline
\end{tabular}
\caption{Overhead of monitor operations}
\label{CACAOMutexCost}
\end{center}
\end{table*}

Using an implementation supporting the monitor routines, as discussed
in section~\ref{objectmutex}, and one implementation without thread
support, we have run several applications on a 300MHz DEC 21064A (see
table~\ref{CACAOMutexCost}). For these single threaded applications,
the overhead introduced by monitor operations ranges from 0\% to 37\%,
depending on the number of monitor operations in the applications. 
Note, however, that this cannot be compared to the overhead figures
given in table \ref{SyncronizedOverhead}, since these applications do
more than just entering and exiting a monitor.

Using the implementation described, the {\em mutex test} application
for table \ref{SyncronizedOverhead} took 0.40 seconds with a
synchronized and 0.28 seconds with an ordinary method to complete. In
this program the time spent on locking/unlocking a mutex is 0.40
microseconds. The reason for the higher cost of mutex operations in
the {\em tree test} is that this test violates the locality of monitor
operations. Overall, these numbers compare very favorably with the
other implementations.

For most single-threaded applications, the monitor overhead can be
eliminated completely. If it is possible to determine statically that
the dynamic class-loader and the \texttt{java.lang.Thread} class are
not used, synchronization code need not be generated.


\subsection{Thin locks}