
MACH EXCEPTION HANDLER NOTES
Cyrus Harmon, December 2007

The goal of this work is to use make SBCL use mach exception handlers
instead of so-called BSD-style signal handlers on Mac OS X. Cyrus
Harmon and Alastair Bridgewater have been working on this.

Mac OS X has a mach-based kernel that has its own API for things like
threads and exception handling. Both mach exception handlers and
BSD-style signal handlers are available for use by application
programmers, but the signal handlers, which are implemented as a
compatibility layer on top of mach exceptions, have some problems. The
main problem for SBCL is that when using BSD-style signal handlers to
respond to SIGSEGV for access to protected memory areas, we cannot use
gdb to debug the process. This is problematic for SBCL which sets up,
and reads and writes to, protected memory areas early and
often. Additionally, threaded builds are seeing a number of problems
with SIGILLs being thrown at odd times. It appears that these are
coming from with the OS signal handling libraries directly, so
debugging these is rather tricky, especially in the absence of a
debugger. Thirdly, the protected memory accesses can, under certain
settings, trigger the Mac OS X CrashReporter, either logging
voluminous messages to a log file or, worse yet, triggering a
user-intervention dialog.

To address these three problems, we propose replacing the BSD-style
signal handling facility with a mach exception handling
facility. Preliminary tests with mach exceptions show that GDB is much
happier when using mach exceptions to respond to access to protected
memory. While mach exceptions probably won't directly fix the
threading problems, they remove a potentially problematic section of
code, the portion of the Mac OS X system library that deals with
BSD-style signal handling emulation and delivery of those signals to
multiple threads. Even if using mach exceptions in and of itself
doesn't immediately fix the problem, it should me much easier to
diagnose using a debugger. Finally, the CrashReporter problem appears
to go away as well, as this arises from an unfortunate placement of
the CrashReporting facility in the OS between the mach exception
handling and the BSD-style signal emulation. By catching the mach
exceptions ourselves, we avoid this problem.

* Mach exception handling details and example

Mach exceptions work by creating a thread that listens for mach
exceptions. The (slightly under-documented) OS function mach_msg_server
is passed the exc_server function and exc_server in turn calls our
catch_exception_raise function when an appropriate exception is
triggered. (Note that catch_exception_raise is called by exc_server
directly by that name. I have no idea how to provide multiple such
functions or to call this function by another name, but we should be
OK with a a single exception handling function.)

To set this up we perform the following steps:

1. allocate a mach port for our exceptions.

2. give the process the right to send and receive exceptions on this
   port.

3. create a new thread which calls mach_msg_server, providing
   exc_server as an argument. exc_server in turn calls our
   catch_exception_raise when an exception is raised.

4. finally, each thread for which we would like exceptions to be
   handled must register itself with the exception port by calling
   thread_set_exception_ports with the appropriate port and exception
   mask. Actually, it's a bit more involved than this in order to
   support multiple threads. Please document this fully.

* USE_MACH_EXCEPTION_HANDLER

The conditional compilation directive USE_MACH_EXCEPTION_HANDLER is a
flag to use mach exception handling. We should continue to support the
BSD-style signal handling until long after we are convinced that the
mach exception handling version works better.

* Establishing the mach exception handler

** x86-darwin-os.c

A new function, darwin_init, is added which creates the mach exception
handling thread and establishes the exception port. Currently the
"main" thread sets its exception port here, but when we go to a
multithreaded SBCL, we will need to do similarly for new threads in
arch_os_thread_init. Note that even "non-threaded" SBCL builds will
have two threads, one lisp thread and a mach exception handling
thread.

catch_exception_raise listens for EXC_BAD_ACCESS and
EXC_BAD_INSTRUCTION (and EXC_BREAKPOINT if we were to use INT3 traps
again instead of the SIGILL traps we've set up as a workaround to the
broken INT3 traps). Analogous to the signal handling context, mach
exceptions allow use to get the thread and exception state of the
triggering thread. We build a "fake" signal context, similar to what
would be seen if a SIGSEGV/SIGILL were triggered and pass this on to
SBCL's memory_fault_handler (for SIGSEGV) or sigill_handler (for
SIGILL). when the handlers return, we set the values of the
thread_state using the values from the fake context, allowing the
"signal handler" to modify the state of the calling thread.

** x86-arch.c

sigill_handler and sigtrap_handler are no longer installed as signal
handlers using undoably_install_low_level_interrupt_handler. Instead
sigill_handler is called directly by the mach exception handling
catch_exception_raise. This means that sigill_handler can no longer be
static void and it is changed to just void.

** bsd-os.c

memory_fault_error no longer installed using
undoably_install_low_level_interrupt_handler and changed to not be
static. darwin_init called.

* Handling exceptions

** interrupt.c

The code for general purpose error handling, which is generally done
by a trap instruction followed by an error opcode (although on Mac OS,
we use UD2A instead of INT3 as INT3 trapping is unreliable and the
sigill_handler in turn calls the sigtrap_handler). sigtrap_handler in
turn calls functions like interrupt_internal_error that are found in
interrupt.c.

Using BSD-style signal handling, interrupt_internal_error calls into
lisp via the lisp function INTERNAL-ERROR via funcall2 (the two
argument form of funcall). Since we are executing the exception
handler on the exception handling thread and we don't really want to
be executing lisp code on the exception handlers thread, we want to
return to the lisp thread as quickly as possible. With BSD-style
signal handling, the signal handlers themselves call into lisp using
funcallN. We can't do this as then we would be attempting to execute
lisp code on the exception handling thread. This would be a bad thing
in a multi-threaded lisp. Therefore, we borrow a trick from the
interrupt handling code and hack the stack of the offending thread
such that when the mach exception handling code returns, and returns
control back to the offending thread, it first calls a lisp function,
then (unless otherwise directed) returns control to the lisp
thread. This allows us to run our lisp (or other) code on the
offending thread's stack, but before the offending thread resumes
where it left off. See the arrange_return_to_lisp_function for details
on how this is done.

arrange_return_to_lisp_function was modified to take an additional
parameter specifying the number of additional arguments, and
additional varargs, which are then placed on the stack in the
call_into_lisp_tramp in x86-assem.S.

The problem with this is that both the signal handling/mach exception
handling code, on the one hand, and the lisp code expect access to the
"context" for accessing the state of the thread at the time that the
signal/exception was raised. This means that the old strategy was:

offending thread
  (operating system establishes signal context)
  signal handler
    lisp code
  signal handler
  (operating system restores state from signal context)
offending thread

and both the signal handler and the lisp code have the chance to
examine and modify the signal context. Now the situation looks like
this:

offending thread
 (operating system establishes thread_state)
 mach_exception_handler
   signal handler (which may arrange return to a lisp function)
 mach_exception_handler
 (operating system restores from thread_state)
 (optionally, a lisp function is called here)
offending thread

So we need to figure out how to provide the lisp function with
information about the context of the offending thread and allow the
lisp code to alter this state and to restore that state prior to
resuming control to the offending thread.

There are, presumably, additional problems with exception masking and
when threads are allowed to interrupt other threads or otherwise catch
signals/exceptions, but we can defer those for the moment.

* Providing a "context" for lisp functions called on returning from a
mach exception handler

Given the flow describe above, we need to expand the following step:

 (optionally, a lisp function is called here)

now it needs to look something like:

0. offending thread triggers a mach exception

1. (operating system establishes thread_state)

2. enter mach_exception_handler

  2a. signal handler (which may arrange return to a lisp function

  2b. frob the offending threads stack, allocating a context on the
      stack (if appropriate, or always?)

3. exit mach_exception_handler

4. (operating system restores from thread_state)

6. with the context on the stack, transfer control to the lisp
       function

7. restore the thread state from the context which either returns
   control to original location in the offending thread or to wherever
   the error-handling code has modified the context to point to
 

* x86-darwin-os.c (again)

** call_c_function_in_context and signal_emulation_wrapper

We arrange for a function to be a called by the offending the thread
when the mach exception handler returns. Essentially we have our own
BSD-style signal emulation library that calls memory_fault_handler,
sigtrap_handler and sigill_handler, as appropriate. It does this by
calling a function call_c_function_in_context which sets up the EIP of
the thread context to call the specified C function, which the
specified arguments. In this case, signal_emulation_wrapper which
takes as arguments a thread_state, which is a copy of the thread state
as it existed upon entry into catch_exception_raise, and an emulated
signal and siginfo and the signal handling function that
signal_emulation_wrapper is to call.

signal_emulation_wrapper creates a BSD-style signal context and
populates it from the values in the passed in thread_state. It calls
the specified signal_handler, and then sets the values in the
thread_state from the context and loads the address of the thread
state to restore into eax and then traps with a special trap that the
catch_exception_raise looks for, which then extracts the thread state
from the trap exception's thread_state.eax.

[MORE DETAILS TO FOLLOW]


===== BUGS =====

MEH1: on threaded macos builds, init.test.sh fails with a
      memory-fault-error (NOTE: this is a threaded macos issue, not a
      mach exception handler bug).

MEH2: timer.impure lisp fails on mach-exception-handler builds

MEH3: threads.impure lisp fails on mach-exception-handler builds

