We would like to keep within a worst-case factor of 10 of C in speed, keeping a careful watch to make sure that every cycle taken on the common bytecodes really is necessary. For example, just because we call a prim function, does not mean it needs to waste a cycle returning.
On at least some machines, jumps are particularly expensive, since they screw up the prefetch queue and make a mess of pipelining, so we are particularly careful to minimize jumps. Short of compiling native code, we need a logical minimum of one jump per opcode executed. We can achieve this minimum by branching through a dispatch table indexed by opcode, and replicating the dispatch logic at the bottom of each fast primitive, to save a return jump.
It is easy for an interpreter to wind up wasting a ridiculous amount of time checking for stack over/underflow and argument types: This can easily add several conditional jumps to a simple ADD primitive if not watched carefully.
We can tackle this by saving argument types explicitly on the stack, and having our dispatch code fold the types of the top two stack arguments into the table-index value. This implements both argument type-checking for the fast arithmetic prims -- if we jump to them, they know they have the right argument types -- and also stack underflow checking, since we keep two dummy arguments with bad types on the bottom of the stack.
We can keep sufficient spare space above the stack pointer at all times to prevent the fast prims from having to worry about stack overflow.
This line of thought leads us to putting something like
table[ *++pc | typ0(stack_top) | typ1(stack_top[-1]) ]();
at the bottom of every fast primitive, where
typ1 are macros producing type info pre-shifted to
mesh appropriately with our bytecode. This appears to be a
very nearly optimal bytecode dispatch design: Every fast
binary op clearly needs an opcode fetch, pc increment, and
checks on both arguments. Unary and nullary ops are paying
for one or two typechecks they don't need, but attempting to
avoid this appears likely to add as much work as we are
The main problem with the above is that it does not implement instruction limits. One could try to do this via interrupts or such, to keep it from slowing down the instruction dispatch code, but for now at least we fold this into our dispatch macro. sigh.
See the JOB_NEXT
#define in `jobprims.h' or
`jobbuild.c' for the actual dispatch code.
Go to the first, previous, next, last section, table of contents.