It's slow(er in the OS when address translation is enabled) because of the TLB not being emulated 'properly', leading to a fetch from the page tables each time an instruction that references memory is accessed. See src/mem.c, mmutranslate_read. You end up incurring a giant performance penalty when a fetch has to occur from memory for each instruction that references a translated address.
It might be best to have a hash table that corresponds to an input address and add to buckets inside for the corresponding address translation (including removing translations from the htab when they need to be evicted), or you can do what MAME does (see http://wiki.mamedev.org/index.php/Virtual_TLB) and have a giant allocated memory chunk where the software managed TLB entries can lie. Either approach should work.
This might improve performance.
Maybe this TLB stuff should be looked to. I have no idea where this observation is right because my knowledge on the subject is not that great, but she is an experienced developer and has extensive knowledge on this kind of thing.