Jato - Inline Caching [GSoC 11]: 2011

Monday, 15 August 2011

A plan for method inlining in Jato

As a first step, we implement method inlining only for static method calls. This ensures that we know the method body to be inlined (as opposed to speculating for virtual calls).

High-level Algorithm
- Detect invokestatic instructions (STMT_INVOKE)
- For each argument
- replace pushing the argument with (STMT_STORE temp = arg_expr)
(Store the mapping between arguments and temporaries)
- Inline method body with references to local_0 to local_(nargs-1) replaced by references to the corresponding temporaries. Replace local variables with temporaries in a similar fashion.
- Convert return in method body to the following sequence of instructions
- ret_temp = return_expr; goto inlined_code_end (For STMT_RETURN)
- goto inlined_code_end (For STMT_VOID_RETURN)

Monday, 8 August 2011

Method inlining in Cacao VM

In Cacao, Method inlining is done at IR level. First we have to select the "root" method to be inlined. This method contains callsites which could be inlined for performance. Once the "root" method is selected, we have to decide which call-sites could be inlined. This includes all static calls and all monomorphic virtual calls. Then we have to create an inlining plan which determines whether we should inline callsites in inlined method etc.

inline_inline() is the main driver method for inlining. inline_make_inlining_plan() analyses code and stores candidate callsites which can be inlined. This method depends on the inlining stratgy being used (Breadth First, Depth First, or knapsack). inline_transform() uses the inlining plan and does the actual inlining of IR instructions. We'll see the details of actual inlining in a later post.

Ref:
[1] "Adaptive optimization and On-stack replacement in CACAO VM" - Steiner, Krall, and Thalinger
[2] CACAO source code

Profiling results for inline caching

Read this

Saturday, 18 June 2011

Megamorphic

Finished implementing megamorphic ICs for x86_32. The performance improvement seems to be minimal (performs worse for some benchmarks). Planning to add a stats collector for ICs to see what's going on.

Things to lookout for:-

# of ICed callsites with 1 or 2 calls - Large number implies we are spending too much time on ICing callsites which don't gain anything from IC.

# of callsites with a miss after 1 or 2 hits - Megamorphic callsites. This number should not be too high.

Tuesday, 14 June 2011

[PATCH] x86: Inline cache with clean and monomorphic states for INVOKEVIRTUAL.

goto patch

Saturday, 11 June 2011

Design of IC - emission of callsite

See a discussion regarding IC design here

The discussion is one how to implement the inline cached callsite code emission.

Wednesday, 8 June 2011

Three kinds of classes

Primitive

These are classes which correspond to primitive data types like short. Note that these are different from the classes java.lang.Short and such. These classes are preloaded by Jato and used only by Java Standard Library.

Array

These are classes which correspond to types of references like
int [] ia = new int[10];
There cannot be any subclasses for these classes.

Regular

All other "normal" classes fall into this category.

Monday, 6 June 2011

PATCH: Support for multiple entry points for methods

Support for multiple entry points for methods

The existing code assumes that each method has a unique entry point which can be used as the call target. This assumption is no longer valid when inline caching is enabled. Each virtual method will have two entry points an unverified or normal entry point which is the same as current entry point and a verified entry point which assumes that the function was called from a call-site with inline cache.

The entry points are stored in *_entry_point fields in the struct compilation_unit. For Java methods, these fields are set inside emit_machine_code().

cu->entry_point

This is the normal entry point for a method. For native methods, this points to the native code. For Java methods, this points to the unverified entry point.

cu->ic_entry_point

This points to the verified entry point for Java methods. This field is NULL for methods for which inline caching is not applicable.

cu_entry_point() or cu_ic_entry_point() returns addresses that can be used as call targets. These functions should be favoured compared to buffer_ptr(cu->objcode) for obtaining call targets.

buffer_ptr(cu->objcode) returns the address of the first machine instruction generated for the method. This may or may not be a valid entry point. This should not be used for fetching entry points.

Thursday, 2 June 2011

Plan for inline caching on x86_32

HotSpot passes the first two int/reference args in registers. For non-static methods this means this argument is always available in a register for inline cache check. The number of memory accesses is 1.

Jato passes this argument on stack. So obtaining this->class will require 2 memory accesses. We could use this pointer available at call-site in a register.

The call site code generated looks like this

/* edi contains this pointer */
mov %(edi), %ecx /* This works as null check and also passes this->class to #fn */
mov #imm, %eax
call #fn

The #imm and #fn fields depends on the state of the inline cache.

Clean state

#imm = vm_method *
#fn = inline_cache_setup

Monomorphic state

#imm = expected class
#fn = verified entry point of the method

Megamorphic state

#imm = vtable index of the virtual method
#fn = vtable lookup routine

The inline cache check can be

cmp %eax, %ecx
jne inline_cache_miss_handler

inline_cache_setup implements the transition from clean to monomorphic. inline_cache_miss_handler implements the transition from monomorphic to megamorphic.

Both routines along with the vtable lookup routine can use this->class passed in ecx for vtable lookup.

Saturday, 28 May 2011

Inline Caching in HotSpot VM

The class CompiledIC represents a call-site with an inline cache.

cpu/x86/vm/LIR_Assembler_x86.cpp: emit_call() emits the machine code for a call.

ic_call() generates the native code for a virtual call with inline cache.

Clean state

This is the initial state for a compiled call-site.

The code generated is
mov imm32, eax
call _resolve_virtual_call_C

Initially this imm32 value is a no-op. When the site becomes monomorphic this is set to the expected class.

Monomorphic state

share/vm/runtime/SharedRuntime.cpp:1268
The _resolve_virtual_call_C function gets the receiver (this) and the method handle by inspecting the stack and relevant bytecodes. This code can be found in SharedRuntime:: resolve_sub_helper(). The state of the IC is then made monomorphic by CompiledIC:: set_to_monomorphic(). set_to_monomorphic() patches the mov and call instruction. The imm32 value is a pointer to class handle and the target of the call instruction is the address of the method implementation for that class.

Megamorphic state

When a cache miss occurs the inline cache enters megamorphic state where a full vtable lookup is performed.

The code which checks for a cache miss is generated by check_icache(). This code is present in all non-static methods. This code loads this->class and compares it with value in eax. If they are not equal, it jumps to handle_wrong_method_ic_miss() which calls handle_ic_miss_helper() which performs the vtable lookup, patches the call-site to make it megamorphic and call the actual method.

For a megamorphic call-site, the eax contains method handle and the call target is a vtable lookup routine. The vtable lookup routine can use the "this" argument and the method handle to jump to the correct method.

Notes on Implementation

The remaining discussion sketches the call-site code at various times.

Clean state

mov #method, eax
call setup_ic

Monomorphic state

mov #class, eax
call method

Megamorphic state

mov #method, eax
call vtable_lookup

setup_ic:
Lookup vtable
if method not yet compiled {
Jump to trampoline
/* Leave the call-site untouched */
} else {
Patch the call-site
Jump to the method
}

vtable_lookup:
jump to this->class->vtable->native_ptr[i]

inline_cache_check:
cmp eax, this->class
jne inline_cache_miss

inline_cache_miss:
Patch call-site to megamorphic
Lookup vtable and jump to it

Saturday, 21 May 2011

invokeinterface

invokeinterface bytecode is used to call an interface method. The mechanism used to implement this is slightly different from invokevirtual. We cannot assign a unique virtual_index to an interface method as a class may implement two different interfaces creating a conflict in virtual_index. So virtual_index of an interface method varies from its implementation in one class to another (i.e virtual_index depends on method and class). So how can we find the virtual_index of a method given "this"? This information is stored in a hash table called itable. Each bucket in an itable is a list of itable_entry structures. Each itable_entry has a c_method structure which points to the vm_method structure of the implementation method (the method to be called). So an invokeinterface call boils down to:-

Searching this->itable[method->itable_index] for the required method. The itable_index for a method is the same across classes. The result of this search is c_method->virtual_index.
Call this->vtable[c_method->virtual_index].

The LIR generated for invokeinterface is produced by following code.

/* object class */
select_insn(s, tree, membase_reg_insn(INSN_MOV_MEMBASE_REG,
call_target, offsetof(struct vm_object, class), call_target));

/* itable entry */
select_insn(s, tree, imm_reg_insn(INSN_ADD_IMM_REG,
offsetof(struct vm_class, itable) + method->itable_index * sizeof(void *),
call_target));

/* hidden parameter to the conflict resolution stub */
select_insn(s, tree, imm_reg_insn(INSN_MOV_IMM_REG,
(unsigned long) method, eax));

/* invoke method */
call_insn = reverse_reg_insn(INSN_CALL_REG, call_target);

This call instruction transfers control to itable conflict resolver generated by emit_itable_resolver_stub(). This function emits code to do a binary search on the appropriate list and jump to the correct vtable entry (See emit_itable_bsearch()).

Friday, 20 May 2011

Details of Virtual Calls

A virtual call to a function looks like

invokevirtual class.method()

in bytecode. The class here is the declared type of the reference and not the actual type of the reference. The method to be called depends on the actual type of the reference which can only be resolved at run time.

Jato Internals for a virtual call

The vm maintains a vm_class structure corresponding to each class. Each vm_class structure contains a vtable which is an array of pointers to executable code of methods in that class. The vtable is organized such that the index (into the vtable) of a virtual method is the same in the base class and all derived classes. i.e if name() has index 10 in Fruit class. Its index is 10 in Apple, Orange, or any other class derived from Fruit. This index is maintained in the virtual_index field in vm_method structure. So executing a virtual call involves.

Loading the correct vtable entry using the virtual_index.
Executing a call instruction to that address.

The code for this can be found in arch/x86/insn-selector_32.brg:invokevirtual()

/* object reference */
call_target = state->left->reg1;

/* object class */
select_insn(s, tree, membase_reg_insn(INSN_MOV_MEMBASE_REG, call_target, offsetof(struct vm_object, class), call_target));

/* vtable */
select_insn(s, tree, membase_reg_insn(INSN_MOV_MEMBASE_REG, call_target, offsetof(struct vm_class, vtable), call_target));

/* native ptr */
select_insn(s, tree, imm_reg_insn(INSN_ADD_IMM_REG, method_offset, call_target));

/* invoke method */
call_insn = reverse_reg_insn(INSN_CALL_REG, call_target);

select_safepoint_insn(s, tree, call_insn);

Initially this call goes to the trampoline function jit_magic_trampoline(). It compiles the method body and patches the vtable entry to point directly to the compiled code. This vtable patching can be found in jit/vtable.c:fixup_vtable() [See also emit_trampoline()].

We can trace the output of different phases of Jato by using the option -Xtrace:jit. Below we show the output of different phases for the virtual call.

Fruit f;
.
.
f.name();

HIR

[main] INVOKEVIRTUAL:
[main] target_method: [0xa24d9d0 'jvm/MethodInvokeVirtualTest$Fruit.name()Ljava/lang/String;' (12)]
[main] args_list:
[main] ARG_THIS:
[main] arg_expression: [temporary reference 0xa415a60 (low)]
[main] result: [temporary reference 0xa4172e8 (low)]

The HIR is very "close" to java bytecode. The only information available at this point is a method descriptor.

LIR

[main] [ 1 ] 2: push_reg r11 ; Push this to stack
[main] [ 1 ] 4: mov_membase_reg $0x0(r11), r11 ; r11 <- this->class
[main] [ 1 ] 6: mov_membase_reg $0x7c(r11), r11 ; r11 <- this->class->vtable
[main] [ 1 ] 8: add_imm_reg $0x30, r11 ; r11 <- &this->class->vtable[method_index]
[main] [ 1 ] 10: test_imm_memdisp $0x0, ($0xa0ae000)
[main] [ 1 ] 12: call_reg (r11) ; jump to the virtual method

The LIR is very close to machine code. The actual machine registers used are determined only during code generation. The LIR encodes the logic to lookup the vtable entry and jump to it. The number $0x30 depends on the virtual method to be called (the virtual_index).

Machine Code

[main] [ 1 ] 0xa775226c: 57 push %edi
[main] [ 1 ] 0xa775226d: 8b 3f mov (%edi),%edi
[main] [ 1 ] 0xa775226f: 8b 7f 7c mov 0x7c(%edi),%edi
[main] [ 1 ] 0xa7752272: 81 c7 30 00 00 00 add $0x30,%edi
[main] [ 1 ] 0xa7752278: f6 04 25 00 e0 0a 0a 00 testb $0x0,0xa0ae000(,%eiz,1)
[main] [ 1 ] 0xa7752280: ff 17 call *(%edi)

This code looks very similar to LIR except that the actual machine registers to be used have been determined.

Saturday, 30 April 2011

Jato is a JIT-only JVM. This means that every piece of code executed is converted to native code before executing (i.e no interpretation is done). In the following discussion, we add dummy line numbers to code to make the discussion easier.
Concepts

We look at the how JIT-compiling is done in Jato. We compile a method only when it is called for the first time. When a method foo is called for the first time, we compile it into a call to a trampoline function.

10: a.foo()

1000: call trampoline

2000: trampoline:
2001: foobin = compile(foo)
2002: modify call trampoline at 1000 to call foobin

3000: foobin:
code for foo() generated by the trampoline

The trampoline function compiles the method body and jumps to the compiled method. The trampoline also backpatches the call site to point to the native code of the method. Note that one trampoline works for all functions. The method to be compiled and address of the call site are "parameters" to the trampoline function.

Implementation

The trampoline function is jit/trampoline.c: jit_magic_trampoline(). The function arch/x86/fixup.c: fixup_direct_calls() method implements call site back-patching once the compilation is completed.