Search

Windows Instrumen­tation Callbacks – Part 2

Search

Windows Instrumen­tation Callbacks – Part 2

November 12, 2025

Windows Instrumentation Callbacks – Hooks, Part 2

Introduction

This multi-part blog series will be discussing an undocumented feature of Windows: instrumentation callbacks (ICs).

If you have not yet read the first part of this series, we strongly recommend you read it to find out what ICs are and how to set them.

In this blog post you will learn how to do patchless hooking using ICs without registering or executing any user mode exception handlers.

Disclaimer

  • This series is aimed towards readers familiar with x86_64 assembly, computer concepts such as the stack and Windows internals. Not every term will be explained in this series.
  • This series is aimed at x64 programs on the Windows versions 10 and 11. Neither older windows versions nor WoW64 processes will be discussed.

Recap

In the first blog post we learned how to install an IC on a process and how to use that callback to interact with specific syscalls. We learned this by intercepting the syscall made by OpenProcess inside the subfunction NtOpenProcess. After intercepting NtOpenProcess, we close the handle that was opened and spoof a return value of STATUS_ACCESS_DENIED. This allows us to get a callback on every syscall that returns and which was made. However, it does not allow hooking arbitrary code. Also consider this: a program calls NtSetInformationProcess to set its own IC after you have already set an IC. Which IC do you think is called? Your original IC or the new IC passed in NtSetInformationProcess? Give it a try.

Hooking

If you are reading this article, there’s a good chance you know what patchless hooking is. If you don’t, we will explain the patchless part; however, you are assumed to know what hooking in general refers to.

There are many hooking techniques, but they are either patchless or require a patch. Regular inline hooks work by patching the executable memory/the binary to redirect execution to the code of the installed hook. Assuming a person wants to hook a binary file on disk, and changes (aka patches) the binary’s bytes, the signature of the binary is changed, as the binary no longer contains the same bytes.

Patchless hooking

As you might’ve guessed, patchless hooking techniques are techniques that do not require a patch. This means, none of the bytes in the executable memory region that is to be hooked are changed, so the signature of that memory region stays the same, meaning the hook can’t be detected by signature scans.

The most common patchless hooking techniques in Windows user mode are probably vectored exception handler (VEH) hooking and page guard hooking. Both these techniques utilize a core concept of Windows and operating systems in general: exceptions.

Page guard hooking works by setting the PAGE_GUARD memory page protection modifier on a certain memory page. Once that memory page is accessed, the system raises an exception that can be handled by an exception handler.

VEH hooking also requires setting up an exception handler, but instead of page guards, hardware breakpoints are used to trigger the exceptions.

Assuming you, for example, add a __debugbreak() to your C/C++ code that adds a software breakpoint, hardware breakpoints are generated by the CPU.

Hardware breakpoints can be set with specific registers in x86_64 CPUs:

  • Dr0-3: These four registers contain the addresses of where the breakpoint should be set.
  • Dr6: This is the status register that contains information about which breakpoint fired during exception handling.
  • Dr7: This is the control register that, using bit flags, controls which debug registers are active and what type of breakpoint is used: read/write/execute.

Exceptions and vectored exception handling

In short, VEHs allow developers to register their own exception handler. For this, Microsoft provides the function AddVectoredExceptionHandler. Let’s look at the function definition:

PVOID AddVectoredExceptionHandler(
ULONG                       First,
PVECTORED_EXCEPTION_HANDLER Handler
);

The function takes a pointer to an exception handler function and an ULONG parameter. Internally, Windows stores the pointers to all the exception handlers in a linked list. If the ULONG parameter, i.e. the parameter called First, is not zero, the exception handler will be added to the start of the linked list instead of the end.

The Handler parameter takes a function pointer to the exception handler that should be added. The function should look as follows according to MSDN:

LONG PvectoredExceptionHandler(
[in] _EXCEPTION_POINTERS *ExceptionInfo
)

The function should take a pointer to an EXCEPTION_POINTERS structure as that will hold the information about the exception which occurred. Most importantly, it will hold a CONTEXT structure of when the exception occurred. The CONTEXT structure holds processor-specific register data such as the member Rip containing the value the CPU register rip had when the exception occurred.

According to documentation, the exception handler should either return EXCEPTION_CONTINUE_EXECUTION (-1) or EXCEPTION_CONTINUE_SEARCH (0). This is used by Windows to decide whether the exception was handled or if the executed exception handler could not/did not want to handle the exception.

The process goes as follows: when an exception is thrown, a context switch to kernel mode occurs, which will then fill out an EXCEPTION_POINTERS structure based on the thrown exception. The kernel then returns to user mode and executes one VEH after another until one of them responds with EXCEPTION_CONTINUE_EXECUTION. If no VEHs to execute are left and the exception wasn’t handled, the process terminates.

The exception handling works based on a first-come, first-served principle: if a VEH in the linked list responds with EXCEPTION_CONTINUE_EXECUTION, the VEHs contained in the linked list after the executed VEH will no longer be executed.

There are ways to avoid calling AddVectoredExceptionHandler to register a VEH, for example by manually locating and manipulating said linked list. However, the same problems and IoCs remain:

  • Our own VEH needs to be part of the linked list.
  • All VEHs before our own VEH in the linked list are executed and can handle the exception first.

Wouldn’t it be nice if we could handle exceptions without adding our exception handler to the linked list while also guaranteeing that our exception handler is executed before any other exception handlers? Or without even calling the other exception handlers at all?

If you were a careful reader of the first part of the series, you might’ve already concluded where this is going: if an exception is a user-mode-to-kernel context switch, which then returns to user mode, can we intercept the return to user mode with our IC?

How convenient that we also created a PoC to log syscall names in the first part. Why don’t we just try using that PoC to see if something shows up when an exception is thrown?

KiUserExceptionDispatch

When an exception is thrown, the KiUserExceptionDispatch function from ntdll is called. As the kernel returns here, we’re guessing that this function most likely calls the registered exception handlers somewhere down the road. Let’s check this theory by opening ntdll! KiUserExceptionDispatch in a decompiler. Luckily, figuring out what the function does is simple because of function names provided by Microsoft:

+0x00    void KiUserExceptionDispatch() __noreturn
+0x00    {
+0x00        int64_t Wow64PrepareForException_1 = Wow64PrepareForException;
+0x0b        void arg_4f0;
+0x0b      
+0x0b        if (Wow64PrepareForException_1)
+0x1a            Wow64PrepareForException_1(&arg_4f0, &__return_addr);
+0x1a      
+0x29        char rax;
+0x29        int64_t r8;
+0x29        rax = RtlDispatchException(&arg_4f0, &__return_addr);
+0x30        int32_t rax_1;
+0x30      
+0x30        if (!rax)
+0x30        {
+0x4b            r8 = 0;
+0x4e            rax_1 = NtRaiseException();
+0x30        }
+0x30        else
+0x37            rax_1 = RtlGuardRestoreContext(&__return_addr, nullptr);
+0x37      
+0x55        RtlRaiseStatus(rax_1);
+0x55        /* no return */
+0x00    }

We can ignore the Wow64 functions because we are only focussing on ICs in non-Wow64 processes as mentioned in the disclaimer.

The code after the Wow64 functions looks interesting; RtlDispatchException is called with two parameters. The parameter names were auto-generated by BinaryNinja.

If we look at the disassembly of the function, we can see that both parameters used for calling RtlDispatchException are taken from the stack. This is also why the second parameter was named as __return_addr by BinaryNinja, as the address is on top of the stack, which is normally the return address. Further down the decompiled snippet, we see a call to RtlGuardRestoreContext. This function does not have documentation on MSDN; however, RtlRestoreContext does. If we peek into RtlGuardRestoreContext with a disassembler/decompiler, we can see it’s just a wrapper around RtlRestoreContext with some sanity checks. Looking at the documentation, we can see that RtlRestoreContext takes a pointer to a CONTEXT structure and an optional second pointer to a _EXCEPTION_RECORD struct. So, the parameter named __return_addr by BinaryNinja is a pointer to the CONTEXT structure of the exception. Theoretically, this would already suffice to do some basic hooks, but let’s get access to the other member of the EXCEPTION_POINTERS structure: EXCEPTION_RECORD. If __return_addr is the CONTEXT structure, the first argument is the EXCEPTION_RECORD structure, as that is also retrieved from the stack that was set up by the kernel for the user mode exception handling. Let’s not overcomplicate things with further static analysis; instead, we can write a program that uses VEH and attach a debugger to it. For this, I’ll use the following program that registers a VEH and then performs a null pointer dereference to cause an exception:

#include "Windows.h"
long exception_handler(EXCEPTION_POINTERS* exception_info) {
   return EXCEPTION_CONTINUE_SEARCH;
}
int main()
{
   AddVectoredExceptionHandler(1, &exception_handler);
   bool* test = nullptr;
   *test = true;
   return 0;
}

Following the compilation, the program was opened in the debugger WinDbg.

First, breakpoints on both the exception handler and the call to RtlDispatchException inside the function KiUserExceptionDispatch were set, as RtlDispatchException takes the pointer to the CONTEXT structure and another parameter, which might be a pointer to the EXCEPTION_RECORD structure.

0:000> bp ntdll!KiUserExceptionDispatch+0x29
0:000> bp exception_handler

After resuming execution, the breakpoint in KiUserExceptionDispatch is executed first as expected. After the breakpoint is triggered, we read out rcx and rdx, because according to the Windows x64 calling convention, these registers will hold the first and second function parameter.

Breakpoint 0 hit
ntdll!KiUserExceptionDispatch+0x29:
00007ffe`2f571439 e8d20efbff      call    ntdll!RtlDispatchException (00007ffe`2f522310)
0:000> r rcx
rcx=0000003d38affa30
0:000> r rdx
rdx=0000003d38aff540

Now, we need to cross-reference these values with the values of the EXCEPTION_POINTERS structure that is passed to the exception handler. This can easily be done with a handy feature of WinDbg: the display type command (dt).

0:000> g
Breakpoint 1 hit
veh_hooking_test!exception_handler:
00007ff7`30c41000 50              push    rax
0:000> dt EXCEPTION_POINTERS @rcx
veh_hooking_test!EXCEPTION_POINTERS
  +0x000 ExceptionRecord  : 0x0000003d`38affa30 _EXCEPTION_RECORD
+0x008 ContextRecord    : 0x0000003d`38aff540 _CONTEXT

As you can see, our assumption was correct: the parameters passed to RtlDispatchException are the EXCEPTION_RECORD and CONTEXT structure. As you can also see, KiUserExceptionDispatch calls RtlGuardRestoreContext on the CONTEXT structure after RtlDispatchException was executed.

RtlRestoreContext, the function internally called by RtlGuardRestoreContext, sets the registers of the specified thread as specified in the CONTEXT struct passed to that function. This means, rip, the instruction pointer, is also overwritten so code after the call to RtlRestoreContext is never executed. This also means that the C++ function (named instrumentation_callback in the previous blog post) won’t return to your assembly bridge to execute everything after the C++ function call.  The IC flag will thus never be reset.

IC exception handling

We now know how we can get access to the EXCEPTION_RECORD and CONTEXT structures and know how KiUserExceptionDispatch resumes execution – with RtlGuardRestoreContext.

All we now need to do is get our IC to intercept KiUserExceptionDispatch, retrieve the EXCEPTION_RECORD and CONTEXT off the stack and resume execution if we want to handle the exception.

We will reuse the same assembly bridge as in the first part of this blog series.

For now, let’s not add hooking but instead create a regular exception handler that continues execution after an access violation. For this, a modified version of the code snippet previously used for debugging will be used. The following snippet adds a regular exception handler that returns EXCEPTION_CONTINUE_EXECUTION, which means that the exception was handled, and that the execution of the program can continue:

#include "Windows.h"
#include "print"
long exception_handler(EXCEPTION_POINTERS* exception_info) {
   exception_info->ContextRecord->Rip += 3;
   return EXCEPTION_CONTINUE_EXECUTION;
}
int main()
{
   AddVectoredExceptionHandler(1, &exception_handler);
   bool* test = nullptr;
   *test = true;
   std::println("Access violation skipped");
   return 0;
}

You might wonder why we are adding a hardcoded value of 3 to the value of rip that is saved in the CONTEXT record. This is used to skip the access violation at the line *test = true, as it gets compiled to the bytes c60001, so 3 bytes that need to get skipped to prevent the exception from being triggered again once execution continues.

In non-test code you would not want to do this, as a different compiler or the same compiler with different settings could also produce other instructions to perform the same logic. Normally, you would want to use a disassembler such as Zydis to disassemble the instruction rip points to, to dynamically calculate the length of the instruction. We decided against this to keep the snippet code as minimal as possible.

Let’s now remove the AddVectoredExceptionHandler line and try to replace it with an IC.

First, register an IC using the same logic/code as in the first part of this series. In this part, we will only cover changes to the instrumentation_callback function, as the rest remains the same as in the first blog post.

The following IC can be used to execute the same exception handler that would’ve been called if you added it with AddVectoredExceptionHandler. The code for the function is simple; if you’ve understood the blog posts so far you shouldn’t have a problem understanding it. The only part that was not covered was the offset of 0x4f0 from rsp to get the EXCEPTION_RECORD*. This comes from KiUserExceptionDispatch. We only showed the decompiled version of the code, which of course does not contain the stack offsets. If you disassembled that function and looked at the function call to RtlDispatchException, you would see the 0x4f0 offset.

You might also notice that we are using KiUserExceptionDispatcher instead of KiUserExceptionDispatch with GetProcAddress. That is because the function is exported as KiUserExceptionDispatcher.

extern "C" uint64_t instrumentation_callback(uint64_t original_rsp, uint64_t return_addr, uint64_t return_val) {
static uint64_t user_exception_addr = 0;
if (!user_exception_addr) {
   user_exception_addr = reinterpret_cast<uint64_t>(GetProcAddress(GetModuleHandle("ntdll.dll"), "KiUserExceptionDispatcher"));
}
if (return_addr != user_exception_addr)
   return return_val;
EXCEPTION_POINTERS exception_pointers = {};
exception_pointers.ContextRecord = reinterpret_cast<CONTEXT*>(original_rsp);
exception_pointers.ExceptionRecord = reinterpret_cast<EXCEPTION_RECORD*>(original_rsp + 0x4f0);
auto exception_status = exception_handler(&exception_pointers);
if (exception_status == EXCEPTION_CONTINUE_SEARCH)
   return return_val;
RtlRestoreContext(exception_pointers.ContextRecord, nullptr);
// This will never be reached if RtlRestoreContext executes successfully
return return_val;
}

With this code, the Windows exception handlers are never executed if our own exception handler returns EXCEPTION_CONTINUE_EXECUTION, as the code restores the context before the regular exception handlers are even called.

Hooking with ICs

Skipping access violations is cool, but it’s not useful compared to what else we can do with an exception handler. So, let’s return to the main topic of this blog post: how to hook code with ICs. For this, we will create an imaginary scenario: we have an installed IC and want to hinder someone else from overwriting/removing our IC. This will only work within the same process context because ICs are process-local – a different process can overwrite the IC remotely if it has the necessary privilege (SeDebugPrivilege).

We’ve touched on hardware breakpoints and debug registers before, but we haven’t set any. We mentioned that hardware breakpoints are set via CPU registers – the debug registers. This means, they are thread-specific: they will only trigger from the specific thread for which they were set. To set the breakpoints for the entire process, the hardware breakpoints need to be set for all threads, and you also need to be careful of thread creations.

Setting hardware breakpoints

To use hardware breakpoints, we first need to set the debug registers accordingly.

For this purpose, we created a function with the following function definition:

bool set_hwbp(debug_register_t reg, void* hook_addr, bp_type_t type, uint8_t len)

The definitions for the two custom enums debug_register_t and bp_type_t look as follows:

enum class debug_register_t {
Dr0 = 0,
Dr1,
Dr2,
Dr3
};
enum class bp_type_t {
Execute = 0b00,
Write = 0b01,
ReadWrite = 0b11
};

These are not mandatory; however, we use them to make our intentions clearer instead of directly requiring numbers or bit literals to be passed. As mentioned before, there are four debug registers that can contain the address of a breakpoint. Each of these debug registers has separate options that can be set. This allows execution, read, and read and write breakpoints.

Now Dr7, the control register, needs to be set accordingly.

OSDev wiki has a table explaining the structure of Dr7:

Figure 1: https://wiki.osdev.org/CPU_Registers_x86#Debug_Registers

Consultant

Category
Date
Navigation

For each hardware breakpoint we want to set, we need to do three things:

  1. Set Dr0/1/2/3 to the address.
  2. Enable the corresponding local breakpoint for the passed debug_register_t (bits 0–7)
  3. Set the correct condition based on the passed
  4. Set the correct size for the breakpoint. For execute breakpoints, it always needs to be 0.

Steps 1 and 2 can be done using the following code:

bool set_hwbp(debug_register_t reg, void* hook_addr, bp_type_t type, uint8_t len) {
CONTEXT context = { .ContextFlags = CONTEXT_DEBUG_REGISTERS };
if (!GetThreadContext(GetCurrentThread(), &context))
   return false;
if (reg == debug_register_t::Dr0)
   context.Dr0 = reinterpret_cast<DWORD64>(hook_addr);
else if (reg == debug_register_t::Dr1)
   context.Dr1 = reinterpret_cast<DWORD64>(hook_addr);
else if (reg == debug_register_t::Dr2)
   context.Dr2 = reinterpret_cast<DWORD64>(hook_addr);
else
context.Dr3 = reinterpret_cast<DWORD64>(hook_addr);

As the debug registers can’t be directly modified from user mode, we need to use the corresponding Windows APIs (GetThreadContext and SetThreadContext). We then set Dr0/1/2/3 to the hook address.

The steps afterwards become a bit more complicated due to bitwise operations being needed. Additionally, the corresponding bit positions need to be calculated in Dr7.

For brevity’s sake, we added comments to the specific passages instead of explaining it via text:

[…]
// Converts enum type to its underlying type to use it for calculations
auto reg_index = std::to_underlying(reg);
// Enables local breakpoint (bit position 0/2/4/6)
context.Dr7 |= 1ULL << (reg_index * 2);
// Clear and set condition (execute/write/read and write)
context.Dr7 &= ~(0b11ULL << (16 + reg_index * 4));
context.Dr7 |= (std::to_underlying(type) << (16 + reg_index * 4));
// Execution breakpoints always require the length to be 0
if (type == bp_type_t::Execute)
   len = 0;
// Clear and set length
context.Dr7 &= ~(0b11ULL << (18 + reg_index * 4));
context.Dr7 |= (len << (18 + reg_index * 4));
return SetThreadContext(GetCurrentThread(), &context);
}

Now we’ve got everything set up to install a hardware breakpoint. The following snippet can be added to your main function to install a breakpoint on function calls to NtSetInformationProcess:

set_hwbp(debug_register_t::Dr0, nt_set_info_proc, bp_type_t::Execute, 0);

This should crash your program if you call the specified function and have no exception handler that handles the exception.

Modifying the exception handler

Now we only need to make the exception handler handle the exception caused by the hardware breakpoint. For this, we don’t need to touch the IC as it already correctly calls the exception handler; instead, we need to modify the function exception_handler.

First, we need to detect if the exception was caused by one of the debug registers. This can be easily done by checking the rip register for breakpoints caused by execution; however, we also want compatibility with write and read/write breakpoints. These types of breakpoints will contain the address of the operation that tries to access the address within a debug register in rip. Instead of checking rip, we can use Dr6: the debug status register. When a debug register is fired, the bits 0-3 will be set according to which debug register is set. For example, when Dr2 is fired, bit 2 will be set.

The debug registers are luckily included in the ContextRecord member of the EXCEPTION_POINTERS structure passed to VEH handlers. This means, we don’t need to call GetThreadContext again to retrieve it.

Here is an example of how to check which debug register fired:

long exception_handler(EXCEPTION_POINTERS* exception_info) {
if (exception_info->ContextRecord->Dr6 & 1)
   std::println("Dr0 fired");
else if (exception_info->ContextRecord->Dr6 & 2)
   std::println("Dr1 fired");
else if (exception_info->ContextRecord->Dr6 & 4)
   std::println("Dr2 fired");
else if (exception_info->ContextRecord->Dr6 & 8)
   std::println("Dr3 fired");
[…]

Before implementing the actual logic that hinders someone from overwriting an IC, we need to fix the error you’ve most likely ran into if you tried testing that code: the exception keeps firing till the program eventually crashes.

The solution for this is the resume flag; this is a bit in the RFLAGS register. The explanation for this bit can be found in the AMD manual: “[…] The RF bit, when set to 1, temporarily disables instruction breakpoint reporting to prevent repeated debug exceptions (#DB) from occurring. […]”. So, all we need to do is set the resume flag, which is at bit 16 of the RFLAGS register. In user mode, only EFLAGS, i.e. the lower 32 bits of the RFLAGS register, are accessible. The resume flag can be set as follows, with EFLAGS being used instead of RFLAGS because of the aforementioned reasons:

exception_info->ContextRecord->EFlags |= 1 << 16;

After adding that, the code can continue execution even after a hardware breakpoint was triggered.

Forbidding IC registration

We’ve covered everything that’s needed to hinder someone from registering a new IC. The following exception handler only handles a hardware breakpoint set in Dr0. Then, NtSetInformationProcess specific actions are performed: first, we check if the 0x28, the value required to install an IC, is even passed to the function or if NtSetInformationProcess should perform something else than registering an IC. If a new IC should get installed, it is read out and printed. Afterwards, rax, the register that holds the return value, is set to 0 to show that the function call was successful. We then set rip to the address of a ret instruction, so NtSetInformationProcess isn’t executed. You could also manually set up the return, meaning manually adjusting the stack and loading the return address into rip.

long exception_handler(EXCEPTION_POINTERS* exception_info) {
if (!(exception_info->ContextRecord->Dr6 & 1))
   return EXCEPTION_CONTINUE_SEARCH;
exception_info->ContextRecord->EFlags |= 1 << 16;
// Does the call even want to overwrite the IC?
if (exception_info->ContextRecord->Rdx != 0x28)
   return EXCEPTION_CONTINUE_EXECUTION;
const auto instrumentation_info = reinterpret_cast<PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION*>(exception_info->ContextRecord->R8);
std::println("Following IC was going to get set: {}", instrumentation_info->Callback);
// Success
exception_info->ContextRecord->Rax = 0;
exception_info->ContextRecord->Rip = reinterpret_cast<DWORD64>(ret_operation_addr);
return EXCEPTION_CONTINUE_EXECUTION;
}

If you installed your own IC with an exception handler, registered a hardware breakpoint on NtSetInformationProcess and then tried reregistering an IC, you would see prints by your own exception handler, which shows that the IC registration was blocked. You can verify that your IC wasn’t overwritten by trying to register a new IC multiple times: if the prints still show up, this of course means your IC is still active.

Closing words

In this blog you learned how to do very basic hooking with ICs, but this is by no means all you can do with ICs in terms of hooking. The benefit of the chosen design, i.e. your IC calling an exception handler with a set up EXCEPTION_POINTERS structure, is that it is compatible with the regular format of exception handlers required for VEH. Anything you can get to work with VEH you can get to work with the IC implementation of it, with the main benefit being that no other exception handlers are called due to the VEH being entirely skipped.

You could, for example, also hook data reads and writes by changing the hardware breakpoint options. You can also get PAGE_GUARD hooks to work, as they also throw exceptions.

We recommend keeping the restrictions of hardware breakpoints in mind, especially with multi-threaded programs.

Instead of blocking NtSetInformationProcess calls that want to register new ICs, you could block the NtSetInformationProcess call and then call the IC that should be set from within your own IC to make the user/program that tried registering the IC think their IC was successfully added, but your IC is still set, and you can filter what is passed to the other IC.

It is also possible to pass through calls to hooked functions from within your hook, but you need to disable the hardware breakpoints or pass through the exceptions to make it work as normal.

A little hint: think about the restrictions of using a flag to enable and disable your IC – what happens if someone sets a hardware breakpoint in your IC?

In the next part of this series, you will learn how you can use ICs to inject shellcode into other processes. After that, in the last part of this series, we will look at ICs from a more theoretical standpoint: what is possible with them, what isn’t and how can programs detect if an IC is set.

Further blog articles

Do you want to protect your systems? Feel free to get in touch with us.

Windows Instrumen­tation Callbacks – Part 1

Search

Windows Instrumen­tation Callbacks – Part 1

November 5, 2025

Windows Instrumentation Callbacks Part 1

Introduction

This multi-part blog series will be discussing an undocumented feature of Windows: instrumentation callbacks (ICs).

In the first part of the blog, you will learn how ICs are implemented and how you can use them to log and spoof syscalls without setting any hooks.

In the second part, you will learn how to use ICs for patchless hooking without registering or executing any exception handlers.

Disclaimer

  • This series is aimed towards readers familiar with x86_64 assembly, computer concepts such as the stack and Windows internals. Not every term will be explained in this series.
  • This blog post will teach you how to set ICs on Windows 10 and 11; for older Windows versions, the API for setting an IC is different.
  • This series is aimed at x64 programs. We will not be discussing setting instrumentation callbacks on WoW64 processes, i.e. processes running through the x86 compatibility layer.

Credits

This blog post is based on the research of multiple people, most notably Alex Ionescu and his Hooking Nirvana presentation at Recon 2015. We recommend watching that presentation as he also shows other interesting hooking techniques.

dx9’s blog post about Hyperion (an anti-cheat) and wave (a cheat), which both utilize instrumentation callbacks, was also very informative.

Additionally, we want to thank ph3r0x for telling us about ICs and about the differences in WoW64 processes.

What are instrumentation callbacks?

A callback is a function that is passed to another function which then executes the callback function at a certain event or condition.

Instrumentation refers to the process of modifying a program to allow analysis of it.

In simple terms, an instrumentation callback instruments a program so that the specified callback function is executed on kernel-to-user-mode returns. According to Alex Ionescu, instrumentation callbacks are used by Microsoft in internal tools such as iDNA, which is apparently used for time travel tracing and for TruScan. We cannot confirm that; however, there is a mention of iDNA and TruScan in this Microsoft research paper.

The more thorough explanation of the inner workings of instrumentation callbacks is as follows: ICs are a process-specific user mode callback to system traps, for example syscalls or exceptions like access violations. Once a trap is triggered, a switch to kernel mode occurs to handle the trap. If an IC is set, the kernel will return to the IC instead of the original return point. This means, the IC is the first execution step back in user mode after the trap was executed. The IC is also responsible for continuing the program flow, as otherwise the program would crash or yield. For this purpose, the kernel passes the original return point in a CPU register as we will find out by reversing later.

For visualization, let’s trace the flow of a typical Windows API call. Please note that the kernel part of this diagram is by no means complete; the diagram is meant to show the execution flow with and without an instrumentation callback; it’s not meant to teach you the inner workings of the kernel. If that interests you, we recommend the explanation of the Windows syscall handler by hammertux.

Figure 1: Exemplary OpenProcess call without IC

Consultant

Category
Date
Navigation

With an IC set, this flow would look as follows:

Figure 2: Exemplary OpenProcess call with IC

You might be wondering why we are jumping to r10. We will get to that in the next chapter.

example.exe refers to the memory region of that process; the IC does not need to be a part of the original program’s binary; it can be added dynamically at runtime.

Looking at that diagram, it might become more obvious how powerful ICs are. The kernel returns right to our code, before even the ret instruction after the syscall is executed: our IC is the first code to be executed after the kernel returns to user mode. We will discuss what can be done with that later. Let’s first check out how the IC is handled by the kernel.

Reversing

KiSetupForInstrumentationReturn

ntoskrnl.exe includes a function called KiSetupForInstrumentationReturn. Let’s check out what this function does; as one could guess by the name, it has something to do with ICs. 

mov rax, qword [gs:0x188]
mov rdx, qword [rax+0xb8]
mov r8, qword [rdx+0x3d8]
test r8, r8
jne 0x140482a86
retn

Let’s go through this step by step.

Line 1: At the start of the gs register in the kernel, the Kernel Processor Control Region (KPCR) structure is located. At an offset of 0x180 of that structure, a member structure called Kernel Processor Control Block (KPRCB) is located. So, by accessing gs:0x188, we access the KPRCB structure member at an offset of 8. At offset 8 of the KPRCB, the CurrentThread member of type KTHREAD* is located, which is dereferenced. So, after the first operation, the register rax holds the address of the start of the current thread’s KTHREAD structure.

Line 2: This operation loads the base of the KPROCESS processes into rdx. This might not fit the KTHREAD structure definition before mentioned; however, if we disassemble PsGetCurrentProcess, we will see the same operations.

Line 3-6: At an offset of 0x3d8 of the KPROCESS structure, the InstrumentationCallback member is located, which gets moved into r8 and tested to check if it is null. If it is null, the function returns. As rax still holds the the start of the current thread’s KTHREAD structure, this is what the function returns.

The following disassembly gets executed if an IC is set:

cmp word [rcx+0x170], 0x33
jne 0x14036d228
mov rax, qword [rcx+0x168]
mov qword [rcx+0x58], rax
mov qword [rcx+0x168], r8
retn

Now the parameter passed to KiSetupInstrumentationReturn in rcx is used: it’s the address of the base of the KTRAP_FRAME structure of the trap – you will just have to believe us on that one 😉

Line 1-2: This check is done to verify that the trap didn’t originate from a WoW64 program by checking the SegCs member of KTRAP_FRAME. For 64-bit programs, it should equal 0x33; for programs executed through the WoW64 compatibility layer, this is most likely 0x23. We’d recommend you check out this blog article by Marcus Hutchins if you are interested in an explanation.

Line 3-4: TRAP_FRAME.r10 is set to KTRAP_FRAME.rip. To clarify, the trap frame/the register members of that structure hold the values the thread had when the trap occurred in user mode. Meaning KTRAP_FRAME.rip does not hold a kernel address but one in userland.

Line 5: KTRAP_FRAME.rip is set to KPROCESS.InstrumentationCallback, which was already moved into r8 before.

Now we know that r10 will hold the actual instruction pointer and saw how the IC is implemented. By checking the cross-references to that function, the following functions show up: KiInitializeUserApc, KiDispatchException, KeRaiseUserException, KiRaiseException. Additionally, an unnamed function shows up. This gives us hints to what we can catch with ICs.

We now know we somehow need to set KPROCESS.InstrumentationCallback; however, this is obviously a kernel structure, which we can’t directly set from user mode.

NtSetInformationProcess

Of course there is a function to set KPROCESS.InstrumentationCallback from user mode, as otherwise this blog post would not exist. As mentioned before, we did not reverse ntoskrnl ourselves to find this function; that credit goes to Alex Ionescu.

NtSetInformationProcess is a common syscall that does multiple things; it receives the same parameters as its kernelbase counterpart SetProcessInformation. The second parameter is an enum called ProcessInformationClass that specifies the operation to execute.

With the knowledge of the Nirvana Hooking presentation by Alex Ionescu, finding the relevant code in NtSetInformationProcess is easy. Within the function, a switch case on the second parameter, the ProcessInformationClass enum, is performed. Case 0x28 is what is relevant for us to set an IC.

For brevity, we will not be going through the entirety of the function. If you are interested in looking at it yourself, you can find it in ntoskrnl.exe at NtSetInformationProcess+0x1b42.

Right after validating the passed handle, a call to PsGetCurrentProcess and SeSinglePrivilegeCheck with SeDebugPrivilege passed as parameter is made.

Then, a big if statement (NtSetInformationProcess+0x1c2b) is opened, which checks if the return value of SeSinglePrivilegeCheck is true or if an unknown variable is equal to PsGetCurrentProcess. This lets us guess we require the SeDebugPrivilege to set an IC on other processes, but we don’t need it to set it on our own process.

At NtSetInformationProcess+0x1d09, we see a familiar looking offset: 0x3d8. This is the line where our IC gets set.

This logic can be represented by the following shortened pseudo code:

struct PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION {
ULONG Version;
ULONG Reserved;
PVOID Callback;
};
NTSTATUS NtSetInformationProcess(HANDLE ProcessHandle, PROCESSINFOCLASS ProcessInformationClass, PVOID ProcessInformation, [...]) {
  switch (ProcessInformationClass) {
      // [...]
      case 0x28:
          NTSTATUS status = ObReferenceObjectByHandle(ProcessHandle, PROCESS_SET_INFORMATION, PsProcessType, [...]);
          if (status < 0)
              return status;
            KPROCESS current_process = PsGetCurrentProcess();
          bool has_debug_priv = SeSinglePrivilegeCheck(SeDebugPrivilege, KPRCB[0x232]);
if (!has_debug_priv && requested_process != current_process)
              return STATUS_PRIVILEGE_NOT_HELD;
          if (IsWow64Process(requested_process))
              return STATUS_NOT_SUPPORTED;
            void* ic_address = ProcessInformation.Callback;
        // IC Sanity checks
          // [...]
        // KPROCESS structure
          requested_process.InstrumentationCallback = ic_address;
            // [...]
        }
  }

Setting up a basic IC

Now that we have partially reversed KiSetupForInstrumentationReturn and NtSetInformationProcess we know the following things:

  • An IC can be set from user mode with NtSetInfomationProcess.
    • ProcessInformationClass needs to be set to 0x28.
    • If we want to set an IC on another process, we need to have the SeDebugPrivilege.
  • When the IC is executed, r10 will hold the original rip.

For a successful NtSetInformationProcess call, the following struct needs to be passed as ProcessInformation parameter. We will also need the type definition of NtSetInformationProcess.

struct PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION {
ULONG Version;
ULONG Reserved;
PVOID Callback;
};

Only the Callback member matters to us, the other two need to be set to 0. You can try setting Callback to a function pointer; however, you will not be very successful as the stack was not set up for a function call. The Callback member should instead point to some assembly code. This assembly code, which we will call the bridge, needs to do the following:

  1. Save the registers
  2. Set up a function call
  3. Restore stack and registers after function call
  4. Jump to r10 as that holds the actual address the code should resume at.

Depending on what you want to use your IC for, you will most likely trigger syscalls from within the IC itself. This would cause an infinite recursion, as the IC would be called again when the syscall is triggered; thus, we will also need an option to disable the IC for the current thread.

Let’s try setting up a very simple IC that will trigger a breakpoint on a kernel to usermode return.

Setting the IC

The following is our exemplary code to set an IC. You will of course need to have a function definition for NtSetInformationProcess.

#include <print>
#include <Windows.h>
extern "C" void instrumentation_bridge();
extern "C" void instrumentation_callback() {
  __debugbreak();
}

int main()

PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION instrumentation_info{};
  instrumentation_info.Callback = reinterpret_cast<void*>(&instrumentation_adapter);
  const auto nt_set_info_proc = reinterpret_cast<NtSetInformationProcess_t>(GetProcAddress(GetModuleHandle("ntdll.dll"), "NtSetInformationProcess"));
  if (!nt_set_info_proc) {
    std::println("Could not resolve NtSetInformationProcess");
    return false;
  }
  auto status = nt_set_info_proc(GetCurrentProcess(), static_cast<_PROCESS_INFORMATION_CLASS>(0x28), &instrumentation_info, sizeof(instrumentation_info));
  if (status) {
    std::println("NtSetInformationProcess returned {:x}", status);
  } else {
    std::println("Successfully installed instrumentation callback");
  }

extern “C” is used to disable C++ name mangling and instead use C style linkage.

With the line extern “C” void instrumentation_bridge(); we are linking to our not-yet-written assembly bridge.

instrumentation_callback is the function we want to call through our assembly bridge. For now, we just set a breakpoint there, as we will not be implementing a flag to avoid recursion just yet.

Writing the assembly bridge

For writing the assembly bridge, we’ll be using NASM. If you are using MASM or another assembler, you will of course need to adjust the assembly accordingly.

We will start by pushing the registers, setting up the function call, calling it and then undoing our changes. After that, we will jump to r10 to continue the execution flow. There are multiple ways you can save the current registers, either you just push them to the stack, save them to a structure or call Windows functions doing that for you. Please note that the following snippets do not save, for example, the floating-point registers.

extern instrumentation_callback
section .code
global instrumentation_adapter
instrumentation_adapter:
pushfq
push rax
push rbx
push rcx
push rdx
push rdi
push rsi
push r8
push r9
push r10
push r11
push r12
push r13
push r14
push r15
push rbp
mov rbp, rsp
sub rsp, 0x20
call instrumentation_callback
add rsp, 0x20
pop rbp
pop r15
pop r14
pop r13
pop r12
pop r11
pop r10
pop r9
pop r8
pop rsi
pop rdi
pop rdx
pop rcx
pop rbx
pop rax
popfq
jmp r10

By running the program with an attached debugger, you should now trigger the breakpoint in the C++ code. This means, our function is correctly called. However, we obviously want to do more with our callback than trigger a breakpoint, but for that we will need to implement a check to avoid infinite recursion as the IC would be executed for every syscall, even if the syscall was made by the IC itself.

This flag should be thread-local, as otherwise we would not catch syscall executions in other threads while our IC in one thread is executing.

For this purpose, we’ll be misusing the legacy member InstrumentationCallbackDisabled of the Thread Environment Block (TEB). This is, at least in x64 versions, no longer used. There are smarter ways of implementing such a check, for example with Thread Local Storage, as using the InstrumentationCallbackDisabled member is an obvious giveaway to EDRs/ACs that something weird is going on.

If you look at the structure of the TEB, you will see InstrumentationCallbackDisabled is located at 0x1b8. The idea is that once the IC is triggered, InstrumentationCallbackDisabled gets set to 1 (true) and then our C++ function is executed. If that functions triggers syscalls, they will not call the function again because before that our assembly bridge will check if InstrumentationCallbackDisabled is set to 1 (true). If it is, it continues execution. Once our C++ function is over and the assembly bridge restores the registers, the flag will be cleared.

To do this, the following assembly can be used. The first part before the dots is meant to be added right after the pushfq, and the bottom part is meant to replace everything after pop rax.

  mov rcx, gs:[30h] ; TEB
  add rcx, 1b8h ; TEB->InstrumentationCallbackDisabled 
cmp byte [rcx], 1
  je _ret
  […]
  mov rcx, qword gs:[30h] ; TEB
  add rcx, 1b8h ; TEB->InstrumentationCallbackDisabled
  mov byte [rcx], 0
_ret:
  popfq
  jmp r10

The careful eye might’ve noticed something: with this code we are no longer backing up and restoring rcx. Why’s that?

If you attach a debugger to a program, place a breakpoint on the instruction after a syscall and trigger it, you will see the address of the instruction after the syscall being in rcx. If you do the same with an IC, you will see that the address of the IC is in rcx. If you wanted to hide the existence of your IC, this would obviously be counterproductive. Fixing this, is not part of this article and will not be covered here

We would also recommend checking the value of r10 with and without an IC set.

Logging and spoofing syscalls

Let’s recap: by now we can execute our own C/C++ function after every exception and make syscalls from within it. This is cool; however, we can’t do specific things for certain executed syscalls, as we do not have access to the executed syscalls’ address in our C++ function. Let’s fix this and while we are it, let’s pass even more parameters that will be useful to us. In total we are planning to add three parameters giving us the address of the syscall that was executed, the return value and the original stack pointer. Why the original stack pointer is interesting will be explained shortly.

As mentioned before, there are different ways of saving the registers and different ways of passing information to your function. If you saved the registers in, for example, a CONTEXT structure, you could just pass that to your IC.

Let’s first change our function definition to add the three parameters. Additionally, it would be nice to change the return value of syscalls.

Like specified in the windows x64 calling convention, return values are passed in the rax register. When a syscall is made and the IC is triggered, rax will hold the return value of the syscall. By changing the return type of the instrumentation_callback function from void to uint64_t we can easily overwrite the return value of the syscall by returning another value from our C++ code as rax is overwritten by that.

After implementing those changes, the instrumentation_callback function looks as follows:

uint64_t instrumentation_callback(uint64_t original_rsp, uint64_t 
return_addr, uint64_t return_val) {
__debugbreak();
}

Now we need to adjust the assembly bridge. We can use rcx to store the original stack pointer, as we do not need to back up rcx because of the reasons mentioned before.

extern instrumentation_callback
section .code
global instrumentation_adapter
instrumentation_adapter:
  mov rcx, rsp
  pushfq
push rcx
  mov rcx, gs:[30h] ; TEB
  add rcx, 1b8h ; TEB->InstrumentationCallbackDisabled 
cmp byte [rcx], 1
  pop rcx
  je _ret
  […]
  push rbp
  mov rbp, rsp
  sub rsp, 0x20
  ; rcx already contains the stack pointer
  mov rdx, r10
  mov r8, rax
  call instrumentation_callback
  add rsp, 0x20
  pop rbp
  […]

This should trigger the placed breakpoint in our C++ code and shows that the parameters contain the correct values.

Logging syscalls

To log syscalls with their function name, we will use the dbghelp library, which you need to link against.

Additionally, the following code needs to get added to the start of main to allocate a console and initialize the symbol handler.

[…] 
if (!AllocConsole())
    return -1;

FILE* fp;
freopen_s(&fp, "CONOUT$", "w", stdout);
freopen_s(&fp, "CONIN$", "r", stdin);
freopen_s(&fp, "CONERR$", "w", stderr);
SymSetOptions(SYMOPT_UNDNAME);
if (!SymInitialize(reinterpret_cast<HANDLE>(-1), nullptr, TRUE)) {   
std::println("SymInitialize failed");
 return -1;
  }
[…]

The following instrumentation_callback function then prints out all the called function names, their address, the displacement from the function start and the return value.

extern "C" uint64_t instrumentation_callback(uint64_t original_rsp, uint64_t return_addr, uint64_t return_val) {
std::array<byte, sizeof(SYMBOL_INFO) + MAX_SYM_NAME> buffer{ 0 };
const auto symbol_info = reinterpret_cast<SYMBOL_INFO*>(buffer.data());
symbol_info->SizeOfStruct = sizeof(SYMBOL_INFO);
symbol_info->MaxNameLen = MAX_SYM_NAME;
uint64_t displacement = 0;
if (!SymFromAddr(reinterpret_cast<HANDLE>(-1), return_addr, &displacement, symbol_info)) {
   printf("[-] SymFromAddr failed: %lu", GetLastError());
    return return_val;
}
  if (symbol_info->Name)
   printf("[+] %s+%llu \n\t- Returns: %llu\n\t- Return address: %llu\n", symbol_info->Name, displacement, return_val, return_addr);
  return return_val;
}

This functionality is obviously the most useful if the project is a DLL and not an EXE, as it can then be injected into a process to see which syscalls the program triggers.

Spoofing syscalls

Let’s now start doing cool stuff with our IC: as ICs are the first code being executed in user mode after a syscall, we can spoof its return values from our IC.

For this example, our test program will be using OpenProcess to open a handle to another process. Our IC will then retrieve the opened handle from the stack, close it and then return ACCESS_DENIED.

Our IC only gets a callback to NtOpenProcess, which is called by OpenProcess, not to OpenProcess itself. Let’s look at the function definitions for both functions:

HANDLE OpenProcess(
[in] DWORD dwDesiredAccess,
[in] BOOL  bInheritHandle,
  [in] DWORD dwProcessId
);
NTSTATUS NtOpenProcess(
[out]          PHANDLE            ProcessHandle,
[in]           ACCESS_MASK        DesiredAccess,
[in]           POBJECT_ATTRIBUTES ObjectAttributes,
[in, optional] PCLIENT_ID         ClientId
);

As we can see, rax, the register containing the return value of the syscall, will hold a NTSTATUS value and not the handle. First, we need to check if NtOpenProcess was executed without an error and then we need to retrieve the handle from the stack for which we need a stack offset.

As OpenProcess returns a HANDLE, we know the required logic to retrieve the handle is already implemented in OpenProcess after the NtOpenProcess function call.

Let’s reverse OpenProcess in kernelbase to retrieve the offset:

[…]
call qword [rel NtOpenProcess]
nop dword [rax+rax]
test eax, eax
js 0x1800338c5
mov rax, qword [rsp+0x88]
add rsp, 0x68
retn

Most of the function is not important for us; we just need to check how the handle gets loaded into rax. This is done through the operation mov rax, qword [rsp+0x88], so we know that if we have the stack pointer of the OpenProcess function, the handle is at an offset of 0x88. Our original_rsp parameter holds the stack pointer of NtOpenProcess, not OpenProcess. This means that the top of the stack holds the address NtOpenProcess should return to in OpenProcess. Therefore, we need to add eight to that value of 0x88 to access the handle.

You might understand now why we added an original_rsp parameter to our C++ function. We could still access the handle from the function with inline assembly; however, every time we add, for example, a local variable in our C++ function, we would need to recalculate our offset to the handle, as a bigger stack frame would be allocated for our function.

Let’s recap what we require to spoof the handle access:

  1. We need to calculate the return address of the NtOpenProcess
  2. We need to check if the return address is that of the ret operation of NtOpenProcess.
  3. We should check the value of rax. If it contains a non-zero value NtOpenProcess
  4. We need to change the handle at the offset of 0x90 of the original stack pointer to INVALID_HANDLE_VALUE.
  5. We need to change the return value to STATUS_ACCESS_DENIED (0xC0000022).

As we can now do this in C++, this is very easy and can be done with the following code:

extern "C" uint64_t instrumentation_callback(uint64_t original_rsp, uint64_t return_addr, uint64_t return_val) {
static uint64_t nt_open_proc;
  if (!nt_open_proc) {
   nt_open_proc =
reinterpret_cast<uint64_t>(GetProcAddress(GetModuleHandleA("ntdll.dll"), "NtOpenProcess"));
   if (!nt_open_proc)
     return return_val;
    nt_open_proc += 20;
}
if (return_addr != nt_open_proc)
   return return_val;
if (return_val != 0)
   return return_val;
auto handle_ptr = reinterpret_cast<HANDLE*>(original_rsp +  0x90);
if (*handle_ptr == INVALID_HANDLE_VALUE)
   return return_val;
  std::println("[+] IC: Detected program NtOpenProcess call: {}", *handle_ptr);
CloseHandle(*handle_ptr);
  std::println("[+] IC: Closed opened handle and spoofing Access denied");
  *handle_ptr = INVALID_HANDLE_VALUE;
  return 0xC0000022; // Access denied NTSTATUS value
}

To test this, let’s open a handle to a process with and without an IC set. For this example, we’ll be using notepad.exe as a test program. As OpenProcess requires a process ID, we have also added a basic process ID enumeration function.

#include <tlhelp32.h>
[…]
uint32_t get_process_id(const std::string_view& process_name) {
PROCESSENTRY32 proc_entry{ .dwSize = sizeof(PROCESSENTRY32) };
HANDLE snapshot = CreateToolhelp32Snapshot(TH32CS_SNAPPROCESS, 0);
  if (snapshot == INVALID_HANDLE_VALUE)
   return 0;
if (!Process32First(snapshot, &proc_entry))
   return 0;
  do {
   if (std::string{ proc_entry.szExeFile } != process_name)
     continue;
   CloseHandle(snapshot);
   return proc_entry.th32ProcessID;
} while (Process32Next(snapshot, &proc_entry));  CloseHandle(snapshot);
return 0;
}
int main()
{
PROCESS_INSTRUMENTATION_CALLBACK_INFORMATION instrumentation_info{};
instrumentation_info.Callback = reinterpret_cast<void*>(&instrumentation_adapter);
  const auto nt_set_info_proc = reinterpret_cast<NtSetInformationProcess_t>(GetProcAddress(GetModuleHandle("ntdll.dll"), "NtSetInformationProcess"));
if (!nt_set_info_proc) {
   std::println("Could not resolve NtSetInformationProcess");
   return -1;
}
  const auto pid = get_process_id("notepad.exe");
if (pid == 0) {
   std::println("Could not find notepad.exe");
   return -1;
}
  auto handle = OpenProcess(GENERIC_ALL, 0, pid);
  if (handle != INVALID_HANDLE_VALUE)
   std::println("Successfully opened process handle: {}", handle);
else
   std::println("Failed opening process handle: {}", handle);
CloseHandle(handle);
  auto status = nt_set_info_proc(GetCurrentProcess(), static_cast<_PROCESS_INFORMATION_CLASS>(0x28), &instrumentation_info, sizeof(instrumentation_info));
if (status) {
   std::println("NtSetInformationProcess returned {:x}", status);
} else {
   std::println("Successfully installed instrumentation callback");
}
  handle = OpenProcess(GENERIC_ALL, 0, pid);
  if (handle != INVALID_HANDLE_VALUE)
   std::println("Successfully opened process handle: {}", handle);
else
   std::println("Failed opening process handle: {}", handle);
CloseHandle(handle);
}

Executing the code with a working IC should result in one successful and one failed OpenProcess call if notepad.exe is running.

Of course, OpenProcess was just used as an example. This can be done with every syscall.

Closing words

In this blog you learnt how ICs work and how they can be used to log and spoof syscalls from user mode. ICs can be utilized for much more; in the upcoming blogs you will learn how to inject shellcode into other processes and how you can hook function calls with ICs to, for example, prevent users from overwriting your own IC. In a more theoretical part of the series we will discuss other use cases of ICs and possible counter measures.

Further blog articles

Do you want to protect your systems? Feel free to get in touch with us.
Search
Search