What if you could strap a full desktop GPU to your MacBook Air? Turns out, you can.
Just a quick FTC required note: When you buy through my links, I may earn a commission.
Never tell me the odds
As much as I hate to admit it, step one in most of my projects now is to ask AI about it. Maybe it’ll tell me something I don’t know.
Fortunately, borderline-impractical is kind of my thing.
What’s a Thunderbolt eGPU?
Ok, so the plan is to plug a big PC gaming GPU, an NVIDIA RTX 5090, into my M4 MacBook Air. To do that, we plug it into a Thunderbolt dock which adapts PCIe to Thunderbolt, and we plug that into a USB-C port.
Thunderbolt tunnels PCIe over a USB-C cable, so from the computer’s perspective a Thunderbolt device really is a PCIe device, not a USB one. You get 4 PCIe lanes at up to 40Gbps on Thunderbolt 4, with a small performance penalty for the tunneling. USB4 includes the same PCIe tunneling as an optional feature, so some non-Thunderbolt USB4 ports can do this too. You can use this to plug a GPU into a laptop with a compatible port.
Thunderbolt from the laptop plugs into the GPU dock. The GPU plugs into the monitor via DisplayPort. Shortly after this was taken, I broke this dock.
From the computer’s perspective, the device looks more or less like a slightly slower PCIe device, so you can usually use the same drivers you’d normally use for those devices. eGPUs work pretty much out of the box on Linux and Windows. It’s even possible to use one on a Raspberry Pi (albeit with Oculink, not Thunderbolt).
The first hurdle is that macOS does not ship with drivers for NVIDIA or AMD GPUs on Apple Silicon.
What about tinygrad?
tinygrad recently released their own macOS eGPU drivers. It’s a whole new AI stack with its own open source driver pipeline for NVIDIA and AMD hardware.
Sadly, if your main objective is to run AI inference or play games, tinygrad probably isn’t the solution you’re looking for. This video by YouTuber Alex Ziskind shows that using an eGPU via tinygrad for inference is about 10 times slower than running native Metal inference directly on an M4 Pro without an eGPU. You can only use the tinygrad eGPU driver with the tinygrad stack, not for anything else. It also has very limited support for different AI models.
Getting NVIDIA PTX code running on the GPU is one thing. Writing a full general-purpose display driver that works with arbitrary software is a significantly harder problem. So for now, what can you actually do with an eGPU and a Mac?
The existing Linux driver
Linux can run on Apple Silicon Macs now. Regrettably, at this time, the Linux kernel does not support Thunderbolt on Apple Silicon (only internal devices and USB3). But…
You can run Linux in a 64-bit ARM VM on a macOS host. macOS supports Thunderbolt devices. Linux supports NVIDIA GPUs. Let’s put the pieces together and pass through the GPU into the Linux VM.
At a high level, we’re just going to put the GPU in the Linux VM. The VM is the same architecture as the Mac host (arm64), so performance should be comparable. Of course, the devil is in the details.
There is no driver for NVIDIA cards on ARM64 Windows. That’s why we use Linux.
There is no driver for NVIDIA cards on ARM64 Windows. That’s why we use Linux.
For a quick video demo of the result, take a look:
In the rest of the post, I’ll go through the long and winding road of getting this to actually work. If you just want to see screenshots and benchmarks, you can probably skip to the benchmark section.
Engineering PCI Passthrough on macOS
PCI device basics
Let’s look at two things we need working for the VM to talk to the PCI device:
PCI BAR (Base Address Registers) - Each PCI device communicates through chunks of memory that the computer can read and write to. There’s basically a reserved region of memory on your computer for each device. Those memory regions have to be mirrored into the VM for PCI passthrough to work.
PCI BAR (Base Address Registers) - Each PCI device communicates through chunks of memory that the computer can read and write to. There’s basically a reserved region of memory on your computer for each device. Those memory regions have to be mirrored into the VM for PCI passthrough to work.
DMA (Direct Memory Access) - This is how the device can read and write information directly in/out of your computer’s memory. Instead of having the CPU burn cycles copying data from the device, the device can copy the memory automatically. For a GPU, it might be used to copy textures directly from the computer’s memory into its own video memory.
DMA (Direct Memory Access) - This is how the device can read and write information directly in/out of your computer’s memory. Instead of having the CPU burn cycles copying data from the device, the device can copy the memory automatically. For a GPU, it might be used to copy textures directly from the computer’s memory into its own video memory.
Mapping PCI BARs
When QEMU starts a VM, it sets up the guest’s memory layout. For normal RAM, this boils down to a call to hvf_set_phys_mem() in QEMU, which uses the Hypervisor.framework method:
hv_vm_map(mem, guest_physical_address, size, HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC);
Next, we connect to the host PCIDriverKit driver and ask to map the memory from the PCI device into our process. (I’m leaving the driver-side code out for now, but it’s very similar boilerplate.)
// map BAR0 into the current process and set `addr` to the location // where it was mapped mach_vm_address_t addr = 0; mach_vm_size_t size = 0; IOConnectMapMemory64(driverConnection, 0, mach_task_self(), &addr, &size, kIOMapAnywhere);
Ok, so then we have addr, which now points to the BAR0 memory that we can access directly in our process. At this point you can just read and write stuff to it, like any other piece of memory.
volatile uint32_t *bar0 = (volatile uint32_t *)addr; printf(“BAR0[0] = %x\n”, bar0[0]); // this would output: BAR0[0] = 0x1b2000a1 // which is a device-specific constant that describes my RTX 5090 // // BAR0[0] is the BOOT_0 register. The fields break down as: // arch = 0x1b → GB200 GPU family // impl = 0x2 → GB202 die (RTX 5090) // major_rev = 0xa → stepping A // minor_rev = 0x1 → revision 1 (together: stepping A1)
Now we just make sure QEMU calls hvf_set_phys_mem() for our device memory, and we can map that into the guest. When guest code touches that mapping, it talks directly to the GPU with minimal host overhead. This is the best case for performance. At least, in theory.
In practice, as soon as the VM touched the PCI BAR memory, the host kernel crashed.
If you’ve never experienced this before, it’s disorienting. Your entire computer will hang, and because the trackpad feedback is controlled by software, suddenly the trackpad will no longer click. The dogs and cats in your neighborhood start howling. Pictures fall off the walls of your house. Eventually your computer will reboot, and you will be presented with this dialog.
Ok, so we can’t map device memory directly, but we have other tricks up our sleeve. We can trap every access to the memory, exit the guest back into QEMU, and have QEMU forward each read or write to the device. That keeps behavior correct, but it’s brutally slow. In many workloads the pain is elsewhere. Most of the performance-sensitive work is DMA, but some paths still care how fast you can push commands through the BAR.
I started preparing a bug report for Apple and wrote a small reproduction (well, AI-assisted) to demonstrate the issue:
In ~100 lines of C, you can spin up a VM, map the device BAR into the guest, and run code that touches it. I’m still not sure whether that was more frustrating or encouraging, but that version ran without crashing, while QEMU was still panicking the host. I was stumped for a while. Was it the guest page tables? Was the BAR colliding with guest RAM in some subtle way? Why were the dogs and cats still howling?
Eventually, in my desperation, I asked an AI coding assistant to compare my sample and QEMU. It immediately flagged that my mapping used HV_MEMORY_READ | HV_MEMORY_WRITE while QEMU used HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC. Alas, bested again by AI. Not even silly blog projects are safe anymore (mostly kidding).
The workaround in QEMU was a small change:
It works, but it’s not perfect. ARM has several flavors of device memory (the Device-nGnRnE/nGnRE/nGRE/GRE family), with different rules for whether writes can be gathered, reordered, or acknowledged early. It’s roughly analogous to x86 write-combining on the most permissive end.
On real hardware, the prefetchable BARs on my GPU are supposed to allow gathering, which makes them several times faster for bulk writes than BAR0. But hv_vm_map() has no flags to configure this, so every device mapping ends up as the strictest nGnRnE. There’s nothing we can do about it, and it’s still ~30x faster than trapping every access, but it makes writing the BAR ~10x slower than it would be normally.
DMA
This was by far the sketchiest part of the project. To start, let’s go over how this works on a PC running Linux with VM PCI-passthrough, and then we’ll compare to our challenge on macOS.
When there’s just a computer talking to a device (no VM involved), they can talk together directly. The PC will tell the device “hey I got that DMA buffer ready at this memory address” and the device can access that memory directly (AKA DMA). Easy.
When a VM is involved, it’s more complicated. Guest physical addresses don’t correspond to host physical addresses. The VM’s RAM is just some chunk of host memory allocated wherever it was available. So if the guest tells the device “DMA into 0x00000000,” the device will happily scribble over whatever actually lives there on the host. The simplest fix is two things:
Pin all guest memory so it can’t be paged out while the device might touch it.
Put a hardware unit called the IOMMU between the device and host memory. The hypervisor programs it with the guest → host translations, and every DMA request from the device gets remapped on the fly.
DMA Request:Read/Write0x00000000
IOMMU
Translation Table
0x00000000 – 0x80000000
0x20000000 – 0xA0000000
Translated to:0x20000000
Host Physical Memory
0x20000000 – 0xA0000000
This is a blunt solution. The guest doesn’t have to do anything special, but the host has to keep all guest RAM pinned. There are more advanced approaches (like a virtual IOMMU), but they’re outside the scope of this post.
DMA on Apple Silicon
On Apple Silicon, there’s a hardware unit called DART that’s more or less equivalent to an IOMMU. It’s not specific to VMs; it also acts as a security boundary, preventing devices from accessing arbitrary host memory. Ideally we’d just use DART the same way Linux uses the IOMMU in the simple case above.
Unfortunately, DART (at least via PCIDriverKit for Thunderbolt devices) has some hard constraints:
~1.5GB mapping limit. A VM with 1.5GB of RAM can technically boot, but CUDA runs out of memory and any modern game needs 8 – 16GB.
~64k mapping cap. With many small DMA buffers the mapping table fills up.
No address or alignment control. PCIDriverKit assigns mapped addresses for you. You can’t pick them, or specify alignment constraints. This rules out a virtual IOMMU, which requires the guest to choose its own DMA addresses.
The 1.5GB ceiling was the biggest initial blocker. I tried a few workarounds: pre-mapping ranges where I guessed DMAs might land (obviously didn’t work), and using a restricted-dma-pool device tree attribute to force all DMA through a pre-allocated region. The restricted pool approach actually works for simpler devices, but GPU drivers are too weird to fit into that model. (If you’re curious about the specifics, there’s a qemu-devel thread where I discuss it.)
apple-dma-pci
I ended up designing a new virtual PCI device in QEMU called apple-dma-pci. It gets inserted into the VM alongside the passed-through GPU, and a companion kernel driver in the guest intercepts the NVIDIA driver’s DMA mapping calls. The solution is, frankly, a very upsetting hack, but it works.
Because mappings are created on demand per DMA request and torn down when the buffer is freed, we reduce the amount of mapped memory we need at any given time. Only the working set of live DMA buffers at any given moment has to fit in our 1.5GB limit, as opposed to the entirety of guest memory.
The guest driver is loaded early (via an /etc/modules-load.d/ config), so it can find the GPU at probe time and swap in custom DMA ops before the NVIDIA driver touches it:
static struct dma_map_ops apple_dma_ops = { .map_page = apple_dma_map_page, .unmap_page = apple_dma_unmap_page, .map_sg = apple_dma_map_sg, .unmap_sg = apple_dma_unmap_sg, .alloc = apple_dma_alloc, .free = apple_dma_free, };
static int apple_dma_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct pci_dev *gpu = pci_get_device(PCI_VENDOR_NVIDIA, PCI_ANY_ID, NULL); if (!gpu) return -ENODEV;
set_dma_ops(&gpu->dev, &apple_dma_ops); pci_dev_put(gpu); return 0; }
Each of the custom ops is a thin wrapper. It marshals its arguments into a small request, writes it into memory for the apple-dma-pci virtual BAR, kicks a doorbell register, and waits for a reply. On the host side, QEMU picks up the request, hands it off to the PCIDriverKit driver, which performs the actual DART mapping, and the resulting DMA address gets written back to guest memory. The NVIDIA driver shouldn’t know the difference.
Linux VM (Guest)
NVIDIA Driver
dma_map_page()
apple_dma_ops handler
virtual PCI BAR write
apple-dma-pci virtual device
VM exit
macOS Host
QEMU
IOConnectCallMethod()
PCIDriverKit driver
IODMACommand
DART hardware
mapped address returned back up the stack
NVIDIA alignment quirk
It didn’t immediately work well, though. While the driver initially loaded and initialized the card, I was greeted with this fun kernel log message as soon as I attempted to run a CUDA workload:
[ 456.194883] NVRM: nvAssertOkFailedNoLog: Assertion failed: The offset passed is not valid [NV_ERR_INVALID_OFFSET] (0x00000037) returned from pRmApi->Alloc(pRmApi, device->session->handle, isSystemMemory ? device->handle : device->subhandle, &physHandle, isSystemMemory ? NV01_MEMORY_SYSTEM : NV01_MEMORY_LOCAL_USER, &memAllocParams, sizeof(memAllocParams)) @ nv_gpu_ops.c:4972 [ 456.371282] NVRM: GPU0 nvAssertFailedNoLog: Assertion failed: 0 == (physAddr & (RM_PAGE_SIZE_HUGE - 1)) @ mem_mgr_gm107.c:1312 [ 456.372020] NVRM: nvAssertOkFailedNoLog: Assertion failed: The offset passed is not valid [NV_ERR_INVALID_OFFSET] (0x00000037) returned from pRmApi->Alloc(pRmApi, device->session->handle, isSystemMemory ? device->handle : device->subhandle, &physHandle, isSystemMemory ? NV01_MEMORY_SYSTEM : NV01_MEMORY_LOCAL_USER, &memAllocParams, sizeof(memAllocParams)) @ nv_gpu_ops.c:4972
If you recall the earlier DMA section, we noted that we can’t control the alignment of DMA-mapped buffers. Bummer. At this point, I dug into the driver to try to see if there was something simple we could patch.
Here’s the relevant segment:
if (type == UVM_RM_MEM_TYPE_SYS) { if (size >= UVM_PAGE_SIZE_2M) alloc_info.pageSize = UVM_PAGE_SIZE_2M; else if (size >= UVM_PAGE_SIZE_64K) alloc_info.pageSize = UVM_PAGE_SIZE_64K;
status = uvm_rm_locked_call(nvUvmInterfaceMemoryAllocSys(gpu->rm_address_space, size, &gpu_va, &alloc_info));
// TODO: Bug 5042223 if (status == NV_ERR_NO_MEMORY && size >= UVM_PAGE_SIZE_64K) { UVM_ERR_PRINT(“nvUvmInterfaceMemoryAllocSys alloc failed with big page size, retry with default page size\n”); alloc_info.pageSize = UVM_PAGE_SIZE_DEFAULT; status = uvm_rm_locked_call(nvUvmInterfaceMemoryAllocSys(gpu->rm_address_space, size, &gpu_va, &alloc_info)); } }
By adding more debug logging in the module, I could see it was a 16MB allocation of type UVM_RM_MEM_TYPE_SYS. So, it uses the largest (2MB) page size. Ironically, there is already a workaround here when the allocation fails. It’ll just try again with a smaller page size. It just doesn’t take into account the different error code for alignment (NV_ERR_INVALID_OFFSET).