QEMU Source Code Notes
Here’s some of my badly organized notes while reading QEMU’s source code. These notes only cover some small parts of QEMU.
I’ll update this note when I encounter any of the following situations:
- The implementation of some specific part is complicated
- I forget something I understand previously
I find pictures are easier to follow, and use LucidChart as my drawing tool. (I find LucidChart most easier to use for me.) But I haven’t found a satisfactory way to represent the logic of code in graph. I’ve tried various forms of graphs, like plain flowcharts, UML sequence diagram. So the style of my picture is not consistent, sorry for that.
CPU emulation
CPU initialization
The CPU initialization process is illustrated in the following picture.
CPU emulation thread
Starting from QEMU 1.0, all emulated CPUs run in a single thread. Here’s an overview of what this thread is doing.
Emulated Memory
Emulated memory allocation
Note the following description applies to QEMU 0.13. It’s changed in QEMU 1.0.
The following picture is an overview of how is emulated main memory allocated.
Let’s look into how the main memory is allocated in QEMU. In hw/pc_piix.c
,
pc_init1
is initialize the whole PC, it calls pc_memory_init
to allocate
emulated memory. The memory is actually allocated by qemu_ram_alloc
which is
called in pc_memory_init
.
QEMU needs to emulated many kinds of devices, each kind of device needs there own memory, e.g. RAM, BIOS, VGA. QEMU uses a list to manage all the allocated device memory.
In qemu_ram_alloc
, it allocates a new RAMBlock
node, which records:
host
the actual host memory (in QEMU)length
the size of this memory regionoffset
theram_addr
(more details following)
To support pluggable device, emulated memory needs to be releasable. This means
the same guest physical memory address (guest is the emulated target) may refer
to different devices, thus different host memory. I guess this is why QEMU
creates an indirection between guest physical memory address and host memory,
which is ram_addr
. (I’m not sure about this because I don’t see problems of
directly mapping from guest physical memory to host memory.)
ram_addr
is allocated when memory is allocated. Using ram_list
, it’s
possible to convert between ram_addr
and host address. There are many places
in the softmmu system uses this ram_addr
to refer to the allocated memory.
After memory is allocated, it needs to be registered by calling
cpu_register_physical_memory_offset
so that the softmmu system knows how to
access it (use IO function or memory read/write function). To register memory,
we need to specify which guest physical memory range does the allocated memory
represent. To accelerate the process of finding ram_addr
from guest physical
memory (which is used in functions like ldl_phys
), the register function will
set up a data structure which is very similar to the page table, l1_phys_map
.
(TODO: more details)
Softmmu
The softmmu emulation uses C macro to emulated template system. There are several template head files which are included in other files multiple times to generate functions that work for different sized memory and functions to access guest memory with different privileges.
The following picture shows the including relationship between the related files.
IRQState and related function
There are 2 types of hardware IRQ:
- Triggered if the electronic level is high (level sensitive)
- Triggered if the electronic incurs a pulse (edge triggered)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
In hw/irq.c
, qemu_irq_set
calls the irq handler stored in IRQState
.
qemu_irq_raise
, qemu_irq_lower
and qemu_irq_pulse
simply calls
qemu_irq_set
to adjust the level of the IRQ.
qemu_allocate_irqs
will allocate memory for several qemu_irq
, each pointing
to new allocated IRQState
, but all with the same handler and opaque value.
i8254 as an example
TODO
Timer
I’m not very familiar with x86 hardware, the description related to hardware is very likely to be wrong.
Host alarm
On x86, both main board and cpu cores have timer. I call the main board timer host alarm, which corresponds to PIT timer. More specifically, qemu emulates i8254 timer attached to i8259 interrupt controller.
QEMUTimer
In pit_init
, it creates a QEMUTimer
whose callback is pit_irq_timer
. The
definition for QEMUTimer
is:
1 2 3 4 5 6 7 |
|
The clock type for pit timer is vm_clock
, which only runs when the vm runs.
(There are also rt_clock
and host_clock
, the difference is commented in
qemu-timer.h
.) The timer should be put onto the active_timers
list (using
qemu_mod_timer
) in order to run. main_loop_wait
will call
qemu_run_all_timers
which will then invoke all expired timers’ (with expire
time > current time) callbacks.
QEMUTimer
only describes when the callback should be called. As QEMU uses
block chaining, virtual cpu may execute lots of translated code and does not run
timer callback as soon as possible. So it uses alarm timer (introduced later) to
avoid waiting too long before handling timer interrupt.
Timer IRQ handler
When the timer fires, it should send an interrupt to the CPU. This is done through the timer irq handler.
In pit_initfn
, a new QEMUTimer
is created, and an irq handler is acquired
from
The irq passed to pit_init
needs explanation. Here’s the initialization
function calls:
pc_init1 (pc_iiix.c):
isa_irq = qemu_allocate_irqs(isa_irq_handler, …);
...
pc_basic_device_init(isa_irq, …);
pc_basic_device_init (pc.c):
pit = pit_init(0x40, isa_reserve_irq(0));
Here, the reserved irq 0’s handler is allocated in pc_init1
, so the handler is
isa_irq_handler
, which actually calls i8259_set_irq
.
i8259’s parent irq is cpu irq, whose handler function is pic_irq_request
, which
will modify CPUState’s interrupt_request
field which finally delivers the irq
to the virtual cpu.
Alarm timer
To periodically notify the cpu, interval timer is provided by qemu_alarm_timer
:
struct qemu_alarm_timer {
char const *name;
int (*start)(struct qemu_alarm_timer *t);
void (*stop)(struct qemu_alarm_timer *t);
void (*rearm)(struct qemu_alarm_timer *t);
void *priv;
char expired;
char pending;
};
On different system, the alarm timer can use different mechanism to implement.
On Linux, it is implemented using timer_create(CLOCK_REALTIME)
and is named
“dynticks”. (Other alarm timer can be “hpet”, “rtc”, “unix”.)
Each time the alarm timer fires, a signal will be sent. The signal handler will set the alarm timer’s pending field if it is expired, so we know it’s fired and can rearm it later. The most important thing is to notify the cpu, so they will stop executing translated code and handler timer as soon as possible.
Module inrastructure
The module infrastructure provides easy initialization for device, machine and other components in QEMU. The following description using QEMU v1.0.1.
Types of modules:
- block
- device
- machine
- qapi
In module.h
, the module_init
macro generates a function with constructor
attribute, this function will register module initialization function. Each type
modules has a corresponding macro to invoke this macro with the module type
specified.
1 2 3 4 5 6 7 8 9 |
|
Take the i8254 timer as an example
1 2 3 4 5 6 7 8 9 10 11 |
|
Each module has a list holding all the initialization function pointer,
register_module_init
will find the corresponding list, then add the function
pointer to the end of the list.
When is the initialization function called?
module_call_init
will get the specified module’s initialization functions list
and invoke the function one by one. Simply grep this function can find the
calling site. Here’s a list:
- vl.c:
module_call_init(MODULE_INIT_MACHINE)
- vl.c:
module_call_init(MODULE_INIT_DEVICE)
- block.c:
module_call_init(MODULE_INIT_BLOCK)
- qemu-ga.c:
module_call_init(MODULE_INIT_QAPI)
Coroutine in QEMU
Learning the coroutine concept
I recommend Simon Tatham’s article Coroutines in C
The key is to provide a way for a function to ‘return and continue’. By continue it means starts executing at the same point when it “returns” upon next call.
QEMU’s implementation
Commit 00dccaf1f848 introduces coroutine, here’s the commit message
coroutine: introduce coroutines
Asynchronous code is becoming very complex. At the same time
synchronous code is growing because it is convenient to write.
Sometimes duplicate code paths are even added, one synchronous and the
other asynchronous. This patch introduces coroutines which allow code
that looks synchronous but is asynchronous under the covers.
A coroutine has its own stack and is therefore able to preserve state
across blocking operations, which traditionally require callback
functions and manual marshalling of parameters.
Creating and starting a coroutine is easy:
coroutine = qemu_coroutine_create(my_coroutine);
qemu_coroutine_enter(coroutine, my_data);
The coroutine then executes until it returns or yields:
void coroutine_fn my_coroutine(void *opaque) {
MyData *my_data = opaque;
/* do some work */
qemu_coroutine_yield();
/* do some more work */
}
Yielding switches control back to the caller of qemu_coroutine_enter().
This is typically used to switch back to the main thread's event loop
after issuing an asynchronous I/O request. The request callback will
then invoke qemu_coroutine_enter() once more to switch back to the
coroutine.
Note that if coroutines are used only from threads which hold the global
mutex they will never execute concurrently. This makes programming with
coroutines easier than with threads. Race conditions cannot occur since
only one coroutine may be active at any time. Other coroutines can only
run across yield.
This coroutines implementation is based on the gtk-vnc implementation
written by Anthony Liguori <[email protected]> but it has been
significantly rewritten by Kevin Wolf <[email protected]> to use
setjmp()/longjmp() instead of the more expensive swapcontext() and by
Paolo Bonzini <[email protected]> for Windows Fibers support.
- On Debian Linux 6,
makecontext
is defined, socoroutine-ucontext.c
is used to provide coroutine implementatoin
The coroutine implementation in QEMU is quite complicated. Here’s a graph showing the relationship between the related functions.
TCG related
A useful debugging technique to pass state in translated code to outside is to store the information in global variable. This out be helpful for debugging purpose and ugly hacking.
TODO
IO port emulation
All IO port related devices for PC emulation
Format: device address comment
- vmport 0x5658 (
hw/vmport.c
) - firmware 0x511 (
hw/fw_cfg.c
) - i8259 0x20,0x4d0,0xa0,0x4d1 Programmable Interrupt Controller (
hw/i8259.c
) - i440FX/PIIX3 PCI Bridge 0xcf8, 0xcfc (
hw/piix_pci.c
) - Cirrus VGA emulator 0x3c0, 0x3b4, 0x3d4, 0x3ba, 0x3da (
hw/cirrus_vga.c
) - RTC emulation 0x70 (
hw/mc146818rtc.c
) - i8254 interval timer 0x40 (
hw/i8254.c
) - PC speaker 0x61 (`hw/pcspk.c)
- UART 0x3f8 (
hw/serial.c
) - Parallel PORT 0x378 (
hw/parallel.c
) - PC keyboard 0x60, 0x64 (
hw/pckbd.c
) - Port 92 0x92 (
hw/pc.c
) - DMA 0x0~0xf,0x81~0x83,0x87,0x89,0x8a,0x8b,0x8f,0xc0~0xce/2,0xd0,0xd2~0xde/2 (
hw/dma.c
) - Floppy disk 0x3f1,0x3f7 (
hw/fdc.c
) - IDE 0x1f0,0x170,0x3f6,0x376 (
hw/ide/core.c
) - ACPI 0xb2,0xb100,0xafe0,0xae00,0xae08 (
hw/acpi_piix4.c
) - IDE PCI PIIX ¾ 0xc000,0xc004,0xc008,0xc00c (
hw/ide/piix.c
dmdma_map) - Unknown 0xb000 (while executing TC, registers ioport_readx)