Random Tech Thoughts

The title above is not random

QEMU Source Code Notes

Here’s some of my badly organized notes while reading QEMU’s source code. These notes only cover some small parts of QEMU.

I’ll update this note when I encounter any of the following situations:

  • The implementation of some specific part is complicated
  • I forget something I understand previously

I find pictures are easier to follow, and use LucidChart as my drawing tool. (I find LucidChart most easier to use for me.) But I haven’t found a satisfactory way to represent the logic of code in graph. I’ve tried various forms of graphs, like plain flowcharts, UML sequence diagram. So the style of my picture is not consistent, sorry for that.

CPU emulation

CPU initialization

The CPU initialization process is illustrated in the following picture.

CPU initialization process

CPU emulation thread

Starting from QEMU 1.0, all emulated CPUs run in a single thread. Here’s an overview of what this thread is doing.

CPU thread

Emulated Memory

Emulated memory allocation

Note the following description applies to QEMU 0.13. It’s changed in QEMU 1.0.

The following picture is an overview of how is emulated main memory allocated.

allocating emulated memory

Let’s look into how the main memory is allocated in QEMU. In hw/pc_piix.c, pc_init1 is initialize the whole PC, it calls pc_memory_init to allocate emulated memory. The memory is actually allocated by qemu_ram_alloc which is called in pc_memory_init.

QEMU needs to emulated many kinds of devices, each kind of device needs there own memory, e.g. RAM, BIOS, VGA. QEMU uses a list to manage all the allocated device memory.

In qemu_ram_alloc, it allocates a new RAMBlock node, which records:

  • host the actual host memory (in QEMU)
  • length the size of this memory region
  • offset the ram_addr (more details following)

To support pluggable device, emulated memory needs to be releasable. This means the same guest physical memory address (guest is the emulated target) may refer to different devices, thus different host memory. I guess this is why QEMU creates an indirection between guest physical memory address and host memory, which is ram_addr. (I’m not sure about this because I don’t see problems of directly mapping from guest physical memory to host memory.)

ram_addr is allocated when memory is allocated. Using ram_list, it’s possible to convert between ram_addr and host address. There are many places in the softmmu system uses this ram_addr to refer to the allocated memory.

After memory is allocated, it needs to be registered by calling cpu_register_physical_memory_offset so that the softmmu system knows how to access it (use IO function or memory read/write function). To register memory, we need to specify which guest physical memory range does the allocated memory represent. To accelerate the process of finding ram_addr from guest physical memory (which is used in functions like ldl_phys), the register function will set up a data structure which is very similar to the page table, l1_phys_map. (TODO: more details)

Softmmu

The softmmu emulation uses C macro to emulated template system. There are several template head files which are included in other files multiple times to generate functions that work for different sized memory and functions to access guest memory with different privileges.

The following picture shows the including relationship between the related files.

softmmu header file include relation

IRQState and related function

There are 2 types of hardware IRQ:

  • Triggered if the electronic level is high (level sensitive)
  • Triggered if the electronic incurs a pulse (edge triggered)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/* in hw/irq.h */

typedef void (*qemu_irq_handler)(void *opaque, int n, int level);


/* in hw/irq.c */
struct IRQState {
    qemu_irq_handler handler;
    void *opaque;
    int n;
}
/* qemu-common.h typedef struct IRQState *qemu_irq */

void qemu_set_irq(qemu_irq irq, int level)
{
    if (!irq)
        return;

    irq->handler(irq->opaque, irq->n, level);
}

In hw/irq.c, qemu_irq_set calls the irq handler stored in IRQState. qemu_irq_raise, qemu_irq_lower and qemu_irq_pulse simply calls qemu_irq_set to adjust the level of the IRQ.

qemu_allocate_irqs will allocate memory for several qemu_irq, each pointing to new allocated IRQState, but all with the same handler and opaque value.

i8254 as an example

TODO

Timer

I’m not very familiar with x86 hardware, the description related to hardware is very likely to be wrong.

Host alarm

On x86, both main board and cpu cores have timer. I call the main board timer host alarm, which corresponds to PIT timer. More specifically, qemu emulates i8254 timer attached to i8259 interrupt controller.

QEMUTimer

In pit_init, it creates a QEMUTimer whose callback is pit_irq_timer. The definition for QEMUTimer is:

1
2
3
4
5
6
7
struct QEMUTimer {
  QEMUClock *clock;
  int64_t expire_time;
  QEMUTimerCB *cb;
  void *opaque;
  struct QEMUTimer *next;
};

The clock type for pit timer is vm_clock, which only runs when the vm runs. (There are also rt_clock and host_clock, the difference is commented in qemu-timer.h.) The timer should be put onto the active_timers list (using qemu_mod_timer) in order to run. main_loop_wait will call qemu_run_all_timers which will then invoke all expired timers’ (with expire time > current time) callbacks.

QEMUTimer only describes when the callback should be called. As QEMU uses block chaining, virtual cpu may execute lots of translated code and does not run timer callback as soon as possible. So it uses alarm timer (introduced later) to avoid waiting too long before handling timer interrupt.

Timer IRQ handler

When the timer fires, it should send an interrupt to the CPU. This is done through the timer irq handler.

In pit_initfn, a new QEMUTimer is created, and an irq handler is acquired from

The irq passed to pit_init needs explanation. Here’s the initialization function calls:

pc_init1 (pc_iiix.c):
  isa_irq = qemu_allocate_irqs(isa_irq_handler, …);
  ...
  pc_basic_device_init(isa_irq, …);

pc_basic_device_init (pc.c):
  pit = pit_init(0x40, isa_reserve_irq(0));

Here, the reserved irq 0’s handler is allocated in pc_init1, so the handler is isa_irq_handler, which actually calls i8259_set_irq.

i8259’s parent irq is cpu irq, whose handler function is pic_irq_request, which will modify CPUState’s interrupt_request field which finally delivers the irq to the virtual cpu.

Alarm timer

To periodically notify the cpu, interval timer is provided by qemu_alarm_timer:

struct qemu_alarm_timer {
    char const *name;
    int (*start)(struct qemu_alarm_timer *t);
    void (*stop)(struct qemu_alarm_timer *t);
    void (*rearm)(struct qemu_alarm_timer *t);
    void *priv;

    char expired;
    char pending;
};

On different system, the alarm timer can use different mechanism to implement. On Linux, it is implemented using timer_create(CLOCK_REALTIME) and is named “dynticks”. (Other alarm timer can be “hpet”, “rtc”, “unix”.)

Each time the alarm timer fires, a signal will be sent. The signal handler will set the alarm timer’s pending field if it is expired, so we know it’s fired and can rearm it later. The most important thing is to notify the cpu, so they will stop executing translated code and handler timer as soon as possible.

Module inrastructure

The module infrastructure provides easy initialization for device, machine and other components in QEMU. The following description using QEMU v1.0.1.

Types of modules:

  • block
  • device
  • machine
  • qapi

In module.h, the module_init macro generates a function with constructor attribute, this function will register module initialization function. Each type modules has a corresponding macro to invoke this macro with the module type specified.

1
2
3
4
5
6
7
8
9
#define module_init(function, type)                                         \
static void __attribute__((constructor)) do_qemu_init_ ## function(void) {  \
    register_module_init(function, type);                                   \
}

#define block_init(function) module_init(function, MODULE_INIT_BLOCK)
#define device_init(function) module_init(function, MODULE_INIT_DEVICE)
#define machine_init(function) module_init(function, MODULE_INIT_MACHINE)
#define qapi_init(function) module_init(function, MODULE_INIT_QAPI)

Take the i8254 timer as an example

1
2
3
4
5
6
7
8
9
10
11
/* In hw/i8254.c */
static void pit_register(void)
{
    isa_qdev_register(&pit_info);
}
device_init(pit_register)

/* Generated function after macro expansion. */
static void __attribute__((constructor)) do_qemu_init_pit_register(void) {
    register_module_init(pit_register, MODULE_INIT_DEVICE);
}

Each module has a list holding all the initialization function pointer, register_module_init will find the corresponding list, then add the function pointer to the end of the list.

When is the initialization function called?

module_call_init will get the specified module’s initialization functions list and invoke the function one by one. Simply grep this function can find the calling site. Here’s a list:

  • vl.c: module_call_init(MODULE_INIT_MACHINE)
  • vl.c: module_call_init(MODULE_INIT_DEVICE)
  • block.c: module_call_init(MODULE_INIT_BLOCK)
  • qemu-ga.c: module_call_init(MODULE_INIT_QAPI)

Coroutine in QEMU

Learning the coroutine concept

I recommend Simon Tatham’s article Coroutines in C

The key is to provide a way for a function to ‘return and continue’. By continue it means starts executing at the same point when it “returns” upon next call.

QEMU’s implementation

Commit 00dccaf1f848 introduces coroutine, here’s the commit message

coroutine: introduce coroutines

Asynchronous code is becoming very complex.  At the same time
synchronous code is growing because it is convenient to write.
Sometimes duplicate code paths are even added, one synchronous and the
other asynchronous.  This patch introduces coroutines which allow code
that looks synchronous but is asynchronous under the covers.

A coroutine has its own stack and is therefore able to preserve state
across blocking operations, which traditionally require callback
functions and manual marshalling of parameters.

Creating and starting a coroutine is easy:

  coroutine = qemu_coroutine_create(my_coroutine);
  qemu_coroutine_enter(coroutine, my_data);

The coroutine then executes until it returns or yields:

  void coroutine_fn my_coroutine(void *opaque) {
      MyData *my_data = opaque;

      /* do some work */

      qemu_coroutine_yield();

      /* do some more work */
  }

Yielding switches control back to the caller of qemu_coroutine_enter().
This is typically used to switch back to the main thread's event loop
after issuing an asynchronous I/O request.  The request callback will
then invoke qemu_coroutine_enter() once more to switch back to the
coroutine.

Note that if coroutines are used only from threads which hold the global
mutex they will never execute concurrently.  This makes programming with
coroutines easier than with threads.  Race conditions cannot occur since
only one coroutine may be active at any time.  Other coroutines can only
run across yield.

This coroutines implementation is based on the gtk-vnc implementation
written by Anthony Liguori <[email protected]> but it has been
significantly rewritten by Kevin Wolf <[email protected]> to use
setjmp()/longjmp() instead of the more expensive swapcontext() and by
Paolo Bonzini <[email protected]> for Windows Fibers support.
  • On Debian Linux 6, makecontext is defined, so coroutine-ucontext.c is used to provide coroutine implementatoin

The coroutine implementation in QEMU is quite complicated. Here’s a graph showing the relationship between the related functions.

QEMU's coroutine implementation

TCG related

A useful debugging technique to pass state in translated code to outside is to store the information in global variable. This out be helpful for debugging purpose and ugly hacking.

TODO

IO port emulation

All IO port related devices for PC emulation

Format: device address comment

  • vmport 0x5658 (hw/vmport.c)
  • firmware 0x511 (hw/fw_cfg.c)
  • i8259 0x20,0x4d0,0xa0,0x4d1 Programmable Interrupt Controller (hw/i8259.c)
  • i440FX/PIIX3 PCI Bridge 0xcf8, 0xcfc (hw/piix_pci.c)
  • Cirrus VGA emulator 0x3c0, 0x3b4, 0x3d4, 0x3ba, 0x3da (hw/cirrus_vga.c)
  • RTC emulation 0x70 (hw/mc146818rtc.c)
  • i8254 interval timer 0x40 (hw/i8254.c)
  • PC speaker 0x61 (`hw/pcspk.c)
  • UART 0x3f8 (hw/serial.c)
  • Parallel PORT 0x378 (hw/parallel.c)
  • PC keyboard 0x60, 0x64 (hw/pckbd.c)
  • Port 92 0x92 (hw/pc.c)
  • DMA 0x0~0xf,0x81~0x83,0x87,0x89,0x8a,0x8b,0x8f,0xc0~0xce/2,0xd0,0xd2~0xde/2 (hw/dma.c)
  • Floppy disk 0x3f1,0x3f7 (hw/fdc.c)
  • IDE 0x1f0,0x170,0x3f6,0x376 (hw/ide/core.c)
  • ACPI 0xb2,0xb100,0xafe0,0xae00,0xae08 (hw/acpi_piix4.c)
  • IDE PCI PIIX ¾ 0xc000,0xc004,0xc008,0xc00c (hw/ide/piix.c dmdma_map)
  • Unknown 0xb000 (while executing TC, registers ioport_readx)