Post

Creating an emulator - Dev Log #1

Deep dive into the ELF format

Around a year ago I decided to try to build an emulator to see how they work, how hard it is to emulate a system and maybe to play some really old video games.

I chose Java because I don’t want to recompile everything everytime for all the platforms that I want to use the emulator on. After that, I defined my first steps. Start little and easy. This meant that the first platform to emulate was the one that I understood the most: Linux + x86.

  1. Understand the ELF file format and build a parser for it
  2. Understand how to read the assembly contained in an ELF file and build an x86 assembly parser/decoder
  3. Understand how to execute a file by properly emulating the system around it (and also learn what a “calling convention” is)
  4. Understand “advanced execution features” like: dynamic linking, exceptions, processes, threads, file descriptors
  5. Do the above steps for EXE/PE files (optional)
  6. Do the above steps for other architectures like ARM (optional per se, but may be required to emulate different game consoles in the future)
  7. Select the desired console and apply the knowledge

My dream would be to emulate the PlayStation 2 to run Jak 3, but I think I will need to start from something much simpler like the NES.

Spoiler alert: right now I am at step 3, so this post and the next couple ones will be more of a recap on all that’s happened rather than an actual dev log. If you want to see the current state of the project, you can find it on GitHub.

After this needed premise, let’s jump in today’s topic.

The ELF file format

A bit of history

The Executable and Linkable Format (ELF for short) file format is the format used for executable files on Linux systems. Just like the .exe files on Windows but, unlike those, they don’t have a characteristic file extension because they rely on their magic byte sequence: 0x7f454c46 (0x7f 'E' 'L' 'F' in ASCII).

Actually it is used also for libraries (both shared and static) and for core files for debugging.

And, actually, it is not for Linux systems, it is a standard (just like POSIX, for example) which the Linux kernel adheres to.

It was invented/established around 1993 (the oldest version of the standard I can find is the 1.1 from october 1993 here) as a 32-bit format and later (in 1995) extended to 64 bits.

In case you didn’t realize it yet, Python is older than the ELF file format.

The structure of a file

An ELF file, regardless of its purpose (executable, library or corefile), is composed of three elements:

  • the file header
  • the program header table
  • the section table

The File Header

The file header contains some general information about the whole file like:

  • the type (e_type) of the file: executable, core, library etc.
  • the OS ABI (EI_OSABI) required to use it correctly
  • the ISA (e_machine) of the target machine

But, most importantly, contains:

  • a byte to specify the format (EI_CLASS) of the data in the file: 32-bit or 64-bit
  • a byte to specify the endianness (EI_DATA): little-endian or big-endian

These two fields are really really important for parsing the file correctly since, speaking from personal experience, not only most of the fields in the file go from 4 to 8 bytes but also the layout of some section changes (like the entries of a symbol table section).

The Program Header Table

The Program Header Table (PHT) is a collection of information about memory regions during execution. At least, this is their purpose inside executable files, I still haven’t tried with libraries or corefiles. These memory regions are actually called segments (I haven’t really checked this one, but I guess they are similar or the same as segments in segmented memory). This information about memory mapping can be seen when using GNU’s readelf utility like so:

1
readelf -l <your_elf_file>

which will display something like this for /bin/bash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x32ef0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000002e188 0x000000000002e188  R      0x1000
  LOAD           0x000000000002f000 0x000000000002f000 0x000000000002f000
                 0x00000000000def6d 0x00000000000def6d  R E    0x1000
  LOAD           0x000000000010e000 0x000000000010e000 0x000000000010e000
                 0x0000000000039b08 0x0000000000039b08  R      0x1000
  LOAD           0x0000000000148a90 0x0000000000149a90 0x0000000000149a90
                 0x000000000000bbc0 0x0000000000016b28  RW     0x1000
  DYNAMIC        0x000000000014b4c0 0x000000000014c4c0 0x000000000014c4c0
                 0x0000000000000200 0x0000000000000200  RW     0x8
  NOTE           0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x0000000000000368 0x0000000000000368 0x0000000000000368
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_PROPERTY   0x0000000000000338 0x0000000000000338 0x0000000000000338
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_EH_FRAME   0x00000000001278a8 0x00000000001278a8 0x00000000001278a8
                 0x000000000000472c 0x000000000000472c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x0000000000148a90 0x0000000000149a90 0x0000000000149a90
                 0x0000000000003570 0x0000000000003570  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00
   01     .interp
   02     .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
   03     .init .plt .plt.got .plt.sec .text .fini
   04     .rodata .eh_frame_hdr .eh_frame
   05     .init_array .fini_array .data.rel.ro .dynamic .got .data .bss
   06     .dynamic
   07     .note.gnu.property
   08     .note.gnu.build-id .note.ABI-tag
   09     .note.gnu.property
   10     .eh_frame_hdr
   11
   12     .init_array .fini_array .data.rel.ro .dynamic .got

Not all segments actually refer to a specific memory region, some of them (like GNU_STACK) have a memory size field (called p_memsize) of zero, meaning that they are allocated during execution. Not every section needs to be mapped into the memory during execution, some of them are not loadable. Also, the same segment can host multiple sections at once (like the overcrowded segment 2) and the same section can be mapped to different segments at the same time (like .note.gnu.property), which means it will be loaded multiple times in different parts of the memory. There is some cool stuff about segment permissions that I will touch later, for now let’s talk about the most important part of an ELF file.

The Section Table

The Section Table (ST) is a bit different from the PHT because each section has both a header and an optional content in the file (unlike program headers whose content is defined only during execution). All section headers are the same, while the content (for those who have it) may change drastically: some have a fixed size content, some have a big block of binary data, some are actually a table of fixed size entries, some need a little parser of their own because of offsets and overlapping tables. While the PHT contains information for the program’s execution, the ST contains just about anything. This last part was a bit of a shock for me because I thought, naively, that a binary program contained only assembly.

Some sections are defined by the standard, while others are OS-specific or CPU-specific. Let’s look at a really simple example:

1
2
3
4
5
6
#include <stdio.h>

int main() {
    printf("Hello world!\n");
    return 0;
}

Compiled with gcc hello.c -o hello and analyzed with readelf -S hello:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
There are 31 section headers, starting at offset 0x3698:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000000318  00000318
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.gnu.pr[...] NOTE             0000000000000338  00000338
       0000000000000030  0000000000000000   A       0     0     8
  [ 3] .note.gnu.bu[...] NOTE             0000000000000368  00000368
       0000000000000024  0000000000000000   A       0     0     4
  [ 4] .note.ABI-tag     NOTE             000000000000038c  0000038c
       0000000000000020  0000000000000000   A       0     0     4
  [ 5] .gnu.hash         GNU_HASH         00000000000003b0  000003b0
       0000000000000024  0000000000000000   A       6     0     8
  [ 6] .dynsym           DYNSYM           00000000000003d8  000003d8
       00000000000000a8  0000000000000018   A       7     1     8
  [ 7] .dynstr           STRTAB           0000000000000480  00000480
       000000000000008d  0000000000000000   A       0     0     1
  [ 8] .gnu.version      VERSYM           000000000000050e  0000050e
       000000000000000e  0000000000000002   A       6     0     2
  [ 9] .gnu.version_r    VERNEED          0000000000000520  00000520
       0000000000000030  0000000000000000   A       7     1     8
  [10] .rela.dyn         RELA             0000000000000550  00000550
       00000000000000c0  0000000000000018   A       6     0     8
  [11] .rela.plt         RELA             0000000000000610  00000610
       0000000000000018  0000000000000018  AI       6    24     8
  [12] .init             PROGBITS         0000000000001000  00001000
       000000000000001b  0000000000000000  AX       0     0     4
  [13] .plt              PROGBITS         0000000000001020  00001020
       0000000000000020  0000000000000010  AX       0     0     16
  [14] .plt.got          PROGBITS         0000000000001040  00001040
       0000000000000010  0000000000000010  AX       0     0     16
  [15] .plt.sec          PROGBITS         0000000000001050  00001050
       0000000000000010  0000000000000010  AX       0     0     16
  [16] .text             PROGBITS         0000000000001060  00001060
       0000000000000107  0000000000000000  AX       0     0     16
  [17] .fini             PROGBITS         0000000000001168  00001168
       000000000000000d  0000000000000000  AX       0     0     4
  [18] .rodata           PROGBITS         0000000000002000  00002000
       0000000000000011  0000000000000000   A       0     0     4
  [19] .eh_frame_hdr     PROGBITS         0000000000002014  00002014
       0000000000000034  0000000000000000   A       0     0     4
  [20] .eh_frame         PROGBITS         0000000000002048  00002048
       00000000000000ac  0000000000000000   A       0     0     8
  [21] .init_array       INIT_ARRAY       0000000000003db8  00002db8
       0000000000000008  0000000000000008  WA       0     0     8
  [22] .fini_array       FINI_ARRAY       0000000000003dc0  00002dc0
       0000000000000008  0000000000000008  WA       0     0     8
  [23] .dynamic          DYNAMIC          0000000000003dc8  00002dc8
       00000000000001f0  0000000000000010  WA       7     0     8
  [24] .got              PROGBITS         0000000000003fb8  00002fb8
       0000000000000048  0000000000000008  WA       0     0     8
  [25] .data             PROGBITS         0000000000004000  00003000
       0000000000000010  0000000000000000  WA       0     0     8
  [26] .bss              NOBITS           0000000000004010  00003010
       0000000000000008  0000000000000000  WA       0     0     1
  [27] .comment          PROGBITS         0000000000000000  00003010
       000000000000002b  0000000000000001  MS       0     0     1
  [28] .symtab           SYMTAB           0000000000000000  00003040
       0000000000000360  0000000000000018          29    18     8
  [29] .strtab           STRTAB           0000000000000000  000033a0
       00000000000001db  0000000000000000           0     0     1
  [30] .shstrtab         STRTAB           0000000000000000  0000357b
       000000000000011a  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), l (large), p (processor specific)

There is a lot of stuff in here for just a simple hello world. Let’s start from the beginning:

  • .text is the most important section: it contains the code to be executed and somewhere in the file header there is the entry point field (e_entry) telling the machine where to start executing. Not all the code of your program is here, there are some ways to put some of your code in other sections to get some specific behavior, but it’s safe to say that most of the code you wrote for your program ends up in here.
  • .data is where the initialized global variables are stored. In our case, it’s empty because we do not have anything preallocated or allocated statically. Discovering this was quite a shock for me because I thought that static arrays were allocated and filled with data at the beginning of the program. With actual code. But no, and this is actually more efficient since it boils down to just a simple memcpy at load time.
  • .rodata stands for read-only data. It is exactly the same as the .data section except that it contains only constants and immutable data. You can see this in its permissions: it has only A, meaning it is allocatable, while .data has WA, meaning allocatable and writable.

Also, as you may have seen from the readelf output, there are a lot of different section types:

  • PROGBITS is the simplest one: it’s just a big block of memory that needs to be loaded as-is. The most important sections of your file (.text, .data, .rodata) are of this type.
  • NOBITS is, somehow, even simpler: it’s an uninitialized block of memory. This means that it has a size of 0 bytes inside the file while a certain size in the memory. It’s basically a malloc call at load-time.
  • STRTAB is the type of a string table. A string table contains a block of 1-byte characters (ASCII) with each string ending with a \0 byte. Some sections use these string tables with a simple index from which the corresponding string can be retrieved by iterating until the zero byte. This allows saving memory with a lot of strings with a common prefix, but I haven’t seen any prefix being used in the files I’ve found.
  • SYMTAB represents a symbol table. In case you’ve never heard of it, that’s the best friend of a linker. It contains all the information about the symbols that are exposed in your program such as functions and global variables.
  • DYNSYM is the same thing as the symbol table, but dynamic. This means that it’s actually used during dynamic linking.
  • DYNAMIC is the type of a section that holds information needed for dynamic linking like the names of libraries to link.
  • NOTE is the type of sections holding various info that is not indispensable for execution nor linking but may be useful. For example, the .note.gnu.build-id section holds a 256-bit string generate during compilation, to identify a certain executable, I think. Another example, the .note.gnu-property contains some information about the specific ISA needed for the file.
  • HASH and GNU_HASH are sections which contain hash tables for fast lookup of symbols inside the symbol tables. I’ve never actually measured how much time is spent doing symbol lookup in large projects with thousands of symbols (and I wouldn’t even know how to measure it), but I guess it’s a nice to have.
  • REL and RELA are types of relocations sections. Relocations and relocations with addends, respectively. As far as I know, they tell the linker how and where to relocate a certain symbol in a symbol table, in case it is needed. I don’t actually know how they work and how to use them because I didn’t need them. Maybe I will learn it when implementing dynamic linking.
  • INIT_ARRAY and FINI_ARRAY. These are sections which contain initialization and deinitialization routines/functions to be called in a specific order. There are some other sections with the same purpose but different layout and/or meaning like .init and .fini or .ctors and .dtors. These ones are the legacy versions of the .init_array and .fini_array (and .preinit_array) section
  • VERSYM, VERDEF and VERNEED. These three usually come together, except in our simple hello world for a simple reason: we don’t define any symbol. The sections with these types complement the information contained in the symbol table and the dynamic symbol table by adding versions to it. I’m not sure if the version refer to the single symbol or to the library. The VERDEF defines versions, the VERNEED defines versions needed for some symbols and the VERSYM puts everything together.

I hope this summarization was useful for you because I would have really loved a post like this, instead of interpolating between the ELF specification, the Solaris documentation and StackOverflow questions. The next post will be about the x86 ISA.

Stay tuned.

This post is licensed under CC BY 4.0 by the author.