Creating an emulator - Dev Log #1
Deep dive into the ELF format
Around a year ago I decided to try to build an emulator to see how they work, how hard it is to emulate a system and maybe to play some really old video games.
I chose Java because I don’t want to recompile everything everytime for all the platforms that I want to use the emulator on. After that, I defined my first steps. Start little and easy. This meant that the first platform to emulate was the one that I understood the most: Linux + x86.
- Understand the ELF file format and build a parser for it
- Understand how to read the assembly contained in an ELF file and build an x86 assembly parser/decoder
- Understand how to execute a file by properly emulating the system around it (and also learn what a “calling convention” is)
- Understand “advanced execution features” like: dynamic linking, exceptions, processes, threads, file descriptors
- Do the above steps for EXE/PE files (optional)
- Do the above steps for other architectures like ARM (optional per se, but may be required to emulate different game consoles in the future)
- Select the desired console and apply the knowledge
My dream would be to emulate the PlayStation 2 to run Jak 3, but I think I will need to start from something much simpler like the NES.
Spoiler alert: right now I am at step 3, so this post and the next couple ones will be more of a recap on all that’s happened rather than an actual dev log. If you want to see the current state of the project, you can find it on GitHub.
After this needed premise, let’s jump in today’s topic.
The ELF file format
A bit of history
The Executable and Linkable Format (ELF for short) file format is the format used for executable files on Linux systems. Just like the .exe
files on Windows but, unlike those, they don’t have a characteristic file extension because they rely on their magic byte sequence: 0x7f454c46
(0x7f 'E' 'L' 'F'
in ASCII).
Actually it is used also for libraries (both shared and static) and for core files for debugging.
And, actually, it is not for Linux systems, it is a standard (just like POSIX, for example) which the Linux kernel adheres to.
It was invented/established around 1993 (the oldest version of the standard I can find is the 1.1 from october 1993 here) as a 32-bit format and later (in 1995) extended to 64 bits.
In case you didn’t realize it yet, Python is older than the ELF file format.
The structure of a file
An ELF file, regardless of its purpose (executable, library or corefile), is composed of three elements:
- the file header
- the program header table
- the section table
The File Header
The file header contains some general information about the whole file like:
- the type (
e_type
) of the file: executable, core, library etc. - the OS ABI (
EI_OSABI
) required to use it correctly - the ISA (
e_machine
) of the target machine
But, most importantly, contains:
- a byte to specify the format (
EI_CLASS
) of the data in the file: 32-bit or 64-bit - a byte to specify the endianness (
EI_DATA
): little-endian or big-endian
These two fields are really really important for parsing the file correctly since, speaking from personal experience, not only most of the fields in the file go from 4 to 8 bytes but also the layout of some section changes (like the entries of a symbol table section).
The Program Header Table
The Program Header Table (PHT) is a collection of information about memory regions during execution. At least, this is their purpose inside executable files, I still haven’t tried with libraries or corefiles. These memory regions are actually called segments (I haven’t really checked this one, but I guess they are similar or the same as segments in segmented memory). This information about memory mapping can be seen when using GNU’s readelf
utility like so:
1
readelf -l <your_elf_file>
which will display something like this for /bin/bash
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Elf file type is DYN (Position-Independent Executable file)
Entry point 0x32ef0
There are 13 program headers, starting at offset 64
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
PHDR 0x0000000000000040 0x0000000000000040 0x0000000000000040
0x00000000000002d8 0x00000000000002d8 R 0x8
INTERP 0x0000000000000318 0x0000000000000318 0x0000000000000318
0x000000000000001c 0x000000000000001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x000000000002e188 0x000000000002e188 R 0x1000
LOAD 0x000000000002f000 0x000000000002f000 0x000000000002f000
0x00000000000def6d 0x00000000000def6d R E 0x1000
LOAD 0x000000000010e000 0x000000000010e000 0x000000000010e000
0x0000000000039b08 0x0000000000039b08 R 0x1000
LOAD 0x0000000000148a90 0x0000000000149a90 0x0000000000149a90
0x000000000000bbc0 0x0000000000016b28 RW 0x1000
DYNAMIC 0x000000000014b4c0 0x000000000014c4c0 0x000000000014c4c0
0x0000000000000200 0x0000000000000200 RW 0x8
NOTE 0x0000000000000338 0x0000000000000338 0x0000000000000338
0x0000000000000030 0x0000000000000030 R 0x8
NOTE 0x0000000000000368 0x0000000000000368 0x0000000000000368
0x0000000000000044 0x0000000000000044 R 0x4
GNU_PROPERTY 0x0000000000000338 0x0000000000000338 0x0000000000000338
0x0000000000000030 0x0000000000000030 R 0x8
GNU_EH_FRAME 0x00000000001278a8 0x00000000001278a8 0x00000000001278a8
0x000000000000472c 0x000000000000472c R 0x4
GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000
0x0000000000000000 0x0000000000000000 RW 0x10
GNU_RELRO 0x0000000000148a90 0x0000000000149a90 0x0000000000149a90
0x0000000000003570 0x0000000000003570 R 0x1
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
03 .init .plt .plt.got .plt.sec .text .fini
04 .rodata .eh_frame_hdr .eh_frame
05 .init_array .fini_array .data.rel.ro .dynamic .got .data .bss
06 .dynamic
07 .note.gnu.property
08 .note.gnu.build-id .note.ABI-tag
09 .note.gnu.property
10 .eh_frame_hdr
11
12 .init_array .fini_array .data.rel.ro .dynamic .got
Not all segments actually refer to a specific memory region, some of them (like GNU_STACK
) have a memory size field (called p_memsize
) of zero, meaning that they are allocated during execution.
Not every section needs to be mapped into the memory during execution, some of them are not loadable. Also, the same segment can host multiple sections at once (like the overcrowded segment 2) and the same section can be mapped to different segments at the same time (like .note.gnu.property
), which means it will be loaded multiple times in different parts of the memory.
There is some cool stuff about segment permissions that I will touch later, for now let’s talk about the most important part of an ELF file.
The Section Table
The Section Table (ST) is a bit different from the PHT because each section has both a header and an optional content in the file (unlike program headers whose content is defined only during execution). All section headers are the same, while the content (for those who have it) may change drastically: some have a fixed size content, some have a big block of binary data, some are actually a table of fixed size entries, some need a little parser of their own because of offsets and overlapping tables. While the PHT contains information for the program’s execution, the ST contains just about anything. This last part was a bit of a shock for me because I thought, naively, that a binary program contained only assembly.
Some sections are defined by the standard, while others are OS-specific or CPU-specific. Let’s look at a really simple example:
1
2
3
4
5
6
#include <stdio.h>
int main() {
printf("Hello world!\n");
return 0;
}
Compiled with gcc hello.c -o hello
and analyzed with readelf -S hello
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
There are 31 section headers, starting at offset 0x3698:
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[ 0] NULL 0000000000000000 00000000
0000000000000000 0000000000000000 0 0 0
[ 1] .interp PROGBITS 0000000000000318 00000318
000000000000001c 0000000000000000 A 0 0 1
[ 2] .note.gnu.pr[...] NOTE 0000000000000338 00000338
0000000000000030 0000000000000000 A 0 0 8
[ 3] .note.gnu.bu[...] NOTE 0000000000000368 00000368
0000000000000024 0000000000000000 A 0 0 4
[ 4] .note.ABI-tag NOTE 000000000000038c 0000038c
0000000000000020 0000000000000000 A 0 0 4
[ 5] .gnu.hash GNU_HASH 00000000000003b0 000003b0
0000000000000024 0000000000000000 A 6 0 8
[ 6] .dynsym DYNSYM 00000000000003d8 000003d8
00000000000000a8 0000000000000018 A 7 1 8
[ 7] .dynstr STRTAB 0000000000000480 00000480
000000000000008d 0000000000000000 A 0 0 1
[ 8] .gnu.version VERSYM 000000000000050e 0000050e
000000000000000e 0000000000000002 A 6 0 2
[ 9] .gnu.version_r VERNEED 0000000000000520 00000520
0000000000000030 0000000000000000 A 7 1 8
[10] .rela.dyn RELA 0000000000000550 00000550
00000000000000c0 0000000000000018 A 6 0 8
[11] .rela.plt RELA 0000000000000610 00000610
0000000000000018 0000000000000018 AI 6 24 8
[12] .init PROGBITS 0000000000001000 00001000
000000000000001b 0000000000000000 AX 0 0 4
[13] .plt PROGBITS 0000000000001020 00001020
0000000000000020 0000000000000010 AX 0 0 16
[14] .plt.got PROGBITS 0000000000001040 00001040
0000000000000010 0000000000000010 AX 0 0 16
[15] .plt.sec PROGBITS 0000000000001050 00001050
0000000000000010 0000000000000010 AX 0 0 16
[16] .text PROGBITS 0000000000001060 00001060
0000000000000107 0000000000000000 AX 0 0 16
[17] .fini PROGBITS 0000000000001168 00001168
000000000000000d 0000000000000000 AX 0 0 4
[18] .rodata PROGBITS 0000000000002000 00002000
0000000000000011 0000000000000000 A 0 0 4
[19] .eh_frame_hdr PROGBITS 0000000000002014 00002014
0000000000000034 0000000000000000 A 0 0 4
[20] .eh_frame PROGBITS 0000000000002048 00002048
00000000000000ac 0000000000000000 A 0 0 8
[21] .init_array INIT_ARRAY 0000000000003db8 00002db8
0000000000000008 0000000000000008 WA 0 0 8
[22] .fini_array FINI_ARRAY 0000000000003dc0 00002dc0
0000000000000008 0000000000000008 WA 0 0 8
[23] .dynamic DYNAMIC 0000000000003dc8 00002dc8
00000000000001f0 0000000000000010 WA 7 0 8
[24] .got PROGBITS 0000000000003fb8 00002fb8
0000000000000048 0000000000000008 WA 0 0 8
[25] .data PROGBITS 0000000000004000 00003000
0000000000000010 0000000000000000 WA 0 0 8
[26] .bss NOBITS 0000000000004010 00003010
0000000000000008 0000000000000000 WA 0 0 1
[27] .comment PROGBITS 0000000000000000 00003010
000000000000002b 0000000000000001 MS 0 0 1
[28] .symtab SYMTAB 0000000000000000 00003040
0000000000000360 0000000000000018 29 18 8
[29] .strtab STRTAB 0000000000000000 000033a0
00000000000001db 0000000000000000 0 0 1
[30] .shstrtab STRTAB 0000000000000000 0000357b
000000000000011a 0000000000000000 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
D (mbind), l (large), p (processor specific)
There is a lot of stuff in here for just a simple hello world. Let’s start from the beginning:
.text
is the most important section: it contains the code to be executed and somewhere in the file header there is the entry point field (e_entry
) telling the machine where to start executing. Not all the code of your program is here, there are some ways to put some of your code in other sections to get some specific behavior, but it’s safe to say that most of the code you wrote for your program ends up in here..data
is where the initialized global variables are stored. In our case, it’s empty because we do not have anything preallocated or allocated statically. Discovering this was quite a shock for me because I thought that static arrays were allocated and filled with data at the beginning of the program. With actual code. But no, and this is actually more efficient since it boils down to just a simplememcpy
at load time..rodata
stands for read-only data. It is exactly the same as the.data
section except that it contains only constants and immutable data. You can see this in its permissions: it has onlyA
, meaning it is allocatable, while.data
hasWA
, meaning allocatable and writable.
Also, as you may have seen from the readelf
output, there are a lot of different section types:
PROGBITS
is the simplest one: it’s just a big block of memory that needs to be loaded as-is. The most important sections of your file (.text
,.data
,.rodata
) are of this type.NOBITS
is, somehow, even simpler: it’s an uninitialized block of memory. This means that it has a size of 0 bytes inside the file while a certain size in the memory. It’s basically amalloc
call at load-time.STRTAB
is the type of a string table. A string table contains a block of 1-byte characters (ASCII) with each string ending with a\0
byte. Some sections use these string tables with a simple index from which the corresponding string can be retrieved by iterating until the zero byte. This allows saving memory with a lot of strings with a common prefix, but I haven’t seen any prefix being used in the files I’ve found.SYMTAB
represents a symbol table. In case you’ve never heard of it, that’s the best friend of a linker. It contains all the information about the symbols that are exposed in your program such as functions and global variables.DYNSYM
is the same thing as the symbol table, but dynamic. This means that it’s actually used during dynamic linking.DYNAMIC
is the type of a section that holds information needed for dynamic linking like the names of libraries to link.NOTE
is the type of sections holding various info that is not indispensable for execution nor linking but may be useful. For example, the.note.gnu.build-id
section holds a 256-bit string generate during compilation, to identify a certain executable, I think. Another example, the.note.gnu-property
contains some information about the specific ISA needed for the file.HASH
andGNU_HASH
are sections which contain hash tables for fast lookup of symbols inside the symbol tables. I’ve never actually measured how much time is spent doing symbol lookup in large projects with thousands of symbols (and I wouldn’t even know how to measure it), but I guess it’s a nice to have.REL
andRELA
are types of relocations sections. Relocations and relocations with addends, respectively. As far as I know, they tell the linker how and where to relocate a certain symbol in a symbol table, in case it is needed. I don’t actually know how they work and how to use them because I didn’t need them. Maybe I will learn it when implementing dynamic linking.INIT_ARRAY
andFINI_ARRAY
. These are sections which contain initialization and deinitialization routines/functions to be called in a specific order. There are some other sections with the same purpose but different layout and/or meaning like.init
and.fini
or.ctors
and.dtors
. These ones are the legacy versions of the.init_array
and.fini_array
(and.preinit_array
) sectionVERSYM
,VERDEF
andVERNEED
. These three usually come together, except in our simple hello world for a simple reason: we don’t define any symbol. The sections with these types complement the information contained in the symbol table and the dynamic symbol table by adding versions to it. I’m not sure if the version refer to the single symbol or to the library. TheVERDEF
defines versions, theVERNEED
defines versions needed for some symbols and theVERSYM
puts everything together.
I hope this summarization was useful for you because I would have really loved a post like this, instead of interpolating between the ELF specification, the Solaris documentation and StackOverflow questions. The next post will be about the x86 ISA.
Stay tuned.