Fibers, Oh My! #

Written by Dale Weiler, last updated Oct 5th, 2020

It’s been brought to the attention of many people in the game development industry that you can use fibers to parallelize your engines and games. This one presentation in particular: Parallelizing the Naughty Dog engine using fibers has been a talking point since it came out in 2015. However, it doesn’t quite go into the details enough to really explain why fibers are a good fit or how to actually implement them. In my pursuit to educate myself and take a stab at implementing them, I’ve concluded that there’s a distinctive lack of good information online. In response to the lack of literature on this topic, I’ve decided to explain it from multiple angles, and provide less known information on the subject. This document serves as a Sourcebook for those who really want to determine if it’s a right fit for them and how to go about actually doing it.

Disambiguation of terms. #

Before we get started I want to define three distinctive terms.

OS-thread (the thread given to us by the OS)
Hardware-thread (the actual physical thread on a CPU)
Fiber-thread (the thread of execution that is our fiber)

What are fibers? #

Fibers are a lightweight thread of execution similar to OS threads. However, unlike OS threads, they’re cooperatively scheduled as opposed to preemptively scheduled. What this means in plain English is that fibers yield themselves to allow another fiber to run. You may have used something similar to this in your programming language of choice where it’s typically called a coroutine, there’s no real distinction between coroutines and fibers other than that coroutines are usually a language-level construct, while fibers tend to be a systems-level concept.

Other names for fibers you may have heard before include:

green threads
user-space threads
coroutines
tasklets
microthreads

There are very few and minor differences between fibers and the above list. For the purposes of this document, we should consider them equivalent as the distinctions don’t quite matter.

Scheduling #

At any given moment the OS is running multiple processes all with their own OS threads. All of those OS threads need to be making forward progress. There’s two classes of thought when it comes to how you solve this problem.

It’s important to note that while you may observe that all processes and OS threads are running in parallel, scheduling is really providing the illusion of that. Not all threads are running in parallel, the scheduler is just switching between them quickly enough that it appears everything is running in parallel. That is they’re concurrent. Threads start, run, and complete in an interleaved fashion.

It is possible for multiple OS threads to be running in parallel with symmetric multiprocessing (SMP) where they’re mapped to multiple hardware threads, but only as many hardware threads as the CPU physically has.

Premptive scheduling #

Most people familiar with threads know that you don’t have to yield to other threads to allow them to run. This is because most operating systems (OS) schedule threads preemptively.

The points at which the OS may decide to preempt a thread include:

IO
sleeps
waits (seen in locking primitives)
interrupts (hardware events mostly)

The first three in particular are often expressed by an application as a system call. These system calls cause the CPU to cease executing the current code and execute the OS’s code registered for that system call. This allows the OS to service the request then resume execution of your application’s calling thread, or another thread entierly.

This is possible because the OS will decide at one of the points listed above to save all the relevant state of that thread then resume some other thread, the idea being that when this thread can run again, the OS can reinstate that thread and continue executing it like nothing ever happened. These transition points where the OS switches a thread are called context switches.

There’s a cost associated with this context switching and all modern operating systems have made great deals of effort to reduce this cost as much as possible. Unfortunately, that overhead begins to show itself when you have a lot of threads. In addition, recent cache side channel attacks like: Spectre, Meltdown, Spoiler, Foreshadow, and Microarchitectural Data Sampling on modern processors has led to a series of both user-space and kernel-space mitigation strategies, some of which increased the overhead of context switches significantly.

You can read more about context switching overhead in this paper.

Cooperative scheduling #

This idea of fibers yielding to each other is what is known as cooperative scheduling. Fibers effectively move the idea of context switching from kernel-space to user-space and then make those switches a fundamental part of computation, that is, they’re a deliberate and explicitly done thing, by the fibers themselves. The benefit of this is that a lot of the previously mentioned overhead can be entierly eliminated while still permitting an excess count of threads of execution, just in the form of these fibers now.

The problem with multi-threading #

There’s many problems related to multi-threading, most obviously that it’s difficult to get right. Most proponents of fibers make false claims about how this problem goes away when you use fibers because you don’t have parallel threads of execution. Instead, you have these cooperatively scheduled fibers which yield to each other. This means it’s not possible to race data, dead lock, live lock, etc. While this statement is true when you look at fibers as a N:1 proposition, the story is entierly different when you introduce M:N.

N:1 and what it means #

Most documentation, libraries, and tutorials on fibers are almost exclusively based around using a single thread given to you by the OS, then sharing it among multiple fibers that cooperatively yield and run all your asynchronous code. This is called N:1 (“N to one”). N fibers to 1 thread, and it’s the most prevalent form of fibers. This is how Lua coroutines work, how Javascript’s and Python’s async/await work, and it’s not what you’re interested in doing if you actually want to take advantage of hardware threads. What you’re interested in is M:N, (“M to N”) M fibers to N threads.

M:N and what it means #

The idea behind M:N is to take the model given to us by N:1 and map it to multiple actual OS threads. Just like we’re familiar to the concept of thread pools where we execute tasks, here we have a pool of threads where we execute fibers and those fibers get to yield more of themselves on that thread.

I should stress that M:N fibers have all the usual problems of multi-threading. You still need to syncronize access to resources shared between multiple fibers because there’s still multiple threads.

The problem with thread pools #

A lot of you may be wondering how this is different from traditional task based parallelism choices seen in many game engines and applications. The model where you have a fixed-size pool of threads you queue tasks on to be executed at some point in the future by one of those threads.

Locality of reference #

The first problem is locality of reference. The data-oriented / cache-aware programmers reading this will have to mind my overloading of that phrase because what I’m really talking about is the resources that a job needs access to are usually local. The job isn’t going to be executed immediately, but rather when the thread pool has a chance to. Any resource that job needs access to, needs to be available for the job at some point in the future. This means local values need their lifetime’s extended for an undefined amount of time.

There’s many ways to solve this lifetime problem, except they all have overhead. Consider this trivial example where I want to asynchronously upload a local file to a webserver

void upload_file(const String& filename) {
  thread_pool->add_job([&] {
    auto file = open_file(filename);
    auto data = read_file(file);
    post_binary_data(data);
  });
}

Do you see the bug?

The issue is that the string passed to upload_file, i.e the local filename, only has a lifetime of the body of the function. When this function returns, filename no longer is a valid string. However, at some point this lambda function will be executed and try to access filename in it’s local context and it’ll be a dangling reference by then.

This can be surprsing to those more familiar with dynamic languages, since they support this functionality with something called a closure. The closure will actually extend the lifetime of filename. In native languages like C or C++ though, this isn’t the case and all lifetime extension needs to be done explicitly with obvious runtime which introduces overhead.

What you could do is copy the string. The copy will then have the lifetime of the lambda. However, if we had a much larger resource to share with this job that couldn’t be as cheaply copied, we would have to use something like a reference count instead. As you can see, the solutions to the resource problem involve a great deal of overhead in many cases and requires you to really think and reason about lifetimes here when you didn’t have to, nor should you have to.

We’re experiencing a lot of friction here already and it’s only a few lines of code. My personal rule of thumb is if you find yourself experiencing friction like this, it’s usually indicative of bad design and the problem needs to be rethought.

Nested induced deadlocks #

The other significant problem with thread pools is what I call nested induced deadlocks. No matter how you slice and dice it, jobs are going to want to schedule other jobs and need the result before they themselves can continue.

Let me set up a real world example because it’s easier to understand with something concrete. Since this is primarily focused for game developers, I’ll use a familiar example.

The setup #

I want to multi-thread the loading of game levels for faster loading. A level can have multiple models. A model can have multiple meshes. A mesh can have multiple materials. Materials can have multiple textures. Here’s some example code for how I might go about doing that:

Expand code listing ↕

Texture load_texture(const TextureDesc& filename); // exposition-only
Material load_material(const MaterialDesc& material) {
  Vector<Job> jobs;
  Material result(material.name);
  for (const auto& texture : material.textures) {
    Job id = thread_pool->add_job([&] {
      result.textures.push_back(load_texture(texture.desc));
    });
    jobs.push_back(id);
  }
  wait_for_all_jobs(jobs);
  return result;
}

Mesh load_mesh(const MeshDesc& mesh) {
  Vector<Job> jobs;
  Mesh result(mesh.name);
  for (const auto& material : mesh.materials) {
    Job id = thread_pool->add_job([&]{
      result.materials.push_back(load_material(material.desc));
    });
    jobs.push_back(id);
  }
  wait_for_all_jobs(jobs);
  return result;
}

Model load_model(const ModelDesc& model) {
  Vector<Job> jobs;
  Model result(model.name);
  for (const auto& mesh : model.meshes) {
    Job id = thread_pool->add_job([&]{
      result.meshes.push_back(load_mesh(mesh.desc));
    });
    jobs.push_back(id);
  }
  wait_for_all_jobs(jobs);
  return result;
}

void load_level() {
  read_metadata();
  for (const auto& model : metadata.models) {
    models.push_back(load_model(model.desc));
  }
}

The loading of models jobifies the loading of all the meshes. The loading of meshes jobifies the loading of all the materials. The loading of materials jobifies the loading of all textures. It’s pretty straight-forward. This completely saturates all threads of execution.

Note that in this case loading conflates reading, decoding, optimizing, and uploading to the GPU. Nearly all of these operations, with the exception of IO, are compute bound problems which are easily parallelizable.

The deadlock #

The problem with this code is that for a given thread pool of N threads, if all N of those threads end up in a wait_for_all_jobs, they’ll never make forward progress and the entire thread pool will deadlock.

There’s many ways to avoid this situation, but the most obvious and easiest is to remove the waits. It’s true there are ways to restructure this to remove the waits and make this work but it has some problems which I want to outline:

If we remove the waits, the locals captured by the lambdas are no longer valid when the job runs. The reason we can access them at all in this current example is because wait_for_all_jobs prevents the function from returning, keeping the lifetime of the locals alive.

I want to stress that to avoid the problem above, the code is going to take on a very different form. In my personal experience, all of the following occur when trying to solve the above problem:

Code organization will suffer.
Violation of SoC (separation of concerns.)
Code duplication to permit the use of the fuctions outside of level loading.
Fragile interfaces with complicated rules about how they’re called, such as;
- Only being allowed to call interfaces on certain threads and not others.
- Management of resources require intricate and explict care.

This is very fragile, tedious, and showing signs of high friction. We’ll see how fibers can help us avoid this later.

An incorrect and often advocated solution #

There’s some proponents that advocate that one such way to avoid these issues is to do away with the thread pool altogether and spawn threads everytime you make a job. The OS will schedule them and avoid the deadlock, when the jobs are done the threads are joined.

This is actually how std::async works in C++ with the async launch policy. It will certainly avoid the problem but it has problems which I’ll outline here:

Thread construction has overhead. The primary motivation behind thread pools to begin with is to eliminate the cost of thread construction. This overhead will have diminishing returns if you cut your jobs too fine and we’ll see later that cutting jobs real fine isn’t only possible with fibers, but that it’s actually encouraged.
There will be a lot of context switches. This many context switches is only necessary because we’ve delegated all the scheduling to the OS when we could do a much better job ourselfs. This context switching overhead will lead to diminishing returns.
Threads are a finite resource. If you do plan on cutting jobs real fine, it’s very easy to see how a level with a few thousand models, handful of meshes per model, handful of materials per mesh and a handful of textures per material can quickly lead to tens of thousands of threads. With this many threads, certain security features like ulimit on Linux, a similar feature on Windows which has no name that I know of, and even third party monitoring services like AV software may decide to kill your process before it has finished loading the level thinking your process is a denial of service attack.

A correct and often advocated solution #

This nesting of jobs and their dependencies have given rise to a multitude of task graph libraries which require you to explicitly describe a graph of all your jobs and their dependencies. This too avoids the deadlock, provided you don’t form cycles in the graph. However it has problems of its own which make it unsuitable:

The graph is explicit and so you must describe it all yourself.
Requires a complete, global understanding of all tasks, their dependencies and how they should interact.
Mandates you write code in a way counter to what is natural to do.
More likely to trigger priority inversion since the scheduling is out of your control.
Still requires explicit memory management of resources, be it by copy or reference count.

What M:N fibers will solve #

The trick of course here is that we want to write code that appears like it blocks but it doesn’t actually block the thread. In other words, we want to write the exact code in my example but we want wait_for_all_jobs to lie a little about what it does. This one little lie enables us to experience a natural and efficient solution to all the above concerns, in particular:

Write the exact code we have already written without making copies or reference counting resources.
Avoid nested induced deadlocks, even though we’ll still have that thread pool to take advantage of OS threads.
Reduce the amount of context switches to only necessary levels.
Jobifiy just about everything.
Scale without making any changes to our code.

What M:N fibers will not solve #

It almost sounds a too good to be true and you should be highly skeptical. This is not without it’s own unique set of problems it cannot solve. If you recall from earlier, fibers are a form of cooperative scheduling, that is we yield to other fibers to allow them to run instead of waiting. This means we’re always keeping the thread saturated with work to do.

With that fact refreshed, imagine what happens when our fibers cannot yield? What if our fibers invoke operating system primitives that block, or interfaces which indirectly do?

IO
sleeps
waits (seen in locking primitives)

When you block a thread in one of these ways, the OS thread our fiber is on, is going to block until the blocking action completes. As a result, it’s imperative that a fiber avoids such operations.

The other thing fibers cannot solve is Priority Inversion. However, we’ll see later that by implementing a fiber scheduler, there’s at least some things we can do to lower the chance of it happening, something that isn’t possible with traditional threads at all.

You should not use fibers for anything that requires a realtime priority. One such example is audio mixing. If you require quick preemption and high priority scheduling, there’s not much room in a fiber scheduler to do that as fibers are cooperatively scheduling each other. Do not move systems that require constant realtime priority to fibers, it’s too easy to starve them.

How not to implement M:N fibers #

Using the N:1 approach. This approach is seen in countless libraries. The problem with this approach is it does not take advantage of multiple hardware threads. It’s true you can layer this ontop of threads, but in doing so you’ll still lack all the syncronization primitives you’re used to with threads and you cannot use OS syncronization primitives in your fibers for the reasons listed above. They’re not designed to be used by multiple threads and adding support is likely more difficult than writing your own.
Use setjmp and longjmp. There’s a growing number of people that do this. Fibers need their own stack which this does not permit.

The third and more subtle mistake people make when implementing fibers is to use OS provided functionality for implementing the user-space context switching

POSIX provides setcontext, getcontext, and makecontext.
Windows provides SetThreadContext and GetThreadContext.
Windows also provides actual Fibers CreateFiber.

Avoid OS fiber primitives #

One of the challenges we need to overcome in our fiber implementation is making the construction of fibers fast. Imagine you want to spawn thousands of fibers every frame, none of the OS methods allow this because they’re required to save and restore more state than we’re interested in. This is an area where NIH syndrome will pay off.

Even if the interfaces provided by the OS were exactly what we needed in terms of performance, they’re made inadequate because of other more subtle issues which are often overlooked. I’ve compiled a list of all the problems with OS provided primitives.

Problems with POSIX #

The POSIX functions are obsolescent and shouldn’t be used. Implementations of the standard library, like musl do not provide the functions at all since they don’t need to.
They need to make system calls (which are context switches) to save the signal mask.
They need to save and restore a large grab-bag structure of state, most of which we’re not interested in.

Problems with Windows #

The CreateFiber interface is responsible for allocating the stack. This means we cannot provide our own stack memory, which in turns means we cannot reuse stacks. The ability to reuse stacks will impact our performance.
The CreateFiber interface leads you into a false sense of security by allowing you to use Windows syncronization primitives like WaitForMultipleObjects, which have a pathetically small limitation on the maximum number of waiters (i.e. MAXIMUM_WAIT_OBJECTS = 64) due to legacy limitations in the API.
The SetThreadContext and GetThreadContext functions rely on a similar, large grab-bag structure of state, most of which we’re not interested in.

You’ll find many other examples of similar problems for interfaces provided by your OS of choice. The fact of the matter is, because these need to be generic interfaces, they need to do a lot more work.

User space context switching #

To switch contexts in users space, we need to be able to backup certain state about the context and set the appropriate state so we can jump into that context. This is highly ABI, OS and architecture specific. I’m going to focus on the SysV x86_64 ABI on Linux and x64 ABI on Windows, since the former is my development platform and the latter is quite similar. The ideas are the same on other platforms and you can find all the relevant documentation for what needs to be done with a bit of Googling, or you can save yourself the effort and use something like Boost.Context which has working code for a ton of platforms. Just remember that you can beat it for performance since it has to be generic.

Getting the context #

The idea behind getting the context is to save the current registers the ABI says need to be saved and the return address. We can use these then later to come back by restoring them.

The code below is x86_64 assembler in GAS syntax for the SysV x86_64 ABI. The lines that begin with . are called assembler directives and I specify the type of my labels here as global functions, this will be needed so we can externally reference them in C and C++.

Expand code listing ↕

.type get_context, @function
.global get_context
get_context:
  # Save the return address and stack pointer.
  movq (%rsp), %r8
  movq %r8, 8*0(%rdi) // RIP
  leaq 8(%rsp), %r8
  movq %r8, 8*1(%rdi) // RSP

  # Save preserved registers.
  movq %rbx, 8*2(%rdi)
  movq %rbp, 8*3(%rdi)
  movq %r12, 8*4(%rdi)
  movq %r13, 8*5(%rdi)
  movq %r14, 8*6(%rdi)
  movq %r15, 8*7(%rdi)

  # Return.
  xorl %eax, %eax
  ret

Those who really know their stuff will note that I’ve not saved the x87 FPU state as it’s not actually used for x86_64 code. It’s true that the compiler may generate code to it if I use long double which requires the full 80-bit precision, however this is one of the things I can be sure I won’t use in my code and so I do not need to worry about. Existing interfaces cannot make those decisions for you however.

If you look carefully you’ll note that RDI is actually the first argument to this get_context function as defined by the ABI and I’m storing the values of some registers at fixed offsets relative to RDI. Those offsets are 8*n where n is some number, that’s because these registers are 8-bytes in size. The thing passed to RDI here is just a pointer to a structure that has those register, you can define it like this if you want.

struct Context {
  void *rip, *rsp;
  void *rbx, *rbp, *r12, *r13, *r14, *r15;
};

Take note that this Context structure isn’t a large grab-bag of state like the OS provided interfaces. We have a total of 64 bytes to save and restore for context switches, which convienently fits in a singlle cache line. Similarly, we only need 12 instructions to save the context and 11 instructions to restore it. This is as fast as it gets.

Setting the context #

Setting the context is pretty much the reverse of getting the context. It’s made a little more tricky on x86 since you cannot set the instruction pointer directly, it has to be done through the stack. This is what you see here with moving RIP into R8 so we can push it to the stack. You cannot push directly from a memory location, hence loading it into a register first.

Expand code listing ↕

.type set_context, @function
.global set_context
set_context:
  # Should return to the address set with {get, swap}_context.
  movq 8*0(%rdi), %r8

  # Load new stack pointer.
  movq 8*1(%rdi), %rsp

  # Load preserved registers.
  movq 8*2(%rdi), %rbx
  movq 8*3(%rdi), %rbp
  movq 8*4(%rdi), %r12
  movq 8*5(%rdi), %r13
  movq 8*6(%rdi), %r14
  movq 8*7(%rdi), %r15

  # Push RIP to stack for RET.
  pushq %r8

  # Return.
  xorl %eax, %eax
  ret

Example of use #

With that out of the way, we can now do some very interesting things in C.

Expand code listing ↕

extern void get_context(Context *c);
extern void set_context(Context *c);

int main() {
  volatile int x = 0;

  Context c;
  get_context(&c);

  printf("hello, world!\n");

  if (x == 0) {
    x++;
    set_context(&c);
  }

  return 0;
}

This code will print "hello, world!" twice.

The need for volatile #

The set_context call returns us to where get_context was called, the value of x will be 1 the second time through and the set_context cannot be called again since x is no longer 0. This behaviour only works because I used volatile though. If we were to omit it. The compiler will see the if statement as always true, because in a normal program it would be. As a result, we’d get an infinite loop of "hello, world!".

Making a context #

What we’re more interested in doing here is actually making our own context though. This is nothing more than an implementation of setjmp and longjmp with a slightly different interface. We want to run actual functions like threads do. We also want to give it our own stack which we will see becomes important later.

We can do that pretty easily, we just need to set the rsp field (stack pointer) of our Context struct to point to some piece of memory that will behave as our stack. Then we point rip (instruction pointer) to the start of our function we wish to call.

Expand code listing ↕

void foo() {
  printf("you called foo\n");

  // Need to exit since there's no return address on the stack after coming
  // here from |set_context| in main.
  exit(1);
}

int main() {
  // This will behave as our stack when we call foo.
  char data[4096];

  // Remember, stacks grow downwards so we need a pointer to the end of the stack.
  char *sp = (char*)(data + sizeof data);

  Context c = {0};
  c.rip = (void*)foo;
  c.rsp = (void*)sp;

  set_context(&c);
}

There’s a couple problems with this code though. It’s important to check your ABI because stack layout does require consideration. The SysV ABI in particular requires that our stack be 16-byte aligned for SSE. In addition to alignment, the ABI requires we also have 128 bytes of space below the stack called the Red Zone.

The relevant modifications we need to make are included here.

Expand code listing ↕

int main() {
  // This will behave as our stack when we call foo.
  char data[4096];

  // Remember, stacks grow downwards so we need a pointer to the end of the stack.
  char *sp = (char*)(data + sizeof data);

  // Align stack pointer on 16-byte boundary.
  sp = (char*)((uintptr_t)sp & -16L);

  // Make 128 byte scratch space for the Red Zone. This arithmetic will not unalign
  // our stack pointer because 128 is a multiple of 16. The Red Zone must also be
  // 16-byte aligned.
  sp -= 128;

  Context c = {0};
  c.rip = (void*)foo;
  c.rsp = (void*)sp;

  set_context(&c);
}

This prints "you called foo"

You may be wondering why we need custom stacks for our fibers. Functions have locals that live on the stack, when we switch between fibers, we need to preserve the contents of those locals so when we resume they’ll still be there. Similarly, in the case of languages like C++, the objects that live on the stack will have their destructors called upon return of the function. Those destructors reference the object on the stack directly through their this pointer which will point into the stack.

This is the main reason setjmp and longjmp cannot be realistically used in C++, the destructors of your stack allocated objects will never get called for one thing, unless you jump back, but even if you did, the contents of the stack are no longer going to be that of what the objects were when you longjmp out. This means when the destructors do get called, they’re referencing clobbered data and invoking undefined behavior.

Swapping contexts #

Context swapping is a combination of set_context and get_context. It replaces one context for another, backing up the current context.

Expand code listing ↕

.type swap_context, @function
.global swap_context
swap_context:
  # Save the return address.
  movq (%rsp), %r8
  movq %r8, 8*0(%rdi) // RIP
  leaq 8(%rsp), %r8
  movq %r8, 8*1(%rdi) // RSP

  # Save preserved registers.
  movq %rbx, 8*2(%rdi)
  movq %rbp, 8*3(%rdi)
  movq %r12, 8*4(%rdi)
  movq %r13, 8*5(%rdi)
  movq %r14, 8*6(%rdi)
  movq %r15, 8*7(%rdi)

  # Should return to the address set with {get, swap}_context.
  movq 8*0(%rsi), %r8

  # Load new stack pointer.
  movq 8*1(%rsi), %rsp

  # Load preserved registers.
  movq 8*2(%rsi), %rbx
  movq 8*3(%rsi), %rbp
  movq 8*4(%rsi), %r12
  movq 8*5(%rsi), %r13
  movq 8*6(%rsi), %r14
  movq 8*7(%rsi), %r15

  # Push RIP to stack for RET.
  pushq %r8

  # Return.
  xorl %eax, %eax
  ret

Windows #

As mentioned, all the above assembly is for the SysV x86_64 ABI which is great if you only care about macOS, Linux, and FreeBSD platforms.

The good news is extending it for the x64 Windows ABI is quite straight forward. We do need to save and restore some additional registers (RDI, RSI, and XMM6-XMM15) though, since they too are considered “preserved registers” in this ABI. The other thing that is different is the calling convention. Instead of the arguments ending up in RDI and RSI, they end up in RDX and RCX.

Here’s the patch and accompanying Context structure for supporting Windows.

struct Context {
  void *rip, *rsp;
  void *rbx, *rbp, *r12, *r13, *r14, *r15, *rdi, *rsi;
  __m128i xmm6, xmm7, xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
};

Expand code listing ↕

 get_context:
   # Save the return address and stack pointer.
   movq (%rsp), %r8
-  movq %r8, 8*0(%rdi) // RIP
+  movq %r8, 8*0(%rdx) // RIP
   leaq 8(%rsp), %r8
-  movq %r8, 8*1(%rdi) // RSP
+  movq %r8, 8*1(%rdx) // RSP

   # Save preserved registers.
-  movq %rbx, 8*2(%rdi)
-  movq %rbp, 8*3(%rdi)
-  movq %r12, 8*4(%rdi)
-  movq %r13, 8*5(%rdi)
-  movq %r14, 8*6(%rdi)
-  movq %r15, 8*7(%rdi)
+  movq %rbx, 8*2(%rdx)
+  movq %rbp, 8*3(%rdx)
+  movq %r12, 8*4(%rdx)
+  movq %r13, 8*5(%rdx)
+  movq %r14, 8*6(%rdx)
+  movq %r15, 8*7(%rdx)
+  movq %rdi, 8*8(%rdx)
+  movq %rsi, 8*9(%rdx)
+  movups %xmm6, 8*10+16*0(%rdx)
+  movups %xmm7, 8*10+16*1(%rdx)
+  movups %xmm8, 8*10+16*2(%rdx)
+  movups %xmm9, 8*10+16*3(%rdx)
+  movups %xmm10, 8*10+16*4(%rdx)
+  movups %xmm11, 8*10+16*5(%rdx)
+  movups %xmm12, 8*10+16*6(%rdx)
+  movups %xmm13, 8*10+16*7(%rdx)
+  movups %xmm14, 8*10+16*8(%rdx)
+  movups %xmm15, 8*10+16*9(%rdx)

   # Return.
   xorl %eax, %eax
@@ -23,18 +25,20 @@

 set_context:
   # Should return to the address set with {get, swap}_context.
-  movq 8*0(%rdi), %r8
+  movq 8*0(%rdx), %r8

   # Load new stack pointer.
-  movq 8*1(%rdi), %rsp
+  movq 8*1(%rdx), %rsp

   # Load preserved registers.
-  movq 8*2(%rdi), %rbx
-  movq 8*3(%rdi), %rbp
-  movq 8*4(%rdi), %r12
-  movq 8*5(%rdi), %r13
-  movq 8*6(%rdi), %r14
-  movq 8*7(%rdi), %r15
+  movq 8*2(%rdx), %rbx
+  movq 8*3(%rdx), %rbp
+  movq 8*4(%rdx), %r12
+  movq 8*5(%rdx), %r13
+  movq 8*6(%rdx), %r14
+  movq 8*7(%rdx), %r15
+  movq 8*8(%rdx), %rdi
+  movq 8*9(%rdx), %rsi
+  movups 8*10+16*0(%rdx), %xmm6
+  movups 8*10+16*1(%rdx), %xmm7
+  movups 8*10+16*2(%rdx), %xmm8
+  movups 8*10+16*3(%rdx), %xmm9
+  movups 8*10+16*4(%rdx), %xmm10
+  movups 8*10+16*5(%rdx), %xmm11
+  movups 8*10+16*6(%rdx), %xmm12
+  movups 8*10+16*7(%rdx), %xmm13
+  movups 8*10+16*8(%rdx), %xmm14
+  movups 8*10+16*9(%rdx), %xmm15

   # Push RIP to stack for RET.
   pushq %r8
@@ -46,31 +50,35 @@
 swap_context:
   # Save the return address.
   movq (%rsp), %r8
-  movq %r8, 8*0(%rdi) // RIP
+  movq %r8, 8*0(%rdx) // RIP
   leaq 8(%rsp), %r8
-  movq %r8, 8*1(%rdi) // RSP
+  movq %r8, 8*1(%rdx) // RSP

   # Save preserved registers.
-  movq %rbx, 8*2(%rdi)
-  movq %rbp, 8*3(%rdi)
-  movq %r12, 8*4(%rdi)
-  movq %r13, 8*5(%rdi)
-  movq %r14, 8*6(%rdi)
-  movq %r15, 8*7(%rdi)
+  movq %rbx, 8*2(%rdx)
+  movq %rbp, 8*3(%rdx)
+  movq %r12, 8*4(%rdx)
+  movq %r13, 8*5(%rdx)
+  movq %r14, 8*6(%rdx)
+  movq %r15, 8*7(%rdx)
+  movq %rdi, 8*8(%rdx)
+  movq %rsi, 8*9(%rdx)
+  movups %xmm6, 8*10+16*0(%rdx)
+  movups %xmm7, 8*10+16*1(%rdx)
+  movups %xmm8, 8*10+16*2(%rdx)
+  movups %xmm9, 8*10+16*3(%rdx)
+  movups %xmm10, 8*10+16*4(%rdx)
+  movups %xmm11, 8*10+16*5(%rdx)
+  movups %xmm12, 8*10+16*6(%rdx)
+  movups %xmm13, 8*10+16*7(%rdx)
+  movups %xmm14, 8*10+16*8(%rdx)
+  movups %xmm15, 8*10+16*9(%rdx)

   # Should return to the address set with {get, swap}_context.
-  movq 8*0(%rsi), %r8
+  movq 8*0(%rcx), %r8

   # Load new stack pointer.
-  movq 8*1(%rsi), %rsp
+  movq 8*1(%rcx), %rsp

   # Load preserved registers.
-  movq 8*2(%rsi), %rbx
-  movq 8*3(%rsi), %rbp
-  movq 8*4(%rsi), %r12
-  movq 8*5(%rsi), %r13
-  movq 8*6(%rsi), %r14
-  movq 8*7(%rsi), %r15
+  movq 8*2(%rcx), %rbx
+  movq 8*3(%rcx), %rbp
+  movq 8*4(%rcx), %r12
+  movq 8*5(%rcx), %r13
+  movq 8*6(%rcx), %r14
+  movq 8*7(%rcx), %r15
+  movq 8*8(%rcx), %rdi
+  movq 8*9(%rcx), %rsi
+  movups 8*10+16*0(%rcx), %xmm6
+  movups 8*10+16*1(%rcx), %xmm7
+  movups 8*10+16*2(%rcx), %xmm8
+  movups 8*10+16*3(%rcx), %xmm9
+  movups 8*10+16*4(%rcx), %xmm10
+  movups 8*10+16*5(%rcx), %xmm11
+  movups 8*10+16*6(%rcx), %xmm12
+  movups 8*10+16*7(%rcx), %xmm13
+  movups 8*10+16*8(%rcx), %xmm14
+  movups 8*10+16*9(%rcx), %xmm15

   # Push RIP to stack for RET.
   pushq %r8

Security #

MSVC for debug builds has something called /RTC ( Runtime-Time Error Checks) which help detect stack corruption and other stack related issues. Our stack switching code does not play nice with it. In particular, I believe it expects the stack pointer address on each function call return be in the range defined by the TIB which is stored in the GS segment register.

We can modify our context switching functions so that we update StackBase and StackLimit to avoid this issue, or you can just outright disable /RTC. /RTC is not enabled in release builds and it precludes optimizations, so it can’t even be realistically turned on in a release build. I left supporting this as an exercise to the reader because I’m not exactly sure that it’s enough to silence RTC.

Not enough information exists on how /RTC works for me to be definitive on how to silence it here. It would be nice if Microsoft would document stuff like this. If someone knows how it works please let me know.

Integrating into your build #

Lets mention the big elephant in the room. This is assembly language. Most build pipelines are not setup to deal with integrating assembly into the build process, and even if they were, the assembly above is the GAS syntax which is only understood by some assemblers. Once you begin talking about MASM, NASM, FASM, etc, this gets annoying quick. You need additional build tools, which vary by platform, etc.

Inline assembler #

One way to avoid this problem is to just use inline assembler which works great if your compiler supports it. MSVC on Windows for x64 does not though. I also don’t trust myself to get inline assembler right in terms of clobbers for inputs and outputs.

Just use C/C++ #

One trick that apparently few people know about is that you can write assembler in almost-standard C and C++ without an inline assembler. I thought of this trick while writing this sourcebook and discovered to my surprise not only is it possible, it’s perfect for this.

We’re not going to be making any real changes to our assembly source, this isn’t something that will require future maintaince. This is a one and done deal, so why don’t we just assemble it offline with what ever assembler we want, get a small byte code listing and embed that into our C/C++ source code directly as a byte array?

Take the get_context function from earlier

Expand code listing ↕

get_context:
  # Save the return address and stack pointer.
  movq (%rsp), %r8
  movq %r8, 8*0(%rdi) // RIP
  leaq 8(%rsp), %r8
  movq %r8, 8*1(%rdi) // RSP

  # Save preserved registers.
  movq %rbx, 8*2(%rdi)
  movq %rbp, 8*3(%rdi)
  movq %r12, 8*4(%rdi)
  movq %r13, 8*5(%rdi)
  movq %r14, 8*6(%rdi)
  movq %r15, 8*7(%rdi)

  # Return.
  xorl %eax, %eax
  ret

Feed it into an assembler. Here I can use gcc as it’s a frontend for many languages, including assembler.

gcc -c file.S

Now dump the contents of it with objdump -d

objdump file.o

0000000000000000 <get_context>:
   0:   4c 8b 04 24             mov    (%rsp),%r8
   4:   4c 89 07                mov    %r8,(%rdi)
   7:   4c 8d 44 24 08          lea    0x8(%rsp),%r8
   c:   4c 89 47 08             mov    %r8,0x8(%rdi)
  10:   48 89 5f 10             mov    %rbx,0x10(%rdi)
  14:   48 89 6f 18             mov    %rbp,0x18(%rdi)
  18:   4c 89 67 20             mov    %r12,0x20(%rdi)
  1c:   4c 89 6f 28             mov    %r13,0x28(%rdi)
  20:   4c 89 77 30             mov    %r14,0x30(%rdi)
  24:   4c 89 7f 38             mov    %r15,0x38(%rdi)
  28:   31 c0                   xor    %eax,%eax
  2a:   c3                      retq

Here we just want those bytes on the left. Add 0x to each and put it in an unsigned char array with static storage duration.

static unsigned char get_context_code[] = {
  0x4c, 0x8b, 0x04, 0x24,
  0x4c, 0x89, 0x07,
  0x4c, 0x8d, 0x44, 0x24, 0x08,
  0x4c, 0x89, 0x47, 0x08,
  0x48, 0x89, 0x5f, 0x10,
  0x48, 0x89, 0x6f, 0x18,
  0x4c, 0x89, 0x67, 0x20,
  0x4c, 0x89, 0x6f, 0x28,
  0x4c, 0x89, 0x77, 0x30,
  0x4c, 0x89, 0x7f, 0x38,
  0x31, 0xc0,
  0xc3
};

We’re almost done. This will end up inside the .data or .rodata section of our binary which isn’t an executable section. If we were to cast this array as a function type and try to call it, we’d just crash.

However, if you have a gcc or clang based compiler, you can add this single attribute before the array:

__attribute__((section(".text#")))

The '#' on the ".text" string is not a typo. It’s a comment token and necessary to silent a warning.

Similarly, if you’re MSVC based, you can add the following instead:

#pragma section(".text")
__declspec(allocate(".text"))

What this now does is move the array to the .text section of the executable instead, which is executable. Now we take a function pointer to this array of byte code.

static void (*get_context)(Context*) = (void (*)(Context*))get_context_code;

And just like that we’re done! No assembler needed. No build process changes needed. Just remember to document that spooky array of bytes. It’s sure to confuse the heck out of anyone who stumbles on it.

Considerations #

When implementing fibers you will run into strange behavior which you cannot explain, this is because fibers violate the flow of control that compilers and debuggers are expecting. The side effects a fiber has on program state is not visible to the compiler or debugger, this can make reasoning and debugging this stuff quite fun.

Barriers #

As already shown above, this user space context switching violates the flow of control compilers are expecting and so they may optimize around it incorrectly. Similarly, when we introduce our fiber scheduler we’ll have threads involved. As a result of this, it’s important that any resources you have that are manipulated through fibers have appropriate compiler barriers. Function calls are full compiler barriers and so most of your application state is fine. However, anything used to control flow based on the result of a fiber resuming needs to be carefully made volatile so the compiler doesn’t optimize it incorrectly.

Debuggers #

Some debugging tools do not enjoy having the stack pointer aimlessly changed out from underneath them. So debugging this stuff while you’re developing it can get a little tedious, fortunately there’s ways to mitigate this.

Address Sanitizer #

Address sanitizer will throw a fit when you switch the stack pointer, however you can teach it our new trick.

Expand code listing ↕

// These macros are defined by GCC and/or clang
#if defined(__SANITIZE_ADDRESS__) || __has_feature(address_sanitizer)
extern "C" {
  // Check out sanitizer/asan-interface.h in compiler-rt for documentation.
  void __sanitizer_start_switch_fiber(void **fake_stack_save, const void *bottom, size_t size);
  void __sanitizer_finish_switch_fiber(void *fake_stack_save, const void **bottom_old, size_t *size_old);
}
#define HAS_ASAN
#endif

#if defined(HAS_ASAN)
// This pointer will live on this stack frame.
void *fake_stack_save = nullptr;
__sanitizer_start_switch_fiber(
  &fake_stack_save,
  stack_data,
  stack_size);
#endif

// The context switch happens here in the middle.
swap_context(old_context, new_context);

#if defined(HAS_ASAN)
// When we return from the context switch we indicate that to ASAN.
void *bottom_old = nullptr;
size_t size_old = 0;
__sanitizer_finish_switch_fiber(
  fake_stack_save,
  &bottom_old,
  &size_old);
#endif

// Here you can assert |bottom_old| and |size_old| are what you expect when
// returning to this context.
//
// You can also pass NULL for the last two arguments if you don't care.

Thread Sanitizer #

Like address sanitizer, thread sanitizer needs to be taught of the stack switching behavior too. Why they don’t share the same interface is a bit annoying, but fortunately it’s not too complicated to deal with.

Expand code listing ↕

// These macros are defined by GCC and/or clang
#if defined(__SANITIZER_THREAD__) || __has_feature(thread_sanitizer)
extern "C" {
  // Check out sanitizer/tsan_interface.h in compiler-rt for documentation.
  void *__tsan_get_current_fiber(void);
  void *__tsan_create_fiber(unsigned flags);
  void __tsan_destroy_fiber(void *fiber);
  void __tsan_switch_to_fiber(void *fiber, unsigned flags);
}
#define HAS_TSAN
#endif

// This needs to be stored somewhere that the context we switch to will have
// access to it so it can switch back to it. This means it cannot be on this
// stack frame. We could store it in the stack frame of the context we switch
// to for example. Or we could just store it on the heap.
#if defined(HAS_TSAN)
void *this_fiber = __tsan_get_current_fiber();

// Here we just make a new fiber and immediately switch to it.
void *next_fiber = __tsan_create_fiber(0);
__tsan_switch_to_fiber(next_fiber, nullptr);
#endif

// The context switch happens here in the middle.
//
// It's important at this point that |old_context| and |new_context| are not
// on the stack, otherwise TSAN will complain since we just told it we've
// switched stacks.
//
// We CANNOT call |__tsan_switch_to_fiber| as the first thing in the context we
// switch to. The documentation explicitly forbids this.
swap_context(old_context, new_context);

// Somewhere in the context we just switch to, when switching back to this
// context we need to restore the fiber stored in |this_fiber|.
//
// So just before the |swap_context| we just call |__tsan_switch_to_fiber| again.
// this time restoring it.
//
// __tsan_switch_to_fiber(this_fiber, nullptr);

// We've returned from the context switch, we can now throw |this_fiber| out.
#if defined(HAS_TSAN)
__tsan_destroy_fiber(this_fiber);
#endif

Valgrind #

Since Valgrind is a full CPU emulator, it does an excellent job tracking stack pointer changes, however it needs to be taught of the stack’s size itself so it can help detect stack overflow and stack-use-after-free memory violations. This is pretty easy to do.

Expand code listing ↕

#if __has_include(<valgrind/valgrind.h>)
#include <valgrind/valgrind.h>
#define HAS_VALGRIND
#endif

// Before context switch, register our stack with Valgrind.
#if defined(HAS_VALGRIND)
unsigned stack_id = VALGRIND_STACK_REGISTER(stack_data, stack_data + stack_size);
#endif

// The context switch happens here in the middle.
swap_context(old_context, new_context);

// We've returned from the context switch, we can now throw that stack out.
#if defined(HAS_VALGRIND)
VALGRIND_STACK_DEREGISTER(stack_id);
#endif

GDB and LLDB #

GDB and LLDB are capable of following context switches but it’s a bit tedious to get automatic step through debugging. Teaching GDB and LLDB how to automatically follow context switches in this way requires we implement our own SystemTap probe. The documentation on SDT probes is fickle at best and I had to piece together a bunch of information from the Initial patches for GDB

Here we’re going to use some builtin non-local goto probes GDB and LLDB automatically recognizes for longjmp which are documented here

In particular, we’re going to teach GDB that our swap_context function is an implementation of longjmp, which isn’t strictly true, but it works. To do this we’re going to artificially extend our Context structure to contain a scratch piece of memory large enough to store some jmp_buf structure, just to appease the debugger. This should only be done for debug builds of our context switching code since we don’t want to make it larger than it needs to be for release.

struct Context {
  // ...
#ifndef NDEBUG
  char scratch[sizeof(jmp_buf)];
#endif
};

Now we begin to modify our swap_context to give it those probes.

Expand code listing ↕

// Note the need for the preprocessor here. This has to be assembled with the
// gcc or clang frontend and cannot be passed to the regular assembler.
#if __has_include(<sys/sdt.h>)
#include <sys/sdt.h>
#define HAS_SDT
#endif

swap_context:
  # Prologue follows
  # ...

  // Mark this function as a "longjmp" with the probe.
  //
  // %rdi contains our Context we're switching to, the +$64 is because the
  // jmp_buf is 64 bytes offset in the structure. The 8@ is because our registers
  // are 8 bytes in size.
  //
  // The $RETURN_ADDRESS is this return address when coming back from the the
  // "longjmp", presumably it's calculated somewhere in the context switch code
  // and stored in a register, this is left as an exercise to the reader.
#if !defined(NDEBUG) && defined(HAS_SDT)
  STAP_PROBE3(libc, longjmp, 8@%rdi+$64, $0, $RETURN_ADDRESS)
#endif

  // Context switch code follows here
  // ...
  // ...
  // ...

  // Now we've returned from the switch, need to inform GDB of that with
  // the other probe.
  //
  // Same arguments as above this time.
#if !defined(NDEBUG) && defined(HAS_SDT)
  STAP_PROBE3(libc, longjmp_target, 8@%rdi+$64, $0, $RETURN_ADDRESS)
#endif

  // Epilogue follows
  // ...

  ret

Signals #

Signals are sent to all threads in a given process. This is problematic for fiber code because having a signal delivered while in the middle of context switching would cause complications. The usual context switching methods seen in OSes and even the stackless context switch methods in standard C (setjmp and longjmp, or specifically in the case of POSIX sigsetjmp and siglongjmp) often need to save and restore signal masks. I don’t see why you would want fiber code to be delivered signals, so it’s easier to just block all signals from ever being delivered to the fiber scheduler threads permanently.

// Don't permit signal delivery to this thread.
sigset_t mask;
sigfillset(&mask);
assert(pthread_sigmask(SIG_BLOCK, &mask, nullptr) == 0 && "failed to block signals");

Note that signal delivery is problematic itself in multi-threaded code, never mind multi-threaded code which is changing it’s stack pointer constantly to switch through different cooperatively scheduled fibers. You likely already block signals because of this.

Blocking in the OS #

For reasons explained earlier, blocking OS interfaces should be avoided. You’ll be responsible for implementing your own syncronization primitives that schedule other fibers as opposed to blocking. The use of OS syncronization primitives in a fiber will lead to deadlocks. Similarly, the use of IO in a fiber is simply inefficient and we’ll discuss ways to avoid this by implementing our own async IO and something called a reactor.

Thread Local Storage #

Compilers aggressively optimize the use of TLS by caching loads. Since fiber code can migrate between threads it’s possible for the load to be stale. To prevent this from happening you should define TLS accessor functions in a separate translation unit and call the function everytime you need the value behind it.

Implementing M:N fibers #

Stack allocator #

For fibers, general purpose memory allocation methods (e.g malloc or new) will be too slow for allocating fiber stacks relatively speaking. Upfront allocation of an arena for stacks and a free-list is not only easy to implement, it’s ideal. You can protect the list with a simple spin-lock or you can go all out and implement a lock-free algorithm.

The free-list is ideal because it naturally gives you the most recently deallocated stack on allocation. This is great for cache purposes even if you intend to replace that data. Consider for example fiber B reuses the stack allocation from fiber A which just completed, and the first thing fiber B does is a four byte write of an integer on the stack. That integer only represents a fraction of a cache-line, a cache-line that is likely still in cache. This avoids the need to go all the way to system memory just to write an integer on the stack.

With very small fibers with small stacks complelting very quickly, this stack reuse has a negligible performance advantage in my benchmarks.

There’s some things to consider when allocating stacks:

You should delay the allocation of the stacks until the scheduler is ready to execute your fiber. If you prepare multiple fibers ahead of time you’ll quickly burn through a lot of memory otherwise, despite the stacks being freed for reuse as fibers complete. Similarly, the above stack reuse optimization couldn’t work otherwise.
You should be cognisant of how much stack memory you’re giving to your fibers. While it may seem like a little, when multiplied over tens of thousand as an example, it adds very quickly, even with the delayed allocation and reuse.
Your fibers are not all going to need the same stack sizes, you should have multiple size classes to pick from which are multiples of each other so you can minimize memory usage and maximize on reuse.

Pinning the threads #

It’s important that the OS threads your scheduler is running on to schedule fibers are pinned to hardware threads, that is we don’t want them migrating between hardware threads. There’s a few reasons for this:

Timing needs to be done with hardware timestamps and not clock interfaces in the OS. Hardware timestamps often drift between CPU cores (depending on the OS), and OS timing interfaces may involve system calls, which themselves invoke context switches which we want to avoid.
It’s less efficient. There’s not much reason for our threads to migrate because we’ll be migrating fibers ourselfs. If it’s not obvious by now, we are the scheduler, the OS is just getting in our way.
Debugging will suffer.

Every OS tackles the process of pinning OS threads to a given hardware thread a bit differently. On Linux there is pthread_setaffinity_np to set affinity masks. Windows has SetThreadAffinityMask, etc.

Sleeping #

If you require timed sleeping primitives, you’ll need to implement those yourself in your fiber scheduler. Dedicate a hardware thread to put asleep-fibers on the ready queue when any timed waits they’re sleeping on expire. I would try to avoid timed sleeping primitives because they necessitate preemption in some form.

Syncronization Primitives #

You’ll have to implement fiber-safe syncronization primitives because in M:N, fibers can migrate between threads.

The basis for most of the syncronization primitives will be a spin lock. There’s too many ways to get a spin lock implementation wrong, so you should avoid implementing one yourself unless you know what you’re doing. There’s excellent implementations in libraries like Concurrency Kit. However, if you do feel like implementing one yourself. Keep in mind that the properties we’re looking for is a lock that does not tell the OS to yield other threads. Not only is this inefficient because it would imply a context switch. It’s wrong because the switch in the contended case needs to be a user-space fiber-thread switch and not a kernel-space OS- thread switch.

Avoid the PAUSE instruction #

One common trick on x86 and x86_64 for spinlocks is to use the PAUSE instruction to hint to the processor we’re in a spinlock. This used to be the preferred method, however modern x86 CPUs have significantly increased the latency of the PAUSE instruction.

The following information can be found in the Intel Optimization Manual in 8.4.7:

The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles.

As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss.

The last sentence in particular is us. Our fibers are a workload that is sensitive to PAUSE latency. We do not want to be doing that.

The general approach #

The general approach to implementing the syncronization primitivies once you have a spinlock implementation is to implement mutex and condition variables. Once you have these two primitives, all other primitives can be implemented through them. Barriers, semaphores, latches, etc.

There is no shortage of algorithms for implementing mutexes and condition variables. Use which ever one, just remember you do not want to wait in user or kernel space. You want to put your fiber to sleep and schedule another fiber in it’s place.

Fiber Local Storage #

The context of a fiber can be extended to include a typeless store for fiber local data as a suitable replacement for TLS.

Reactor #

One of the problems fibers have as mentioned, is the inability to make calls to blocking OS interfaces, such as IO. The solution to these problems is to dedicate a thread of execution for performing nothing but blocking operations and IO on instead.

The fundamental idea here can be broken down into a couple steps:

Fiber queues the IO operation to take place.
Fiber is put to sleep and yields to another fiber.
Reactor thread sees the IO operation to take place and executes it.
Reactor polls for completion in some way.
Reactor wakes up the fiber put to sleep in step two when operation completes.

The idea here is while the scheduler schedules other fibers during the IO period, the fiber the IO is done on never gets scheduled because it has been put to sleep. It’s only when the reactor thread satisifes the given IO operation does that fiber get woken up again.

It seems simple in practice to implement, because fundamentally it is. However, step four in particular is made complicated because every platform has a different mechanism to achieve it.

POSIX has poll. Linux has epoll and io_uring. FreeBSD and macOS have kqueue. Windows has poll (but only for files, not sockets) and IOCP.

As you can see by now, implementing this for multiple platforms requires multiple implementations of what is effectively an event loop. You can save yourself a lot of time and effort here if you pick an off-the-shelf event loop solution like libuv.