A Computer Scientist's Quest: RTEMS Modular Task Scheduler

As I mentioned in my last post, this past summer I participated in the Google Summer of Code by working on the RTEMS project. I have hopefully ironed out the last few small details that prevented the basic infrastructure that I developed from being accepted into the RTEMS code base. In this post I describe the design of the new scheduler, and also comment on some of the problems that I faced when merging the work. At the end of the post is a list of remaining work for this project.

Pre-existing RTEMS Scheduler

New Modular Scheduler
Scheduler Control Structure
Ready Queue Encapsulation
Per-Thread Metadata
Scheduler Operations Table
Configuring the Scheduler

Merging the New Scheduler
Short-circuit Evaluation
PowerPC overheads

Open Issues
Improper Encapsulation
Interrupt Service Routine (ISR) Latency
Documentation
Configuration
New Technologies

Pre-existing RTEMS Scheduler

A structure of chains called the ready chains is used by the scheduler to manage the set of ready tasks and assign the highest priority task to the CPU. The ready queue is implemented as a 256-element array of FIFO lists. Every element (list) of the array represents a priority level. The zero level designates the highest priority. Tasks with equivalent priorities are placed on the same FIFO list associated with their priority level. The highest priority task is the first one in the first non-empty priority-level FIFO list from the beginning of the ready queue.

Finding the highest priority task would be time-consuming using linear search for the first non-empty FIFO list. Therefore, a 256-bit bitfield is maintained in which every non-empty FIFO list has a set bit corresponding to its priority level. The search is a two level lookup with a scan of only 16 bits at a time required. Two scans are performed. The first scan is the "major" portion of the priority (e.g. upper nibble) and the second scan scans the 16-bits associates with that range of priorities to determine the "minor" portion (e.g. lower nibble). This bitfield scanning is handled by low-level cpu dependent instructions or highly optimized portable C when special instructions are not available. Thus the first non-empty list can be found efficiently and in a fixed length of execution time. Note that the executing task remains in the ready queue. A task is removed from the ready queue only when it terminates or becomes blocked.

Manipulation of the ready chains (array of fifos) happens throughout the thread management code. Additionally, functions in the thread manager access the ready chains through the thread control block (TCB). The goal of my project was to encapsulate the ready queue structure within a scheduler subsystem, thus hiding the implementation from the rest of the RTEMS kernel, especially the thread management code. I was mostly successful.

New Modular Scheduler

To provide maximum flexibility but maintain low overhead, the new Scheduler is dynamically configurable yet avoids making runtime checks.

Scheduler Control Structure

The scheduler is implemented as a stand-alone data structure with associated functions, and also as some additional per-thread and system-wide bookkeeping. The main structure for the scheduler is the Scheduler_Control, defined as:

typedef struct Scheduler_Control_struct {
  /** 
   *  Pointer to the data structure used to manage 
   *  the ready set of tasks. Varies based on 
   *  ready queue required by the scheduler.
   */
  union{
    /** 
     * This is the set of lists (array) 
     * for priority scheduling. 
     */
    Chain_Control            *Priority;
  } ready_queues;

  /** Jump table for scheduler-specific functions */
  Scheduler_Operations                  operations;
} Scheduler_Control;

Scheduler_Operations is explained later. Notice that the Scheduler_Control embeds the scheduler's ready queue via a pointer in a union. This is for extensibility, so that the same basic Scheduler_Control structure can be used for another scheduling algorithm that may use a different ready queue structure by simply adding another pointer to the ready_queues union. This allows any number of ready queues to be added to the code base, but at runtime only the ready queue that is actually used by the particular scheduler will be accessible. The pointer is initialized during scheduler initialization.

Ready Queue Encapsulation

Each scheduler implementation encapsulates its ready queue behind an opaque Scheduler Interface, consisting of the following abstract methods:


/**
  * Update the Thread_Heir with the
  * highest priority ready thread
  */
_Scheduler_Schedule( void )
 
/**
  * Currently executing thread
  * voluntarily gives up the CPU.
  * Might cause a thread dispatch.
  */
_Scheduler_Yield( void )
/**
  * tcb is removed from the set of
  * ready threads.
  * Might cause a thread dispatch.
  */
_Scheduler_Block( Thread_Control *tcb )

/**
  * tcb is added to the set of
  * ready threads.
  * Might cause a thread dispatch.
  */

_Scheduler_Unblock( Thread_Control *tcb )

The scheduler does not care why a thread is blocked or unblocked. Thread state management is not the responsibility of the scheduler.

These four functions are an interface between the scheduler and the rest of the kernel for operations that touch the ready queue. I defined this interface by examining how the current ready queues are used and isolated the common patterns. I then hid the ready queue behind this Scheduler interface. In the original RTEMS scheduling mechanism, this was the Chain_Control _Thread_Ready_chain. Other variables that might be subsumed by the scheduler include _Thread_Executing and _Thread_Heir, but these are transitioning to the per-cpu code base, so I left them alone for the time being.

Per-Thread Metadata

A lot of the complexity for the new scheduler is involved in flexible and efficient per-thread management of scheduler-related data. The current solution embeds a pointer in the tcb to a per-thread scheduling structure that is dynamically allocated at task creation time. In actuality, a union of such pointers is placed in the tcb. More on that later.

To support the pre-existing priority-based scheduler, the following per-thread scheduling data is instantiated for each tcb:


/**
 * Per-thread data related to the
 * _Scheduler_PRIORITY scheduling policy.
 */
typedef struct  {
  /** This field points to the Ready FIFO
      for this thread's priority. */
  Chain_Control *ready_chain;

  /** This field contains precalculated
      priority map indices. */
  Priority_bit_map_Information Priority_map;
} Scheduler_priority_Per_thread;

This structure is used mainly to assist in finding the ready queue for the enclosing tcb. Management of the per-thread metadata is done through three more abstract Scheduler methods:

/**
 * This routine allocates @a the_thread->scheduler.
 */
void * _Scheduler_Thread_scheduler_allocate(
    Scheduler_Control   *the_scheduler,
    Thread_Control      *the_thread
);

/**
 * This routine frees @a the_thread->scheduler.
 */
void _Scheduler_Thread_scheduler_free(
    Scheduler_Control   *the_scheduler,
    Thread_Control      *the_thread
);

/**
 * This routine updates @a the_thread->scheduler  
 * based on @a the_scheduler and thread state
 */
void _Scheduler_Thread_scheduler_update(
    Scheduler_Control   *the_scheduler,
    Thread_Control      *the_thread
)

The first two functions are invoked only during task creation and deletion. The last function, _Scheduler_Thread_scheduler_update, is currently invoked when a thread changes its priority, but is intended to be used for any situation that the per-thread scheduler metadata needs to be updated. This may require some additional flags to indicate what event is causing the update, but for now update is only during a priority change event.

Scheduler Operations Table

A function jump table is embedded in the Scheduler_Control structure to provide for flexibility via dynamic binding at runtime for scheduling operations. The Scheduler_Operations structure looks like:

/**
 * function jump table that holds 
 * pointers to the functions that
 * implement specific schedulers.
 */
typedef struct {
  /** Implements the scheduling decision logic. */
  void ( *schedule ) ( Scheduler_Control * );

  /** Voluntarily yields the processor. */
  void ( *yield ) ( Scheduler_Control * );

  /** Removes the given thread from scheduling decisions. */
  void ( *block ) ( Scheduler_Control *, Thread_Control * );

  /** Adds the given thread to scheduling decisions. */
  void ( *unblock ) 
    ( Scheduler_Control *, Thread_Control * );

  /** allocates the scheduler field of the given thread */
  void * ( *scheduler_allocate ) 
    ( Scheduler_Control *, Thread_Control * );

  /** frees the scheduler field of the given thread */
  void ( *scheduler_free ) 
    ( Scheduler_Control *, Thread_Control * );

  /** 
   * updates the scheduler field of the given thread
   * primarily used when changing the thread's priority. 
   */  void ( *scheduler_update ) 
    ( Scheduler_Control *, Thread_Control * );
} Scheduler_Operations;

These function pointers correspond with the functions described above. Indeed, the function pointers are wrapped by inline functions. A scheduler implementation defines functions that match the function pointer signatures, and then installs a Scheduler_Operations table. The real magic of the modular scheduler is in how a scheduler is configured by an application.

Configuring the Scheduler

There are two main ways to provide configuration in RTEMS:

static configuration using configure (autoconf) options
dynamic configuration using confdefs.h options

These two choices provide different plusses and minuses.

configure is most useful for coarse-grained configuration options that can provide conditional compilation. Its main advantage is that it can drastically reduce the size of the compiled binary image by removing large subsystems that a user does not want included, for example support for POSIX. Further, static configuration can reduce runtime overheads because configuration options are not checked during the runtime. However, configure options are less maintainable, lead to pollution of source code with RTEMS-specific #ifdef blocks, and cannot be changed without rebuilding the application to produce a new binary image.

confdefs.h is the main way in which users are able to configure certain aspects of RTEMS at runtime, by way of changing values in configuration tables. The confdefs.h approach has the main advantage of flexibility, because a user can change the configuration tables and reset the application (board) in order to change a system's configuration without having to upload a new binary image. The main disadvantages of confdefs are that it cannot provide conditional compilation and there is additional overhead at runtime to check the configuration options. The added overhead mostly can be relegated to the application's initialization phase.

I provide a confdefs.h based configuration for the new scheduler. The scheduler configuration allows an application to select the scheduling policy to use. The supported configurations currently are:

CONFIGURE_SCHEDULER_USER
CONFIGURE_SCHEDULER_PRIORITY

If no configuration is specified by the application, then CONFIGURE_SCHEDULER_PRIORITY is the default. An application can define its own scheduling policy by defining CONFIGURE_SCHEDULER_USER and CONFIGURE_SCHEDULER_ENTRY_USER to point to an initialization routine. However, the user scheduling has not actually been tested. I also have two additional scheduling algorithms waiting in the wings.

The confdefs.h logic is, I hope, easy to follow. The first thing is to ensure the scheduler structures are loaded, and then check for the USER scheduler:

#include <rtems/score/scheduler.h>
#if defined(CONFIGURE_SCHEDULER_USER) && \
    !defined(CONFIGURE_SCHEDULER_ENTRY_USER)
  #error "CONFIGURE_ERROR: CONFIGURE_SCHEDULER_USER \
          without CONFIGURE_SCHEDULER_ENTRY_USER"
#endif

There is a user configuration option to include all of the schedulers available from the RTEMS sources in the built applications, although this is also an untested feature, like the USER scheduler:

#if defined(CONFIGURE_SCHEDULER_ALL)
  #define CONFIGURE_SCHEDULER_PRIORITY
#endif

Next, the user configuration options are processed. If no scheduler is specified, then the priority-based scheduler is configured:

/* If no scheduler is specified, 
   the priority scheduler is default. */
#if !defined(CONFIGURE_SCHEDULER_USER) && \
    !defined(CONFIGURE_SCHEDULER_PRIORITY)
  #define CONFIGURE_SCHEDULER_PRIORITY
  #define CONFIGURE_SCHEDULER_POLICY _Scheduler_PRIORITY
#endif

As another special check, a USER provided scheduler needs to be checked in order to define the scheduler policy.

/*
 * If a user scheduler is specified and no policy is set, 
 * the user scheduler is the default policy.
 */#if defined(CONFIGURE_SCHEDULER_USER) && \
    !defined(CONFIGURE_SCHEDULER_POLICY)
  #define CONFIGURE_SCHEDULER_POLICY _Scheduler_USER
#endif

Next, process each of the configured schedulers, for now this is just the priority scheduler. Here is where the real magic happens: the priority-based scheduling structures and code are only included if the priority scheduler is configured for the user's application. At the end of this block, the memory requirements for including the priority are computed:

/* 
 * Check for priority scheduler next, 
 * as it is the default policy if there
 * is no CONFIGURE_SCHEDULER_POLICY set 
 * and no USER scheduler provided.
 */
#if defined(CONFIGURE_SCHEDULER_PRIORITY)
  #include <rtems/score/schedulerpriority.h>
  #define CONFIGURE_SCHEDULER_ENTRY_PRIORITY \
     { _Scheduler_priority_Initialize }
  #if !defined(CONFIGURE_SCHEDULER_POLICY)
    #define CONFIGURE_SCHEDULER_POLICY _Scheduler_PRIORITY
  #endif
  
  /**
   * define the memory used by the priority scheduler
   */
  #define CONFIGURE_MEMORY_SCHEDULER_PRIORITY ( \
    _Configure_From_workspace( \
      ((CONFIGURE_MAXIMUM_PRIORITY+1) * \
        sizeof(Chain_Control)) )) \

  #define CONFIGURE_MEMORY_PER_TASK_SCHEDULER_PRIORITY ( \
    _Configure_From_workspace( \
      sizeof(Scheduler_priority_Per_thread)) )
#endif

After adding all of the configured schedulers, the scheduling table is configured. During scheduler initialization, the scheduling table is indexed to select the active scheduler. This mechanism for configuring is based on the mechanism for filesystem configuration used in RTEMS confefs.h:

/* 
 * Set up the scheduler table.  
 * The scheduling code indexes this table to 
 * invoke the correct scheduling implementation. 
 * The scheduler to use is determined by the 
 * Configuration.scheduler_policy field, which is set
 * by CONFIGURE_SCHEDULER_POLICY. If a particular scheduler 
 * is not enabled, an empty entry is included in its 
 * entry in the scheduler table.
 */

  /**
   * An empty scheduler entry
   */  #define CONFIGURE_SCHEDULER_NULL { NULL }

#ifdef CONFIGURE_INIT
  /* the table of available schedulers. */
  const Scheduler_Table_t _Scheduler_Table[] = {
    #if defined(CONFIGURE_SCHEDULER_USER) && \
        defined(CONFIGURE_SCHEDULER_ENTRY_USER)
      CONFIGURE_SCHEDULER_ENTRY_USER,
    #else
      CONFIGURE_SCHEDULER_NULL,
    #endif
    #if defined(CONFIGURE_SCHEDULER_PRIORITY) && \
        defined(CONFIGURE_SCHEDULER_ENTRY_PRIORITY)
      CONFIGURE_SCHEDULER_ENTRY_PRIORITY,
    #else
      CONFIGURE_SCHEDULER_NULL,
    #endif
  };
#endif

Finally, the memory overheads for each of the configured schedulers should be added together to compute the scheduling subsystem's memory requirements. Note that this does not currently support estimating any USER scheduler, only the RTEMS-provided schedulers:

/**
 * Define the memory overhead for the scheduler
 */#define CONFIGURE_MEMORY_FOR_SCHEDULER ( \
    CONFIGURE_MEMORY_SCHEDULER_PRIORITY \
  )

#define CONFIGURE_MEMORY_PER_TASK_FOR_SCHEDULER ( \
    CONFIGURE_MEMORY_PER_TASK_SCHEDULER_PRIORITY \
  )

The value of CONFIGURE_SCHEDULER_POLICY is propagated into the newly added scheduler_policy field of the RTEMS configuration table:


/** This field specifies the scheduling policy to use. */
uint32_t                       scheduler_policy

At runtime, during the scheduler initialization the Scheduling table is indexed by the value of scheduler_policy in order to call the configured scheduler's initialization routine which is responsible for initializing its operations table and ready queue structures:

void _Scheduler_Handler_initialization( )
{
  Scheduler_Control *the_scheduler = &_Scheduler;

  (*(_Scheduler_Table[ \
        Configuration.scheduler_policy \
      ].scheduler_init))(
    the_scheduler
  );
}

Merging the New Scheduler

RTEMS uses a bugzilla to vet code changes. I submitted the scheduler code as PR 1647. Aside from a few stylistic issues, there were three issues that I needed to resolve. These blocker issues dealt with introducing performance latency overheads and that the scheduler's memory overhead was not being calculated. The memory overhead issue was corrected in confdefs.h and is documented above. The other two issues are described in the following.

Short-circuit Evaluation

There were some fairly substantial latency increases when running the RTEMS tmtests with the SPARC sis simulator. Extra overhead came primarily from doing bit-scan operations in cases where a short-circuit is possible by knowing that the heir (RTEMS terminology for highest priority ready thread) either changed explicitly (e.g. by unblocking a higher priority thread than the currently executing) or in cases where there were no other threads ready to take over the CPU from a yielding thread. Such overheads were eliminated by adding some checks for these special cases.

Latency on task_create and task_delete are still increased, because there is some additional allocation/freeing for per-thread scheduling metadata.

PowerPC overheads

Testing the scheduler on the PowerPC architecture (psim simulator) exposed a few more issues. Since these issues were not raised during the SPARC tests, I expected that architectural features were to blame. The PowerPC test exposed the problem with calculating the memory overhead for the scheduler. The PowerPC contains floating point tasks, which changes the memory requirements and puts more pressure on the kernel's Workspace (heap). Other problems with the PowerPC test were related to performance latency increases that were not observed with the SPARC.

Because of the SPARC's register windowing feature, function calls are extremely cheap up to a certain threshold (about 6 function calls deep), after which calls become more expensive due to pressure on the register window. In contrast, the PowerPC has a more traditional "flat" register file, which means function calls may need to save caller- and callee-save registers on any given function call. This means there is a higher function call overhead for smaller call depths (since the SPARC has almost zero function call overhead for small call depths).

Originally, I had designed the modular scheduler with a modular ready queue as well, which potentially allows schedulers to share some ready queue operations and structures. However, the extra flexibility was implemented with another function call table. With the use of function pointers, the ready queue (and scheduler) functions were no longer inlined. This was adding at least two function calls on every scheduler invocation.

After making the ready queue implicitly part of the scheduler, I was able to make all of the ready queue functions inline within the scheduler implementation. Inlining the ready queue saved much of the overhead that remained, although there is still some overhead from adding one level of function calls to scheduling events.

Open Issues

There are still a few lingering items to be done.

Improper Encapsulation

There remain two places in the thread management code that are not properly encapsulated: threadsettransient.c and threadchangepriority.c both make direct modifications of the priority scheduler ready queue. Down the line it may become necessary to extend the scheduler interface to the ready queue for one or both of these purposes.

Interrupt Service Routine (ISR) Latency

Most of the scheduler functions assumes that they are called with protection enabled (i.e. with interrupts disabled). However, they do not take any arguments to help them to disable protection, for example to do an ISR_Flash(level). So although the functionality is identical, the refactoring introduces some change to the interrupt latency of the scheduling code.

For uni-processor scheduling we can simply pass the nested ISR 'level' as an argument to the scheduler; however, for SMP this will not work. Joel said that his "design thought was to have processor interrupts off locally while you owned an "interrupt disable lock". So it may work just fine if the SMP interrupt disable lock is known.

Documentation

This blog post is actually a step in this direction. After I add some more scheduling algorithms, I will need to explain when a user should select specific algorithms. Additionally, other implementers will need to be able to navigate the new scheduling infrastructure, especially when debugging or adding schedulers. I am also thinking about writing up a technical report to cover some of this material, or to demonstrate some useful features of the new scheduler. Unfortunately, this is off the critical path.

Configuration

Joel suggests that I consider a way to "[m]ove scheduler policy out of the Configuration Table and make it so the user is setting a variable that we can directly use in the Score. This ties in to simplifying the initialization of the scheduler below where you set some fields."

There are some other issues with configuration. For one thing, the USER configuration option should be either tested or eliminated. Also, use cases for multiple scheduler algorithms concurrently being available (although NOT concurrently used!) should be identified.

New Technologies

Finally, using the new modular scheduler to explore new ways of scheduling is the motivation for this project. I have already lined up two new scheduling algorithms, the Earliest Deadline First algorithm and a simple First-In First-Out scheduler. Determining how well this new infrastructure will support SMP scheduling remains, in my opinion, the primary direction coming out of this project.

A Computer Scientist's Quest

Sunday, November 14, 2010

RTEMS Modular Task Scheduler