Difference between revisions of "Memory Type Based Allocation"

From eLinux.org
Jump to: navigation, search
(Tracing MTA with Linux Trace Toolkit)
Line 85: Line 85:
 
text/data. The MTA config file syntax defines two keywords for these purposes.
 
text/data. The MTA config file syntax defines two keywords for these purposes.
  
== define_node keyword ==
+
=== define_node keyword ===
  
 
To define nodes for configuring a kernel, use the following MTA config file line:
 
To define nodes for configuring a kernel, use the following MTA config file line:
Line 129: Line 129:
 
A macro in the generated header file called INSTANTIATE_MTA_NODES will instantiate the mta_nodes[] array. This is done in mm/numa.c in the kernel source.
 
A macro in the generated header file called INSTANTIATE_MTA_NODES will instantiate the mta_nodes[] array. This is done in mm/numa.c in the kernel source.
  
== tag_elf keyword ==
+
=== tag_elf keyword ===
  
 
The second use for the MTA config file is to mark ELF binaries in a target file system with memory type information. This is simply a convenience, it allows a file system’s memtypes configuration to be described in a single location, instead of having to invoke the elfmemtypes tool many times to configure the file system.
 
The second use for the MTA config file is to mark ELF binaries in a target file system with memory type information. This is simply a convenience, it allows a file system’s memtypes configuration to be described in a single location, instead of having to invoke the elfmemtypes tool many times to configure the file system.
Line 162: Line 162:
 
The script will call the elfmemtypes tool with the clear argument once for every tag_elf line found in the config file.
 
The script will call the elfmemtypes tool with the clear argument once for every tag_elf line found in the config file.
  
== Load_elf_binary() ==
+
=== Load_elf_binary() ===
  
 
The function load_elf_binary() is an implementation of the load_binary() method of the linux_binfmt object, for ELF binaries. It is called by do_execve() when loading a new program for execution.
 
The function load_elf_binary() is an implementation of the load_binary() method of the linux_binfmt object, for ELF binaries. It is called by do_execve() when loading a new program for execution.
Line 173: Line 173:
 
If any of the mnemonic names listed in the .memtypes NOTE section do not match any of the kernel’s node names, the node list is disabled for that segment (text or data). That is, the text/data memory region will not have node preferences, and will have pages allocated for that region from any available node.
 
If any of the mnemonic names listed in the .memtypes NOTE section do not match any of the kernel’s node names, the node list is disabled for that segment (text or data). That is, the text/data memory region will not have node preferences, and will have pages allocated for that region from any available node.
  
== load_elf_interp() ==
+
=== load_elf_interp() ===
  
 
Load_elf_interp() is called by load_elf_binary() when the latter function discovers a program header of type PT_INTERP. This header describes the interpreter program that is to be used to dynamically load the shared libraries that the program requires.
 
Load_elf_interp() is called by load_elf_binary() when the latter function discovers a program header of type PT_INTERP. This header describes the interpreter program that is to be used to dynamically load the shared libraries that the program requires.
Line 181: Line 181:
 
For MTA, load_elf_interp() locates and reads the NOTE section containing the memory types list from the interpreter binary, converts the list to node ID’s, and passes that information to do_mmap_nodelist() and do_brk_nodelist(). Just like load_elf_binary(), the node info is inserted into a structure of type struct node_list (described later).
 
For MTA, load_elf_interp() locates and reads the NOTE section containing the memory types list from the interpreter binary, converts the list to node ID’s, and passes that information to do_mmap_nodelist() and do_brk_nodelist(). Just like load_elf_binary(), the node info is inserted into a structure of type struct node_list (described later).
  
== The Program Interpreter (ld.so) ==
+
=== The Program Interpreter (ld.so) ===
  
 
Ld.so is actually the first piece of code to execute when a new program runs. Ld.so runs in user space, and it’s job is similar to load_elf_interp(). It loads (maps) the text, data, and bss segments of every shared object listed in the main program.
 
Ld.so is actually the first piece of code to execute when a new program runs. Ld.so runs in user space, and it’s job is similar to load_elf_interp(). It loads (maps) the text, data, and bss segments of every shared object listed in the main program.
Line 189: Line 189:
 
Because ld.so is part of glibc, a new version of glibc is required to load shared objects in the correct nodes.
 
Because ld.so is part of glibc, a new version of glibc is required to load shared objects in the correct nodes.
  
== memtypes_to_nodelist() ==
+
=== memtypes_to_nodelist() ===
  
 
The method that converts memory type mnemonics to a node list is memtypes_to_nodelist(), and it has the following interface: void memtypes_to_nodelist(struct node_list * nl, char * names, int size);
 
The method that converts memory type mnemonics to a node list is memtypes_to_nodelist(), and it has the following interface: void memtypes_to_nodelist(struct node_list * nl, char * names, int size);
Line 196: Line 196:
 
null-termination character in the buffer. The size argument is the total size of the buffer in bytes, including the null characters. The buffer must be a kernel buffer, it cannot be a user-space buffer. If any of the names in the buffer do not match any of the kernel’s node names, the node list is disabled by setting nl->depth to zero (see next).
 
null-termination character in the buffer. The size argument is the total size of the buffer in bytes, including the null characters. The buffer must be a kernel buffer, it cannot be a user-space buffer. If any of the names in the buffer do not match any of the kernel’s node names, the node list is disabled by setting nl->depth to zero (see next).
  
== The node_list Object ==
+
=== The node_list Object ===
  
 
The struct node_list object is defined as follows:  
 
The struct node_list object is defined as follows:  
Line 218: Line 218:
 
All of the kernel methods that take a node list as input (such as do_mmap_nodelist() and do_brk_nodelist()) call check_nodelist() to verify that the node list is valid. The section "Kernel API’s" below describes how each method behaves when given an invalid node list.
 
All of the kernel methods that take a node list as input (such as do_mmap_nodelist() and do_brk_nodelist()) call check_nodelist() to verify that the node list is valid. The section "Kernel API’s" below describes how each method behaves when given an invalid node list.
  
== do_mmap_nodelist() and do_brk_nodelist()  ==
+
=== do_mmap_nodelist() and do_brk_nodelist()  ===
  
 
Load_elf_binary(), load_elf_interp(), and ld.so convert the .memtypes NOTE section from the ELF binary into a node list via memtypes_to_nodelist(), and pass the resultant struct node_list
 
Load_elf_binary(), load_elf_interp(), and ld.so convert the .memtypes NOTE section from the ELF binary into a node list via memtypes_to_nodelist(), and pass the resultant struct node_list
Line 253: Line 253:
 
described later.
 
described later.
  
== setup_arg_pages() ==
+
=== setup_arg_pages() ===
  
 
Setup_arg_pages() is called by load_elf_binary() to create the memory region for the program’s stack, which includes the program stack and also the argument strings to the program and environment variables that the program inherited.  When setup_arg_pages() instantiates the new VMA for the stack region, it simply copies the struct node_list data_nodes from the memory descriptor to the new VMA.
 
Setup_arg_pages() is called by load_elf_binary() to create the memory region for the program’s stack, which includes the program stack and also the argument strings to the program and environment variables that the program inherited.  When setup_arg_pages() instantiates the new VMA for the stack region, it simply copies the struct node_list data_nodes from the memory descriptor to the new VMA.

Revision as of 12:12, 11 December 2006

Introduction

This specification describes the design for a Linux kernel memory manager that can locate a program’s executable code and data in different physical memory devices.

Purpose of Feature

Embedded systems can use this feature to locate a program’s text and data segments in specific memory devices. Shared library text and data segments can also be targeted to specific memory devices. For instance, frequently executed code, such as glibc or "ls", could be located entirely in a single specified memory device or a set of memory devices. Glibc text/data could be targeted to a fast static RAM bank for instance, while other less frequently referenced libraries and programs could be located in slower DRAM.


Feature Requirements

  1. All of a program’s segments must be locatable in specified memory devices: text, initialized data (data), unitialized data (bss), heap (brk), and stack.
  2. The loadable segments of shared libraries (text and initialized data) must be locatable in specified memory devices.
  3. The ELF binaries of programs and shared libraries must contain memory device information for each of the binaries’ loadable segments (text and initialized data). This must be in the form of mnemonic strings. For instance: "SRAM", "SDRAM", etc.
  4. A tool will be provided to mark the ELF binaries with memory device information for each of the loadable segments.
  5. A kernel API must be provided for kernel code (such as device drivers) to allocate whole page frames from specified memory devices.
  6. A kernel API must be provided for kernel code to allocate memory using the slab allocator (kmalloc()) from specified memory devices.
  7. A user-level API must be provided for User program’s to create mappings, using the mmap() system call, that will allocate page frames for the mapping in specified memory devices.
  8. A /proc filesystem interface must be provided that prints the kernel’s node configuration.

High Level Design

Memory devices in Memory Type Based Allocation (MTA) are based on discontiguous memory support. Traditionally, discontiguous memory is meant for platforms whose system memory is not contiguous in the physical memory map. Discontiguous memory in Linux in turn is based on Non-Uniform Memory Access (NUMA) nodes. Each discontiguous memory bank is represented by a NUMA node. Therefore in MTA memory devices are also synonymous with NUMA nodes. Note: to execute a user program directly out of ROM, such as flash, requires a totally different approach from that described here.

To understand MTA, it’s best to first describe the memory device type information contained in the ELF binary of programs and shared libraries. Then we describe how memory nodes are configured in the kernel. We then follow the path and morphing of the memorytype data from its source (the ELF binary) until it reaches the lowest level: when its used to allocate a page frame for the process during a page fault exception.

Memory Type Information in ELF Binaries and the Elfmemtypes Utility

In MTA, memory device type information is added to the ELF binaries of programs and shared libraries using the elfmemtypes utility. This information is then passed down to the mmap() and brk() calls to create new memory regions for the process. The elfmemtypes tool adds memory type information by adding a new NOTE section with the name ".memtypes" to the ELF binary. It does this by forking and running objcopy as follows:

objcopy --add-section .memtypes=[temp binary file] [ELF file]

The memory type mnemonic strings specified to the tool are copied to a temporary file, and that file is passed to objcopy, which copies the temporary file’s contents to the new .memtypes section. Currently, the elfmemtypes tool allows specifying memory types for the text segment and data segment. The text segment includes code and read-only data sections, and therefore all these sections will be allocated to the memory types specified for text. Likewise, the data segment includes initialized data (data) and uninitialized data (bss), so all these sections will be allocated to the memory types specified for data. Also, although there are no heap (brk) and stack sections defined for ELF binaries, heap and stack regions for the new process currently use the memory types specified for data. A future enhancement will be to allow data, bss, brk, and stack regions to have their own memory types. The command line arguments to the tool are as follows to mark an ELF binary:

elfmemtypes [ELF file] [{text|data} [space-seperated list of mnemonics]]

An example command line might be:

elfmemtypes /bin/bash text SRAM SDRAM0 ANY data SDRAM1

In the example, /bin/bash is marked so that its text segment will have physical memory allocated to it from the memory node named SRAM. If allocation from SRAM fails, allocate from SDRAM0. If allocation from SDRAM0 fails, allocate from any available node. Finally, /bin/bash is marked so that its data segment only allows allocation of physical memory from the memory node named SDRAM1. A more detailed description of the algorithm for allocating physical pages using the above memory node lists is discussed later. Note that the mnemonics ANY, any, text, and data are reserved names, i.e. they cannot be used for memory type mnemonic names. If a .memtypes NOTE section already exists in the ELF file, the memory types specified in the section will be left undisturbed unless they are overriden on the command line. For example, if the existing .memtypes NOTE section lists memory types for both text and data, but the command line specifies only data memory types, the existing text memtypes will be left unchanged, but the data memtypes will be modified. The elfmemtypes tool can also be used to display the current memory type information in an ELF file, or clear out all memory types information from the file. The command line forsuch cases is as follows:

elfmemtypes [ELF file] [{show|clear}]

(or just elfmemtypes [ELF file] to display the current memory type information). When clearing an ELF file, elfmemtypes simply removes the .memtypes NOTE section by forking and running objcopy like so:

objcopy --remove-section=.memtypes [ELF file]

Note that a non-MTA configured kernel or non-MTA aware ld.so can still load ELF executables and shared libraries that contain a .memtypes NOTE section, since this section will just be ignored. Note also that elfmemtypes does not check whether a memory type name corresponds to any kernel node names. This is because the tool is meant to be a cross tool as well as a native tool. As a cross tool, elfmemtypes has no way of knowing the node names of the target kernel. See the "load_elf_binary()" section below to see how the kernel handles unknown memory mnemonics in the .memtypes NOTE section. As a native tool, it is possible for elfmemtypes to compare memory mnemonic names with kernel node names by reading /proc/nodeinfo (described later), and this could be a future enhancement. The structure of the new .memtypes NOTE section in the ELF file added by the tool is shown below:

typedef struct elf32_memtypes_note {
	Elf32_Nhdr nh;
	char note_name[16];
	Elf32_Word num_text_strings;
	Elf32_Word text_string_size;
	Elf32_Word num_data_strings;
	Elf32_Word data_string_size;
	char memtype_strings[0];
} Elf32_MemTypesNote;

The nh member contains the NOTE header, note_name is the name of the NOTE ("memtypes"), and the rest specify the number and total size of the text and data mnemonic strings. The member memtype_strings then marks the start of the null-terminated mnemonic strings, beginning with text. The data strings immediately follow the text strings, so the data strings begin at &memtype_stringstext_string_size.

The MTA Config File and the Mtaconfig Script

The MTA config file (and its associated parsing script mtaconfig) is used for two purposes: defining nodes for building an MTA-enabled kernel, and marking ELF binaries with memory types for text/data. The MTA config file syntax defines two keywords for these purposes.

define_node keyword

To define nodes for configuring a kernel, use the following MTA config file line:

define_node [name] [start physaddr] [end physaddr] [0|1]
  • name is the mnemonic name for the node.
  • start physaddr is the starting physical address for the node, in hex.
  • end physaddr is the end physical address for the node, in hex.
  • 0|1 is a flag, 1 means allow allocation from this node when no list of nodes to allocate from is provided to the kernel page allocator. This flag is described in more detail later.

An example line in the config file might be:

define_node SRAM 20000000 2002E000 0

which defines a node with the name SRAM to be located between 0x20000000 and 0x2002E000 physical, and do not allow default page allocation from this node.

Node ID numbers are assigned in the order the define_node keywords appear in the config file. So if the above line was the first define_node line in the file, SRAM would be assigned node ID 0.

The mtaconfig script will output a C header file that can be used when compiling the kernel. For this purpose it is called as follows:

mtaconfig [MTA config file] makehdr

This command is used by the kernel Makefile’s when confuring an MTA kernel. If the makehdr argument is not specified, define_node keywords in the config file are ignored and no header file is produced.

The content of the C header file produced by mtaconfig is an array of structures containing the same information as the define_node lines in the MTA config file. Each entry in the array is of type struct mta_node, and is defined as follows:

struct mta_node {
	char * name;
	unsigned long start;
	unsigned long end;
	int allow_def_page_alloc;
};

A macro in the generated header file called INSTANTIATE_MTA_NODES will instantiate the mta_nodes[] array. This is done in mm/numa.c in the kernel source.

tag_elf keyword

The second use for the MTA config file is to mark ELF binaries in a target file system with memory type information. This is simply a convenience, it allows a file system’s memtypes configuration to be described in a single location, instead of having to invoke the elfmemtypes tool many times to configure the file system.

Use the following line to mark an ELF binary with memory type info:

tag_elf [ELF file path] [{text|data} [comma-seperated list of mnemonics]]

Notice that the command line is almost identical to the elfmemtypes tool command line, except that the memtypes list is comma-seperated rather than space-seperated. Also, text and data lists can be seperated on different lines. An example config file entry might be:

tag_elf /target_root/bin/bash
 text SRAM,SDRAM0,any
 data SDRAM1

The command line to the mtaconfig script to process the tag_elf lines is as follows:

mtaconfig [MTA config file] tag

The script will call the elfmemtypes tool once for every tag_elf line found in the config file. Unlike the elfmemtypes tool, mtaconfig can check if the memory type names correspond to any kernel node names, because the node names are listed in the MTA config file itself. If any memory names listed on the tag_elf line have not been defined in a define_node line up to this point in the config file, mtaconfig prints an error message and skips tagging the ELF file. Finally, a file system’s memtype information can be completely cleared out with the following command line:

mtaconfig [MTA config file] clear

The script will call the elfmemtypes tool with the clear argument once for every tag_elf line found in the config file.

Load_elf_binary()

The function load_elf_binary() is an implementation of the load_binary() method of the linux_binfmt object, for ELF binaries. It is called by do_execve() when loading a new program for execution.

The job of load_elf_binary() is to read the executable file’s program segments, and pass that segment info to do_mmap() for every loadable segment program header found, which then actually creates the file mapped regions. Loadable program headers are of type PT_LOAD.

For MTA, load_elf_binary() also locates and reads the .memtypes NOTE section containing the memory types list. It then converts the mnemonic names to node ID’s and passes that information to new functions do_mmap_nodelist() and do_brk_nodelist(). The node ID’s are inserted into a structure of type struct node_list and a pointer to the structure is passed to do_mmap_nodelist() and do_brk_nodelist(), and is described later.

If any of the mnemonic names listed in the .memtypes NOTE section do not match any of the kernel’s node names, the node list is disabled for that segment (text or data). That is, the text/data memory region will not have node preferences, and will have pages allocated for that region from any available node.

load_elf_interp()

Load_elf_interp() is called by load_elf_binary() when the latter function discovers a program header of type PT_INTERP. This header describes the interpreter program that is to be used to dynamically load the shared libraries that the program requires.

It’s the job of load_elf_interp() to load the segments of the interpreter itself, so that when the program begins executing, the interpreter is actually the first code to execute.

For MTA, load_elf_interp() locates and reads the NOTE section containing the memory types list from the interpreter binary, converts the list to node ID’s, and passes that information to do_mmap_nodelist() and do_brk_nodelist(). Just like load_elf_binary(), the node info is inserted into a structure of type struct node_list (described later).

The Program Interpreter (ld.so)

Ld.so is actually the first piece of code to execute when a new program runs. Ld.so runs in user space, and it’s job is similar to load_elf_interp(). It loads (maps) the text, data, and bss segments of every shared object listed in the main program.

For MTA, ld.so reads the NOTE section containing the memory types list of every shared object binary, and passes that information to a new mmap_memtypes() system call. The memory types list passed to mmap_memtypes() is a buffer holding the null-terminated memory type mnemonic strings. The mmap_memtypes() system call is described in more detail later.

Because ld.so is part of glibc, a new version of glibc is required to load shared objects in the correct nodes.

memtypes_to_nodelist()

The method that converts memory type mnemonics to a node list is memtypes_to_nodelist(), and it has the following interface: void memtypes_to_nodelist(struct node_list * nl, char * names, int size);

The names argument is a pointer to a buffer holding a packed list of null-terminated mnemonic strings. That is, each null-terminated string starts immediately after the previous string’s null-termination character in the buffer. The size argument is the total size of the buffer in bytes, including the null characters. The buffer must be a kernel buffer, it cannot be a user-space buffer. If any of the names in the buffer do not match any of the kernel’s node names, the node list is disabled by setting nl->depth to zero (see next).

The node_list Object

The struct node_list object is defined as follows:

struct node_list {
	unsigned int nid[MAX_NR_NODES]; /* ID of nodes to alloc pages from,
	in order of preference */
	unsigned int depth; /* number of entries in above list */
};

The number of entries in the node list is limited to MAX_NR_NODES, which is the maximum number of nodes a system could contain, currently set at 16. Therefore depth must be less than MAX_NR_NODES. A depth of zero is valid, meaning the node list is empty or disabled.

In addition, each entry in nid[] must be a valid node ID, i.e. it must be in the range 0 to numnodes-1, where numnodes is the number of nodes in the system.

The following method checks these conditions, and returns -EINVAL if any are false:

check_nodelist(struct node_list * nl);

All of the kernel methods that take a node list as input (such as do_mmap_nodelist() and do_brk_nodelist()) call check_nodelist() to verify that the node list is valid. The section "Kernel API’s" below describes how each method behaves when given an invalid node list.

do_mmap_nodelist() and do_brk_nodelist()

Load_elf_binary(), load_elf_interp(), and ld.so convert the .memtypes NOTE section from the ELF binary into a node list via memtypes_to_nodelist(), and pass the resultant struct node_list object to the new methods do_mmap_nodelist() and do_brk_nodelist(). From this point on in the data flow of memory type information, the memory types are in the form of node ID’s rather than mnemonic strings.

do_mmap_nodelist() and do_brk_nodelist() have the same arguments as the original do_mmap() and do_brk(), with the addition of the struct node_list pointer.

The primary job of do_mmap_nodelist() and do_brk_nodelist() is to instantiate a new memory region descriptor for the requested range of program adddresses. In Linux the memory region descriptor is an object of type struct vm_area_struct, and is commonly referred to as a "VMA" (Virtual Memory Area).

In MTA, the node list information is added to the VMA with a struct vm_node_list vm_nodes member. The struct vm_node_list object contains a node list as well as information important to the VMA, and is defined as follows:

struct vm_node_list {
	struct node_list nl; /* the node list */
	unsigned long pgstart; /* if this node info belongs to a file mapping,
				the start page offset in the file */	   
	unsigned long pgend; 	/* and end page offset */	   
	unsigned long flags; 	/* unused */ 	   
}; 			 

Two struct node_list object’s are also added to a process’ memory map descriptor (struct mm_struct), one each for the process’ text and data regions (member names text_nodes and data_nodes in struct mm_struct).

After the new VMA is instantiated, do_mmap_nodelist() and do_brk_nodelist() copy the passed struct node_list object to the VMA, but only if the node list is valid as indicated by check_nodelist() (see Kernel API’s below).

If the passed struct node_list pointer is null, or the list is empty (depth is zero), do_mmap_nodelist() and do_brk_nodelist() check to see if text_nodes or data_nodes in the calling process’ struct mm_struct are enabled (depth is non-zero). If so, do_mmap_nodelist() and do_brk_nodelist() copy to the VMA either text_nodes or data_nodes depending on whether the region being mapped is text or data. This ensures that, even if the mapping doesn’t pass a node list, the new region will still use any node preferences listed by the executable.

With the creation of the VMA, the program is now allowed to reference addresses within the memory region described by the VMA. However, no actual page frames for the region are available yet. The job of allocating page frames for the program’s memory region goes to the Page Fault Exception Handler. This is part of Linux’s demand paging mechanism: memory pages are allocated to the program only as they are needed (referenced) by the program.

The important point here however, is that the memory regions contain the node ID’s needed by the page fault handler, so that it can allocate pages in the correct nodes for the region. This is described later.

setup_arg_pages()

Setup_arg_pages() is called by load_elf_binary() to create the memory region for the program’s stack, which includes the program stack and also the argument strings to the program and environment variables that the program inherited. When setup_arg_pages() instantiates the new VMA for the stack region, it simply copies the struct node_list data_nodes from the memory descriptor to the new VMA.

However, there is one small glitch. Before load_elf_binary() was even called, in do_execve(), pages were already allocated for the argument and environment strings. These pages were allocated using the default node round-robin approach (because no node info was known at that time), so the pages almost certainly were not allocated from the correct node for the stack region. Therefore setup_arg_pages() needs to allocate a new page in the correct node for every page already allocated, copy the page contents from the old to new page, and then release the old page.

Page Fault Exception Handler

When the program references a valid address within one of the program’s memory regions, a page fault exception occurs if the address is not yet listed in any of the process’ page tables. The page fault exception handler goes about allocating pages for the faulting region, and creates the page tables that point to the new page.

For MTA, the exception handler will allocate the page from the correct node as described in the faulting region’s vm_node_list object. This includes allocating pages in all of the following situations: anonymous mappings, private and shared file mappings, and copy-on-write pages for private mappings.

Allocating Pages

At the lowest level of page allocation, the buddy system page allocator _ _alloc_pages(), is passed a node descriptor pointer of type pg_data_t. This descriptor contains information related to the NUMA node, such as the number of "memory zones" contained in the node, the pointer to the start of the struct page * list of pages contained in the node, the start physical address of the node memory, and the node ID.

_ _alloc_pages() is used by both the standard/default page allocator _alloc_pages(), and by the MTA page allocator alloc_pages_nodelist(). Internally, _ _alloc_pages() attempts to allocate pages atomically (without blocking the calling process). If that fails and the _ _GFP_WAIT bit is set in gfp_mask, it "rebalances" the memoru zone within the node, and attempts the allocation again. If that fails, it blocks the calling process and yields to the kswapd daemon. When _ _alloc_pages() returns from kswapd, it returns NULL to allow either _alloc_pages() or alloc_pages_nodelist() to try again with a different node (the default non-MTA behavior is to again attempt to allocate pages from the same node in an endless alloc-kswapd loop until it succeeds).

Default Page Allocator

The default page allocator is _alloc_pages(). It attempts to allocate from any available node in a round-robin manner. This method has been slightly modified for MTA. Each configured node in MTA includes a flag specifying whether _alloc_pages() is allowed to allocate pages from that node. This flag can thus be used to reserve an entire node only for MTA allocation. An example use might be a node containing a very small number of physical pages. By reserving the node only for MTA allocation, it guarantees that it will only be used to allocate pages for process memory region’s that specify node lists, or for any caller of alloc_pages_nodelist() (described next).

Allocating Pages With a Node List

A wrapper function is provided to _ _alloc_pages() called alloc_pages_node(), which takes as arguments the node ID of the memory node to allocate the pages from. It’s interface is:

struct page * alloc_pages_node(int nid, unsigned int gfp_mask,
		unsigned int order);

Alloc_pages_node() in turn is used by the MTA allocator, alloc_pages_nodelist(). It is this latter method that the page fault exception handler uses to allocate pages using the node information described by the struct vm_node_list object in the faulting VMA. It’s prototype is:

struct page * alloc_pages_nodelist(struct node_list * nl, int gfp_mask,
		 unsigned int order);

The function is written so that plenty of opportunity is given for allocation from the first choice node (nl->nid[0]) to succeed if the gfp_mask includes the _ _GFP_WAIT flag. Note that if _ _alloc_pages() returns NULL when the _ _GFP_WAIT flag is set, it means kswapd was allowed to run, and therefore pages may have become free in the first choice node, so we should try again.

Alloc_pages_nodelist() accomplishes this behavior with an outer and inner loop (see the flow chart below for an illustration of the algorithm). The outer loop increments from zero to nl->depth, and the inner loop increments from zero to the current outer loop index. The inner loop attempts to allocate a page from nl->nid[j], where j is the inner loop index. The function returns on the first successfull page allocation. As described above, the underlying buddy system allocator, _ _alloc_pages(), will first attempt atomic allocation from the node, and if that fails, will yield to kswapd to free up pages, and then return NULL back to alloc_pages_nodelist().

As an example, suppose we have a node ID list containing {3,1} (nl->depth is 2), and the _ _GFP_WAIT flag is set in gfp_mask. Assuming alloc_pages_nodelist() ultimately fails, it will attempt allocation from the nodes in the following order: 3 3 1 3 1. In other words:

  1. kswapd runs after allocation from 1st choice node 3 fails.
  2. retry node 3 - fails again (kswapd runs again).
  3. try alloc from node 1 (2nd choice node) - fails (kswapd runs).
  4. retry first choice node 3 - fails again (kswapd runs).
  5. retry node 1 - fails again and giveup (return NULL).

It is also possible to attempt allocation from the first choice node many times by repeating the node in the node list. For example, with a node ID list containing {3,3,1}, alloc_pages_nodelist() attempts allocation from the nodes in the following order before finally failing: 3 3 3 3 3 1 3 3 1.

Note that if the _ _GFP_WAIT flag is not set, the inner loop is collapsed, and each node in the list is tried in sequence with no retries. So given the node list {3,3,1} from the example above, alloc_pages_nodelist() attempts allocation from the nodes in the following order before finally failing: 3 3 1.

Kernel API’s

Allocating Whole Pages, alloc_pages_nodelist()

Device drivers or other kernel code that wish to allocate whole memory pages from a specific node can call alloc_pages_nodelist() directly. If the caller has a list of mnemonic strings, it must first convert the strings to a node list with memtypes_to_nodelist() before calling alloc_pages_nodelist().

For sake of speed in allocating pages during page faults, alloc_pages_nodelist() does not call check_nodelist() to check the validity of the passed node list. Instead, it does the following (refer to the flow chart above):

  • if depth is greater than MAX_NR_NODES, fail immediately (return NULL).
  • in the inner loop, if the current node ID in the list is invalid, skip this entry and move on to the next ID in the list.

Note that the passed node list will never be invalid if alloc_pages_nodelist() was called as a result of a page fault or a slab allocation, because kmalloc_nodelist(), do_mmap_nodelist(), and do_brk_nodelist() all check the validity of the list beforehand.

Slab Allocator, kmalloc_nodelist()

Device drivers or other kernel code that wish to allocate memory of arbitrary size from a specific node can make use of a new interface to the slab allocator, kmalloc_nodelist(), which takes as an extra argument a pointer to a struct node_list object. It’s prototype is as follows:

void * kmalloc_nodelist (struct node_list * nl, size_t size, int flags);

There is also a new slab interface that allows creation of a new cache that includes a node list:

kmem_cache_t *	kmem_cache_create_nodelist (struct node_list * nl,
		const char *name, size_t size, size_t offset, unsigned long flags,
			void (*ctor)(void*, kmem_cache_t *, unsigned long),
			void (*dtor)(void*, kmem_cache_t *, unsigned long));

The new cache can then be used when allocating objects by passing it to kmem_cache_alloc(). The new objects will be allocated from the nodes listed in the cache objects node list.

Both of these new methods perform the following checks on the passed node list:

  • if the node list pointer is NULL, or the list is empty, the new slab object or cache will not have any node preference.
  • if the node list is invalid as indicated by check_nodelist(), both methods fail, returning NULL.

do_mmap_nodelist() and do_brk_nodelist()

Kernel code that wishes to create new mappings for a process can call do_mmap_nodelist() or do_brk_nodelist() directly. The current prototypes are identical to the original do_mmap() and do_brk(), with the addition of a node_list pointer as the last argument.

If the passed node_list pointer is non-NULL and enabled (depth is non-zero), but the list is invalid as indicated by check_nodelist(), the mapping fails, and both methods return -EINVAL.

User API’s

Mmap_memtypes() and brk_memtypes()

These new system calls are implemented to allow creating memory maps from user space with node information. They essentially provides user-level access to the kernel methods do_mmap_nodelist() and do_brk_nodelist(). The prototypes are the same as the current system calls, with two additional arguments:

void * mmap_memtypes(void *start, size_t length, int prot, 
	int flags, int fd,  off_t offset, char * memtypes, 
	int memtypes_len);
int brk_memtypes(void *end_data_segment, char * memtypes, 
	int memtypes_len);

The memtypes argument is a pointer to a user buffer holding a packed list of null-terminated strings. The strings represent the memory type mnemonics, and their order in the buffer is the order of node preference for the region. The memtypes_len argument is the total size of the user buffer in bytes.

Note that these new libc functions are not reserved by the POSIX standard. Applications that use them have to be compiled with -D_GNU_SOURCE.

The new syscalls are also used by the dynamic linker (ld.so) in MTA-aware glibc, to create the maps for a program’s shared libraries. The following checks are made on the arguments passed to mmap_memtypes() and brk_memtypes():

  • If the memtypes buffer pointer is NULL, or if memtypes_len is zero, the new mapping created will not have any node list preference, i.e. it will be as if the regular mmap() and brk() syscalls were used.
  • If the copy of the user buffer to kernel space fails (for instance the memtypes pointer is invalid), the mapping fails.
  • There is an upper limit of one page (4096 bytes) on the user buffer size. If memtypes_len is greater than PAGE_SIZE, the mapping fails.
  • If any of the memory type mnemonic names in the memtypes buffer do not match any of the kernel’s node names, the new mapping created will not have any node list preference.
  • The usual conditions exist on the remaining arguments (for instance, for a file mapping the file descriptor must refer to a valid open file).

/proc Interface

There are two new entries in the /proc file system.

/proc/nodeinfo

The first is /proc/nodeinfo, which lists the node configuration of the kernel, including the name, physical address range, and whether default page allocation is allowed, of each configured node.

/proc/[pid]/nodemap

The second is an extension of the Memory Accounting tool. If the kernel config option CONFIG_MEMORY_ACCOUNTING is enabled along with CONFIG_MEMTYPE_ALLOC, a new proc entry, /proc/[pid]/nodemap will be available. The information is similar to the Memory Accounting Tool’ s /proc/[pid]/memmap, except that instead of displaying the page usage counter for every resident page in each region, the node ID of resident pages are displayed. Pages for a region that are not yet resident are shown with a dash character "-".

In other words, for every line (region) printed by /proc/[pid]/maps, /proc/[pid]/nodemap also prints a line, showing the node ID of resident pages for that region.

Tracing MTA with Linux Trace Toolkit

Important MTA events are captured by the run-time creation of Linux Trace Toolkit (LTT) custom events for MTA. The following events are defined in include/linux/vmnode.h, and are called at the appropriate locations in the kernel where the corresponding events occur:

  • TRACE_MTA_ELF_MEMTYPES
An ELF executable or ld.so was loaded containing a .memtypes NOTE section.
  • TRACE_MTA_MMAP_MEMTYPES
Entry to mmap_memtypes system call with a non-empty memtypes buffer.
  • TRACE_MTA_BRK_MEMTYPES
Entry to brk_memtypes system call with a non-empty memtypes buffer.
  • TRACE_MTA_MMAP_NODELIST do_mmap_nodelist() was called with a valid node list.
  • TRACE_MTA_BRK_NODELIST do_brk_nodelist() was called with a valid node list.
  • TRACE_MTA_KMALLOC_NODELIST kmalloc_nodelist() was called with a valid node list.
  • TRACE_MTA_KMEM_CACHE_CREATE_NODELIST kmem_cache_create_nodelist() was called with a valid node list.
  • TRACE_MTA_SLAB_ALLOC
A group of contiguous pages were allocated for a slab cache object containing a node list.
  • TRACE_MTA_VMA_ALLOC
A page was allocated for a copy-on-write, for an anonymous or file mapping containing a node list.
  • TRACE_MTA_PAGE_CACHE_ALLOC
A page was allocated and placed in the page cache, for a file mapping containing a node list.

With these events, it’s possible to trace MTA-related activity from the time a program was loaded, to the creation of its memory map, down to the allocation of memory pages for the program. The events can also trace the creation of new slab caches containing node lists, down to allocation of pages for the cache objects.

Additional Information

Porting MTA to other Architectures

At this time, only the ARM OMAP1510 Innovator platform has MTA support. To port MTA to other architectures:

  • First of all, the architecture must support discontiguous memory.
  • Add the CONFIG_MEMTYPE_ALLOC option to arch/[arch]/config.in if CONFIG_DISCONTIGMEM is defined. See arch/arm/config.in for example.
  • Add system call entry points for sys_brk_memtypes() and old_mmap_memtypes() and define their syscall numbers. See arch/arm/kernel/calls.S and include/asm-arm/unistd.h for example.
  • Implement old_mmap_memtypes() (sys_brk_memtypes() is implemented in generic kernel code in mm/mmap.c). See arch/arm/kernel/sys_arm.c for example implementation.
  • Configure the system’s memory nodes using the start and end physical addresses of each node in the mta_nodes[] array. How discontiguous memory nodes are initially configured is very architecture specific. See include/asm-arm/arch-omap1510/memory.h, arch/arm/mach-omap1510/innovator.c, and arch/arm/mm/init.c for an example of how this is done for ARM and the Innovator platform.

Limitations

  • In ELF binaries, the first file page offset of the initialized data segment is usually the same file page offset as the last page of text (the end of text and start of data share the same page). Because of this, the same allocated page frame in the kernel’s page cache is shared between the last page of text and the first page of initialized data. Therefore, if the program references the last page of text after it references the first page of data (which is usually the case), the last page of the text region will be located in the node of the data region, not in the text’s node.
  • The Innovator’s SRAM is very small, and page allocations from SRAM will begin to fail very quickly. The text segment of ld.so happens to just barely fit in SRAM. Even then, the kernel will attempt to allocate a cluster of pages for a region instead of only one during a file mapping page fault, and if that many pages are not free in SRAM, the cluster allocation will fail.

Future Enhancements

  • Expand maximum allowable nodes beyond 16.
  • Allow separation of data/bss/brk/stack segments into different nodes.
  • For native elfmemtypes tool, check mnemonic names against /proc/nodeinfo.

Notes

  • Copyright 2002, 2003, 2004 Sony Corporation
  • Copyright 2002, 2003, 2004 Matsushita Electric Industrial Co., Ltd.
  • Copyright © 2002?2004 by MontaVista Software.

Source Code

linux-mta-041004.tar.bz2 is a kernel soruce archive including MTA. Please someone isolate MTA funtion from the tarball. mta-utils.tar.gz MTA util and mta-glibc-2.2.5.patch glibc patch are also available.