BeagleBoard/GSoC/2010 Projects/C6Run/Documentation

= Project Overview =

DSP-RPC-POSIX is a component of the C6Run project which allows you to do DSP->GPP remote procedure calls - that is, you can invoke functions/code residing somewhere on the GPP side directly from the DSP as if you were accessing a local function (there are, of course, certain requirements and restrictions).

"What functions are available on the GPP side, then?" one might ask. The answer is pretty much "everything that the hardware can do" or "everything you can do in your regular operating system" or something along the lines of that. From basic tasks like accessing the file system to more sophisticated things like sending a file over FTP, there's a myriad of possibilities.

Two reasons as to why DSP->GPP RPC is desireable would be: access to otherwise (directly) inaccessible features and being able to reuse existing code. The reason why C6RunApp exists is because it's messy to write and run code for the DSP, especially if all you want to do is try out or experiment with things (prototyping). C6RunApp makes your life easier as a DSP-side developer by offering you easy compilation/running and access to console I/O (which is actually a limited form of RPC); DSP-RPC-POSIX expands this by granting you access to virtually any existing functionality you have on the GPP side.

= Usage =

Building DSP-RPC-POSIX is no different than building C6Run itself. For example, to build for the BeagleBoard:

1. Checkout the sources from the SVN repository

2. Change into the sources directory and issue make beagleboard_config

3. Set up dependency paths in the top-level Rules.mak and in platforms/beagleboard/Platform.mak

4. Issue make build

5. That's all! You can simply add the bin directory to your path and start doing things like c6runapp-cc hello_world.c -o hello_world

It's highly recommended to read the C6Run wiki at http://processors.wiki.ti.com/index.php/C6Run_Project for more insight into C6Run and usage instructions.

Since the RPC layer builds on top of the existing C6Run build system, nothing else is necessary to use DSP -> GPP remote procedure calls. The only thing you need to do to access your GPP side functions from your DSP code is to make sure that your function is "identified" to the RPC system - that is, the stubs for your function(s) must be present. See the section about stubs for more details.

In a general sense, to access RPC functionality using C6RunApp, you must:


 * include the relevant header file in which the function stubs you want to use are declared (under the rpc/include directory) - the DSP-RPC-POSIX prebuilt stubs (including rpc_malloc and rpc_free) are all contained within rpc_stubs_dsp.h


 * provide the --rpc command line option to c6runapp-cc when building your sources

And that's it! You can call your GPP-side function from your DSP code as you regularly would (just with a rpc_ prefix to remind you that this is actually a remote procedure call and special rules apply). Most standard C library functions already have stubs that come packaged with DSP-RPC-POSIX - so rpc_printf, rpc_puts, rpc_fopen... are all ready to use. The rpc_example.c file under the examples/rpc_example folder gives a tiny demonstration of how this can be done.

IMPORTANT NOTICE REGARDING POINTER PARAMETERS
Any pointer/buffer parameters to be passed to RPC calls should ideally be allocated with rpc_malloc, which is identical to the RTS malloc in its calling convention, but allocates memory from the shared CMEM region instead. Similarly, memory allocated with rpc_malloc should be freed with rpc_free. RTS functions that work via C6Run? C I/O can still be used with DSP stack or heap variables, as well as rpc_malloc'd variables.

For the developer's convenience while working with read-only string parameters, DSP-RPC-POSIX has the ability to copy a fixed number of bytes from the DSP memory space (stack or heap) into a GPP-side buffer using PROC_read. This means that as long as the sent buffer is relatively small (defined as RPC_PROCREAD_MAXPARAM in build/gpp_libs/rpc_server.h) and is expected to be read only (ie, not modified on the GPP side) it's safe to pass them from the DSP stack or heap. So rpc_puts("hello world!") or rpc_printf("String: %s \n", "test") is indeed possible.

Targeting your own functions with RPC
To be able to access your own functions using RPC, you need to declare them to DSP-RPC-POSIX. This "declaration" is done via what's called a RPC stub. These stubs are small functions (also written in C) that do the necessary steps to call your functions via RPC. For every function to be accessed via RPC, there are three stubs (and so, three files) involved:


 * The DSP side stub, residing inside a C file in the dsp-rpc-posix/rpc/dsp directory
 * The GPP side stub, residing inside a C file in the dsp-rpc-posix/rpc/gpp directory
 * The declaration for the DSP stub (which is actually the function you're calling), residing inside a header file in the dsp-rpc-posix/rpc/include directory

Stubs are quite straightforward to write by hand, as you can see by studying the standard C library stubs already present inside dsp-rpc-posix, and it is highly recommended that you at least examine them to understand what they look like.

DSP-RPC-POSIX contains a stub generator utility (dsp-rpc-posix/bin/c6runapp-rpcgen) to easily generate stubs for any given function. For example, given a C file test.c that contains the function test_fxn, you can invoke

 c6runapp-rpcgen test.c 

and the utility will create three files in the same directory for you:
 * test.dsp_stub.c (containing the DSP stubs)
 * test.gpp_stub.c (containing the GPP stubs)
 * test.include.h (containing the declarations).

For DSP-RPC-POSIX to detect these stub files, they must be present in predetermined directories. You can simply use the --autocopy switch while calling c6runapp-rpcgen to have the files automatically copied into the respective locations, as in:

 c6runapp-rpcgen --autocopy test.c 

If you would like to manually manage your stub files instead of using --autocopy:


 * copy test.dsp_stub.c into dsp-rpc-posix/rpc/dsp
 * copy test.gpp_stub.c into dsp-rpc-posix/rpc/gpp
 * copy test.include.h into dsp-rpc-posix/rpc/include
 * copy test.c (the original source file) into dsp-rpc-posix/rpc/gpp_sources

Removing functions from the RPC system
You can use the --remove switch if you no longer need a group of stubs that you generated using c6runapp-rpcgen:

 c6runapp-rpcgen --remove test.c 

and all the four files will be removed from DSP-RPC-POSIX directories, or you can remove the files manually.

= Architectural Documentatin =

Step-by-step RPC Events
Let's start by some definitions:


 * DSP-side application - this is what you're assumed to be currently working on, which you compile with the C6Run script and will work on the DSP.
 * GPP-side application - sets up the DSP app and starts running it, and then "answers" the RPC requests. C6Run actually generates this for you, so you don't have to worry about anything here.
 * RPC target - a function which resides somewhere in the GPP side (could be a library, a shared library, your own code, etc.)
 * DSP-side stub - a little wrapper function which looks identical to the RPC target. it causes the RPC target to be executed with the parameters you passed to it, and returns you the same value it returns. this is what you actually call from the DSP-side. can be produced by the c6runapp-rpcgen tool, or written manually.
 * GPP-side stub - another little wrapper function on the GPP side, this is what the GPP side application actually calls. this function "knows" how to call the RPC target itself, so it executes that call and gets the result. can be produced by the c6runapp-rpcgen tool, or written manually.

The run of events that occur when you want to do a remote procedure call are as follows:


 * 1) From inside the DSP-side application, the DSP-side stub is called (which looks identical to the RPC target)
 * 2) The DSP-side stub is executed. It initializes the RPC request, and copies all the parameters into the request package (called "marshalling"), and signals for the RPC to be performed.
 * 3) The request package is sent to the GPP-side application using the RPC transport.
 * 4) The GPP-side application receives the package, unpacks it and extracts the parameters into a buffer (called "unmarshalling"). Some extra processing such as address translation for buffer/pointer parameters may be carried out at this step.
 * 5) The GPP-side application locates the relevant GPP-side stub and executes it.
 * 6) The GPP-side stub executes the RPC target, using the provided parameters, then stores the return value into another buffer. Some extra processing regarding structures or non-shared buffer return types may be carried out at this step.
 * 7) The GPP-side application sends back the result to the DSP-side.
 * 8) The DSP-side stub receives the result in the buffer, extracts and returns it to the user code.

Structure of the RPC Package
The buffer carrying a RPC request is structured as follows:


 * NameLen: length of the function name
 * Name: function name of the GPP-side stub to be executed (observe: NOT the name of the RPC target)
 * SignatureLen: length of the function signature
 * Signature: function signature describing how the parameters section will be unpacked
 * Params: the function parameters, packed without any size promotions or alignment
 * 0: the null-terminating zero signalling the end of the package

GPP Side Architecture
Relevant source code files: build/gpp_libs/rpc_server.c build/gpp_libs/rpc_server.h build/gpp_libs/cio_ipc.c rpc/gpp/*.c

Overview
DSP-RPC-POSIX's GPP-side is "heavier" compared to its DSP-side and almost completely integrated into the C6Run GPP-side library. Aside from C6Run's regular GPP-side duties such as setting up the DSP and serving C I/O requests, it is also responsible for these tasks:


 * extracting/cleaning up the GPP stubs library
 * receiving and responding to RPC requests
 * unmarshalling received packages
 * postprocessing stubs' returned data
 * locating and executing stubs
 * managing RPC memory

The GPP Stubs Library
All GPP stubs located inside the rpc/gpp directory are compiled into a dynamic link library (librpcstubs.so), which allows the usage of dlfcn.h functions dlsym to dynamically locate them by their names. This dynamic link library is rebuilt, converted into a C header file and included in the compilation of the final executable. Upon launch, the library is temporarily extracted into the same directory as the executable, used for locating and executing the stubs, then removed upon termination.

Receiving and Responding to RPC Requests
The important task of servicing RPC requests - that is, carrying out the recieve-unmarshal-locate-execute-return steps, is currently done inside the C I/O service routines (since the RPC transport is carried out via the C I/O transport), located inside build/gpp_libs/cio_ipc.c.

Address Translations and RPC Memory
The GPP side server application is responsible for managing the shared memory regions used in RPC. Central to its workings is the CMEM kernel module which allocates contiguous memory regions and provides both physical and virtual addresses for accessing these. Since the CMEM module offers only physical-from-virtual address translation with base addresses, DSP-RPC-POSIX maintains an internal list of allocated buffers and their sizes to facilitate bidirectional address translation for any memory region lying within the allocated areas (ie, not just the allocated base addresses but incremented pointers referring to somewhere within that allocated block are also translateable). The list is kept as a doubly-linked list, which grows with rpc_malloc calls and shrinks with rpc_free calls. Using rpc_translate_address (which is also exposed through RPC on the DSP-side), the list can be searched to perform virtual->physical or physical->virtual translations. If any address translation goes unsuccessful (no corresponding allocation entries are found) the input address is returned unchanged, which can be used to detect if something has gone wrong.

DSP Side Architecture
Relevant source code files: rpc/core/dsp_core.c rpc/core/dsp_stubs_base.h rpc/dsp/*.c

Overview
DSP-RPC-POSIX's DSP-side is very small and rather simple - in fact, none of it currently resides in the C6Run DSP-side libraries but are compiled alongside the user sources every time (thus, modifications to the DSP-side code never need a re-build of the C6Run libraries - just the re-execution of c6runapp-cc script). This is mainly because there isn't all that much to do on the DSP-side: we have a buffer of a certain size (see RPC_BUFSZ in dsp_stubs_base.h) into which every DSP-side stub is responsible for copying its function name, function signature and parameters ("marshaling"), which is then sent to the GPP side using the transport, then the reply obtained and passed back to the DSP stub.

Message identifiers
Aside from the information transmitted in the RPC request, there is one more piece of information given to the GPP side which is vital to the servicing of the RPC call - the message identifier, which describes the nature of the RPC message and defined as follows:


 * RPC_MSG_REQUEST    generic RPC function call request
 * RPC_MSG_RESPONSE   sent by the GPP side as a reply to every RPC request, both generic and specialized
 * RPC_MSG_MALLOC     specialized RPC function call request, for memory allocation
 * RPC_MSG_FREE       specialized RPC function call request, for memory dellocation
 * RPC_MSG_TRANSLATE  specialized RPC function call request, for address translation

The reason why specialized call requests exist is because these particular functions should not be called from inside the GPP stubs library, but be handled directly inside the RPC server. Since all regular stubs would automatically use RPC_MSG_REQUEST, the specialized functions are defined manually in dsp_core.c - they also don't obey the structural conventions defining the RPC packages (no function name, signature or null terminator is given).

Observe that these identifiers are NOT used as the MSGQ MSG identifier - since the RPC transport is carried out via C6Run's existing C I/O transport, those are always CIO_TRANSFER. These RPC identifiers are carried on the command byte during writemsg for requests and the first byte of parm[] for responses.

Transport
Currently, the RPC transport is completely carried out via C6Run's existing C I/O transport system - that is, the functions writemsg and readmsg.

The definitions of writemsg and readmsg, and the usage conventions for parameters are as follows:


 * command - the RPC message identifier
 * parm   - unused, any 8-element char array
 * data   - the RPC buffer
 * length - length of the RPC buffer


 * parm   - char array whose first element should contain RPC_MSG_RESPONSE
 * data   - buffer that contains the RPC response

Function Signatures
The function signature is a string of characters describing the data type of a function's return value and parameters. The reason why this data is needed is threefold:


 * the unmarshaller needs to know how many bytes each parameter takes while extracting them from the buffer


 * GPP and DSP address spaces aren't the same, and in order to know when to perform address translation the unmarshaller needs to know which parameters are pointers (should be translated) and which parameters are regular values (should be left untouched)


 * special treatment (such as copying into a shared buffer) may be needed for some functions' return values

The signature is composed of ASCII characters, starting with the character representing the return type and continuing with characters representing each parameter in order. Its length always has to be nonzero (the return type is always needed).

Table of Function Signature Characters
Observe that indirect pointers (double/triple pointers such as char**, void**) are not fully supported - the contained direct pointers won't be translated, and neither will be pointers hidden inside struct's

Pointers and Shared Memory
There are two important issues that need to be kept in mind while working with buffers/pointers in RPC:


 * 1) The GPP and the DSP don't use the same address space: the GPP works with virtual addresses, while the DSP works with physical addresses
 * 2) Due to memory protection issues, a buffer which will be accessed by both the DSP and the GPP must be allocated from a shared memory area (that is, CMEM)

To save you from the troubles of having to work manually with allocating CMEM buffers and translating addresses back and forth, DSP-RPC-POSIX offers you easy allocation of shared buffers via rpc_malloc and automatic translation of memory addresses using the @ character in function signatures. Thus, any rpc_malloc'd buffer you pass as an '@' parameter will be automatically converted to its virtual equivalent which can be used by and GPP side function, and any '@' return type lying inside rpc_malloc'd areas will be translated to its physical equivalent which is accessible by the DSP. It is important to keep in mind that address translations (which are absolutely necessary for accessing pointers from both sides) are performed automatically only in case if the parameter is specified as @ in the function signature. This means that double/triple pointers, pointers inside structs or pointers hidden inside buffers won't be automatically translated.

For the developer's convenience while working with read-only string parameters, DSP-RPC-POSIX has the ability to copy a fixed number of bytes from the DSP memory space (stack or heap) into a GPP-side buffer using PROC_read. This means that as long as the sent buffer is relatively small (defined as RPC_PROCREAD_MAXPARAM in build/gpp_libs/rpc_server.h) and is expected to be read only (ie, not modified on the GPP side) it's safe to pass them from the DSP stack or heap. So rpc_puts("hello world!") or rpc_printf("String: %s \n", "test") is indeed possible.

In case you won't be accessing the contents of a pointer from the DSP-side, it is safe to use the 'a' signature character instead of '@'. In this case the system won't perform any address translation at all, pointers will be passed back and forth as regular values. For example, FILE pointers used by fopen/fread/etc. calls are never meant to be dereferenced at all, but merely passed as arguments to other members of the same function family, and thus are good candidates for being 'a' parameters.

Returning Non-Shared Buffers
In case the GPP-side function is going to return a pointer to a GPP-side allocated memory buffer (ie, anything that wasn't allocated by rpc_malloc), the function signature return type should be defined as '$' which has this special meaning: the GPP stub needs to use the RPC_LOAD_RESULT_BUFLEN(result_buffer, length) macro to set the size of the returned buffer. The RPC system will copy this many bytes into a shared memory region from the returned buffer, and return the physical address of the shared memory region instead. Observe that this shared buffer is only synchronized once (upon the completion of the GPP side call) and its later contents will not be kept in sync with the nonshared GPP buffer.

Returning Structures
In case the GPP-side function is going to return a structure (the structure itself and not a pointer to it), the function signature return type will be defined as '#' which has this special meaning: the GPP stub needs to allocate sufficient memory with malloc, place the returned structure into this buffer, use the RPC_LOAD_RESULT_BUFLEN macro to set the length of the returned structure, and then return this buffer without freeing the memory. The GPP side server will copy the specified number of bytes into the return buffer and free the allocated memory. Observe that the size of the RPC return buffer is limited (defined as RPC_RESPSZ)

Cache Issues
As with any modern processor, the C64x+ megamodule possesses an amount of cache (L1P, L1D and L2) for higher performance. The presence and usage of ARM and DSP side caches poses a cache coherency problem when we want to access a shared area by both processors. Let's consider the following scenario:


 * 1) A CMEM buffer is allocated for shared usage by the GPP side and its physical pointer is passed to the DSP.
 * 2) The DSP wants to read and then write some data into this buffer. Let's say that there are free entry slots in the DSP L2 cache - so the data actually gets written to the DSP cache, instead of making it to the DDR shared region. The DSP then signals the GPP that it's done with the buffer for the time being.
 * 3) The GPP attempts to read the buffer, but what it reads is just the garbage values present in the buffer after initialization since the DSP-written data is in the DSP cache. This buffer also gets cached in the ARM-side now, so when the GPP tries to write some new data into it, it stays into the ARM cache and doesn't make it to the main memory either.
 * 4) If the DSP tries to read the cache now, it won't get what the ARM has written into it most recently since it'll be reading from its own cache, and vice-versa for the ARM side.

We can see that the "same" buffer actually exists in three different locations (main memory, DSP cache, ARM cache), all of which can contain totally different data - in this case it is said that they are not coherent, and that we have a cache coherency problem.

In most cached systems, there are cache coherency protocols which prevent these situations from occuring. The TMS320C64x+ DSP Cache User's Guide states:

In the following cases, it is your responsibility to maintain cache coherence:


 * DMA or other external entity writes data or code to external memory that is then read by the CPU
 * CPU writes data to external memory that is then read by DMA or another external entity

thus we have to manually maintain cache coherence for mutual access to CMEM regions by the DSP and the GPP. Studying the scenario above, we can observe that there are two underlying problems:


 * 1) If the memory block to be read already exists in the local cache, there's a risk that the local cache is outdated: we need to discard the local cache entries and fill them up with information from the main memory. This process is called cache invalidation.
 * 2) When the memory block is to be written into, there's a risk that the info remains in the local cache and doesn't make it to the main memory: we have to make sure that the new info gets written to the main memory as well. This process is called cache writeback.

Therefore, from a RPC perspective, for a call that involves transferring buffers, the steps we have to take are as follows:


 * 1) Before passing the marshalled info via DSP/Link, the DSP must do a cache writeback
 * 2) Before passing the params to the GPP side stub, the GPP must do a cache invalidate
 * 3) After the GPP side stub is finished, the GPP must do a cache writeback
 * 4) The DSP side stub must do a cache invalidate before terminating

This is assuming that both processor caches will be active - in case the DSP cache is disabled, the steps 1 and 4 will not be necessary, and likewise with steps 2 and 3 for a disabled GPP cache.

Enabling or Disabling DSP Cache
DSP caches are enabled by default, and coherency issues are handled inside the RPC layer. To adjust the DSP-side cache, you can edit the platforms/{platform_name}/platform.tci configuration file. To disable all DSP-side caching, you can set the MAR bits to 0, as in:

Enabling or Disabling ARM CMEM Cache
The ARM-side CMEM caching is disabled by default, since there were coherency handling issues inside the RPC layer at the time of writing. If you want to enable ARM-side caching and handle coherency issues on your own, set

in build/gpp_libs/C6Run.h