RPi Xorg rpi Driver
This is the documentation for the Xorg Raspberry Pi driver developed in this thread 
By default, each Raspberry Pi Linux distro uses the generic framebuffer driver to draw the X display. All rendering of the display is done by the CPU into off- or on-screen buffers which eventually are shown on the output by the scan-out hardware. As the CPU on the Raspberry Pi is reasonably weak this makes for a sluggish user interface that at the same time causes a high CPU load, that slows down other programs.
In the modern 2D X11 desktop environment however there are two major ways that an application can choose to render itself,
- all rendering done by the X server
- nearly all rendering done by the application, with the X server simply presenting the rendered output
The primary goals of this project are to improve the performance of the first case and leave the second to other projects - there are many different user libraries that can do application-side rendering and boosting the performance of each of those would be a huge undertaking. The driver accomplishes this by offloading common tasks onto other hardware on the SoC that can process the work asynchronously, allowing the X server to be pre-empted by the OS yet still allowing progress to be made. This should allow other processes to see more CPU time.
Unfortunately this means that even if the X server runs infinitely faster, applications can still seem unresponsive if extensive application-side rendering is used.
No effort has been made so far to allow applications access to OpenGL/GL ES through the X server - basic 2D has been the priority so far.
Xorg provides a mechanism for drivers to accelerate a number of important rendering tasks. This is called EXA . EXA allows easy overriding of,
- block copy aka blitting
- solid colour fills
- compositing, aka alpha blending
This driver implements the required functionality of EXA by using different parts of the Raspberry Pi SoC.
- 2D A->B block copies are performed using asynchronous DMA. A DMA engine is programmed to copy - in either a linear or 2D fashion - from A(x, y) to B(x2, y2) with an incrementing source address and incrementing destination address.
- Solid colour fills are also performed by asynchronous DMA. A DMA engine is programmed to copy (again either linear or 2D) from a non-moving source address to a moving destination address starting at B(x, y). The source address holds the colour that is to be used in the fill.
DMA commands are enqueued one after the other to ensure correct results. They are constructed as a chain of DMA control blocks (CBs), and passed to the DMA controller to be kicked in one go.
For cases where the DMA set-up time would take longer than a naive CPU copy or fill, a CPU fallback is used instead.
Composition is more complex as the inputs vary much more. The operation allows many different operating modes, eg with different filtering, transformations, pixel formats, blend equations, wrapping modes - but some operations are much more common than others. These are handled in three different ways:
- synchronous acceleration by the VPU's vector unit . This covers the fewest cases but ideally the ones in which the most pixels need to be processed, where a speed-up will be most appreciated. Hand coded in assembly.
- synchronous low-latency CPU implementation using 32-bit ARM SIMD, catching all common cases. This should perform a "good enough" job, as it is primarily designed for low-pixel-count operations, eg rendering small antialiased characters.
- fallback to the generic X implementation: this covers all other composition modes. This is the worst case, as the overhead reaching the first actual image processing instruction is high.
SoC hardware used
As mentioned above, the driver leverages three things that the generic driver wouldn't otherwise use.
The ARM CPU's SIMD instruction set
This is not as comprehensive  as the NEON instruction set found in some v7 ARM implementations but is still useful for the task of composition. Through C++ template metaprogramming, a careful consideration of what a compiler can and cannot optimise, and finally eyeballing the code generated we can have composition functions that are of a comparable speed to those hand-optimised functions in pixman. The templating helps here by generating hundreds of these functions, rather than the handful that are specially implemented in pixman.
The Raspberry Pi SoC includes a decent number of DMA engines which can be used by the ARM for moving data around memory. They can all access the full bandwidth of the memory - more than the ARM could itself. They all share the bandwidth, and one DMA is sufficient to saturate the bus. The DMA engines are not all the same however - some have greater performance or features than the others. For instance, half of them have the ability to perform '2D' DMAs rather than straight linear transfers. Also one of the DMA engines (DMA zero) has a deeper FIFO allowing it to do larger read bursts