RPi Xorg rpi Driver

From eLinux.org
Revision as of 05:48, 2 January 2013 by Teh orph (Talk | contribs) (The display does not appear to be accelerated)

Jump to: navigation, search

This is the documentation for the Xorg Raspberry Pi driver developed in this thread [1]

By default, each Raspberry Pi Linux distro uses the generic framebuffer driver to draw the X display. All rendering of the display is done by the CPU into off- or on-screen buffers which eventually are shown on the output by the scan-out hardware. As the CPU on the Raspberry Pi is reasonably weak this makes for a sluggish user interface that at the same time causes a high CPU load, that slows down other programs.

In the modern 2D X11 desktop environment however there are two major ways that an application can choose to render itself,

  • all rendering done by the X server
  • nearly all rendering done by the application, with the X server simply presenting the rendered output

The primary goals of this project are to improve the performance of the first case and leave the second to other projects - there are many different user libraries that can do application-side rendering and boosting the performance of each of those would be a huge undertaking. The driver accomplishes this by offloading common tasks onto other hardware on the SoC that can process the work asynchronously, allowing the X server to be pre-empted by the OS yet still allowing progress to be made. This should allow other processes to see more CPU time.

Unfortunately this means that even if the X server runs infinitely faster, applications can still seem unresponsive if extensive application-side rendering is used. This is a common problem in optimisation. [2]

No effort has been made so far to allow applications access to OpenGL/GL ES through the X server eg via GLX - basic 2D has been the priority so far.


Xorg provides a mechanism for drivers to accelerate a number of important rendering tasks. This is called EXA [3]. EXA allows easy overriding of,

  • block copy aka blitting
  • solid colour fills
  • compositing, aka alpha blending

This driver implements the required functionality of EXA by using different parts of the Raspberry Pi SoC.

  • 2D A->B block copies are performed using asynchronous DMA. A DMA engine is programmed to copy - in either a linear or 2D fashion - from A(x, y) to B(x2, y2) with an incrementing source address and incrementing destination address.
  • Solid colour fills are also performed by asynchronous DMA. A DMA engine is programmed to copy (again either linear or 2D) from a non-moving source address to a moving destination address starting at B(x, y). The source address holds the colour that is to be used in the fill.

DMA commands are enqueued one after the other to ensure correct results. They are constructed as a chain of DMA control blocks (CBs), and passed to the DMA controller to be kicked in one go.

For cases where the DMA set-up time would take longer than a naive CPU copy or fill, a CPU fallback is used instead.

Composition is more complex as the inputs vary much more. The operation allows many different operating modes, eg with different filtering, transformations, pixel formats, blend equations, wrapping modes - but some operations are much more common than others. These are handled in three different ways:

  • synchronous acceleration by the VPU's vector unit [4]. This covers the fewest cases but ideally the ones in which the most pixels need to be processed, where a speed-up will be most appreciated. Hand coded in assembly.
  • synchronous low-latency CPU implementation using 32-bit ARM SIMD, catching all common cases. This should perform a "good enough" job, as it is primarily designed for low-pixel-count operations, eg rendering small antialiased characters.
  • fallback to the generic X implementation: this covers all other composition modes. This is the worst case, as the overhead reaching the first actual image processing instruction is high.

SoC hardware used

As mentioned above, the driver leverages three things that the generic driver wouldn't otherwise use.


This is not as comprehensive [5] as the NEON instruction set found in some v7 ARM implementations but is still useful for the task of composition. Through C++ template metaprogramming, a careful consideration of what a compiler can and cannot optimise, and finally eyeballing the code generated we can have composition functions that are of a comparable speed to those hand-optimised functions in pixman. The templating helps here by generating hundreds of these functions, rather than the handful that are specially implemented in pixman.

DMA engines

The Raspberry Pi SoC includes a decent number of DMA engines which can be used by the ARM for moving data around memory. They can all access the full bandwidth of the memory - more than the ARM could itself. They all share the bandwidth, and one DMA is sufficient to saturate the bus. The DMA engines are not all the same however - some have greater performance or features than the others. For instance, half of them have the ability to perform '2D' DMAs rather than straight linear transfers. Also one of the DMA engines (DMA zero) has a deeper FIFO allowing it to do larger read bursts.

The DMA hardware does not live within the same address space as the ARM CPU. It uses the bus address space instead. A translation must be made to the address to get from one to the other. Also, the ARM's page tables are not used by this hardware. Virtually contiguous ARM addresses are not necessarily physically contiguous and the DMA hardware won't know about this - as a result DMA needs to be sometimes broken up into 4 KB blocks to ensure the correct result.

There is a start-up cost associated with DMA, and as a result sometimes it is not efficient to use this hardware. Steps include,

  • breaking a large transfer up into something which respects 4 KB page boundaries
  • entering the kernel
  • translating user virtual addresses into bus addresses for each DMA CB
  • flushing and invalidating parts of the data cache, then kicking off the DMA chain
  • returning to user mode to do more work
  • entering the kernel
  • waiting for the DMA to complete
  • returning to user mode

For reference, a user->kernel->user transition takes roughly 1 us. Each DMA CB appears to take the DMA engine around 6 us to start.


Also on the SoC is the a custom processor that appears to be the controlling brains of the GPU, the VPU. This is what the 'firmware' runs on. Rpi_Software#Overview

Within this processor is a 16-way vector unit which is well-suited to image processing operations. Although this processor is ordinarily clocked nearly three times slower than the ARM core, the vector unit and improved memory interface more than make up for it. Some composition functions that commonly operate on thousands of pixels have been coded to run on this unit.

Like the DMA hardware, it lives within the bus address memory space. User virtual ARM addresses need translation, and the 4 KB page boundary/page table issues still apply. Also from the viewpoint of the driver the VPU appears as an asynchronous co-processor: there is a ~56 us overhead at stock clock speeds communicating with it from X so work should really only be sent to it if worthwhile.

Blocking waits

The workload sent to the driver from the running applications is not known in advance. The structure of the work is generally the same though. Allocate some images, upload some data from the user application, perform a handful of operations, synchronisation point. Perform some more operations, synchronisation point.

The point of the frequent synchronisation points is to allow the application to get a hold of the rendered pixel data. It is also there to allow the application to release memory. By knowing that all rendering has completed by a given point, it knows what is sees in the buffer is correct and also that a given image buffer is no longer in use and can be freed.

This behaviour is contrary to a game-style render loop: for the majority of a frame a command buffer is filled with drawing commands. At the end of a frame this command buffer is sent to the GPU for processing. However whilst this was going on, the GPU was processing the last frame's command buffer. This contrasts the X update loop as,

  • there are many more synchronisation points, and it is unknown when they will appear
  • there is not a double-buffered command buffer, meaning the GPU cannot be processing "last frame's image" whilst the CPU is building this frame's command buffer
  • image data cannot go too far away from the CPU as the application may want to inspect it with the minimum of delay

This means OpenGL and other run-times are not appropriate for an X driver on a system like the Raspberry Pi that lacks horsepower. If GL or a similar run-time was used, the high CPU cost of setting up and tearing down images and command lists etc would drawf the actual amount of time the GPU would be doing rendering. Also the fact that on the Raspberry Pi textures are not trivially accessible by the CPU means that when an application needs to gain access to pixel data a high cost must be paid stopping the GPU and then downloading the textures. Even the 6 us start-up time of a DMA CB is noticeable in some applications - if instead the whole 3D stack is traversed many applications would be far slower than the generic CPU route.

That said applications which expect the driver to be implemented on a run-time like OpenGL will tailor their workload to properly suit it. For example, the Chromium browser. This is an exception though - most applications expect a synchronous driver with easily-reachable memory

When to synchronise

As mention before, the driver treats the VPU and DMA hardware as asynchronous co-processors. The CPU overhead of reaching them is relatively high and as CPU time is not in abundance this overhead needs to be amortised over as many operations as possible. This means the CPU builds up a list of work to send to the DMA hardware and a list of work to send to the VPU hardware. Eventually these lists are sent off for processing. The overhead cost is only paid once.

Yet as it is not known when a synchronisation point may appear it is not clear when a DMA or VPU command list must be started. By waiting longer the overhead is decreased (per unit of work) yet there is a greater chance that the application requests the work is finished, but it won't be as it has yet to start. The application then blocks. The opposite is also true - if work is kicked too frequently, too much overhead results yet the application will wait less as there is a greater chance the work has completed by the time it requests a sync point.

Performing composition with the VPU is tricky too - as there is a 56 us start-up cost to performing work there, are there sufficient pixels to process that it is worth this cost? An estimate needs to be made of the relative speeds of the two processors (ARM and VPU) and based on this a decision is made whether the task is run on the co-processor or not. Something to also consider is that on the ARM CPU the run-time is variable. There are early-out optimisations that can be performed based on the value of the mask (if present). These optimisations can't be used on the VPU.

Finally there is also the case where work to be processed with DMA is so small that it would be quicker to process on the CPU. Yet if a million tiny pieces of work come along back-to-back it would still have been faster to do with DMA - but it is not known how many pieces of work will be enqueued. This means there is a decision to be made: when should the CPU perform work that it thinks it could do faster than the DMA hardware?

High-level driver layout


Yellow represents "the driver", code that has been written as part of this project.

Memory layout

In order to suit different situations, there are three main ways of using the driver. Each has advantages and disadvantages. Due to the way EXA is currently being used, it needs to track a large chunk of "offscreen memory". This is memory that the application can't see, yet the graphics accelerator (in whatever form the driver writer is targetting) can see. The problem comes in the Raspberry Pi's case is that all the hardware targetted can see all the memory, and so this abstraction is unnecessary and wasteful. This limitation incurs a decent performance penalty, but work will be done in the future to address this issue.

Initial work has been done to work around the above issue by use of the SelfManagedOffscreen option.

Here are the three ways of providing EXA and the driver with its offscreen memory.

4 KB page mode

In order for DMA to function, the the memory being operated on must not move to a different physical address by the Linux page compaction system nor be paged out to swap. This will happen to normal user pages, so they are not suitable. One way to get around this is to allocate pages from the kernel and map them in to the user space. These pages are unmoveable, they will never be swapped and their physical addresses can be cached for a more efficient translation.

The advantages for this scheme include:

  • no wasted memory - if X does not use the whole of the offscreen space, then only the pages that are used are allocated. No pre-declaration or reservation of memory is necessary.
  • it is the most robust and secure system, as all address translations are vetted by the kernel, and the hardware that is provided the addresses is simple and predictable.
  • the maximum memory can be scaled up and down at run-time, without a reboot

The disadvantages are based around the fact that memory is not physically contiguous:

  • DMA must be broken at the 4 KB page boundary into separate DMA transfers (still as part of the CB chain though)
  • VPU composition cannot be used
  • 2D DMA cannot be used due to the complexity of breaking at page boundaries
  • more CPU address translation is necessary due to the increase in CBs

Finally, as the pages are unmoveable they cannot be swapped out - this will increase the pressure on the remaining 'normal' memory.

  • the address translation is generally more complex, increasing the ARM CPU load

With all three memory options, all on-screen pixmaps that are not under the control of EXA are allocated with the 4 KB mode in order to allow DMA to pull them into offscreen memory.

Boot-time reservation mode

This mode involves telling the kernel to simply ignore - at boot-time - a chunk of the memory provided to the CPU. It is never used by applications or the kernel as the system simply believes it has less memory attached. This reserved memory can be mapped in as the offscreen buffer to good effect. It is physically continuous, and this means address translation is very simple. Just one constant needs to be added to the user virtual address to get to the final bus address needed by DMA and the VPU. If the mapping is done smartly, no offset is needed at all.

Advantages include:

  • DMA does not need to be broken into 4 KB chunks
  • VPU composition can be used
  • 2D DMA can function
  • address translation is very simple

Disadvantages include:

  • memory is wasted if it is not used, either by X not needing it all or the user simply not running X in the first place
  • a reboot is needed to change the reservation size
  • the user needs to compute the address of the reservation before passing it to X
  • there is generally a greater chance of something "going wrong" due to user or programmer error (there's no problem inherent to the technique, though)

VideoCore mailbox reservation mode

This involves asking the firmware for a block of memory from its share, and then mapping it in to the user process. This can function either in the static split mode, where the user declares how much is needed for the GPU in their config.txt (eg 192/64) or via the new "floating split" CMA mode. The size of the split grows and shrinks based on what's happening at that time.

Advantages include:

  • DMA does not need to be broken into 4 KB chunks
  • VPU composition can be used
  • 2D DMA can function
  • address translation is very simple
  • easy user set-up
  • no wasted memory if X is not used
  • memory usage can be changed without a reboot

Disadvantages include:

  • if not in the CMA mode, this may prevent 3D applications from starting as there may be insufficient GPU memory

Installation of the driver


Installation of the pre-prepared binary driver is easy, assuming a few prerequisites.

  • Raspbian
    • as up-to-date as possible
  • xserver-xorg-core and xserver-xorg-video-fbdev need to already be installed: ie you need to be able to load a desktop without any trouble, with no modifications made to the installed code

Pre-install set-up

  • Install Hexxeh's rpi-update, if not previously installed
  • Update to the bleeding edge (as of the time of writing) kernel and firmware
    • sudo rpi-update d0fe451d1e17c1780348d90daa2d45569b09efec
  • Change the display to 32-bit mode
    • Add these lines to your /boot/config.txt file
      • framebuffer_depth=32
      • framebuffer_ignore_alpha=1
  • If you want to use the VideoCore mode and the CMA memory allocation also add,
    • cma_lwm=16
    • cma_hwm=64
    • Make sure to remove any gpu_mem= option
  • If you want to use the VideoCore mode but not CMA then you need to manage the memory split yourself
    • Remove any cma_ options
    • Try gpu_mem=32, 48 or 64
    • Once you reboot, run sudo vcdbg reloc | grep "largest free block"
    • Multiply this size by 1048576 and use as the BlockSize option. See that entry in the configuration section to know what sort of size to aim for.
  • Reboot, and ensure your device comes back up with the new firmware and kernel
    • uname -a should give something like > 3.6.11
    • vcgencmd version should give something like > 359004
  • Check that the display is in fact in 32-bit mode
    • fbset, and look at the end of the 'geometry' line: it should say 32
  • Ensure you can load the desktop like normal
    • do this however you would do, and confirm that everything works (eg your USB input devices) and also that there are no oddities due to being in 32-bit colour mode
  • Quit the desktop and return to the console again

Driver install

Download <xorg-server.tar.gz> and <xserver-xorg-video-fbdev.tar.gz>.

  • Install the X server
    • sudo tar xfvz xorg-server.tar.gz -C /
  • Install the driver, default config file and VPU binary
    • sudo tar xfvz xserver-xorg-video-fbdev.tar.gz -C /

Kernel module install and set-up

Two kernel modules need to be configured, the DMA controller and the VC mailbox.

  • Load the DMA module
    • sudo modprobe dmaer_master
    • major=$(awk '$2=="dmaer" {print $1}' /proc/devices)
    • sudo mknod /dev/dmaer_4k c $major 0
  • Make the mailbox character device (for now in the home dir as that's what the default config file expects)
    • sudo mknod /home/pi/char_dev c 100 0

Running it

Make sure you have backed up by this point. Run 'sync' just because you can. From a console that you will always be able to see in the event of a trouble (and alt-F1 VT is not good enough),

  • sudo gdb Xorg
  • set args -verbose -keeptty
  • handle SIGPIPE nostop
  • run

If you see a hang at "Checking that it works..." then this means VPU composite has failed. You must now ctrl-z, sync and sudo reboot. You will not be able to ctrl-c. When the device reboots, try running again (you will need to modprobe the module and make /dev/dmaer_4k again). If it hangs again double-check you do have the latest kernel and firmware, and that /usr/share/X11/vpu_offload_asm.bin is intact. Its md5sum is 836970f42edb1268087efedceca50ec1.

Configuration of the driver

There are a number of different 'Options' that can be passed to the driver at load-time. These can be added to the default configuration file I supply.

Option name Suitable inputs Explanation
AccelMethod EXA, EXA_NULL, NoAccel
  • Ordinarily you should use EXA.
  • If you want to wholly disable acceleration and fall back almost to the original fbdev driver NoAccel. You will still see improvements made to the window dragging performance, as they aren't controlled by those flags.
  • EXA_NULL is a special option: this means to 'accelerate' as many things as possible via the EXA framework...but simply do nothing. This will allow the X server to run as fast as absolutely possible - as it's not doing anything! The display will be heavily corrupted - but it is likely you can still make out what is going on. And you will also be able to make out that stuff is still slow. This is the upper bound on performance that I can achieve through EXA.
FaultInImm true/false This is really only useful for people who choose to find their EXA offscreen memory RPi_Xorg_rpi_Driver#Memory_layout via the 4 KB method. As memory is allocated lazily (ie when it is first used) you will have a slowly increasing memory load. This can cause trouble for the Ethernet/USB driver, plus cause occasional slow-downs.

By choosing yes here all pages allocated via the 4 KB method will be allocated and locked up-front. Only use if you're seeing trouble in dmesg related to memory allocations.

BlockBase an address, in decimal or hexadecimal. If you choose to use the boot-time reservation you will need to use this option to configure where your memory hole is. You can derive this address yourself, or sort of look it up in /proc/iomem. Let's say you reserve 123 MB for the driver and you choose to use the a 240/16 MB CPU/GPU split in config.txt. The address will be (240 - 123) * 1048576.

You will be able to see this by looking in /proc/iomem at the first entry: "System RAM". It will say 00000000-<some number in hex>. <some number> will be (240 - 123) * 1048576 - 1. Easy.

To make the actual reservation you must add to your kernel command line mem=sizeMB, where size is the amount that the kernel will manage. eg in the example above size would be 256-16-123 ie mem=117MB

BlockSize a number of bytes, in decimal or hexadecimal This is the maximum number of bytes you would like to use as the driver's offscreen memory. You should realistically use over 8 MB - I would say this would be a minimum for 1280x1024x32. 16 MB would be a realistic minimum for 1080p.

Nothing should fail if you get it too small (although it will stop and caution you at load-time if you choose a number silly small), there will simply be more swapping back and forth ie worse performance. For users who use the boot-time reservation this number should really be the size of the hole that you reserve. It can be smaller (pointless) but no bigger.

Note that if SelfManagedOffscreen is used then things may go funny if it runs out of memory. So these recommendations apply for SelfManagedOffscreen=false. If SelfManagedOffscreen=true then maybe try 24-32 MB.

MemMode 4k, mem, vc
  • 4k means using the 4 KB page-based allocation mode
  • mem means the boot-time reservation system
  • vc means using the VideoCore mailbox system
VpuOffload true, false This toggles whether large composition operations will be offloaded to the VPU, assuming the blending function and pixel formats match. This is a highly-experimental feature and can lock up your Raspberry Pi and/or corrupt your SD card.

This makes use of information derived through reverse engineering and should not be used in countries where this is prohibited.

MboxFile a filename This should point to a special character device file, with major 100 minor 0. Only needs to be set if VPU offloading is used.
VpuElf a filename This should point to a special binary which is to be run on the VPU and provides the VPU offload functionality. Not necessary if VPU offload is disabled.
VerboseReporting true, false Turns on the stats the show what the EXA system is handing to the driver, broken down by each function, the number of pixels being pushed by each part, and which part of the SoC is doing what work.
SelfManagedOffscreen true, false This allows the driver to manage the offscreen memory itself, rather than letting EXA manage that memory as a cache. This has the advantage that the download step is no longer necessary (but upload still currently is), and performance improves as a result. The disadvantage is the scheme is very new, and it is not clear how it behaves in the long run. If this option is used, it is recommended that sufficient memory is allocated with BlockSize such that the driver does not run out of memory.


Here is a list of common problems and hopefully their solutions. This section will likely be expanded as problems arise.

Failed to initialise kernel interface (does /dev/dmaer_4k exist?)

This problem occurs when either the /dev/dmaer_4k file cannot be successfully be opened. This can either happen because,

  • the file does not exist
  • the file exists, but the major/minor dev numbers do not match that of the kernel module
  • the file exists, but the kernel module is not loaded
  • the file exists, the kernel module is loaded, but the permissions on the file do not allow it to be opened

Double-check that you have modprobe'd the kernel module, and that you have created the special device file with the correct major/minor numbers. Note that these numbers can change depending on the order that modules are loaded into the kernel. Finally check that you are running Xorg as root.

My display is entirely corrupt yet it appears 'functional'

If you choose to use the 'null' driver then this will happen; it is intentional. Double-check that you are not using EXA_NULL as AccelMethod in your Xorg config file.

Memory size not specified! (use BlockSize)

You need to tell the driver how much memory you would like to use as "video memory". Memory is physically unified in the Raspberry Pi so this concept does not really apply, but you must tell the driver nonetheless. See RPi_Xorg_rpi_Driver#Memory_layout and RPi_Xorg_rpi_Driver#Configuration_of_the_driver for details.

No memory mode specified

You need to tell the driver how to configure the "video memory" that is used by the driver. See the above question as it is very similar.

Memory base not specified! (use BlockBase)

You have chosen to use the 'mem' memory mode option, but you have not told the driver where it is mapped in the address space. See RPi_Xorg_rpi_Driver#Configuration_of_the_driver for an example of its usage.

Memory size of <whatever> bytes is too small to be useable

You must specify the BlockSize option in bytes, not megabytes or kilobytes. The driver does a basic test to see if the value is below 1048576 and will immediately fail if so.

Failed to open /dev/mem

If using the 'vc' or 'mem' memory modes the driver then /dev/mem is used to map memory in to Xorg's address space. You must ensure that /dev/mem exists, and has the correct permissions to allow opening by the current user - which should ideally be root.

CMA requires memory size to be multiple of page size

The size you choose for BlockSize must be a multiple of 4096. Ensure the number you pass in with this option has no remainder. Note that even though the error says 'CMA' it actually means the 'vc' memory option.

Failed to allocate CMA memory from VC

Again, ignore the 'CMA' bit but this means that the VideoCore has rejected the size of the memory allocation. Check dmesg just for confirmation. Try reducing your BlockSize option to match what sudo vcdbg reloc | grep "largest free block" returns, in bytes.

Failed to open binary file <whatever>

This means that the file chosen with the VpuElf option could not be opened for reading. Check that that path to the file is correct, and that it has read permissions for the current user.

Misc VPU binary loading problems

  • failed to read file marker
  • binary does not appear to be loadable
  • failed to read version number
  • incorrect format version number, <whatever>
  • header is at invalid offset, <whatever>
  • header points beyond the end of the file, <whatever>

These all indicate a complete failure to load the VPU binary. Check that the file pointed to by the VpuElf option is not corrupt, and is exactly what comes from the driver release. VPU binaries are versioned and must correspond to a given driver release. Any corruptions to the file can hang the VPU or trash the memory or SD card.

System hangs whilst testing VPU code

Immediately ctrl-z (not ctrl-c) Xorg, sync and reboot your system. It will likely fail to shut down properly but as soon as the ARM kernel has shut down (you'll see a message on the screen) pull the power. On rebooting, try again. If it happens again confirm that the VPU binary is intact and has not been tampered with, and that you have an up-to-date start.elf/GPU firmware that has execute capability.

There is some display corruption

For example, colours are not what they are expected to be, icons look corrupt, speckled pixels etc. This should be reported to me as a bug via [6]. However you need to ensure you have a reproducible example - otherwise it can't be fixed. Restart X and try to figure out the minimum number of steps needed to get the artefacts you can see.

Also try disabling VpuOffload and see if you have the same effect. Finally, before submitting the bug you need to confirm that the stock driver would not have done it anyway. Uninstall the driver and run your test again to confirm if it does or does not happen.


This will mean one of three things,

  • the driver has crashed outright
  • the driver has caught an invalid DMA command
  • the driver has malfunctioned somehow and caught it

all these are bad and need to be reported via [7]. It may be user error but if so a more user-friendly message should be displayed.

In order to submit this as a bug you need to have a reproducible example, the output log from Xorg (if appropriate) and the core file. After the crash, from gdb run generate-core-file and ensure this file is submitted with the repro.

The display does not appear to be accelerated

The ARM CPU is a huge bottleneck and even at its best the driver is not substantially faster than the stock driver. You may need to reel in your expectations! Don't forget that previously you would likely have been using the 16-bit colour mode, which is lighter on the memory subsystem. However in the case of genuine lack-of-acceleration, check:

  • the display is running in 32-bit colour mode. The driver is designed for this mode and any other mode will invoke much of the original driver's code as its fallback mechanism. There is no composition support for 16-bit colour on either the CPU or VPU, but DMA copies and blits will work correctly. CPU fallback for small copies/blits does not exist for 16-bit colour mode and will continue to use DMA.
  • you may be running programs which do all of their work with client-side rendering. This driver cannot assist there, and so ends up being a glorified framebuffer. There are many such programs which do all their work like this. Run top and see where your CPU time is being spent - is it all in the application?
  • has enough memory been passed to the driver. If not enough memory is provided (consider the display resolution too) then there will be insufficient memory for the code to manoeuvre and the fallback path will be chosen. If using the SelfManagedOffscreen option check to see that the driver is not running out of memory.
  • the whole system may have run out of memory, and swap memory is being used. You can identify this with top or free.
  • some applications will try to use OpenGL to do display composition, rather than XRender. There is no hardware support for OpenGL, and this means that (at best) software rendering will be used. Check to see if the application or window manager is trying to use GL. Remember, there is currently no support for OpenGL ES via GLX either.

Finally do consider that the current version needs to be tuned (hence the testing, to learn common usage patterns) and that it is full of validation code to ensure that if things go awry they get caught before any damage is done. This validation code does slow things down.

Black square in the top left of the display

This is the VT's cursor showing through onto the framebuffer. This sounds exactly like what is seen [8]. Note that even with the stock fbdev driver, when used in 32-bit colour mode, exhibits the same behaviour. No effort has been made to address this.

rendercheck sometimes fails

It appears that the render targets that use the 'window' can sometimes fail. This is because the window render options use the X root window, which lives in the framebuffer memory. The VT cursor/black square issue in the previous point interferes with this (by writing black), causing those tests to fail.

Disabling / uninstalling

You can disable the driver by changing the NoAccel option from EXA to NoAccel. This disables VPU and DMA work, but still leaves the X server itself with performance-altering modifications in it - toggling NoAccel will do nothing here.

To wholly remove this code, run

sudo apt-get install --reinstall xserver-xorg-video-fbdev xserver-xorg-core

This will just reinstall over the top of any modified files.