

#### **Android Platform Optimizations**

ELC-Europe

Prague, October 2011

Ruud Derwig



## **Helping Design the Chips Inside**



## **Agenda**

Market & value drivers

What to optimize?

Synopsys

How to optimize?

Results & conclusion



#### **Android Markets**

- Smartphones
- Tablets
- TV
- STB / multimedia

Others / new











#### **Android Markets**

- Smartphones
- Tablets
- TV
- STB / multimedia

Others / new











**Android Markets** 

Smartphones

Tablets

TV

• STB / multimedia

Others / new



# **Key Value Drivers & System Architecture Choices**

- Power consumption
  - → optimize performance / mW
- Product cost
  - → optimize performance / area
  - → optimize development efficiency





- Hardware Software trade-offs
  - Maximum flexibility & developer efficiency: "virtual everything"
    - PC model, multi-GHz SMP processor centric designs
  - Minimal power & optimal performance: "dedicated hardware"
    - dedicated, fixed function device
  - Sweetspot: "heterogeneous, HW accelerated multi-core"
    - Mix of CPU, DSP, and dedicated HW



#### **Agenda**



#### **Linux Kernel & Library Optimizations**

- Important,
- ... but not Android specific
- Optimization options
  - Optimize hotspots
    - compiler
    - handwritten assembler
  - CPU hardware optimizations
    - MMU
    - special instructions



#### **Dalvik Virtual Machine**

- "Java" \* virtual machine
  - Register-based architecture (Java VMs are stack machines). Dalvik registers are typically stored in memory (on the stack, like local variables in C).
  - Own bytecode
- Three virtual machines
  - Portable: completely C-based, in fact one large switch{} statement with a case x: for every Dalvik opcode.
  - Fast (a.k.a. MTERP): assembly-coded handlers for every Dalvik opcode, which are aligned on 64 bytes addresses, so that the address of the handler can be easily calculated from the opcode, saving a lookup.
  - JIT: just-in-time compiler, initially starts as fast/mterp interpreter, but will identify 'hot' traces and pass these to the compiler thread.

<sup>\*</sup>Dalvik is a clean-room implementation of Java for copyright reasons. The syntax is similar.



#### **Android Media Player Architecture**



#### **Android Media Player Architecture**



## Audio Optimization Option: off-load audio processing to DSP





#### **Android Graphics - Architecture**

- 2D
  - Canvas/Skia
  - OpenVG
- 3D
  - OpenGL-ES 1.x
  - OpenGL-ES 2



- Renderscript
  - Expose native GPU/SMP to (portable) applications
  - C99 ->LLVM intermediate bitcode -> machine code

### **Android Graphics - Compositor**



#### **Graphics Optimization Options**

- Graphics drawing/rendering
  - Software/assembler optimization
    - Skia, PixelFlinger
  - Hardware acceleration
    - GPU (OpenGL-ES 2)
    - 2D accelerator (OpenVG compatible or other)
    - Memory architecture, caching
  - Renderscript
- Surface Composition
  - Scaling, colorspace conversion
    - Custom instructions
    - GPU
    - Dedicated hardware acceleration (bitblit)



## **Agenda**



#### **Optimized Designware ARC Android**



#### Differences between VM Implementations

| Portable                                                                             | MTerp                                                                           | JIT                                                        |
|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|------------------------------------------------------------|
| <pre>switch (opcode) { case add: a = b + c; break; case sub: a = b - c; break;</pre> | <pre>ld r0, [b] ld r1, [c] add r0, r0, r1 st r0, [a] ld r0, [next_opcode]</pre> | <pre>ld r0, [b] ld r1, [c] add r0, r0, r1 st r0, [a]</pre> |
|                                                                                      | asl r0, r0, 6<br>add r0, r13, r0<br>j [r0]                                      | OR<br>add r20, r20, r21                                    |

```
ld r0, [next_opcode]
<pipeline stall>
ld.as r1, [jump_table, r0]
<pipeline stall>
j [r1]
```



## Register- and Stack-based VMs

Example:  $\mathbf{a} = \mathbf{b} + \mathbf{c}$ 

| Java                                                                                                    | Dalvik                                                     | Dalvik for ARC                                                                             |
|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------------------------------------|
| iload b<br>iload c<br>iadd<br>istore a                                                                  | add-int a, b, c                                            | add-int a, b, c                                                                            |
| <pre>ld r0, [b] push r0 ld r0, [c] push r0 pop r0 pop r1 add r0, r0, r1 push r0 pop r0 st r0, [a]</pre> | <pre>ld r0, [b] ld r1, [c] add r0, r0, r1 st r0, [a]</pre> | Registers are only saved/restored when changing stack frames or when moving to interpreter |

#### **Audio Processing on DSP**



- Audio decoding and Post-processing off-loaded to ARC Sound Processor
- Special host Audio Decoder implementation that takes care of off-loading
  - with standard host decoder interfaces, so seamless integration
- Post-processing control through Renderer on host (special Renderer or Renderer plug-in component)





#### **Android & Audio APIs**

- Stagefright supports 2 types of interfaces
  - OpenMax-IL : for re-use of OMX components
  - Stagefright codec interface : for native Stagefright codecs
- AudioFlinger uses dedicated interfaces
  - standard implementation using "ALSA" exist
  - developments ongoing (?) to support OpenSL-ES
     Khronos standard (like OMX)
- SNPS API choice not yet made
  - OMX-IL pro : open standard
  - OMX-IL con: efficiency, complexity: standard by committee...
  - Stagefright pro : efficient integration with Stagefright
  - Stagefright con : not an open standard, no deep tunneling



#### **Alternative: Gstreamer**



- GStreamer Android Player
  - see e.g. ELC-E 2010 presentation
- "The goal of the project is to both allow hardware makers to standardize on GStreamer accross their software platforms, but also to make the advanced functionality of GStreamer available on the Android platform, like video editing, DLNA Support and Video conferencing."

#### GStreamer replace OpenCore



October 27, 2010





## GStreamer DSP Off-loading with "Deep Tunneling"

- Gstreamer-MSF integration makes heterogeneous multi-core SW development transparent to user
- Instantiation of Gstreamer element → instantiation of module on one of the ARC cores
- Creation of link 

   local connection or core-crossing connection between modules





#### **Gstreamer Deep Tunneling**

```
static void connect msf outpin (GstPad* pad)
   GstPad
                   *peerpad = gst pad get peer(pad);
                  *element = gst pad get parent element( pad );
   GstElement
   GstElement
                  *peerelement = gst pad get parent element( peerpad );
   GstAudioModule *filter = GST AUDIOMODULE(element);
   auint32
                   result:
   if (!pad is deeptunnel(pad))
       /* not a deep tunnel */
       /* create sink module */
       msf api sink module create(filter->msf coreid, "Sink module", output fifo buffer,
                                   sink pv data, sizeof(sink pv data), &sink module id)));
       msf api connect pins(filter->msf moduleid, sink module id, 0, 0)));
   else
       if (pad is corecrossing(pad))
           /* deep tunnel AND core-crossing */
           /* create sink module */
           msf api sink module create(filter->msf coreid, "Sink module", filter->msf sharedfifo,
                                       sink pv data, sizeof(sink pv data), &sink module id)));
           msf api connect pins(filter->msf moduleid, sink module id, 0, 0)))
       else
           /* deep-tunnel AND no core-crossing */
           guint32 peer module id;
           /* get the module id of the peer MSF module */
           g object get (G OBJECT (peerelement), "msf moduleid", &peer module id, NULL);
           msf api connect pins(filter->msf moduleid, peer module id, 0, 0)))
```

#### **ARC HW Extensions**



#### Leveraging the ARC EIA Capabilities

Example: Colour Space Conversion



## **Agenda**



#### **Optimizing Dalvik VM**



#### **Optimizing Dalvik VM**





| Core<br>Mark | Caffeine<br>Mark | Without<br>L2 cache |
|--------------|------------------|---------------------|
| 1,9          | 4,9              | /MHz                |
| 37           | 90               | /mW                 |
| 14           | 35               | /MHz/mm²            |

measurements are done on 50MHz FPGA results are without performance gains from hardware extensions



## Optimizing Hardware Custom Instructions & Prefetching



#### **Linux kernel + ARC HW optimizations**



#### **Conclusions**

- There are more markets for Android than high-end smartphone
- There are more optimizations possible than relying on Moore's law for GHz multi-cores
- Optimize performance / mW & performance / area
- Sweetspot: "heterogeneous, HW accelerated multi-core"
  - Mix of CPU, DSP, and dedicated HW
  - Highly optimized platform infrastructure SW hides heterogeneous complexities
- 'Simple' ARC processor with SW optimized Dalvik VM performs equal or better as others, thanks to careful SW optimizations, and the use of simple HW acceleration
  - Custom instructions tailored for specific tasks
  - Prefetcher iso. general purpose 2nd level cache
  - DSP more efficient in audio processing than CPU



## SYNOPSYS®

Fast Forward to Predictable Success



