Difference between revisions of "BeagleBoard/GSoC/2021ProposalGPGPU"

From eLinux.org
Jump to: navigation, search
(Implementation:)
(Timeline)
 
(30 intermediate revisions by the same user not shown)
Line 33: Line 33:
  
 
===Description===
 
===Description===
The beagleboard ARM A8 Processor has an integrated graphics accelerator from PowerVR (SGX530 or 550). As the name implies this chip is mainly used and built for graphics rendering, but as the time shows there exist alot of other applications that profit from the parallel nature of graphic chips, like deep learning, bitcoin mining or analyzing DNA sequences. This is called GPGPU (general purpose computations on graphic processing units) and is done with api's like OpenCL or CUDA. The PowerVR SGX only supports the OpenGL ES 2.0 specification (there also exist a propiertary openCL driver from IT https://university.imgtec.com/fun-with-beagle-video/), this api is heavily targeted towards graphics rendering, but can also be exploited for general purpose computations. The goal of this project is, to show how to use the mostly unused graphics accelerator chip for general purpose computations using the OpenGL ES api. Therefore I will create samples, showing how to use the GPGPU and also show the timing difference when doing computations on CPU vs GPU, to show what computations can benefit from the GPU. Due to the limited nature of OpenGL ES 2.0, its best fit for GPGPU is image processing.  
+
The beagleboard ARM A8 Processor has an integrated graphics accelerator from PowerVR (SGX530, 544 or 550). As the name implies this chip is mainly used and built for graphics rendering, but as the time shows there exist alot of other applications that profit from the parallel nature of graphic chips, like deep learning, bitcoin mining or analyzing DNA sequences. This is called GPGPU (general purpose computations on graphic processing units) and is done with api's like OpenCL or CUDA. The PowerVR SGX only supports the OpenGL ES 2.0 specification (there also exist a propiertary openCL driver from IT https://university.imgtec.com/fun-with-beagle-video/), this api is heavily targeted towards graphics rendering, but can also be exploited for general purpose computations. The goal of this project is, to show how to use the mostly unused graphics accelerator chip for general purpose computations using the OpenGL ES api. Therefore I will create samples and a tutorial, showing how to use the GPU and also show the timing difference when doing computations on CPU vs GPU. Due to the limited nature of OpenGL ES 2.0, its best fit for GPGPU is image processing. The samples and tutorial are targeted towards all beagleboards GPU's (SGX 530, 544, 550,..), so I will research the subtle differences between them and what capabilities in terms of supported texture targets, texture formats etc they have.
 
 
The samples will be convolution and matrix multiplication.
 
convolution
 
input image => perform convolution (sobel filter..) output convoluted image.
 
Convolution can be used for pre processing, edge detection, feature extraction etc..
 
 
 
 
 
The samples and techniques shown, are applicable for all beagleboards, but maybe most relevant for BBAI, as it has the best gpu.
 
  
 
===Implementation:===
 
===Implementation:===
OpenGLES 2.0 subset of OpenGL, targeted towards embedded devices, lightweight
 
not all texture formats supported
 
 
 
The importance difference to "normal" OpenGL is, no compute shaders are supported. This means the computation cannot be divide into work-groups and so there also is no possibility for shared memory. Work-Groups are a way to seperate the computations in smaller chunks and every work-group has access to very fast shared memory. This can be used to accelerate computations even more and is a standart procedure in OpenCL / CUDA. In OpenGL ES 2.0 on the other hand you can imagine there is only one huge workgroup doing all the work. No work distribution is possible, so every work-item is independent of one another. Memory barriers can be simulated with multiple rendering passes.
 
  
Also there exists limited precision in the texture data. For example in the datasheet of the SGX530 (?) it says RGBA with 8 bits per color is support. Im still looking for datasheets for the SGX550 and SGX544 and I think this kind of information is hard to find, so it would probably be the best to just test what runs on what device. Then implementation then could differ, depending on what specific GPU the beagleboard uses.
+
The first part of the implementation is to get the GPU drivers up and running, to create an EGL rendering context for offscreen rendering and use OpenGL ES 2.0. Hunyue Yau is willing to help me do that.
 +
In the next part I will use OpenGL ES 2.0 to access the GPU of the beagleboard and run the sample programs.
  
Only the GL_RGBA or GL_RGB format is supported as color-renderable format (see OpenGL ES spec, section 4.4.5). This is generally bad news, but it still allows the processing of images, so convolution and matrix multiplication are a good choice of sample programs, as other forms would probably need higher precision.
+
OpenGL ES 2.0 is a subset of modern OpenGL and targeted towards embedded devices. It is a more lightweight api and does not support all features of modern OpenGL, since it is really old.
  
 +
The importance difference to modern OpenGL is, that no compute shaders are supported. This means the computation cannot be divide into work-groups and so there also is no possibility for shared memory. Work-Groups are a way to seperate the computations in smaller chunks and every work-group has access to very fast shared memory. This shared memory can be used to accelerate computations even more and is a standard procedure in OpenCL / CUDA. In OpenGL ES 2.0 on the other hand no work distribution is possible, so every work-item is independent of one another. But memory barriers can be simulated with multiple rendering passes, to sync the computations when needed.
  
 +
Also there exists limited precision in the texture data. In the OpenGL ES 2.0 specification only RGBA4 is said to be supported (so 4 bits per color). But I could find a datasheet with information about available texture formats for the SGX530 (?), which says the SGX530 supports RGBA8 (so 8 bits per color). Im still looking for datasheets for the SGX550 and SGX544 and I think this kind of information is hard to find, so it would probably be the best to just test what runs on what device. The implementation of the samples then could differ, depending on what specific GPU the beagleboard uses. In the project I would like to clarify and test what formats run on which device.
  
 +
In general only the GL_RGBA or GL_RGB format is supported as color-renderable format (see OpenGL ES 2.0 specification, section 4.4.5). This means image processing would be a favorable way to use the SGX for GPGPU and convolution is a good example for that. But one also could store 32 bit floating values in textures, by dividing them into their correspoding bits. In the project I will give sample code how to do that and measure if its efficent.
  
I provide a first example how to add two vectors using OpenGL (https://github.com/StevenSchuerstedt/GPGPU_with_OpenGL). I will use this as a starting point for this project.
 
OpenGL ES 2.0 is only a small subset of the whole OpenGL specification, so the specific OpenGL commands have to be choosen carefully, so they are supported on the SGX GPU.
 
Data transfer between CPU and GPU will be done using textures. The difficulty for each GPGPU project is to find a good mapping from the input data to textures and texture coordinates. Also there exists different texture formats, with different floating point precisions.
 
The fragment shader will include the actual computations for the data and the result will be written to a output texture attached to a framebuffer.
 
  
- ARM neon intrinsics
+
I provide a first example how to add two vectors using OpenGL (https://github.com/StevenSchuerstedt/GPGPU_with_OpenGL). I will use this as a starting point for this project and test if it runs on the beagleboard GPU. Also I will research how to map the data when higher precision as natively supported is required. This means when a 8 bit floating point value is not enough, then it needs to be separated into different color channels to get higher precision, with the cost of more memory needed.
- BBAI (SGX 544)
+
The next samples will be matrix multiplication and convolution. The convolution sample gets some image as input (good old lena test image ;)) and performs convolution with a filter, like sobel to detect edges. The output will be the filtered image.
- upstream? what happens after GSoC
+
I will write a detailed tutorial about all my result and how to apply them to different problems.
  
 
===Code Example:===
 
===Code Example:===
Line 80: Line 67:
 
actual computation on interpolated data from the vertex shader
 
actual computation on interpolated data from the vertex shader
  
The program creates two vectors of size N and fills them with random integer values (floating point values would also work). The vectors are then transferred to the GPU with OpenGL Textures. This is the most important step, since it is crucial to find a good mapping between the data / problem one tries to solve, and the mapping / accessing of the data on the GPU. In this simple example the mapping is straight forward. I use the GL_TEXTURE_2D texture target and GL_RGBA as internal texture format. This gives the following mapping:
+
The program creates two vectors of size N and fills them with random floating point values. The vectors are then transferred to the GPU with OpenGL Textures. This is the most important step, since it is crucial to find a good mapping between the data / problem one tries to solve, and the mapping / accessing of the data on the GPU. In this simple example the mapping is straight forward. I use the GL_TEXTURE_2D texture target and GL_RGBA8 as internal texture format. This gives the following mapping for a vector of size 16:
 +
Vector on CPU:
 +
  [x1, x2, x3, x4,...x16]
 +
 
 +
Texture on GPU (size: 2x2, with 4 entries per texel, 2*2*4 = 16):
 +
    R  G  B  A    R  G  B  A
 +
  [{x1, x2, x3, x4}, {x5, x6, x7, x8}]
 +
  [{x9,        ..}, {        ...x16}]
 +
 
 +
The memory for the textures are allocated on the gpu with:
 +
  glTexImage2D(texture_target, 0, internal_format, texSize, texSize, 0, texture_format, data_format, 0);
 +
internal_format specifies the format OpenGL ES 2.0 will use internally for the texture, so for example GL_RGBA8, for 8 bits per color channel.
 +
 
 +
To copy data from the CPU to the generated Buffer on the GPU we will use:
 +
  glTexSubImage2D(texture_target, 0, 0, 0, texSize, texSize, texture_format, data_format, cpu_data1);
 +
Where cpu_data1 is a pointer to a float array.
 +
 
 +
There are three textures used in total, two textures for the input vectors (these are read only) and one for the output texture (write only) which is attached to a offscreen framebuffer.
 +
 
 +
In order to invoke the fragment shader, which performs the actual computation, we need to draw something on the screen, otherwise no shader will be executed. In this simple example there is a 1:1 mapping from vector entries to textur coordinates, so we simply draw a quad across the complete screen.
 +
 
 +
To get the data back from the GPU to the CPU we will use:
 +
  glReadPixels(0, 0, texSize, texSize, texture_format, data_format, result);
 +
Which reads the current values from the framebuffer and copies them into the result vector.
 +
 
 +
Because we only have 8 bits available per floating point value, the result will have an error of some bits. In the project I will get rid of this error and find a better mapping with an errorless addition. This will require a bigger texture.
 +
 
 +
===After GSoC:===
 +
I would like to extend the approach and provide a library for the beagleboard users to use the GPU for a predefined set of computations. So for example the user can provide an image and a filter kernel and the library will automatically perform the convolution on the GPU. This would make it even easier for people to use the GPU, but only for predefined use cases, so I think it is a good addition to the tutorial.
  
 
===Resources:===
 
===Resources:===
Line 90: Line 105:
  
 
* https://www.datasheetarchive.com/pdf/download.php?id=4f76e136e3a90df88cee468a5c19b37dc44774&type=P&term=PowerVR%2520SGX530
 
* https://www.datasheetarchive.com/pdf/download.php?id=4f76e136e3a90df88cee468a5c19b37dc44774&type=P&term=PowerVR%2520SGX530
datasheet for the PowerVR SGX 530, containing detailed information what texture / texture formats are supported etc
+
datasheet for the PowerVR SGX 530 (?), containing detailed information what texture / texture formats are supported etc
  
 
* https://www.seas.upenn.edu/~cis565/fbo.htm
 
* https://www.seas.upenn.edu/~cis565/fbo.htm
a good overview how to to GPGPU with OpenGL, but it needs to be adapted for OpenGL ES 2.0
+
a good overview how to do GPGPU with OpenGL, but it needs to be adapted for OpenGL ES 2.0
  
 
* https://mkonrad.net/projects/mastersthesis_mobile_gpgpu.html
 
* https://mkonrad.net/projects/mastersthesis_mobile_gpgpu.html
Line 112: Line 127:
 
| Jun 17 || validate OpenGL calls, add two vectors together, Introductory YouTube video
 
| Jun 17 || validate OpenGL calls, add two vectors together, Introductory YouTube video
 
|-
 
|-
| June 24 || setup elinux page for the GPGPU tutorial
+
| June 24 || setup elinux page for the GPGPU tutorial, validate different texture formats
 
|-
 
|-
 
| June 30 || create matrix multiplication sample program
 
| June 30 || create matrix multiplication sample program
Line 122: Line 137:
 
| July 23 || measure timings between CPU / GPU
 
| July 23 || measure timings between CPU / GPU
 
|-
 
|-
| July 30 || finish tutorial on elinux how to do to GPGPU (is this a good place?)
+
| July 30 || finish tutorial on elinux how to do to GPGPU
 
|-
 
|-
 
| Aug 06 || clean up code, add one more sample if time allows (vector reduction, compute histogram...)
 
| Aug 06 || clean up code, add one more sample if time allows (vector reduction, compute histogram...)
Line 143: Line 158:
 
Accelerate computations.
 
Accelerate computations.
 
Free up the main processor to do other stuff.
 
Free up the main processor to do other stuff.
If successfully completed, what will its impact be on the BeagleBoard.org community? Include quotes from BeagleBoard.org community members who can be found on http://beagleboard.org/discuss and http://bbb.io/gsocchat.
+
 
 +
  The benefit comes from being able to offload the processing from the A8 core and leave it free for doing other stuff.
 +
~Dr. Iain Hunter
  
 
==Misc==
 
==Misc==
Please complete the requirements listed on the [[BeagleBoard/GSoC/Ideas#General_requirements|ideas page]]. Provide link to pull request.
+
Link to PR: https://github.com/jadonk/gsoc-application/pull/153
 
 
===Suggestions===
 
Is there anything else we should have asked you?
 

Latest revision as of 11:45, 12 April 2021


ProposalTemplate


About Student: Steven Schuerstedt
Mentors: Hunyue Yau
Code: current sample code: https://github.com/StevenSchuerstedt/GPGPU_with_OpenGL
Wiki: https://elinux.org/BeagleBoard/GSoC/2021ProposalGPGPU
GSoC: https://elinux.org/BeagleBoard/GSoC/Ideas-2021#GPGPU_with_GLES

Status

This project is currently just a proposal.

Proposal

I have completet the requirements on the ideas page. ARM cross compiling pull request: https://github.com/jadonk/gsoc-application/pull/153

About you

IRC: steven100
Github: https://github.com/StevenSchuerstedt
School: Karlsruhe Institute of Technology
Country: Germany
Primary language: German, English
Typical work hours:5AM - 3PM US Eastern
Previous GSoC participation: I love the idea of open source and especially open hardware. First time participant.

About your project

Project name: GPGPU with OpenGL ES

Description

The beagleboard ARM A8 Processor has an integrated graphics accelerator from PowerVR (SGX530, 544 or 550). As the name implies this chip is mainly used and built for graphics rendering, but as the time shows there exist alot of other applications that profit from the parallel nature of graphic chips, like deep learning, bitcoin mining or analyzing DNA sequences. This is called GPGPU (general purpose computations on graphic processing units) and is done with api's like OpenCL or CUDA. The PowerVR SGX only supports the OpenGL ES 2.0 specification (there also exist a propiertary openCL driver from IT https://university.imgtec.com/fun-with-beagle-video/), this api is heavily targeted towards graphics rendering, but can also be exploited for general purpose computations. The goal of this project is, to show how to use the mostly unused graphics accelerator chip for general purpose computations using the OpenGL ES api. Therefore I will create samples and a tutorial, showing how to use the GPU and also show the timing difference when doing computations on CPU vs GPU. Due to the limited nature of OpenGL ES 2.0, its best fit for GPGPU is image processing. The samples and tutorial are targeted towards all beagleboards GPU's (SGX 530, 544, 550,..), so I will research the subtle differences between them and what capabilities in terms of supported texture targets, texture formats etc they have.

Implementation:

The first part of the implementation is to get the GPU drivers up and running, to create an EGL rendering context for offscreen rendering and use OpenGL ES 2.0. Hunyue Yau is willing to help me do that. In the next part I will use OpenGL ES 2.0 to access the GPU of the beagleboard and run the sample programs.

OpenGL ES 2.0 is a subset of modern OpenGL and targeted towards embedded devices. It is a more lightweight api and does not support all features of modern OpenGL, since it is really old.

The importance difference to modern OpenGL is, that no compute shaders are supported. This means the computation cannot be divide into work-groups and so there also is no possibility for shared memory. Work-Groups are a way to seperate the computations in smaller chunks and every work-group has access to very fast shared memory. This shared memory can be used to accelerate computations even more and is a standard procedure in OpenCL / CUDA. In OpenGL ES 2.0 on the other hand no work distribution is possible, so every work-item is independent of one another. But memory barriers can be simulated with multiple rendering passes, to sync the computations when needed.

Also there exists limited precision in the texture data. In the OpenGL ES 2.0 specification only RGBA4 is said to be supported (so 4 bits per color). But I could find a datasheet with information about available texture formats for the SGX530 (?), which says the SGX530 supports RGBA8 (so 8 bits per color). Im still looking for datasheets for the SGX550 and SGX544 and I think this kind of information is hard to find, so it would probably be the best to just test what runs on what device. The implementation of the samples then could differ, depending on what specific GPU the beagleboard uses. In the project I would like to clarify and test what formats run on which device.

In general only the GL_RGBA or GL_RGB format is supported as color-renderable format (see OpenGL ES 2.0 specification, section 4.4.5). This means image processing would be a favorable way to use the SGX for GPGPU and convolution is a good example for that. But one also could store 32 bit floating values in textures, by dividing them into their correspoding bits. In the project I will give sample code how to do that and measure if its efficent.


I provide a first example how to add two vectors using OpenGL (https://github.com/StevenSchuerstedt/GPGPU_with_OpenGL). I will use this as a starting point for this project and test if it runs on the beagleboard GPU. Also I will research how to map the data when higher precision as natively supported is required. This means when a 8 bit floating point value is not enough, then it needs to be separated into different color channels to get higher precision, with the cost of more memory needed. The next samples will be matrix multiplication and convolution. The convolution sample gets some image as input (good old lena test image ;)) and performs convolution with a filter, like sobel to detect edges. The output will be the filtered image. I will write a detailed tutorial about all my result and how to apply them to different problems.

Code Example:

On https://github.com/StevenSchuerstedt/GPGPU_with_OpenGL I provide a first example how to use OpenGL for general purpose computations. This example involves adding two Vectors of size N.

Architecture of sample program:

  • GPGPU_with_OpenGL.cpp

main code to setup data on CPU, copy to GPU and run rendering

  • shader.cpp / shader.h

helper class to handle shaders

  • gpgpu.vert

vertex transformation with orthogonal projection matrix

  • gpgpu.frag

actual computation on interpolated data from the vertex shader

The program creates two vectors of size N and fills them with random floating point values. The vectors are then transferred to the GPU with OpenGL Textures. This is the most important step, since it is crucial to find a good mapping between the data / problem one tries to solve, and the mapping / accessing of the data on the GPU. In this simple example the mapping is straight forward. I use the GL_TEXTURE_2D texture target and GL_RGBA8 as internal texture format. This gives the following mapping for a vector of size 16: Vector on CPU:

 [x1, x2, x3, x4,...x16]

Texture on GPU (size: 2x2, with 4 entries per texel, 2*2*4 = 16):

   R   G   B   A     R   G   B   A
 [{x1, x2, x3, x4}, {x5, x6, x7, x8}]
 [{x9,         ..}, {        ...x16}]

The memory for the textures are allocated on the gpu with:

 glTexImage2D(texture_target, 0, internal_format, texSize, texSize, 0, texture_format, data_format, 0);

internal_format specifies the format OpenGL ES 2.0 will use internally for the texture, so for example GL_RGBA8, for 8 bits per color channel.

To copy data from the CPU to the generated Buffer on the GPU we will use:

 glTexSubImage2D(texture_target, 0, 0, 0, texSize, texSize, texture_format, data_format, cpu_data1);

Where cpu_data1 is a pointer to a float array.

There are three textures used in total, two textures for the input vectors (these are read only) and one for the output texture (write only) which is attached to a offscreen framebuffer.

In order to invoke the fragment shader, which performs the actual computation, we need to draw something on the screen, otherwise no shader will be executed. In this simple example there is a 1:1 mapping from vector entries to textur coordinates, so we simply draw a quad across the complete screen.

To get the data back from the GPU to the CPU we will use:

 glReadPixels(0, 0, texSize, texSize, texture_format, data_format, result);

Which reads the current values from the framebuffer and copies them into the result vector.

Because we only have 8 bits available per floating point value, the result will have an error of some bits. In the project I will get rid of this error and find a better mapping with an errorless addition. This will require a bigger texture.

After GSoC:

I would like to extend the approach and provide a library for the beagleboard users to use the GPU for a predefined set of computations. So for example the user can provide an image and a filter kernel and the library will automatically perform the convolution on the GPU. This would make it even easier for people to use the GPU, but only for predefined use cases, so I think it is a good addition to the tutorial.

Resources:

reference pages for OpenGL ES 2.0

common profile specification, very detailed information

datasheet for the PowerVR SGX 530 (?), containing detailed information what texture / texture formats are supported etc

a good overview how to do GPGPU with OpenGL, but it needs to be adapted for OpenGL ES 2.0

master thesis about GPGPU on mobile devices, also has a chapter about OpenGL ES 2.0 and some sample code

Timeline

Provide a development timeline with a milestone each of the 11 weeks and any pre-work. (A realistic timeline is critical to our selection process.)

Mar 29 Applications open, Students register with GSoC, work on proposal with mentors
Apr 13 Proposal complete, Submitted to https://summerofcode.withgoogle.com
May 17 Proposal accepted or rejected
Jun 07 Pre-work setup OpenGL ES drivers for beagleboard, Coding officially begins!
Jun 17 validate OpenGL calls, add two vectors together, Introductory YouTube video
June 24 setup elinux page for the GPGPU tutorial, validate different texture formats
June 30 create matrix multiplication sample program
July 12 18:00 UTC create convolution sample program (separable and non-separable convolution), Mentors and students can begin submitting Phase 1 evaluations
July 16 18:00 UTC Phase 1 Evaluation deadline
July 23 measure timings between CPU / GPU
July 30 finish tutorial on elinux how to do to GPGPU
Aug 06 clean up code, add one more sample if time allows (vector reduction, compute histogram...)
August 10 finish everything, Completion YouTube video
August 16 - 26 18:00 UTC Final week: Students submit their final work product and their final mentor evaluation
August 23 - 30 18:00 UTC Mentors submit final student evaluations

Experience and approach

I have a decent experience in programming, computer-graphics and mathematics. I developed a 2D platformer game with C++ and OpenGL (StevieJump), a Monte-Carlo Pathtracer with C++ (StevieTrace) and I'm very interested in computer architecture and embedded systems. I followed Ben Eaters excellent youtube series to build a 8-Bit Breadboard Computer (8-Bit). I currently work as a C++ / OpenGL software developer at my university. I have experience in OpenCL and did several GPGPU courses at my university.

Contingency

I got stuck many times in my life, especially with programming related tasks. Programming and computer science can sometimes be a very unforgiving and frustrating experience. There is no easy way around this, so I will just keep trying and do my best, there is no shame in failure, just in giving up. So if I dont give up I will eventually succed. If I really get stuck I just take a break and do some outdoor exercise, this always helps.

Benefit

Enable more people to use the GPU on a beagleboard. Accelerate computations. Free up the main processor to do other stuff.

 The benefit comes from being able to offload the processing from the A8 core and leave it free for doing other stuff.

~Dr. Iain Hunter

Misc

Link to PR: https://github.com/jadonk/gsoc-application/pull/153