[TYPES] CPS and accumulators

Giuseppe Maggiore giuseppemag at gmail.com
Thu Aug 20 02:24:12 EDT 2009


Of course what you cited from my first mail was a bit hyperbolic :)

To be precise, the thesis focuses on optimizing certain pieces of code that
work on sequences. The idea is roughly to take functions that manipulate
lists and turn them into equivalent code that runs on a GPU. I will use
DirectX compute shaders called from F#, since CUDA works only on nVidia and
with Larrabee on the horizon I have no intention of being locked to the
hardware.

The idea is to avoid trying to compile entire programs on the GPU, since
that would be crazy: GPUs are SIMD machines, so there really is no point in
trying to map independent pieces of code to different cores because having
diverging instructions they would just be executed sequentially. Whenever a
sequence is processed recursively, though, there might be some hope of
recognizing it and translating the code (if it is simple enough, otherwise
it's not worth it) to the GPU, basically recognizing all those functions
that somehow "look" like map, filter, sort, etc. This approach is vastly
simpler than "writing a compiler for a GPU", but given the current state of
the hardware it also has more chances of becoming something reasonably
working (compiling an entire FL to the GPU would mostly generate slower
code).

PS: I have done quite a lot of computer vision on GPUs with a local research
group to get the hang of the process (I have done shaders since, well,
ever): the amount of time you could save by not using those awful and
unreadable GPU languages would be immense, and you would still be able to
tap into the power of the GPU even without all the fine-tuning!

On Wed, Aug 19, 2009 at 10:35 PM, Cunningham, David <
david.cunningham04 at imperial.ac.uk> wrote:

> > Luckily all is well now, since I have realized that functional languages
> are actually one of the best imaginable tools for writing the extremely high
> parallel code that modern graphics cards (GPUs) execute.
>
> I'm not sure this is particularly true, depending on what you mean by
> functional languages.  I take it you are planning to compile a toy
> functional language to OpenCL or CUDA?  You will have to work very hard to
> avoid performance overhead so vast that they will make the GPU look like a
> CPU.  The best I can imagine you could achieve is an eager language without
> typical functional datastructures like lists and trees.  Arrays are the name
> of the game on the GPU because memory accesses have to be regular.  You
> would have to implement general recursion yourself using a stack but this is
> tricky because there are very strict memory overheads (as little as 128
> bytes per 'thread' on a fully-loaded G200) which if exceeded will kill
> performance.  You'd have to put the stack in main memory but then there are
> latency issues.  You would have to implement functions-as-values by inlining
> or using some sort of big switch statement, but keeping the environment
> alive becomes very difficult with no 'heap'.  You would have to invent your
> own memory allocator and perhaps garbage collector.  There is no cache, you
> can make your own in software but this will probably kill performance.
>  Typically GPU algorithms are carefully hand-tuned to make sure they fit in
> the various bits of memory, and careful compromises to the algorithms are
> made to ensure this is possible.  Any high level language for the GPU must
> have enough performance transparency to allow this kind of activity.
>
> I highly recommend writing some parallel algorithms for the GPU and get
> some experience benchmarking and hand-tuning them to appreciate how
> difficult it can be to exceed the performance of the CPU, even for
> data-parallel problems like dense matrix multiplication and k-means.




-- 
Giuseppe Maggiore
Microsoft Student Partner - Ca' Foscari University of Venice, Italy
B.Sc. in Computer Science
Tel. +393319040031
Website: http://xnalearners.spaces.live.com


More information about the Types-list mailing list