Subscribe via E-mail

Your email:

Download our latest technical papers


cta-noc-power-benefits-paper-pink-frontpage

Kurt Shuler bio

Kurt Shuler Arteris Intel TI MIT USAFAKurt Shuler is the VP of marketing at Arteris. 

He has held senior roles at Intel, Texas Instruments, ARC International and two startups, Virtio and Tenison. Before working in high technology, Kurt flew as an air commando in the U.S. Air Force Special Operations Forces.

Kurt earned a B.S. in Aeronautical Engineering from the U.S. Air Force Academy and an MBA from the MIT Sloan School of Management.

Posts by category

Follow Me

Arteris Connected Blog

Current Articles | RSS Feed RSS Feed

Putting the “Heterogeneous” in the HSA Foundation

  
  
  

In September’s article, SMP, Asymmetric Multiprocessing, and the HSA Foundation, I explained why symmetric multiprocessing (SMP) architectures have been popular in PC and server markets, and why heterogeneous or asymmetric multiprocessing (AMP) has bHSA parallel workloads 350pxeen the norm in mobility and consumer electronics markets. I also explained the trends that are leading PC and server markets to adopt heterogeneous architectures and introduced the HSA Foundation’s goal of making heterogeneous core chips easy to program.

In this month’s article I will introduce the HSA Solution Stack and give a longer-term vision of how HSA can scale beyond CPU-GPU computing. (Hint: The hardware/SoC interconnect fabric is a critical ingredient in this!)

How heterogeneous programming is done today

In its initial stages, HSA addresses the need for easy software programming of GPUs to take advantage of their unique capability to crunch parallel workloads much more efficiently than x86 or ARM CPUs. The graphic above summarizes this concept.

Today, CPUs and GPUs do not share a common view of system memory, requiring an application to explicitly copy data between the two devices. In addition, an application running on the CPU that wants to add work to the GPU’s queue must execute system calls that communicate through the CPU OS’s device driver stack, and then communicate with a separate scheduler that manages the GPU’s work.  This adds significant runtime latency, in addition to being very difficult to program.

HSAIL: Heterogeneous programming the HSA way

To avoid this situation and enable easier programming, HSA will allow developers to program at a higher abstraction level using mainstream programming languages, with the addition of libraries targeting HSA. The following is a high-level view of the HSA Solution Stack:

HSA solution stack 300pxThe key to enabling one language for heterogeneous core programming is to have an intermediate runtime layer that abstracts hardware specifics away from the developer, leaving the hardware-specific coding to be done once by the hardware vendor or IP provider. In HSA, the top of this intermediate layer is the HSA Intermediate Language or “HSAIL”.


The diagram below shows the HSAIL and its path through the HSA runtime stack:

HSA runtime stack 350pxHSAIL is created by compiling a high-level language like C++ with the HSA compilation stack. HSA’s compilation stack is based on the LLVM infrastructure (http://www.llvm.org), which is also used in OpenCL (http://www.khronos.org/opencl/). 

Creation of HSAIL can occur prior to runtime or during runtime: The OpenCL Runtime includes the compiler stack and is called at runtime to execute a program that is already in data-parallel form. Alternatively, Microsoft’s C++ AMP (C++ Accelerated Massive Parallelism) uses the compiler stack during program compilation rather than execution. The C++ AMP compiler extracts data-parallel code sections and runs them through the HSA compiler stack, and passes non-parallel code through the normal compilation path.

The diagram below shows the HSA Compilation Stack, where programming code is compiled into HSAIL using the LLVM compilation infrastructure:

 HSA compilation stack

The hardware-specific HSA Finalizer

A key role is played by the hardware-specific “finalizer” which converts HSAIL to the computing unit’s native instruction set. Hardware and IP vendors are responsible for creating finalizers that support their hardware. The finalizer is lightweight and can be run at compile time, installation time or run time depending on requirements.

The finalizer is the point at which the specifics of different heterogeneous computing units are addressed. Initial HSA implementations will most likely support GPU compute with finalizers from GPU vendor HSA members like AMD, Imagination and ARM. (And maybe even Qualcomm to support their Adreno graphics cores.)

Heterogeneous: More than CPU and GPU

However, as discussed in last month’s article, many existing heterogeneous architectures have additional discrete processing units for functions like audio (digital signal processing or stream processing), image and video processing (SIMD frame processing), and security. As HSA matures, hardware and IP vendors creating these processing units may want to enable HSA programmability on their hardware by creating hardware-specific finalizers.

From dumb scheduling to smart scheduling

Having multiple heterogeneous processing units will complicate workload scheduling from a system perspective. The harsh reality is that existing workload scheduling and OS scheduling algorithms are relatively simple and generally only take into account local activity on a processing unit or cluster of homogeneous processing units (see the Linux Completely Fair Scheduler for one example of how scheduling is implemented: http://en.wikipedia.org/wiki/Completely_Fair_Scheduler).

These algorithms do not take into account the existing traffic coursing throughout the system or a view into other processing units. This lack of a global view for scheduling virtually guarantees there will be contention and stalling as processing units wait for access to precious system resources, especially the DRAM.

One way to enhance workload scheduling will be to probe existing runtime data flows at critical points throughout a system’s SoC interconnect fabric, and use this information to assign priorities to workloads, and workloads to processing units. As heterogeneous processing becomes the norm and more processing units are added to a system, this type of interconnect-assisted scheduling will be required.

In other words, the hardware interconnect is a key enabler to putting the heterogeneous into HSA.

Sources

Kyriazis, George (AMD). “Heterogeneous System Architecture: A Technical Review.” Whitepaper, HSA Foundation, August 2012.

HSA Solution Stack diagram is from http://developer.amd.com/Resources/hc/heterogeneous-systems-architecture/Pages/default.aspx.

Comments

Actually I would really like to support for this foundation and I guess this is really more effective that what we know so far. I guess this one is perfect.
Posted @ Saturday, August 24, 2013 10:10 PM by vertical vs horizontal markets
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics