33 C
New York
Sunday, August 7, 2022

Solving Heterogeneous Programming Challenges with Python, Today – HPCwire

- Advertisement -

Since 1987 – Covering the Fastest Computers in the World and the People Who Run Them
Since 1987 – Covering the Fastest Computers in the World and the People Who Run Them
By James Reinders
June 30, 2022
In the fourth of a series of guest posts on heterogenous computing, James Reinders considers how full heterogeneous programming can be realized with Python today.
You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question, check out the first installment.
When considering “how do we program a truly heterogeneous machine?,” broadly we need two things: (1) a way to learn at runtime about all the devices that are available to our application, and (2) a way to utilize the devices to help perform work for our application.
Since utilizing devices involves both data and computation, we are left with three key questions:
When a program can do all three well, regardless of vendor and architecture, we have made possible open heterogeneous programming. By seeking such open approaches, we aim to increase application portability and reduce unnecessary barriers to using and supporting new hardware innovations.
First, we need to understand how to get parallelism and compiled code because we won’t want to offload serial interpreted code to our accelerator.
Numba is an open-source, NumPy-aware optimizing (just-in-time) compiler for Python developed by Anaconda. It uses the LLVM compiler to generate machine code from Python bytecode. Numba can compile a large subset of numerically focused Python, including many NumPy functions. Additionally, Numba has support for automatic parallelization of loops, generation of GPU-accelerated code, and creation of Universal Functions (ufuncs) and C callbacks. Numba includes an auto-parallelizer that was contributed by Intel. The auto-parallelizer can be enabled by setting the parallel=True option in the @numba.jit. The auto-parallelizer analyzes data-parallel code regions in the compiled function and schedules them for parallel execution.
There are two types of operations that Numba can automatically parallelize: Implicitly data-parallel regions such as NumPy array expressions, NumPy ufuncs, NumPy reduction functions, and explicitly data-parallel loops that are specified using the numba.prange expression.
For example, consider a simple Python loop such as:
Next – I found an easy 12X improvement (24.3 seconds to 1.9 seconds) even without offloading (even more with offloading).
We can make it explicitly parallel by changing the serial range (range) to a parallel range (prange), and adding a njit directive (njit = Numba JIT = compile a parallel version):
This dropped from 24.3 seconds to 1.9 seconds when I ran it. Results can easily be more or less depending on the parallelism available on a system. To try it yourself, start with ‘git clone https://github.com/oneapi-src/oneAPI-samples’ and then open the notebook AI-and-Analytics/Jupyter/Numba_DPPY_Essentials_training/Welcome.ipynb. An easy way to do that quickly is to get a free account on DevCloud.
The key to understanding how accelerators will help is overcoming the overhead of offloading. For many, the above ability to JIT and parallelize our code is more than enough. Over time, the above could evolve to automatically use tightly connected GPUs (on die, on package, coherently shared memory, etc.).
The techniques we cover in this article can be highly effective if and when our application has sufficient computations in it to overcome any costs with overloading and be worth the time for programming.
Extending the prior example, we can use Numba Data Parallel Extensions (numba-dpex) to designate a kernel to be compiled and ready for offload. An essentially equivalent computation can be expressed as a kernel as follows (for more details, refer to the Jupyter notebook training):
The kernel code is compiled and parallelized, like it was previously using @njit to ready for running the CPU, but this time it is ready for offload to a device. It is compiled into SPIR/V, which the runtime finishes mapping to a device when it is submitted for execution. This gives us a vendor agnostic solution for offload. It turns out, the first code snippet (using only @njit) can also be offloaded as-is without writing a kernel explicitly.
The array arguments to the kernel can be NumPy arrays or USM arrays (an array type explicitly placed in Unified Shared Memory) depending on what we feel fits our programming needs best. Our choice will affect how we set up the data and invoke the kernels.
Since SYCL can answer all three key questions we posed, the most consistent and versatile approach is to provide SYCL bindings for Python and use them directly.  This is exactly what the open source Data-Parallel Control (dpctl: C and Python bindings for SYCL) has done. You can learn more from their github docs and “Interfacing SYCL and Python for XPU Programming.”  These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources. This avoids reinventing solutions, reduces how much we have to learn, and allows a high level of compatibility as well.
Connecting to a device is as simple as:
Select any device – regardless of vendor.
The default device can be influenced with an environment variable SYCL_DEVICE_FILTER if we want to control device selection without changing this simple program. The dpctl library also supports programmatic controls to review and select an available device based on hardware properties.
The kernel can be invoked (offloaded and run) to the device with a couple lines of Python code:
Our use of device_context has the runtime do all the necessary data copies (our data was still in standard NumPy arrays) to make it all work. The dpctl library also supports an ability for use to allocate and manage USM memory for devices explicitly. That could be valuable when we get deep into optimization, but the simplicity of letting the runtime handle it for standard NumPy arrays is hard to beat.
Python coding style is easily supported by the synchronous mechanisms shown above. Asynchronous capabilities, and their advantages (reducing or hiding latencies in data movement and kernel invocations), are also available if we want to change our Python code even more. That gives us the ability to run kernels with less latency and move data asynchronous to help maximize performance. to learn more about the asynchronous capabilities, see dpctl gemv example.
CuPy is a reimplementation of a large subset of NumPy. The CuPy array library acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms. The massive programming effort needed serves as a considerable barrier to reimplementing for any new platforms.
Such an approach also raises questions about supporting multiple vendors from the same Python program, because it does not address our three questions. For device selection, CuPy requires there is a CUDA-enabled GPU device available. For memory, it offers little direct control but does automatically perform memory pooling to reduce the number of calls to cudaMalloc.  When offloading operations, it offers no control over which device is utilized and will fail if no CUDA-enabled GPU is present.
While this is indeed effective for CUDA-GPU-devices, we can have more portability for our application when we are open to address all three of the key questions. There is a strong need for portable and architecture-agnostic abilities to write extensions.
Python programming in general is well suited for compute-follows-data, and using enabled routines is beautifully simple. The dpctl library supports a tensor array type that we connect with a specific device.  In our program, if we cast our data to a device tensor (e.g., dpctl.tensor.asarray(data, device=”gpu:0″)) it will be associated with and placed on the device. Using a patched version of SciKit-Learn that recognizes these device tensors, the patched sklearn methods that involve such a tensor are automatically computed on the device. It is a great use of dynamic typing in Python to sense where the data is and direct the computation to be done where the data is. Our Python code changes very little, only the lines where are recast our tensors to a device tensor. Based on experience thus far, we expect compute-follows-data methods to be the most popular models for Python users.
compute-follows-data: Python dynamic typing allows computation to be directed to where the data is automatic – it is quite beautiful.
Python can be an instrument to embrace the power of diversity of hardware, and harness the impending Cambrian Explosion. Numba Data Parallel Python combined with dpctl to connect use with SYCL, and compute-follows-data patched SciKit-Learn are worth considering because they are vendor and architecture agnostic.
Open is driving vendor and architecture agnostic solutions that helps us all.
While Numba offers great support for NumPy, we can consider what more can be done for SciPy and other Python needs in the future.
The fragmentation of array APIs in Python has generated interest in array-API standardization for Python (read a nice summary) because of the desire to share workloads with devices other than the CPU. A standard array API goes a long way in helping efforts like Numba and dpctl increase their scope and impact. NumPy, and CuPy have embraced array-API, and both dpctl and PyTorch work to adopt is underway. As more libraries head this way, the task of supporting heterogeneous computing (accelerators of all types) becomes more tractable.
The simple use of dpctl.device_context is not sufficient in more sophisticated Python codes with multiple threads or asynchronous tasks (see github issue). It is likely better to pursue a compute-follows-data policy, at least in more complex threaded Python code. It may become the preferred option over the device_context style of programming.
Python support is ready today for supporting open multivendor multiarchitecture heterogeneous programming.  This enables nearly limitless controls, and they can in turn be buried in easy to use Python code everyone can use. I have no doubt that we will continue to see exciting developments with support for heterogeneous programming in Python that will come from feedback as we gain experience through usage.
For learning, there is nothing better than jumping in and trying it out yourself. I have some suggestions for online resources to help.
A firm understanding of best practices for using NumPy is highly recommended: the video Losing your Loops Fast Numerical Computing with NumPy by Jake VanderPlas, is an delightfully useful talk on how to use NumPy effectively, by the author of the book Python Data Science Handbook.
For Numba and dpctl, there is a 90 minute video talk covering these concepts in more detail titled “Data Parallel Essentials for Python.” Also, the step-by-step Jupyter notebook based training within the oneAPI samples was mentioned earlier (refer back for the git and file information).
These heterogeneous Python capabilities are all open source, and also come prebuilt with the Intel oneAPI Base and AI Toolkits because it bundles the prebuilt Intel Distribution for Python. A SYCL enabled NumPy is hosted on github. Numba compiler extensions for kernel programming and automatic offload capabilities are hosted on github. The open source Data-Parallel Controls (dpctl: C and Python bindings for SYCL) has github docs and  a paper Interfacing SYCL and Python for XPU Programming. These enable Python programs to access SYCL devices, queues, memory and execute Python Array/Tensor operations using SYCL resources.
Exceptions are indeed supported, including asynchronous errors from device code. Async errors will be intercepted once they are rethrown as synchronous exceptions by async error handler function. This behavior is courtesy of Python extensions generators and community documentation explains it well in Cython and Pybind11.
Prior Installments in this Series

About the Author
James Reinders believes the full benefits of the evolution to full heterogeneous computing will be best realized with an open, multivendor, multiarchitecture approach. Reinders rejoined Intel a year ago, specifically because he believes Intel can meaningfully help realize this open future. Reinders is an author (or co-author and/or editor) of ten technical books related to parallel programming; his latest book is about SYCL (it can be freely downloaded here). 
More Off The Wire
Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!
July 1, 2022
Oak Ridge National Laboratory’s exascale Frontier system may be stealing some of the spotlight, but the lab’s 148.6 Linpack petaflops Summit system is still churning out powerful science. Recently, for instance, the Read more…
July 1, 2022
In this monthly feature, we’ll keep you up-to-date on the latest career developments for individuals in the high-performance computing community. Whether it’s a promotion, new company hire, or even an accolade, we’ Read more…
July 1, 2022
HPCwire takes you inside the Frontier datacenter at DOE’s Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Director Justin Whitt. The first supercomputer to surpass 1 exafl Read more…
June 30, 2022
You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question… Read more…
June 29, 2022
MLCommons’ latest MLPerf Training results (v2.0) issued today are broadly similar to v1.1 released last December. Nvidia still dominates, but less so (no grand sweep of wins). Relative newcomers to the exercise – AI Read more…
Hamburg-based Indivumed specializes in using the highest quality biospecimen and comprehensive clinical data to advance research and development in precision oncology. Its IndivuType discovery solution uses AWS to store data and support analysis to decipher the complexity of cancer. Read more…
Consumers use many accounts for financial transactions, ordering products, and social media—a customer’s identity can be stolen using any of these accounts. Identity fraud can happen when setting up or using financial accounts, but it can also occur with communications such as audio, images, and chats. Read more…
June 29, 2022
In February 2020, the United States’ National Oceanic and Atmospheric Administration (NOAA) announced that it would be procuring two HPE Cray systems, allowing the organization to triple its operational supercomputing capacity for weather and climate applications. Now, those efforts have come to fruition: NOAA has inaugurated the two systems, which are… Read more…
July 1, 2022
HPCwire takes you inside the Frontier datacenter at DOE’s Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…
June 30, 2022
You may be surprised how ready Python is for heterogeneous programming, and how easy it is to use today. Our first three articles about heterogeneous programming focused primarily on C++ as we ponder “how to enable programming in the face of an explosion of hardware diversity that is coming?” For a refresher on what motivates this question… Read more…
June 29, 2022
MLCommons’ latest MLPerf Training results (v2.0) issued today are broadly similar to v1.1 released last December. Nvidia still dominates, but less so (no gran Read more…
June 29, 2022
In February 2020, the United States’ National Oceanic and Atmospheric Administration (NOAA) announced that it would be procuring two HPE Cray systems, allowing the organization to triple its operational supercomputing capacity for weather and climate applications. Now, those efforts have come to fruition: NOAA has inaugurated the two systems, which are… Read more…
June 28, 2022
With the Linpack exaflops milestone achieved by the Frontier supercomputer at Oak Ridge National Laboratory, the United States is turning its attention to the next crop of exascale machines, some 5-10x more performant than Frontier. At least one such system is being planned for the 2025-2030 timeline, and the DOE is soliciting input from the vendor community… Read more…
June 28, 2022
HPE’s early stab at ARM servers close to a decade ago didn’t pan out, but the company is hoping the second time is a charm. The company introduced the ProLiant RL300 Gen11 server, which has Ampere’s ARM server processor. The one-socket server is designed for cloud-based applications, with the ability to scale up applications in a power efficient… Read more…
June 28, 2022
In this regular feature, HPCwire highlights newly published research in the high-performance computing community and related domains. From parallel programmin Read more…
June 22, 2022
You may recall that efforts proposed in 2020 to remake the National Science Foundation (Endless Frontier Act) have since expanded and morphed into two gigantic bills, the America COMPETES Act in the U.S. House of Representatives and the U.S. Innovation and Competition Act in the U.S. Senate. So far, efforts to reconcile the two pieces of legislation have snagged and recent reports… Read more…
April 18, 2022
Getting a glimpse into Nvidia’s R&D has become a regular feature of the spring GTC conference with Bill Dally, chief scientist and senior vice president of research, providing an overview of Nvidia’s R&D organization and a few details on current priorities. This year, Dally focused mostly on AI tools that Nvidia is both developing and using in-house to improve… Read more…
May 11, 2022
Intel has shared more details on a new interconnect that is the foundation of the company’s long-term plan for x86, Arm and RISC-V architectures to co-exist in a single chip package. The semiconductor company is taking a modular approach to chip design with the option for customers to cram computing blocks such as CPUs, GPUs and AI accelerators inside a single chip package. Read more…
May 30, 2022
In April 2018, the U.S. Department of Energy announced plans to procure a trio of exascale supercomputers at a total cost of up to $1.8 billion dollars. Over the ensuing four years, many announcements were made, many deadlines were missed, and a pandemic threw the world into disarray. Now, at long last, HPE and Oak Ridge National Laboratory (ORNL) have announced that the first of those… Read more…
March 8, 2022
AMD/Xilinx has released an improved version of its VCK5000 AI inferencing card along with a series of competitive benchmarks aimed directly at Nvidia’s GPU line. AMD says the new VCK5000 has 3x better performance than earlier versions and delivers 2x TCO over Nvidia T4. AMD also showed favorable benchmarks against several Nvidia GPUs, claiming its VCK5000 achieved… Read more…
May 30, 2022
The 59th installment of the Top500 list, issued today from ISC 2022 in Hamburg, Germany, officially marks a new era in supercomputing with the debut of the first-ever exascale system on the list. Frontier, deployed at the Department of Energy’s Oak Ridge National Laboratory, achieved 1.102 exaflops in its fastest High Performance Linpack run, which was completed… Read more…
June 8, 2022
The first-ever appearance of a previously undetectable quantum excitation known as the axial Higgs mode – exciting in its own right – also holds promise for developing and manipulating higher temperature quantum materials… Read more…
March 22, 2022
The battle for datacenter dominance keeps getting hotter. Today, Nvidia kicked off its spring GTC event with new silicon, new software and a new supercomputer. Speaking from a virtual environment in the Nvidia Omniverse 3D collaboration and simulation platform, CEO Jensen Huang introduced the new Hopper GPU architecture and the H100 GPU… Read more…
April 21, 2022
PsiQuantum, founded in 2016 by four researchers with roots at Bristol University, Stanford University, and York University, is one of a few quantum computing startups that’s kept a moderately low PR profile. (That’s if you disregard the roughly $700 million in funding it has attracted.) The main reason is PsiQuantum has eschewed the clamorous public chase for… Read more…
June 15, 2022
AMD is getting personal with chips as it sets sail to make products more to the liking of its customers. The chipmaker detailed a modular chip future in which customers can mix and match non-AMD processors in a custom chip package. “We are focused on making it easier to implement chips with more flexibility,” said Mark Papermaster, chief technology officer at AMD during the analyst day meeting late last week. Read more…
June 21, 2022
Additional details of the architecture of the exascale El Capitan supercomputer were disclosed today by Lawrence Livermore National Laboratory’s (LLNL) Terri Read more…
May 31, 2022
Intel reiterated it is well on its way to merging its roadmap of high-performance CPUs and GPUs as it shifts over to newer manufacturing processes and packaging technologies in the coming years. The company is merging the CPU and GPU lineups into a chip (codenamed Falcon Shores) which Intel has dubbed an XPU. Falcon Shores… Read more…
March 8, 2022
Just a couple of weeks ago, the Indian government promised that it had five HPC systems in the final stages of installation and would launch nine new supercomputers this year. Now, it appears to be making good on that promise: the country’s National Supercomputing Mission (NSM) has announced the deployment of “PARAM Ganga” petascale supercomputer at Indian Institute of Technology (IIT)… Read more…
June 16, 2022
The long-troubled, hotly anticipated MareNostrum 5 supercomputer finally has a vendor: Atos, which will be supplying a system that includes both Nvidia and Inte Read more…
April 6, 2022
MLCommons today released its latest MLPerf inferencing results, with another strong showing by Nvidia accelerators inside a diverse array of systems. Roughly fo Read more…
July 1, 2022
HPCwire takes you inside the Frontier datacenter at DOE’s Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tenn., for an interview with Frontier Project Direc Read more…
June 6, 2022
Supercomputing has been indispensable throughout the Covid-19 pandemic, from modeling the virus and its spread to designing vaccines and therapeutics. But, desp Read more…
© 2022 HPCwire. All Rights Reserved. A Tabor Communications Publication
HPCwire is a registered trademark of Tabor Communications, Inc. Use of this site is governed by our Terms of Use and Privacy Policy.
Reproduction in whole or in part in any form or medium without express written permission of Tabor Communications, Inc. is prohibited.

source

- Advertisement -

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles