Creating a Parallel Programming Language for Multicore
Intel and its ecosystem are developing a parallel programming language for multicore chips, but don’t expect miracles anytime soon.
By Ed SperlingSoftware development almost always lags behind changes in hardware, but in the case of multicore chips software the gap is widening.
In hardware, the ability to get increasing performance out of a single-core processor within acceptable power budgets became extraordinarily difficult at 130 nanometers, and totally impractical at 65nm. In portable devices such as a notebook computer, boosting performance by 50 percent for a single-core chip would make it too hot to hold or deplete the battery life to cool it—or both. Even in places where the chips can be cooled effectively, such as data centers, the demand for energy to lower the heat in server racks has become so enormous that it has drawn the wrath of the U.S. Environmental Protection Agency.
In hardware, the ability to get increasing performance out of a single-core processor within acceptable power budgets became extraordinarily difficult at 130 nanometers, and totally impractical at 65nm. In portable devices such as a notebook computer, boosting performance by 50 percent for a single-core chip would make it too hot to hold or deplete the battery life to cool it—or both. Even in places where the chips can be cooled effectively, such as data centers, the demand for energy to lower the heat in server racks has become so enormous that it has drawn the wrath of the U.S. Environmental Protection Agency.
Even where they have been successful, application developers have utilized multiple cores by threading different functions or operations across those cores. In the case of database searches, for example, threading works extremely well because a single task can be parsed among the available processors or cores. The more cores available, the faster the application runs. In contrast, that becomes much harder with gaming software because the tasks are both different and randomly ordered.
“Threads are really a low-level way to get performance increases,” says Anwar Ghuloum, principal engineer at Intel. “It’s easy to make mistakes and deadlock the program.”
Intel’s research lab is working with its top customers to develop a new programming environment, Intel Ct, which is a key component in Intel’s Tera Scale project (see Figure). At the Intel® Developer Forum in China in April, Zhang Cia, chief technology officer at Neusoft Co., presented a slide showing the number of lines of code needed for a single command was 36 for a singlethreaded application, 29 using a vectorized, multi-threaded approach with forward scalability—the result of working with Ct—and 116 lines using a single-threaded vectorized approach, which does not scale.
That’s a somewhat ideal scenario. Neusoft, based in Shenyang City, China, develops security software, and sees a parallel programming programming future. When it comes to other companies, such as gaming software makers, the course is less obvious. In the case of desktop applications such as Microsoft Word, there are few, if any, advantages to writing the code in parallel.
In developing Ct, Intel has focused on applications that can be built for speed.
“The real question is how we get the productivity of the last generation of object-oriented languages like C++ and the performance benefits of Fortran,” Ghuloum says. “If you take a look at C code versus Fortran, the Fortran had two times better performance.
Benefits and costs
From a performance standpoint, developing software that can take advantage of more cores is a slam-dunk argument. It’s the only way to build a system cost-effectively using a single die or even multiple embedded cores. But there also is performance overhead associated with multiple cores. Even with highly parallelized applications, adding another core doesn’t double performance.
Intel estimates that with a programming language like C++, the performance hit already was 20 percent to 30 percent, which was acceptable given the productivity gain in writing code. With parallel programming, the overhead is probably in the 30 percent range. But with multiple cores, the total performance still can be increased as much as six times.
Ken Karnofsky, director of signal processing and communications at The Mathworks, says his company has been working to parallelize computations in its MATLAB product and to simulate code faster in both MATLAB and Simulink. He says that work includes splitting functions as well as spreading out different functions across processors.
“There are some embarrassingly parallel computations— computations that are done over and over again with different parameters for different data,” he says. “That is relatively straightforward. The harder ones are where you have to consider how the algorithms are structured and how you distribute the data.”
Using more parallel software, in some cases, means more middleware, which also exacts a toll on total performance. Intel is developing its own middleware to work with multiple cores. IBM has been doing the same in its Cell processor, creating a hypervisor that acts like a traffic cop for the chip. And because all of this traffic has to be directed dynamically, that carries a power and performance price tag.
The upside is that more software also means more programmability. While it’s up to the chip’s architect to determine the percentage of functionality in software versus hardware, adding more software—either in embedded code, firmware or externally—allows some flexibility in how a device is built. And from an inventory standpoint, discrete components can be field-upgraded in rapidly changing markets such as consumer electronics to incorporate the latest communications protocols or interfaces.

Figure: CT is a programming model developed by Intel and its ecosystem for multicore chip development, as demonstrated by the Tera-Scale program
The upside is that more software also means more programmability. While it’s up to the chip’s architect to determine the percentage of functionality in software versus hardware, adding more software—either in embedded code, firmware or externally—allows some flexibility in how a device is built. And from an inventory standpoint, discrete components can be field-upgraded in rapidly changing markets such as consumer electronics to incorporate the latest communications protocols or interfaces.
Not all cores are alike
Still, developing software in parallel is immensely more complex, which is why there hasn’t been a focused effort to do it until now. Just as cheap gasoline made alternative energy sources a job for the future, classical scaling made multicore less attractive. Multicore programming is no longer something that can be ignored, despite its complexity. And that complexity grows when you consider that not all cores are alike—some are large, some And in software parallelization, the same application may take advantage of some or all of these different types of cores at different times.
Until now, Ct has worked largely on a shared memory system. Intel is now examining whether to use a distributed computing environment approach so that an application can scale to every node on the system.
All of this will take time, of course. The first step is for libraries and frameworks to be parallel-enabled, which Intel believes will happen in the next one to two years. After that, it could take 5 to 10 years for the development language to become mainstream— something that will require lots of work on the part of Intel, its ecosystem, and research currently being done by universities around the globe.
“Problem number one is how to make multicore programming easier,” says Ghuloum. “That’s not solved yet. “
Ed Sperling is a regular contributing editor to Chip Design magazine. Ed has spent the past two decades immersed in technology. He is the recipient of numerous awards for journalistic excellence.















