Pixelglow Interview

18 December 2006

Originally published on macresearch.org, around 2006. Reproduced from the author's archive; some links may no longer resolve.

Interview with Glen Low of Pixelglow Software

Glen Low probably knows as much about high-performance computing (HPC) on Mac OS X as anyone outside Apple. He is author of macstl, a C++ library that takes advantage of the vector engines in PowerPC and Intel chips to make almost obscene performance gains over scalar code. In the following interview, which was conducted by email, Glen tells us about his software, gives his views on the Intel switch, and offers some tips and tricks on winning an Apple Design Award.

Q: I became aware of Pixelglow a few years ago when macstl came on the scene, giving C++ developers a high-performance valarray implementation for the Mac. Can you explain what valarray is for the uninitiated?

In the 1990s, computational physicist Kent Budge wanted to make C++ useful and efficient for numerics work — to give the preferred tool Fortran 90 (and its array operations) a run for their money. So this work made it into the Standard C++ library in 1998, and his valarray substitutes for C arrays, but includes an important new feature: most operations that apply to the elements of array, will apply to the valarray as a whole in a neat intuitive syntax.

Thus where in C, you would code:

float f [100];
float g [100];
float h [100];
for (int i = 0; i < 100; ++i)   // add each element in g to each element in h and store it in f
    f = g + h;

Using C++ valarrays, you would write instead:

valarray <float> f (100);
valarray <float> g (100);
valarray <float> h (100);
f = g + h;      // add each element in g to each element in h and store it in f

Kent put it (link no longer available), “I wanted valarray to provide a mechanism for expressing any one-loop operation as a single expression which could be highly optimized.” By the early 2000s though, he seemed embarrassed when his little creation was overtaken in popularity by the sexier STL (standard template library) containers and algorithms. And yet this simple-looking construct yields profound implications for modern CPUs and the future of high-performance computing.

Q: Some of the performance results given on your web site are very impressive. How does macstl achieve those results?

Modern CPUs have a weapon of mass destruction called SIMD — single instruction multiple data — that blasts through multiple pieces of data with a single machine instruction. On the PowerPC, this weapon is called Altivec and on the Intel side it’s called SSE. But like it was with a particular Middle Eastern country, their existence in real code is somewhat more legendary than widespread!

See, you have to prime the weapon properly to get the best results. Substitute the right SIMD ops for their scalar equivalents. Align data to 16 bytes. Eliminate any branching (no if’s, but’s or function calls) except for the primary loop, so that the CPU doesn’t waste time doing speculative execution. Minimize writing out intermediate results, because every extraneous store and reload taxes the overworked memory access hardware. Schedule code so that pipelines are full and all units are busy doing real work. It’s hard to get it right for code you write for yourself, and it’s even harder to get it right for code you write for others — library code.

However, it turns out that those little valarrays provide an ideal base for SIMD conversion in my macstl library. When you use them, you don’t ask for loops explicitly, so macstl can then transparently substitute SIMD ops without changing the meaning of your code. Since valarrays own their own memory, macstl can then only allocate 16-byte aligned chunks. Aggressive code inlining and careful code choice remove all other branches than the primary loop. macstl also uses a C++ technique called expression templates to evaluate valarray expressions in one loop pass without creating any temporary valarrays for intermediate results. And finally there were lots of sleepless nights hand-tuning the C++ source and poring over the compiled assembly, all to see that the optimizing compiler did its job in properly scheduling the code. The result? A 450x speed-up (link no longer available) in the trigonometric benchmark with Yellow Dog Linux running on a G4.

Q: Macstl was first developed for PowerPC chips, but you now have support for the new Intel chips. What have you learned about the vector capabilities of the new Macs? How does SSE compare with Altivec?

Altivec sprang out whole from the split skull of the old AIM (Apple, IBM, Motorola) alliance, while SSE grew up over 4 revisions (and counting) on the Intel platform. Thus Altivec has the more comprehensive, well-thought out opcodes but is now a sitting duck for the other SIMD architectures to hit. For example, the second version of SSE finally offered the same SIMD types as Altivec, did one better with a SIMD double type but missed some opcodes that would complete the repertoire. Still, Altivec’s patented permute functions can’t be beat: you can rearrange any of the 16 bytes in a SIMD register at runtime, and most Altivec code is packed with permutes.

SSE is ickier to work with at the assembly level, since it uses the 2 register format of the x86 ISA coupled with only 8-16 visible SIMD registers, whereas Altivec uses the 3 register format of the PowerPC ISA coupled with 32 visible SIMD registers. However, Intel has put a lot of work into covering these deficiencies — their current CPUs have good instruction decoders and their C intrinsics interface.

Q: I see that the latest releases of macstl also include a cross-platform vector type. Can you explain what that is, and how it is used?

macstl uses good software design — I layer the valarray implementation over a platform-independent vec type, which is as close to the underlying SIMD metal as possible. That way I can swap in the machinery of the valarray implementation whenever a new SIMD architecture comes along — for example, I did most of the SSE work in 3,000 lines of code and about 3 week’s work. But the vec type is useful in its own right for doing low-level SIMD programming in a platform-independent fashion.

So the expression: vec p, q, r; p = q + r; compiles to vaddfp in Altivec and addps in SSE, adding the four float values in q to the four float values in p.

Besides handling common SIMD operations, the vec type also exposes a common, intuitive initialization syntax and a platform-dependent interface. Amazingly, thanks to compiler optimizations, the vec type has no performance gap over the underlying SIMD opcodes!

Q: How do you see the future of vectorization (SIMD) in general? It seems to me that the vector pipelines are not being extended, and that the trend is toward multiple cores. Do you think vector is here to stay?

I believe vectorization and parallelization are the flip sides of the same coin — using concurrency to make the most of future Moore’s Law gains. As C++ guru Herb Sutter said, the free lunch is over for getting programs to run faster — megahertz (and hence pipelines) have stalled, and increased performance will only come with vectorized, parallelized code. And of course, it’s easier for CPU vendors to just tack on an extra vector execution unit rather than a whole new parallel core.

That’s why Intel’s still innovating in the SIMD space — their new Core 2 Duo (link no longer available), featured in recent Macs, adds the 16 new instructions of SSSE3 and doubles the execution rate of all SIMD code over their previous CPUs. SSE4 will feature 50 new instructions in 2007/2008, playing catch up with Altivec in some areas and surpassing Altivec in others.

And then of course there’s IBM’s Cell — powering the Sony PS3 and IBM/Mercury high-end servers — up to 8 cores born and bred to eat and drink SIMD code, a souped-up version of Altivec. I hear the PS3 is in um, pretty hot demand lately (link no longer available)!

Q: A few years ago you won an Apple Design Award for Graphviz. Can you explain what Graphviz is for, and what was involved in bringing it to the Mac?

Graphviz is an automatic graph layout program — think Omnigraffle or Visio (for you Windows users) — but you don’t have to figure out the graph, the program does. You just describe which nodes connect to which other nodes using the simple DOT language, and the program then lays them out in an efficient, even beautiful fashion.

It all started when I tried to document macstl code using Doxygen (link no longer available), a source-code documentation system. Doxygen uses Graphviz to generate graphs; both programs come from a strong Unix background. While Doxygen on the Mac was quite mature, the Graphviz port still had some rough edges — a lot of Linux dependencies and the output was these jagged bitmaps. So there was the itch, and I applied the usual open-source scratch to it — I’d do my own port of Graphviz but using only the standard libraries in OS X and using all its inbuilt Quartz yumminess: antialiasing and typography. Impressed by the story of Coding Monkey winning the last Apple Design Award for their Hydra editor (now known as SubEthaEdit (link no longer available)), I too would use the Award to motivate me to finish the port.

At first I just wanted to do the command-line tool and write a simple wrapper around the options. A typical GUI wrapper lets you select an input file, tweak the CLI parameters, then click on a button to run the command. One day I was thinking about the open dialog box or selecting which DOT file to render. I thought the user would want to see what the graph looked like when rendered, so how could you render a preview in the open dialog box? Then it hit me, why not make the “preview” the focus of the whole application instead? That’s really the Macintosh way of course, putting the document — in my case the graph — front and center, while the process — rendering, exporting to different file formats — fades to the background as a menu or toolbar command.

On the path to the Design Award, open source has been good to me. First, I got to stand on the shoulder of giants — AT&T wrote the original Graphviz with over 14 years of research and more than 118,000 lines of code, and generously offered their package to the world — and me — with an open-source license. Then I received immense help and feedback from the community; they suggested features I hadn’t thought of and even helped refine my submission to Apple. And finally the contest itself — I just knew the other categories would be filled to the gills with big, established companies, and the open-source category was perhaps the only place this one-man show from “Down Under” could shine.

The last few weeks were the hardest. After polishing the app with GUI spit and shine, like integrating proper Mac OS X font and color dialogs, I happened to re-read the entry conditions — my “oh no” moment! To qualify for the open-source category, all the source needed to be OSI certified. While my own GUI code was BSD-licensed and thus certified, the original source was under the uncertified ASCA license. In a flurry of panic, I sent emails to OSI and AT&T, pleading with them to fast-track the license approval and also to Apple, asking them to be more flexible with the conditions. But I had left it too late: OSI had rejected the current ASCA license earlier and wanted changes before they would even reconsider, and so AT&T would need to consult their lawyers over the changes — certainly doable, just not doable within the deadline. And of course the Apple guy, while sympathetic to my cause, said there could be no exceptions to the rule.

I still remember sitting in my old white Camry parked at the Barrack Street Jetty, poring it all out to my wife while waiting for the in-laws to turn up for our Saturday dim-sum lunch. All the sleepless nights and time off from the paying day job, lost moments with the family and that insane desire to scratch the face of eternity, if only for a while — all gone away on a technicality. But she looked into my eyes and said we need to have faith, sometimes that’s all we have. We said a prayer, and I could finally breathe.

On Monday I submitted the finished product and argued that Apple should only evaluate the OSI-certified portions of the submission — effectively paring my submission down to just the GUI wrapper I wrote. On Tuesday I couldn’t get any work done at the day job, my sweaty fingers clicking on the browser refresh button at Apple’s ADA announcement page every few minutes. That’s when I saw the impossible! — Pixelglow Software won the Best Open Source Product and is runner-up for Best New Product.

Sometimes a dream has to die, so that God can bring it back to life again.

Q: You also sell a set of automator actions called Shellac. They wrap command line tools like cat and grep. What is the advantage of using the Shellac actions to just, say, creating a shell script action with the corresponding command included?

Shellac shares the same philosophy as my Graphviz port — letting the user tweak Unix command options with a user-friendly interface. Sure you can just create a shell script action around grep, but then you’re back to the man pages for what option dumps out the surrounding context, for example. (It’s -C, by the way.)

Q: You seem to have a very broad experience, not only in Mac development, but also in Java and Windows .NET, so you are probably in as good a position as any to evaluate the current state of software development on the Mac. How does Cocoa compare to technologies like .NET? What about Apple’s development tools? Do you see areas that could be improved?

That’s a very broad question too, and it’s hard to even begin to answer. Both Cocoa and .NET are object-oriented frameworks that cover an impressive breadth of functionality, so I’ll just pick on how to write a GUI application as a launch point.

Cocoa is based on message passing — objects talk to each other through messages, which are really just optimized strings. So there’s a lot of dynamic behavior that can happen at runtime. When you write “My First GUI App” in Cocoa, you use Interface Builder to create the controls graphically in a nib file. You then write delegate objects in your own code to handle all the actual work, very much a clean model-view-controller separation of code that takes a little getting used to. Behind the scenes, the runtime deserializes the controls and delegates from the nib file, then dispatches messages to the right object as the user clicks on this and that. The height of this dynamism must be Cocoa Bindings (Mac OS X 10.2 and later), where you can bind your model objects to various controls, even complicated, deep ones like trees and tables, and get all the data synchronized without extra glue code. .NET binding pales in comparison.

.NET is based on interfaces and inheritance — objects declare interfaces at compile-time and call methods and properties on these interfaces through pointers. When you write “My First GUI App” in .NET, you use Visual Studio to create the controls graphically, but it actually writes out the source code for an inherited form class — this sort of wizard-generated code is prevalent in .NET. You then add your own event handling code to the giant class, conceptually mixing your model code with the framework’s view-controller code that’s easy to get into but may bite you in the long run. Behind the scenes, the runtime can efficiently instantiate the objects from actual code and call methods and properties through pointers, so there is some efficiency here. Still, .NET achieves a good degree of dynamism through runtime compilation, for example XSLT stylesheets and XML serialization are handled by extending and compiling custom classes at runtime. There’s no neat equivalent of this in Cocoa.

A good IDE is like a comfortable sofa, you want to chill out and not be bothered by a hard lump of something sticking out from under the cushions. Xcode is all that and more — they’ve build an impressive system around open-source tools like gcc and gdb, and polished it in many ways that don’t constantly intrude on the way you do things. For example, syntax and error highlighting, dependency analysis and distributed builds. However, Xcode could stand with better code completion and refactoring tools like you find in the latest Visual Studio. I like Shark too — their profiling tool not only focuses your attention on hot spots in your code but suggests possible changes to improve performance — you won’t find a free tool like this in the Visual Studio arsenal.

Q: What is your view of High-Performance Computing on the Mac? Is it healthy, or does it still have a ways to go compared to platforms like Linux?

The jury is out on this one. A lot of HPC gear is either custom-made or is built by specialized HPC companies, and they tend to prefer open-source Linux over proprietary Mac OS X. In the top 500 supercomputer list (link no longer available), Mac OS only has a tiny 0.6% versus Linux at 75%. Still, Apple was winning some converts with the Power Mac G5 and the Xserve G5, so it remains to be seen how they’ll continue that drive with the Intel regime (link no longer available). I know Apple continues to plough a lot of thought and hard work into its math and vector libraries, both PowerPC and Intel versions — the macstl trigonometric benchmark only speeds up 70x over the Mac OS X standard libaries vs. 450x over Yellow Dog Linux standard libraries.

Q: Thanks for sharing with us. I read on your web site that you’re also available as a coder for hire. If anyone out there wants to get in touch, how can they find you?

Easy, just send an email to info@pixelglow.com. Or they can read my Rentacoder profile (link no longer available) or look at a slighter older resume.