Skip to content

With Little Power Comes Great Responsibility.

Apologies to Voltaire.  But I have to mention power efficiency.  Power efficiency is no longer an option.  We absolutely must be power efficient.  Our challenge is to maximize how much we can get done with ever-decreasing power budgets.  Algorithms belong on the cores where they’ll execute efficiently.  Modern architectures implement complex thermal strategies with very real performance implications.  The poor power citizen will reduce overall performance, not to mention drain valuable battery life.

We have to earn our power.  The CPU’s general purpose cores pay a power price for their goodness.  Still, many algorithms execute more-efficiently here.  Software Occlusion Culling with a Depth Buffer Rasterizer is just such an algorithm.  For the price of a fraction of our frame’s CPU budget, we can significantly reduce our CPU and GPU demand.  We can consume our bounty as better performance, or lower power, or a bit of both.

We’ll talk more about this culling algorithm later.  But let’s eat our desert first; The next post introduces a depth-buffer rasterizer.  Can we write one that’s fast enough to give a net perf+power win when used for culling?

Advertisement
Privacy Settings

The Time Has Come To Speak Of Rasterizers and Other Things

Introduction

This blog series details a software rasterizer.  Wait.  What?  Why would anyone be interested in a software rasterizer?  That’s exactly what I thought, years ago, when Michael Abrash revealed Pixomatic.  Mental note: when you think Michael Abrash has gone off the deep end, stop and figure out where your own model of the Universe is broken.  Wisdom is bittersweet.  I also mention Michael as a segue to Larrabee.  I joined Intel to work on Larrabee.  It’s still my favorite architecture; an awesome collection of flexible computing – a graphics programmer’s wet dream.  Graphics programmers are a creative bunch, and a general-purpose machine offers unlimited creative potential.  Larrabee taught me a lot.

Alas, we don’t get the Larrabee land of milk and honey.  What do we get?

Where Are We?

We get a GPU.  GPUs are great too, and getting better every day.  Starting with Sandy Bridge, the minimum performance is respectably high and furiously getting higher.  GPUs are also becoming more general-purpose.  Their integration onto the die with the CPU cores has immediate synergies and future  opportunities.  The future really is bright.  But, GPUs do have limitations.  The biggest kick in the head is that they’re designed to be interchangeable.  That makes for wonderful economics and is probably good for long-term GPU architecture evolution, as GPU designers compete to produce the best solution.  But, it can also be very annoying.

We don’t get to directly manipulate the GPU; GPU programming is the art of negotiating with an API.  We only touch the GPU with boxing gloves on.  We can’t rely on consistent performance characteristics across different GPUs.  How expensive is an anisotropic texture fetch vs a dot product on this GPU v. that GPU?  How many ROPs are there?  What’s the penalty for small triangles?  What’s the ratio of FLOPs to GB/s?  How much memory is available to me?  How much will MSAA cost?  Could you even do anything with the answers if you had them, given that they’re different for every GPU?  At the least, supporting these variations adds cost.  The GPU and driver also have functionality that isn’t exposed by the API.  Work scheduling, and memory management come to mind.

Hello Mr CPU.  What Have We Here?

We also get CPU cores.  Good ones.  With big caches.  And low latency.  And out-of-order execution.  And lots of memory.  And a low-level ISA.  And virtual memory, And, … you get the picture.  The CPU is great.  But, it has its own issues; it isn’t perfect either.  CPUs are different from each other too.  They wouldn’t be getting better if they were exactly the same every year.  Better requires different.  But, the CPU ISA is a much more-direct, low-level interface than a graphics API.  Also, the big CPU differences are exposed: core count, cache size, supported ISA version, etc.

When I moved off the Larrabee project, I adapted some of my experiments for the traditional cores.  I was thrilled to find there’s a lot of juice available if you’re willing to program the whole machine.  Programming a mainstream CPU as a throughput machine works much better than I expected.  I wanted to take what I learned about throughput programming for Larrabee, and apply it to a software rasterizer for traditional cores.

I love flexibility.  I’m not at all happy that the rasterizer is fixed-function.  I realize that the perf/watt value is high.  But, what if we want to innovate at the rasterizer level?  Consider a hypothetical AA approach that’s elegant if we can modify the rasterizer, but hideous otherwise.  How do we even know for sure the idea really works?  How do we know it doesn’t have bizarre temporal artifacts?  How do we know it doesn’t have corner cases, or introduce a glass jaw?  It’s easy to claim a HW change will enable some new elegance.  It’s an entirely different thing to convincingly demonstrate it.  Similarly, what if some future GPU has a cool new rasterizer feature?  How do we productively work with it before getting real HW – that won’t be available for years?

Where Is That Confounded Software Rasterizer?

A software rasterizer is in order.  Note that none of the mentioned motivations require peak performance.  Any performance improvements we make will increase our productivity.  But, I know just how deep the perf well goes.  We can drink for days.  I’m confident that we can be fast enough to be useful to real games, in the real world, right now.

Can we make a rasterizer that’s fast enough to actually be useful?  Yes, we can.