# Friday, August 25, 2006

If you read much of what I write, you know that performance is something I like to always keep in mind, and I'm also a big fan of quantitative analysis.  It all started ages ago with a look at the real time capabilities of Windows CE, but most recently I've been looking at managed code.  This entry is the second in a series of looks at performance of managed code, specifically the .NET Compact Framework.  For the first entry, see this blog entry (which became this MSDN article).

Act II - Method calls are Expensive (or Stay in the Shallow End of the Stack!)

As with many of my diatribes into technical problems, this all started with a newsgroup post.  SOmeone posted that the had done a Towers of Hanoi implementation in VB.NET, C# and C++ and that the C# implementation was faster than VB, but got slower in CF 2.0.  They also said that native code was much faster than both (I think he may have said an order of magnitude, but don't quote me on that).

Well he didn't post any code so we could reproduce his results, and I never take anyone on their word on something like this, so I decided to put it to the test.  Before going any further, you may need some background information on what the Towers of Hanoi problem is.  You're probably familiar with it - you just didn't know what it was called.  Go read this Wikipedia entry and come back.

I decide to do a recursive solution, so the meat of the problem is that we have an exponential number of method calls, making the call stack very, very deep.  Just from a memory point of view this is a bad idea (see my MEDC presentation on memory management if you want to know more on the whys of that).  But you'll also see that the expense of method calls in the CF (see Act I) really bites you here as it bites hard.

First, let's look at the code.  In C#, it looks like this:

public class Hanoi
{
  private int[] pegs = new int[3];

  public Hanoi(int totalDisks)
  {
    // start with all disks on peg 0
    pegs[0] = totalDisks;
    pegs[1] = 0;
    pegs[2] = 0;
  }

  public int Solve()
  {
    int et = Environment.TickCount;
    Move(0, 2, 1, pegs[0]);
    return Environment.TickCount - et;
  }

  private void Move(int fromPeg, int toPeg, int intermediatePeg, int disks)
  {
    if(disks == 0) return;

    // move all but one disk to the intermediate peg
    Move(fromPeg, intermediatePeg, toPeg, disks - 1);

    // move the last remaining disk to the destination - no need for the intermediate
    pegs[fromPeg] -= 1;
    pegs[toPeg] += 1;

    // now move all but one off the intermediate peg to the destination peg
    Move(intermediatePeg, toPeg, fromPeg, disks - 1);
  }
}

You can see that Move calls itself twice.  Now try tracing the code in your head if I call Solve when totalDisks is 30.  Ugly.

Now lets look at the C++ implementation.

class Hanoi
{
    private:
        int pegs[3];
        void Move(int fromPeg, int toPeg, int intermediatePeg, int disks);

    public:
        Hanoi(int totalDisks);
        int Solve();
};

Hanoi::Hanoi(int totalDisks)
{
    // start with all disks on peg 0
    pegs[0] = totalDisks;
    pegs[1] = 0;
    pegs[2] = 0;
}

void Hanoi::Move(int fromPeg, int toPeg, int intermediatePeg, int disks)
{
    if(disks == 0) return;

    // move all but one disk to the intermediate peg
    Move(fromPeg, intermediatePeg, toPeg, disks - 1);

    // move the last remaining disk to the destination - no need for the intermediate
    pegs[fromPeg] -= 1;
    pegs[toPeg] += 1;

    // now move all but one off the intermediate peg to the destination peg
    Move(intermediatePeg, toPeg, fromPeg, disks - 1);
}

int Hanoi::Solve()
{
    int et = GetTickCount();
    Move(0, 2, 1, pegs[0]);
    return GetTickCount() - et;
}

You can see that it's really not much different - in fact it's really, really close.  So what do we see for results when we run these? Look at the table below (my device was an Axim X30 with a PXA270 processor)

disks C# - CF 2.0 C++ Diff % improvement
20 744 461 283 38%
21 1518 1148 370 24%
22 2997 2068 929 31%
23 5964 4006 1958 33%
24 11892 7646 4246 36%
25 23942 14875 9067 38%
26 47945 29423 18522 39%
27 95674 58496 37178 39%
28 190917 116837 74080 39%

The C++ version performed nearly 40% better.  Of course the C# JIT compiler is running on a mobile device, with presumably much less power than a desktop machine, so the JITter is built to optimize for compile speed, not execution speed.  To try to level the playing field, I compiled the C++ version in Debug mode to turn off all compiler optimizations.  I can't actually see what the CF JITter creates for assembly, so we can't be sure if the resulting code is the same, but that's the nearest I can come.

So we know that the implementation code is near identical.  We also know from Act I that identical code in a single method has no performance difference between native and managed code.  We also know from Act I that method calls are quite expensive.  A reasonable conclusion then is that the performance degradation we're seeing here is nearly all in the cost of method calls.  Does this mean that managed code performance is terrible?  The answer is "it can be if you don't fully understand what the CLR is doing." The lesson learned here then is to either refactor the algorithm to keep your call stack short (there are non-recursive solutions to this problem - maybe another day I'll test that), or put heavily recursive stuff into a native library and P/Invoke to it.

If you want to try these out on your own device, you can download the source code here (post your results in the comments if you'd like).  The source actually points out another lesson, this time in UI development speed. The C# version has a nice UI that I put together in about 10 minutes.  Writing a nice UI would have taken quite a bit longer in C++ so I didn't even bother. With the C++ version you have to get the results from by running in eVC or Studio and setting a breakpoint in the calling loop and writing down the number after each iteration.

Friday, August 25, 2006 11:09:58 AM (Central Daylight Time, UTC-05:00)  #     | 
# Wednesday, August 23, 2006

We've had an idea for a new project for some time, and we've finally decided to put out a code seed for it to see if the community wants to get involved.

The idea is this - Create a native coredll.dll library for the desktop that exactly matches the one exposed by Windows CE (same funtions at the same ordinals).  In theory, this would allow you to run Compact Framework applications against the full framework including code that P/Invokes.  The long-term goal is to implement every funtion (there are about 1800 of them, we've seeded the project with 50), but the milestone I'm shooting for is to get this library to a point that the SDF will run on the desktop.

Why would you want this library you ask?  The answer is fairly simple - to help in debugging and unit testing.  I don't envision you shipping a product that runs on CE and XP, but I do see great value in being able to run your CF assemblies through NUnit or Team Suite unit tests, which today cannot be done on a device.  This project is an enabler for that.

THe project is located at CodePlex as the OpenNETCF Advanced Debugging Toolkit.  Look for more pieces to the toolkit as time progresses.

Wednesday, August 23, 2006 1:00:11 PM (Central Daylight Time, UTC-05:00)  #     | 
# Tuesday, August 22, 2006

The Bitmap class in the Compact Framework is a confusing thing, largely because it has abstracted what the OS is doing underneath a little too far.  For example, look at the following code:

Bitmap bmp1 = new Bitmap(fileStream);
Bitmap bmp2 = new Bitmap(200, 200);

Let's assume that fileStream is a valid stream to a resource bitmap file that is 100x100 in size.  So is there any difference between bmp1 and bmp2, other than the fact bmp1 presumably has some color data in it?  The answer is yes - there's a very big difference, and that difference can have a huge impact on application performace as well as cause exceptions.

So let's look at this a little deeper with some examples.  Here's the first:

int iterations = 0;

while (true)
{
  try
  {
    iterations++;
    Bitmap b = new Bitmap(GetImageStream());
    if (iterations % 100 == 0)
    {
      Debug.WriteLine(string.Format("{0} objects", iterations));
    }
  }
  catch
  {
    Debug.WriteLine(string.Format("Failed after {0} objects", iterations));
    Debugger.Break();
  }
}

If I run this code (GetImageStream() just pulls an image from an embedded resource), the app will run forever, occasionally spitting out the debug port how many hundreds of objects it's created.  If you run RPM on it you'll see memory getting allocated, the GC firing occasionally and resources being freed up.  All is well in the world of managed code and everything is working as expected.  Hooray.

Now let's change that ever so slightly to this:

int iterations = 0;

while (true)
{
  try
  {
    iterations++;
    Bitmap b = new Bitmap(200, 200); // <--- CHANGED HERE
    if (iterations % 100 == 0)
    {
      Debug.WriteLine(string.Format("{0} objects", iterations));
    }
  }
  catch
  {
    Debug.WriteLine(string.Format("Failed after {0} objects", iterations));
    Debugger.Break();
  }
}

Note the single change where the bitmap is created.  Try running this and after a few iterations - the exact number depends on available device memory - it will OOM (throw an out of memory exception).  On the device in front of me it was about 40.

So the first thing to do is theorize why this would happen.  Seems like the Bitmap's resources aren't getting freed after it goes out of scope at the end of the while block.  An explicit call to Dispose() may solve it if that's the case, so let's try another test.

int iterations = 0;

Bitmap b = null;
while (true)
{
  if (b != null)  // explicit disposal
    b.Dispose();

  try
  {
    iterations++;
    b = new Bitmap(200, 200);
    if (iterations % 100 == 0)
    {
      Debug.WriteLine(string.Format("{0} objects", iterations));
    }
  }
  catch
  {
    Debug.WriteLine(string.Format("Failed after {0} objects", iterations));
    Debugger.Break();
  }
}

Sure enough, when we run this, it behaves like the first.  Strange that the Bitmap behaves differently depending on which constructor we use - this is contrary to common sense, right? 

So let's think a little more.  A Bitmap has a large area of unmanaged resources and some managed resources.  It seems that when we create a bitmap using the size ctor, the finalizer doesn't get run when an OOM happens.  Let's test again and see if that really is what's going on. We'll remove the explicit Dispose call and wait for the finalizers and try again when we OOM.

int iterations = 0;

while (true)
{
  try
  {
    iterations++;
    Bitmap b = new Bitmap(200, 200);
    if (iterations % 100 == 0)
    {
      Debug.WriteLine(string.Format("{0} objects", iterations));
    }
  }
  catch (OutOfMemoryException)
  {
    Debug.WriteLine("Waiting for finalizers to run..."));
    GC.WaitForPendingFinalizers();
    Bitmap b = new Bitmap(GetImageStream());
  }
}

When we run this one, again all is well in managed code land, though we see it waiting for finalizers to run a lot, and that catch is an expensive one for perf (as all exceptions are). At this point I think "well that surely has to be a bug" but I often like a second opinion, so I went right to the source and asked the CF team about the behavior.  The response from them is actually quite informative.  Their response in in italics below.

I think you are probably seeing is several interactions that can be quite confusing.

  1. Creating a bitmap using the stream constructor will construct a DIB (Device Independent Bitmap).
  2. Creating a bitmap using the width/height constructor will construct a DDB (Device Dependent Bitmap).
  3. DIB's are allocated out of the virtual address space of the application.
  4. DDB's are allocated by the driver. This typically means that they are allocated in the virtual address space of gwes.exe. Alternatively, the driver could allocate these in dedicated video ram.
  5. Creating a bitmap with the stream constructor will generate a fair amount of garbage as it copies data from one buffer to the other.

When we perform a GC because of an OOM in the stream constructor case, we will almost certainly have some amount of garbage that we can free back to the OS immediately. This will also trigger the finalizer to run on another thread as soon as possible. That should help the next call to bitmap creation.

When we perform a GC because of an OOM in the width/height constructor case, it is fairly likely that the OOM is caused because of virtual memory exhaustion in gwes.exe. Thus freeing memory in our process will not help the memory condition in gwes.exe. We need the bitmap finalizer to run before this would actually free memory in a way that would help this scenario. While the finalizer thread would certainly have been triggered to start, it most likely will not get a chance to free bitmaps before we OOM while trying to allocate a bitmap immediately after triggering a GC on the initial thread.

In short, we have 2 different types of Bitmap in our runtime with varying performance and allocation characteristics. DDBs are generally faster to manipulate and draw to the screen than DIBs, but they are constructed in an external memory space that can cause allocation confusion and cause the performance of calls to LockBits or Save to be slow. If a DIB is desired and you wish to construct it based on width and height, we provide a function that constructs a Bitmap with a width, height, and pixelformat specified. This function will construct a DIB instead of a DDB.

I personally still consider this a bug in the implementation - the CF should catch these occasions and handle it for us rather than OOMing all the way back to the app to wait for the Finalizers and retry - that's an implementation that should be done below us. 

Still the answer sheds light on the fact that how we create a Bitmap should be highly dependent on how we intend to use that Bitmap.

Tuesday, August 22, 2006 1:24:50 PM (Central Daylight Time, UTC-05:00)  #     | 
# Friday, June 30, 2006

Check this out!  Software stored on 12-inch vinyl.  That's insane.

Friday, June 30, 2006 12:29:24 PM (Central Daylight Time, UTC-05:00)  #     | 

It occurred to me today that I never posted my presentation on Memory Management in the CF from MEDC 2006, so here it is.

Friday, June 30, 2006 10:42:23 AM (Central Daylight Time, UTC-05:00)  #     | 
# Thursday, June 29, 2006

Alex Feinman found this interesting reference to us.  Evidently University of California at Berkeley "officially" recommends us in their CS160 curriculum.  How cool is that?  Have you seen the SDF on campus anywhere?  Let us know.

Thursday, June 29, 2006 10:45:31 PM (Central Daylight Time, UTC-05:00)  #     | 

Since most people like to try before they buy, we've released Evaluation Versions of our Calendar Controls.  Download the evaluation binaries and a sample project using them here.

Thursday, June 29, 2006 11:37:14 AM (Central Daylight Time, UTC-05:00)  #     |