Wednesday, July 15, 2009

A C++ class for histogramming data (2). Optimized?

In a recent post, I published a simple C++ class for handling histogramming of data. Recently, I have been trying to beat my compiler by writing some assembly code by hand for some of the methods in the class. The assembly parts have been written in NASM and linked in as external "C" functions used in various class methods.

A simple test program that measured elapsed wall clock time was constructed that enabled comparison between the compiler generated executable and the executable containing hand written NASM functions.

When the (C++) compiler was set to do it's job without any optimization turned on, the handwritten assembly functions was much faster. However, with optimization turned on, there was a very small difference in speed. You can see the relative difference in speed in the table below. Note also that the compiler optimizations does not do much with the "wrapper code" when the NASM functions are used, optimization is performed on the C++ code only.

It is interesting to see that using relatively simple assembly functions gives results that are comparable to the optimizations performed by the compiler. Note that REP STOSD and REP MOVSD was used, which is often reported to be slower than unrolling the loops and using simple compare and jumps. The compilation and linking time is much faster when optimization is turned off.

So, what did I do exactly? Well, the class contains essentially three different loops for initialization, copying and summing of data in an array of unsigned integers. So, assembly functions for those essential parts was written.

Now, to some code. First, the modified Histogram class is displayed. It was somehwat modified compared to the original post, you can compare it if you feel inclined to do so. The code now contains series of preprocessor ifdef's that determines if the assembly functions should be used or the original C++ code. The variable USE_NASM was defined on the command line to redirect compilation and linking, depending on whether or not assembly functions was to be used or not.

// Some functions that are implemented in NASM.
#ifdef USE_NASM
#include "memcpy32.h"
#include "memset32.h"
#include "sum32.h"
#endif
// A histogram class.
// The Histogram object can keep a tally of values
// within a range, the range is arranged into some
// number of bins specified during construction.
// Any allocation of a Histogram object may throw
// a bad_alloc exception.
// No range-checking is performed within this object.
class Histogram
{
   public:
      // Construct a histogram that can count
      // within a range of values.
      // All bins of the histogram are set to zero.
      Histogram(
            const double& Start,
            const double& End,
            const unsigned int& nBins):
         Start(Start),
         nBins_by_interval(nBins/(End-Start)),
         nBins(nBins),
         freq( new unsigned int[nBins] )
      {
#ifdef USE_NASM
         memset32( freq, 0U, nBins );
#else
         for(unsigned int i(0); i < nBins; ++i)
            freq[i] = 0U;
#endif
      }
      // Construct a histogram by
      // copying another one.
      Histogram(const Histogram& other):
         Start(other.Start),
         nBins_by_interval(other.nBins_by_interval),
         nBins(other.nBins),
         freq( new unsigned int[nBins] )
      {
#ifdef USE_NASM
         memcpy32( freq, other.freq, nBins );
#else
         for(unsigned int i(0); i < nBins; ++i)
            freq[i] = other.freq[i];
#endif
      }
      // Deallocate the memory that was
      // allocated for the tallied counts.
      ~Histogram() {delete[] freq;}
      // Set this histogram equal to another.
      Histogram& operator=(const Histogram& other)
      {
         if( this != &other )
         {
            Start = other.Start;
            nBins_by_interval
               = other.nBins_by_interval;
            if( nBins != other.nBins )
            {
               nBins = other.nBins;
               delete[] freq;
               freq = new unsigned int[nBins];
            }
#ifdef USE_NASM
            memcpy32( freq, other.freq, nBins );
#else
            for(unsigned int i(0); i < nBins; ++i)
               freq[i] = other.freq[i];
#endif
         }
         return *this;
      }
      // Increase the count for the bin that holds a
      // value that is in range for this histogram.
      void Add(const double& x)
      {
         const unsigned int i(
               static_cast<unsigned int>(
                  (x-Start)*nBins_by_interval) );
         freq[i]++;
      }
      // Get the sum of all counts in the histogram.
      const unsigned int GetTotalCount() const
      {
#ifdef USE_NASM
         return sum32(freq,nBins);
#else
         unsigned int c(0U);
         for( int i(nBins-1); i >= 0; --i )
            c += freq[i];
         return c;
#endif
      }
      // Get the sum of all counts between the
      // bins specified. The bins must not be
      // out-of-range and end must be larger
      // than begin.
      const unsigned int GetSum(
            const unsigned int begin,
            const unsigned int end) const
      {
#ifdef USE_NASM
         return sum32( &freq[begin], end-begin );
#else
         unsigned int c(0U);
         for( unsigned int i(begin); i < end; ++i )
            c += freq[i];
         return c;
#endif
      }
   private:
      double Start,nBins_by_interval;
      unsigned int nBins;
      unsigned int* freq;
};

The header files included at the top of the Histogram class are as follows:

#ifndef TYPES_H
#define TYPES_H
typedef unsigned int uint32_t;
typedef unsigned int size_t;
#endif

#ifndef MEMCPY32_H
#define MEMCPY32_H
#include "types.h"
extern "C"
{
 // Copies n 32 bit chunks of the memory
 // area pointed to by src to the memory
 // pointed to by dest.
 void memcpy32(uint32_t *dest, uint32_t *src, size_t n);
}
#endif

#ifndef MEMSET32_H
#define MEMSET32_H
#include "types.h"
extern "C"
{
 // Fills the n blocks of the memory
 // area pointed to by s with c
 void memset32(uint32_t *s, uint32_t c, size_t n);
}
#endif

#ifndef SUM32_H
#define SUM32_H
#include "types.h"
extern "C"
{
   // Sum n consecutive elements from
   // the memory started at pos.
   uint32_t sum32( const uint32_t* pos, size_t n);
}
#endif

The assembly functions, finally, are given below:

global memcpy32
section .text
memcpy32:
MOV edx,edi
MOV eax,esi
MOV edi,[esp+4]
MOV esi,[esp+8]
MOV ecx,[esp+12]
REP MOVSD
MOV esi,eax
MOV edi,edx
RET
global memset32
section .text
memset32:
MOV edx,edi
MOV edi,[esp+4]
MOV eax,[esp+8]
MOV ecx,[esp+12]
REP STOSD
MOV edi,edx
RET
global sum32
section .text
sum32:
MOV edx,[esp+4]
MOV ecx,[esp+8]
XOR eax,eax
next:
ADD eax,[edx]
ADD edx,4
DEC ecx
JNZ next
RET

The test program used for measuring the speed difference is included below. As you can see, it contains also a small class for actually measuring the time spent called CStopWatch. (Credits goes to David Bolton with About.Com for publishing the timing code.)

#include "Histogram.cpp"
#include "CStopWatch.cpp"
#include <iostream>
using namespace std;
int main()
{
   const unsigned int nBins(10000000);
   CStopWatch sw;

   cout << "Testing allocation and "
      "initialization to zero (memset).";
   sw.startTimer();
   for( int i(0); i < 1000; ++i )
   {
      Histogram h(0.0,i*0.001,nBins);
   }
   sw.stopTimer();
   cout << " Time: " << sw.getElapsedTime() << " s.\n";

   cout << "Testing construction copy (memcpy).";
   Histogram h(0.0,1.0,nBins);
   sw.startTimer();
   for( int i(0); i < 1000; ++i )
   {
      Histogram h2(h);
   }
   sw.stopTimer();
   cout << " Time: " << sw.getElapsedTime() << " s.\n";

   cout << "Testing copy assigment(memcpy).";
   sw.startTimer();
   Histogram h2(0.0,1.0,1000); 
   for( int i(0); i < 1000; ++i )
   {
      h2 = h;
   }
   sw.stopTimer();
   cout << " Time: " << sw.getElapsedTime() << " s.\n";

   cout << "Adding 1000 values to a histogram.\n";
   for( int i(0); i < 1000; ++i )
   {
      h.Add( i * 0.001 );
   }

   cout << "Testing sum of all elements (sum32).";
   unsigned int cc(0U);
   sw.startTimer();
   cc = h.GetTotalCount();
   sw.stopTimer();
   cout << " Time: " << sw.getElapsedTime()
      << " s. (sum = " << cc << ")\n";

   cout << "Testing sum of "<<(5400000-100)
      <<" elements (sum32).";
   unsigned int cc2(0U);
   sw.startTimer();
   cc2 = h.GetSum(100,5400000);
   sw.stopTimer();
   cout << " Time: " << sw.getElapsedTime()
      << " s. (sum = " << cc2 << ")\n";
}