Tuesday, June 30, 2009

A C++ class for histogramming data

Another recurring issue; How to effectively create a histogram of some data? A histogram is also called a frequency distribution.

I have created a small class in C++ that provides a histogram and can count the frequency of data values within a range. The code is supplied below for your amusement. If you have any idea on how to improve on the code, please let me know in a comment. Of course, the standard template library could be used, e.g. std::vector<unsigned int>, if you like to have an even more simple class. But I like code that have no includes :-)

// A histogram class.
// The Histogram object can keep a tally of values
// within a range, the range is arranged into some
// number of bins specified during construction.
// Any allocation of a Histogram object may throw
// a bad_alloc exception.
class Histogram
{
   public:
      // Construct a histogram that can count
      // within a range of values.
      // All bins of the histogram are set to zero.
      Histogram(
            const double& Start,
            const double& End,
            const unsigned int& nBins):
         Start(Start),
         nBins_by_interval(nBins/(End-Start)),
         nBins(nBins),
         freq( new unsigned int[nBins] )
      {
         for(unsigned int i(0); i < nBins; ++i)
            freq[i] = 0U;
      }
      // Construct a histogram by copying another one.
      Histogram(const Histogram& other):
         Start(other.Start),
         nBins_by_interval(other.nBins_by_interval),
         nBins(other.nBins),
         freq( new unsigned int[nBins] )
      {
         for(unsigned int i(0); i < nBins; ++i)
            freq[i] = other.freq[i];
      }
      // Deallocate the memory that was allocated for
      // the tallied counts.
      ~Histogram() {delete[] freq;}
      // Set this histogram equal to another.
      Histogram& operator=(const Histogram& other)
      {
         if( this != &other )
         {
            Start = other.Start;
            nBins_by_interval = other.nBins_by_interval;
            if( nBins != other.nBins )
            {
               nBins = other.nBins;
               delete[] freq;
               freq = new unsigned int[nBins];
            }
            for(unsigned int i(0); i < nBins; ++i)
               freq[i] = other.freq[i];
         }
         return *this;
      }
      // Increase the count for the bin that holds a
      // value that is in range for this histogram.
      void Add(const double& x)
      {
         const unsigned int i(
               static_cast<unsigned int>(
                  (x-Start)*nBins_by_interval) );
         if( i < nBins ) freq[i]++;
      }
      // Get the sum of all counts in the histogram.
      const unsigned int GetTotalCount() const
      {
         unsigned int c(0U);
         for( unsigned int i(0); i < nBins; ++i )
            c += freq[i];
         return c;
      }
   private:
      double Start,nBins_by_interval;
      unsigned int nBins;
      unsigned int* freq;
};

Sunday, June 28, 2009

Lessons learned: Quality of developed software & Programming challenges

Recently, I arranged a programming challenge that attracted some attention. Since I made an announcement on comp.lang.c++ on Usenet, most of the entries so far came from there. I also made an entry on Hacker News. The challenge was about counting frequencies of letters in a large text file, timing the execution time and then I compared the times between the entrants.

It was the first challenge that I arranged like this and of course there are lessons to be learned about how to arrange such a thing. There are noticeable similarities between arranging such a programming challenge and assuring quality of developed software that I would like to write in this post.

So, what follows is simply a list that compares the similarities that I found. All comments and suggestions for improvements are of course welcome. Make your voice heard in a comment!

Arranging Challenges Ensuring Quality
How is an entry judged, i.e. what is benchmarked in the challenge? (E.g. CPU time?, I/O? Startup time?) What are the functional and non-functional requirements from the stakeholders on the developed software?
What compiler, what version and what compiler options should be used when compiling an entry to the contest? What environment is the developed software supposed to work in?
Are there any special circumstances that have an impact on judging the challenge? E.g. Should the computer be freshly rebooted before starting a contest entry? Are there any requirements (from the stakeholders to the developed software) on the state of the machine to contain the developed software?
Be sure to completely and ambiguously specify input and output from an entry to the programming challenge. Specific requirements on the input/output to/from the software to develop are important to specify, together with all stakeholders.

Saturday, June 20, 2009

A modification to an entry to the programming challenge "Letter Frequency"

One of the entries to the programming challenge "Letter Frequency" had used reading from standard input to read the text file. It was discussed on the Usenet group comp.lang.c++ that it could be that some OS-dependent optimization occur in this case.

So, I have rewritten that entry somehat to read from a file instead of standard input, the code is posted below. I will include this into the challenge as well and possibly put the original codeunder a separate heading.

#include "limits.h"
#include "stdio.h"
#include "fstream"
#include "ctype.h"
int main(int argc,char** argv)
{
   size_t f [UCHAR_MAX+1] = { 0 };
   size_t size = 0;
   int c;

   std::ifstream inp(argv[1]);

   while ( inp )
   {
      c = inp.get();
      f [c]++;
      size++;
   }
   inp.close();

   for ( c = 0; c <= UCHAR_MAX; ++c )
      if ( islower( c ) )
         f [toupper( c )] += f [c];

   for ( c = 0; c <= UCHAR_MAX; ++c )
      if ( f [c] && !islower( c ) && isprint( c ) )
         printf( "%c %.3f%%\n", c, f[c]*100.0/size );

   return 0;
}

Wednesday, June 10, 2009

My own entry to programming challenge "Letter Frequency"

Here is my own entry to the programming challenge "Letter Frequency" that is written in another post.
#include "cctype"
#include "fstream"
#include "iostream"
#include "CStopWatch.hpp"
int main( int argc, char** argv)
{
   unsigned long int freq[256];
   for( int i(0); i < 256; ++i )
   {
      freq[i] = 0UL;
   }
   CStopWatch s;
   s.startTimer();
   std::ifstream f(argv[1],std::ios::binary);
   for( unsigned char buf[1024]; f.read((char*)buf,1024); )
   {
      for( std::streamsize i(0), num( f.gcount() );
           i < num; ++i )
      {
         ++freq[ buf[i] ];
      }
   }
   f.close();
   unsigned long int N(0UL);
   for( int i('a'); i <= 'z'; ++i )
   {
      N += freq[i];
   }
   for( int i('A'); i <= 'Z'; ++i )
   {
      N += freq[i];
      freq[ std::tolower((char)i) ] += freq[i];
      freq[i] = 0UL;
   }
   const double scale(100.0/N);
   for( int i('a'); i <= 'z'; ++i )
   {
      std::cout<<(char)i<<' '
               <<freq[i]*scale<<"%\n";
   }
   s.stopTimer();
   std::cout<<"Time [microseconds]: "
            <<s.getElapsedTime()*1.0E6<<'\n';
   return 0;
}
Outputs on my machine:
a 6.91958%
b 1.16229%
c 4.20517%
d 3.31721%
e 11.6445%
f 2.5592%
g 1.89503%
h 3.82255%
i 7.81837%
j 0.101068%
k 0.638897%
l 3.39662%
m 2.36789%
n 6.86543%
o 9.38132%
p 2.80104%
q 0.126336%
r 7.86529%
s 6.06411%
t 8.82183%
u 2.9743%
v 1.18033%
w 1.49798%
x 0.202137%
y 2.33179%
z 0.0397055%
Time [microseconds]: 846880

Tuesday, June 9, 2009

Programming Challenge: Letter frequency

(Results at the end of the post - Updated 2009-06-20 21:52 UTC.)

I have devised a small programming challenge for the world to work on. It has been suggested as a programming challenge for About.Com's C/C++ but it is given here as well in case they reject it. It goes as follows.

Write a program that calculate the frequency of characters in a given large text file as fast as possible. Write the frequency results and how long time it took on standard output in the format given below. Submit your code in a comment to this post.

This challenge goes on forever! I post new results in the end of this post as soon as there are any.

Rules:
  1. Your program should be case insensitive when it counts letters.
  2. The program should not read the text file from standard input. (Addendum 2009-06-20.)
  3. The program should be written using Standard C or C++ so that it can easily be compiled and run on various platforms.
  4. The 1.1 GB text file to use for counting, containing 1 byte characters, can be downloaded in a 339 MB compressed zip archive using one of the following ways.
  5. You must begin timing your code just before the text file is opened and you must end the timing after the results have been written to standard output. For timing, you can use the code given below (it works on POSIX and Windows systems).
Output format:
a 8.167%
b 1.492%
c 2.782%
d 4.253%
e 12.702%
f 2.228%
g 2.015%
h 6.094%
i 6.966%
j 0.153%
k 0.772%
l 4.025%
m 2.406%
n 6.749%
o 7.507%
p 1.929%
q 0.095%
r 5.987%
s 6.327%
t 9.056%
u 2.758%
v 0.978%
w 2.360%
x 0.150%
y 1.974%
z 0.074%
Time [microseconds]: 310
The frequency distribution given above is incidentally the relative frequencies of letters in the English language.

Timing code:

Use as follows in C++ (the code can be modified to be useful in C):
/*
 CStopWatch s;
 s.startTimer();
 The_Stuff_You_Want_To_Time_Executes_Here();
 s.stopTimer();
You may then get the elapsed time in seconds
via the method getElapsedTime.
*/
#ifdef WIN32
#include <windows.h>
class CStopWatch
{
private:
   typedef struct {
      LARGE_INTEGER start;
      LARGE_INTEGER stop;
   } stopWatch;
   stopWatch timer;
   LARGE_INTEGER frequency;
   double LIToSecs( LARGE_INTEGER & L)
   {
      return ((double)L.QuadPart/(double)frequency.QuadPart);
   }
public:
   CStopWatch()
   {
      timer.start.QuadPart=0;
      timer.stop.QuadPart=0;
      QueryPerformanceFrequency( &frequency );
   }
   void startTimer() { QueryPerformanceCounter(&timer.start); }
   void stopTimer() { QueryPerformanceCounter(&timer.stop); }
   double getElapsedTime()
   {
      LARGE_INTEGER time;
      time.QuadPart=timer.stop.QuadPart-timer.start.QuadPart;
      return LIToSecs( time );
   }
};
#else
#include <sys/time.h>
class CStopWatch
{
private:
   typedef struct {
      timeval start;
      timeval stop;
   } stopWatch;
   stopWatch timer;
public:
   void startTimer( ) { gettimeofday(&(timer.start),0); }
   void stopTimer( ) { gettimeofday(&(timer.stop),0); }
   double getElapsedTime()
   {
      timeval res;
      timersub(&(timer.stop),&(timer.start),&res);
      return res.tv_sec + res.tv_usec/1000000.0;
   }
};
#endif

Results

All programs was compiled using GNU g++/gcc on an Ubuntu i386 machine with 500 MB RAM using the -O3 optimization flag. NDEBUG was defined as well. All programs was run using the text file in the challenge description. Some of the programs did not use the (wall clock) timing code provided so I have used the UNIX time command to get their execution real time, marked with (*) below. (I used GNU make for building and executing the code using this Makefile.)

Each program was run 10 times and the average time was calculated using the spreadsheet available here. The list below shows relative times, compared to the fastest.

Standard C or C++ (As per the rules.)
  1. Chris M Thomasson 3 (C): 100,00 % (*)
  2. Jonathan Lee (C++): 100,02 %
  3. Pavel (C++): 100,09 % (*)
  4. Chris M Thomasson 4 (C): 100,12 % (*)
  5. Alf P Steinbach's modification of Chris' 2:nd code (C): 100,31 % (*)
  6. Meathe (C): 100,33 % (*)
  7. Nils (C++): 100,45 %
  8. Giannis T (C++): 100,71 %
  9. Chris M Thomasson 1 (C): 100,91 % (*)
  10. Chris M Thomasson 2 (C): 100,95 % (*)
  11. Agelos Anaptixi (C++): 101,05 % (*)
  12. Peter Jansson (C++): 101,10 %
  13. Ioannis Vranos (C++): 101,18 % (*)
  14. JT (C): 102,10 % (*)
  15. Graham Cox (C++): 102,28 % (*)
  16. Peter's modification on Blarrg's code (to read from file) (C++): 123,46 % (*)
  17. Jerry Coffin (C++): 244,82 % (*)
  18. Bart van Ingen Schenau (C++): 378,50 % (*)
Using other libraries (not following the rules)
  1. Blarrg (C++, reads from stdin): 76,67 % (*)
  2. MrX 2 (C++, Linux pthreads): 98,69 %
  3. MrX 1 (C++, Linux pthreads): 100,09 %
  4. Fifoforlifo (C++, Boost 1.37): 101,74 % (*)

Tuesday, June 2, 2009

BOUML: Build and install on Ubuntu 9.04 - Jaunty Jackalope.

BOUML is an Open Source UML 2 modeling tool. A quote from it's home page:

"BOUML is a free UML 2 tool box (under development) allowing you to specify and generate code in C++, Java, Idl, Php and Python.
BOUML runs under Unix/Linux/Solaris, MacOS X(Power PC and Intel) and Windows.
BOUML is very fast and doesn't require much memory to manage several thousands of classes, see benchmark.
BOUML is extensible, and the external tools named plug-outs can be written in C++ or Java, using BOUML for their definition as any other program. The code generators and reverses are ones of the pre-defined plug-outs included in the BOUML distribution."

In other words, a relatively nice tool to use in many circumstances. At the time of this writing, there was no recent BOUML release included in the 9.04 version of Ubuntu. So, I used the information at the BOUML home page and wrote a little shell script code for downloading, building and installing version 4.12.3 of BOUML including some dependencies needed (BOUML requires qt 3):

sudo apt-get install g++ qt3-dev-tools
wget http://downloads.sourceforge.net/bouml/bouml_4.12.3.tar.gz
tar xfz bouml_4.12.3.tar.gz
make -C bouml_4.12.3
sudo make -C bouml_4.12.3 install
make -C bouml_4.12.3 clean

Please feel free to comment on this post to report any errors or suggestions for improvements.