Tuesday, June 9, 2009

Programming Challenge: Letter frequency

(Results at the end of the post - Updated 2009-06-20 21:52 UTC.)

I have devised a small programming challenge for the world to work on. It has been suggested as a programming challenge for About.Com's C/C++ but it is given here as well in case they reject it. It goes as follows.

Write a program that calculate the frequency of characters in a given large text file as fast as possible. Write the frequency results and how long time it took on standard output in the format given below. Submit your code in a comment to this post.

This challenge goes on forever! I post new results in the end of this post as soon as there are any.

Rules:
  1. Your program should be case insensitive when it counts letters.
  2. The program should not read the text file from standard input. (Addendum 2009-06-20.)
  3. The program should be written using Standard C or C++ so that it can easily be compiled and run on various platforms.
  4. The 1.1 GB text file to use for counting, containing 1 byte characters, can be downloaded in a 339 MB compressed zip archive using one of the following ways.
  5. You must begin timing your code just before the text file is opened and you must end the timing after the results have been written to standard output. For timing, you can use the code given below (it works on POSIX and Windows systems).
Output format:
a 8.167%
b 1.492%
c 2.782%
d 4.253%
e 12.702%
f 2.228%
g 2.015%
h 6.094%
i 6.966%
j 0.153%
k 0.772%
l 4.025%
m 2.406%
n 6.749%
o 7.507%
p 1.929%
q 0.095%
r 5.987%
s 6.327%
t 9.056%
u 2.758%
v 0.978%
w 2.360%
x 0.150%
y 1.974%
z 0.074%
Time [microseconds]: 310
The frequency distribution given above is incidentally the relative frequencies of letters in the English language.

Timing code:

Use as follows in C++ (the code can be modified to be useful in C):
/*
 CStopWatch s;
 s.startTimer();
 The_Stuff_You_Want_To_Time_Executes_Here();
 s.stopTimer();
You may then get the elapsed time in seconds
via the method getElapsedTime.
*/
#ifdef WIN32
#include <windows.h>
class CStopWatch
{
private:
   typedef struct {
      LARGE_INTEGER start;
      LARGE_INTEGER stop;
   } stopWatch;
   stopWatch timer;
   LARGE_INTEGER frequency;
   double LIToSecs( LARGE_INTEGER & L)
   {
      return ((double)L.QuadPart/(double)frequency.QuadPart);
   }
public:
   CStopWatch()
   {
      timer.start.QuadPart=0;
      timer.stop.QuadPart=0;
      QueryPerformanceFrequency( &frequency );
   }
   void startTimer() { QueryPerformanceCounter(&timer.start); }
   void stopTimer() { QueryPerformanceCounter(&timer.stop); }
   double getElapsedTime()
   {
      LARGE_INTEGER time;
      time.QuadPart=timer.stop.QuadPart-timer.start.QuadPart;
      return LIToSecs( time );
   }
};
#else
#include <sys/time.h>
class CStopWatch
{
private:
   typedef struct {
      timeval start;
      timeval stop;
   } stopWatch;
   stopWatch timer;
public:
   void startTimer( ) { gettimeofday(&(timer.start),0); }
   void stopTimer( ) { gettimeofday(&(timer.stop),0); }
   double getElapsedTime()
   {
      timeval res;
      timersub(&(timer.stop),&(timer.start),&res);
      return res.tv_sec + res.tv_usec/1000000.0;
   }
};
#endif

Results

All programs was compiled using GNU g++/gcc on an Ubuntu i386 machine with 500 MB RAM using the -O3 optimization flag. NDEBUG was defined as well. All programs was run using the text file in the challenge description. Some of the programs did not use the (wall clock) timing code provided so I have used the UNIX time command to get their execution real time, marked with (*) below. (I used GNU make for building and executing the code using this Makefile.)

Each program was run 10 times and the average time was calculated using the spreadsheet available here. The list below shows relative times, compared to the fastest.

Standard C or C++ (As per the rules.)
  1. Chris M Thomasson 3 (C): 100,00 % (*)
  2. Jonathan Lee (C++): 100,02 %
  3. Pavel (C++): 100,09 % (*)
  4. Chris M Thomasson 4 (C): 100,12 % (*)
  5. Alf P Steinbach's modification of Chris' 2:nd code (C): 100,31 % (*)
  6. Meathe (C): 100,33 % (*)
  7. Nils (C++): 100,45 %
  8. Giannis T (C++): 100,71 %
  9. Chris M Thomasson 1 (C): 100,91 % (*)
  10. Chris M Thomasson 2 (C): 100,95 % (*)
  11. Agelos Anaptixi (C++): 101,05 % (*)
  12. Peter Jansson (C++): 101,10 %
  13. Ioannis Vranos (C++): 101,18 % (*)
  14. JT (C): 102,10 % (*)
  15. Graham Cox (C++): 102,28 % (*)
  16. Peter's modification on Blarrg's code (to read from file) (C++): 123,46 % (*)
  17. Jerry Coffin (C++): 244,82 % (*)
  18. Bart van Ingen Schenau (C++): 378,50 % (*)
Using other libraries (not following the rules)
  1. Blarrg (C++, reads from stdin): 76,67 % (*)
  2. MrX 2 (C++, Linux pthreads): 98,69 %
  3. MrX 1 (C++, Linux pthreads): 100,09 %
  4. Fifoforlifo (C++, Boost 1.37): 101,74 % (*)

32 kommentarer:

Ioannis Vranos said...

Give a try using -O3 with g++.

Peter said...

I changed it from -O2 to -O3.

Chris M. Thomasson said...

BTW, did you make sure to define NDEBUG (e.g., add -DNDEBUG to command line)? My crappy little application uses some `assert()' calls, and defining NDEBUG will convert them into NOP's.

Peter said...

I will post new results with NDEBUG defined as soon as possible. :)

Anonymous said...

This is pointless, 99.9% of the time will be I/O

JT said...

// Here's my solution, looks pretty much the same
// as the other c solutions.
#include <stdio.h>
#define BUFSIZE 4096
int
main(int argc, char ** argv)
{
    int i=0, r=0;
    char c;
    unsigned long total = 0;
    FILE* fp = NULL;
    char buffer[BUFSIZE] = {0};

    char lookupCase[256] = {0};
    unsigned long counts[265] = {0};

    if( argc <= 1 )
    {
        printf("%s <file>\n", argv[0]);
        return 1;
    }

    fp = fopen(argv[1], "r");
    if ( fp == NULL ) return 2;

    // Could do this a head of time but I'm lazy :)
    for(i=0;i<256;i++)
    {
        if(i >= 'A' && i <= 'Z')
        {
            lookupCase[i] = i - 'A' + 'a';
        }else{
            lookupCase[i] = i;
        }
    }

    // Count each char.
    while( (r=fread(buffer,BUFSIZE,1,fp)) > 0 )
    {
        for(i=0;i<r;i++)
        {
            c=buffer[i];
            counts[lookupCase[c]]++;
        }
    }

    // Get total
    for(i='a';i<='z';i++)
    {
        total += counts[i];
    }
    //printf("total: %lu\n", total);
    // Print output
    for(i='a';i<='z';i++)
    {
        printf("%c %2.3f%%\n", i, (float)(100*counts[i]) / (float)total );
    }
    return 0;
}

Peter said...

JT, I have timed also your code.

grahamcox said...

0.412 seconds according to the time program. The code's not the tidiest but it works. It is *much* slower on a Windows machine than a Linux one though...

http://pastebin.com/f177eecba

Peter said...

Graham, I will include your contribution a.s.a.p into the results collection in the weblog post.

Mr X said...

The code by Chris M. Thomasson only times CPU, not total runtime. Isn't this cheating? The other code that uses the class you provide will always give a slower time because it gives TOTAL runtime.

Mr X said...

Yeah, you're posting two different things here for the results. And I think each program should post both if you're using Linux. Have the execution AND the total runtime. You have both mixed together here. If a program uses clock(), you're only getting the execution time and not the total runtime given by your class above.

Please correct your results! Otherwise, we don't know which one is better.

BTW, what are you interested in? The time it takes to process the data without regard to I/O time. Or do you want to time the WHOLE thing, I/O and everything?

Would it be possible to include a special program I wrote? It only works on Linux, but I'd like to see how it compares to the others.

Meathe said...

Here's yet another C solution that looks very similar. I/O is the bottleneck, choosing an appropriate buffer size improves performance.


#include &ltstdio.h>
#include &lttime.h>

#define BUFFER 2048U // 4096 works beter with NTFS. Ideal size will vary across systems

int main (int argc, char** argv)
{
int tallies[256];
float total=0;
unsigned char i;
for (i='A';i<='Z';i++)
tallies[i]=0, tallies[i|0x20]=0;

clock_t const start = clock();
FILE* file = fopen(argv[1], "rb");

if (file)
{
size_t size;
static unsigned char buffer[BUFFER];

while ((size = fread(buffer, 1, BUFFER, file)))
{
// 99.3% of the runtime...
size_t tmpsize = size;
while (tmpsize)
++tallies[buffer[--tmpsize]];
}

for (i = 'A'; i <= 'Z'; ++i)
total+=tallies[i]+tallies[i | 0x20]; // Bitwise op converts to lowercase
for (i = 'A'; i <= 'Z'; ++i)
printf("%c %.3f%%\r\n", (i | 0x20), (total) ? (float)(tallies[i]+tallies[i | 0x20]) / total * 100.0 : 0.0);

printf("\r\nelapsed time: %lums\n", (unsigned long)((((double)clock() - (double)start) / CLOCKS_PER_SEC) * 1000.0));
}
return 0;
}

MrX said...

Here is my code. I was wondering if you could time it even though it only runs on Linux. I'm curious to see how it holds up (even if it doesn't satisfy requirement #2).

http://pastebin.com/f7fa5f121

Compile with -lpthread

Peter said...

MrX, I agree with you and will change how the timing results are to be displayed. Will update a.s.a.p. Will see what I do with your Linux-only code.

Meathe, I will include your code a.s.a.p.

Peter said...

MrX, I have now included your program as well.

MrX said...

Hi, could you replace my program with this one? I'm simply curious to its time because I took 20% off of the execution time. This will be my last entry.

http://pastebin.com/f4120e96e

As for JT's program, it has a bug. It only counts 1 character per buffer read. That's 1 letter count for every 4K. BUFFER_SIZE should be the third argument, not the second.

Since JT's program does zero work, this means that we've hit a theoretical limit and we are bound solely by I/O timings. This would mean that the top spot gains an advantage by its buffer size alone.

Peter said...

MrX, I will include your recent code a.s.a.p. Can you pinpoint which line in JT's code that has the bug that you mention?

Ioannis Vranos said...

Opening the text file in binary mode is ==> definitely a mistake, so I think you should remove all such answers from the valid answers.

Peter said...

I will keep the binary mode programs in the list since it interesting to see if it makes any speed difference compared to the non-binary mode programs.

Pavel said...

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
const char *fileName = argc == 1
? "LetterFrequencyInput.txt"
: argv[1];
FILE *in = fopen(fileName, "rb");
if(!in) {
printf("ERROR: failed to open %s\n", fileName);
return 2;
}

const size_t BUF_SZ = 256u * 256u;
unsigned char buf[BUF_SZ];

const size_t N_COUNTERS = 256u;
int freqCounters[N_COUNTERS];

while(!(feof(in) || ferror(in))) {
size_t res = fread(buf, 1, BUF_SZ, in);
int nRem = res % 8;
int nMult8 = res - nRem;
int nOcts = nMult8 >> 3;
while(nRem--)
++freqCounters[(size_t)buf[nRem]];
while(nOcts--) {
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
++freqCounters[(size_t)buf[--nMult8]];
}
}

const int N_LETTERS = 'z' - 'a' + 1;
int letterFreqs[N_COUNTERS];
int lcIdx = 'a';
int ucIdx = 'A';
double total = 0;
while(lcIdx <= 'z') {
int sum = freqCounters[lcIdx] + freqCounters[ucIdx];
letterFreqs[lcIdx] = sum;
++lcIdx;
++ucIdx;
total += sum;
}
double scale = total ? 100 / total : 0;
for(int i = 0; i < N_LETTERS; ++i) {
size_t idx = (size_t)'a' + i;
printf("%c %f%%\n", (char)(idx), letterFreqs[idx] * scale);
}
return 0;
}

Pavel said...

Opening the text file in a binary mode is not a mistake.

All the project rules say about the file is: "The 1.1 GB text file to use for counting, containing 1 byte characters".

The project rules do not mention the exact encoding of the file but it is clear it is some one particular encoding. You cannot assume it will be same encoding your C library will use if you open the file in "text" mode. You have to study the file. I studied the file and found out it is in ASCII encoding. So, opening it in "binary" mode is the right thing to do. If I knew the encoding were different, opening it in binary mode and decoding would be the right thing to do, too. It is because the project rules do not allow you to assume any particular encoding on the system where your program will be used.

Peter said...

Pavel, I will include your program in the list a.s.a.p.

KHD said...

The JT solution has a bug causing it to count only 70294 of the
287924224 total characters in the file. The following statement

r=fread(buffer,BUFSIZE,1,fp))

should be

r=fread(buffer,1,BUFSIZE,fp))

otherwise r is always 1 (except for the last read) causing the
inner for loop to examine only the 1st character of each block.
That is why the JT solution is so much faster than the others.
It is much slower when this error is corrected.

Here is actual counts (rather than %) for the bugged and fixed
JT code (where * are totals):

bugged fixed
* 70294 * 287924224
a 3834 a 15704064
b 644 b 2637824
c 2330 c 9543680
d 1838 d 7528448
e 6452 e 26427392
f 1418 f 5808128
g 1050 g 4300800
h 2118 h 8675328
i 4332 i 17743872
j 56 j 229376
k 354 k 1449984
l 1882 l 7708672
m 1312 m 5373952
n 3804 n 15581184
o 5198 o 21291008
p 1552 p 6356992
q 70 q 286720
r 4358 r 17850368
s 3360 s 13762560
t 4888 t 20021248
u 1648 u 6750208
v 654 v 2678784
w 830 w 3399680
x 112 x 458752
y 1292 y 5292032
z 22 z 90112

Finally the format string in

printf("%c %2.3f%%\n", i, (float)(100*counts[i]) / (float)total );

is bugged. It should be

printf("%c %7.3f%%\n", i, (float)(100*counts[i]) / (float)total );

ie %7.3f not %2.3f since floating point field width is the total
width not the width of the portion left of the decimal point. So
the JT timing results should be removed until fixed; and outputs
should be compared against a known correct output showing giving
the actual counts rather than normalized percentages.

Without correctness checks (in the form of verified output)
we cannot judge performance. This is a perfect example.

KHD

MrX said...

Could you recheck Blarrg's program. I'm having a difficult time buying that getchar() is beating out directly accessed buffered input. Blarrg's program is close to twice as slow on at least two of my machines. How are you feeding it the data? I tried piping and redirection. Both are slow.

Peter said...

I will recheck a.s.a.p.

Peter said...

Blarrg's program has been re-run. Similar results as posted, around 20-21 seconds on my machine.

Chris M. Thomasson said...

I would be interested to see if the following program:

http://groups.google.com/group/comp.lang.c++/msg/2f8903a47b5c493c

gets a little better performance results over my previous submissions. I think it should do slightly better because I am not processing the buffer backwards; I don't know why I did it that way to begin with... I guess it was temporary dyslexia or something...

;^(...

Peter said...

It has been timed. I would say that the difference between the results is within the uncertainty limit of the statistics.

MrX said...

Sorry, but something is wrong with your benchmarks if Blarrg's program is top spot. I've never heard of getc() being faster than fread(). Not only that, but Blarrg's program actually counts each character one by one instead of taking the sum at the end. On all my machines, it runs slow as mud. However, all the other programs run in the same respective times.

Something's out of whack with your benchmarks.

Peter said...

As someone commented on comp.lang.c++, I believe that Blarrg's time is so fast because his program is fed the text on standard input using some OS-dependent optimization. It could be interesting to see what happens with that time if it is rewritten to read from file instead of standard inout.

Peter said...

I have re-written Blarrg's code to read from file instead of standard input. It does indeed increase the time needed to read the file.

MrX said...

Well, OS optimization can only do so much. I'm troubled that billions of functions calls can be faster than buffered reads. I'm really curious as to what's truly going on.

Moving the benchmark wasn't the objective. If it's faster with stdin, then so be it. I'll stand in awe if everything pans out. Unfortunately, I cannot reproduce it. I'll ask a friend to run it on his machines.

Post a Comment