How much faster is assembly language?

On reading about the philosophy behind the Raspberry Pi and the emphasis on teaching programming I looked for a book I have called Problems For Computer Solution which I have used on occasion to learn. I was also asked, when talking about the ARM processor in my SheevaPlug at my local Linux User Group, how much faster code was when written in assembly language.

As an experiment, I chose to code the seventh problem, to locate all of the Armstrong numbers of 2, 3 or 4 digits.

I coded the different versions in the following order:

  1. Perl - short and clear (armstrong4.pl).
  2. C - using sprintf into a string to separate the digits. A little more involved (armstrong4string.c).
  3. Assembly language - I sketched a flow chart and then coded it (armstrong.s).
  4. Assembly language with a macro - I realised that I was repeating code in the previous version so abstracted it to a macro (armstrong4macro.s).
  5. A version in C which uses division to separate the digits and follows a similar algorithm to the assembly language version (armstrong4divide.c).

The code is listed in the appendix. See also

Timing

Here is the cpuinfo for the machine.

bob@poland:~/src/problems_for_computer_solution/07_armstrong_numbers$ cat /proc/cpuinfo 
Processor	: Feroceon 88FR131 rev 1 (v5l)
BogoMIPS	: 1192.75
Features	: swp half thumb fastmult edsp 
CPU implementer	: 0x56
CPU architecture: 5TE
CPU variant	: 0x2
CPU part	: 0x131
CPU revision	: 1

Hardware	: Marvell SheevaPlug Reference Board
Revision	: 0000
Serial		: 0000000000000000
bob@poland:~/src/problems_for_computer_solution/07_armstrong_numbers$

I extended the search space to 5 and 6 digits to allow for longer runtimes.

Maximum number
of digits
Perl C - string C - divide Assembly code
4
time perl armstrong4.pl
real 0m0.583s
user 0m0.580s
sys 0m0.000s
time ./armstrong4divide
real 0m0.256s
user 0m0.260s
sys 0m0.000s
time ./armstrong4string
real 0m0.267s
user 0m0.270s
sys 0m0.000s
time ./armstrong4macro
real 0m0.007s
user 0m0.020s
sys 0m0.000s
5
time perl armstrong5.pl
real 0m6.202s
user 0m6.180s
sys 0m0.020s
time ./armstrong5string
real 0m3.302s
user 0m3.300s
sys 0m0.000s
time ./armstrong5divide
real 0m3.198s
user 0m3.200s
sys 0m0.000s
time ./armstrong5macro
real 0m0.044s
user 0m0.060s
sys 0m0.000s
6
time perl armstrong6.pl
real 1m10.881s
user 1m10.650s
sys 0m0.010s
time ./armstrong6string
real 0m39.312s
user 0m39.200s
sys 0m0.000s
time ./armstrong6divide
real 0m40.903s
user 0m38.230s
sys 0m0.000s
time ./armstrong6macro
real 0m0.512s
user 0m0.510s
sys 0m0.000s

The assembly language is the first draft, apart from the abstraction of the macro. It could probably be further optimised to shave a few cycles if performance were important. The ARM is a RISC processor and the version I have in the SheevaPlug (5TE) has no divide instruction (though I think that ARMv7 does?). Division can be achieved via repeated subtraction and counting which is the approach followed here.

Engineering is often a tradeoff between different constraints - here coding time and run time. If the code is to be run once - or once a day, then it makes sense to write it in Perl (or some other high-level language); if, however, it is to be run a million times per day then it makes sense to invest the time to make it run efficiently.

I documented some preliminary investigations into assembly language programming on the ARM here.

I have just been reading about THUMB mode - which allows the 32 bit processor to run 16 bit instructions. There are, however, restrictions on what is permissible in this mode, and I am not convinced of the benefits of having smaller instruction (quicker to load and execute?). However, I was curious to see if the switch (.thumb) would work, and if it ran faster. It may, but requires investigation which I may do?

I have no experience of teaching, so if anyone has any ideas as to how I could improve this page, or the code, please email me.

Arnaud tested the code on his Nokia 900 phone, which is an ARMV7 with approx 250 BogoMIPS (c.f. the SheevaPlug with approx 1000 BogoMIPS). The relative performances of Perl, C and assembly language were similar to those seen on the SheevaPlug.

Appendix - The code

  1. Perl version
  2. #!/usr/bin/perl
    use strict;
    use warnings;
    
    foreach my $number (10 .. 9999) {
      my $size = length $number;
      my @digits = split(//, $number);
      my $total = 0;
      for (my $index = 0; $index < $size; $index++) {
        $total += $digits[$index] ** $size;
      }
      print "ARMSTRONG NUMBER is $number\n" if ($total == $number);
    }
    
    
  3. C versions
  4. N.B. These are functionally equivalent.

  5. Assembly language
  6. A Makefile for the armstrong code
  7. AS      := /usr/bin/as
    CC      := /usr/bin/gcc
    LD      := /usr/bin/ld
    
    ASOPTS  := -gstabs
    CCOPTS  := -g
    CLIBS   := -lm
    
    all: armstrong4 armstrong5 armstrong6
    
    #harness: harness.s armstrong4macro.s power.s
    #armstrong: armstrong4main.s armstrong.s power.s
    
    armstrong4: armstrong4macro armstrong4string armstrong4divide 
    armstrong4macro: armstrong4main.s armstrong4macro.s power.s
    armstrong4string: armstrong4string.c
    armstrong4divide: armstrong4divide.c
    
    armstrong5: armstrong5macro armstrong5string armstrong5divide
    armstrong5macro: armstrong5main.s armstrong5macro.s power.s
    armstrong5divide: armstrong5divide.c
    armstrong5divide: armstrong5divide.c
    
    armstrong6: armstrong6macro armstrong6string armstrong6divide
    armstrong6macro: armstrong6main.s armstrong6macro.s power.s
    armstrong6string: armstrong6string.c
    armstrong6divide: armstrong6divide.c
    
    
    %: %.c
    	$(CC) $(CCOPTS) -o $@ $^ $(CLIBS)
    
    clean:
    	rm -f armstrong harness armstrong4macro armstrong4string armstrong4divide armstrong5macro armstrong5string armstrong5divide armstrong6macro armstrong6string armstrong6divide