Machine Language: Count Faster on 6502

By Michael Doornbos

January 11, 2024 - 7 minutes read - 1289 words

I’ve done a lot of silly math, ciphers and asked a lot of my vintage hardware on this site over the years. But I’ve not talked a ton about optimizing code for faster results.

Today, we will count from 0 to 2^24-1 (16,777,215). Nothing else. Count, time it, and see how fast we can get it to go.

Rust

In Rust on my 2020 MacBook Pro:

use std::time::{Instant};

fn main() {
    // Start timing
    let start = Instant::now();
    // Loop from 0 to 16777215
    for _i in 0..=16777215 {
        // The loop body is empty, just like in the BASIC code
    }
    // Calculate elapsed time
    let duration = start.elapsed();
    // Convert elapsed time to seconds
    let elapsed_seconds = duration.as_secs() as f64
        + (duration.subsec_nanos() as f64 / 1_000_000_000.0);
    println!("Counting to 16777215 took {} seconds", elapsed_seconds);
    println!("That's {} additions per second", 16777216.0 / elapsed_seconds);
    println!("On 2 GHz Quad-Core Intel Core i5")
}

Counting to 2^24 in Rust

It’s so fast I gave ZERO thought to optimzing futher. Interesting if you’re into blazing speed I suppose.

Commodore 64 BASIC

Slower machines are great for learning optimization because the results can be dramatic. And when we need slow, the best tool is BASIC!

Better grab some coffee. And lunch. And maybe an afternoon snack. Also, go to the gym and walk the dog…

Counting to 2^24 in BASIC

Well, that took a while.

Let’s turn off the screen. The VIC-II steals time from the CPU quite often, so let’s turn it off and see what happens.

Counting to 2^24 in BASIC with the screen off

It IS 6% faster.

The Commodore 64 uses a variant of the 6502 processor, the 6510. I’m going to use the term 6502 to refer to both processors. Also, I did these two BASIC tests on an Ultimate 64 because I couldn’t bring myself to ask a 40 year old machine to sit in a loop incrementing for 24 hours to get these results. THINK OF THE CHILDREN!

Assembly

Naturally, we must turn to assembly language to make this more palatable. All day isn’t going to work.

Some housekeeping: I wrapped all of the following code sections with a couple of things that I don’t show here to keep this post from being the length of a novel:

Disabled interrupts
Changed the screen colors
Printed the count bytes to make sure they end up at FFFFFF
Printed the Commodore 64 Jiffy count that it took to complete in hexadecimal

The output screen, when done, looked like this:

Counting to 2^24 in Assembly

Nothing fancy. Just a quick and dirty way to get the results for each of these tests.

I used a breadbin C64 with all original parts except the PLA, which is a drop-in replacement on this motherboard as the original is in PLA heaven.

The one one the right is the test machine for all of these.

Breadbin C64

This code is all formatted for Turbo Macro Pro, so if you use a different assembler, you may have to make minor adjustments for your weapon of choice.

Method 1

My first approach is along the lines of the way my brain works:


counter

         clc
         lda count
         adc #$01
         sta count
         lda count+1
         adc #$00
         sta count+1
         lda count+2
         adc #$00
         sta count+2

         lda count+2
         cmp #$ff
         bne counter
         lda count+1
         cmp #$ff
         bne counter
         lda count
         cmp #$ff
         bne counter

Took 751 Seconds. Not bad, but not great.

The VIC-II steals time from the CPU quite often, so let’s turn the screen off and see what happens.

        ; disable screen
         lda $d011
         and #$ef
         sta $d011

         ;------------------------------
         ; Do stuff here


         ;------------------------------
         ; re-enable screen
         lda $d011
         ora #$10
         sta $d011

Took 703 Seconds. Hmm.

The three bytes we’re using here are in “high memory.” The 6502 accesses the zero-page addresses MUCH faster. What if we put those 3 bytes we’re incrementing in zero page?

Move

count    .byte 0,0,0

count = $fb

Took 583 Seconds. Still better…

We’re doing a lot of needless checking. Let’s iterate and try again.

Method 2

We can reduce the number of checks by looking for the carry bit to be clear on each addition. That could eliminate thousands of checks.

counter

         clc
         lda count
         adc #$01
         sta count
         bcc ahead
         lda count+1
         adc #$00
         sta count+1
         bcc ahead
         lda count+2
         adc #$00
         sta count+2

ahead

         lda count+2
         cmp #$ff
         bne counter
         lda count+1
         cmp #$ff
         bne counter
         lda count
         cmp #$ff
         bne counter

It took 441 seconds with the screen on, 412 seconds with the screen off, and 289 seconds with the screen off and zero page.

Much better, but let’s keep going.

Method 3

inc is fewer cpu cycles than adc, so let’s try using that opcode instead:

counter

         inc count
         bne done
         inc count+1
         bne done
         inc count+2

done
         lda count+2
         cmp #$ff
         bne counter
         lda count+1
         cmp #$ff
         bne counter
         lda count
         cmp #$ff
         bne counter

Took 331 seconds with the screen on, 310 seconds with the screen off and 275 seconds with the screen off and zero page.

Method 4

Credit is due to Gregorio Nacu for this one, who pointed out that we really only need to check if we’re done when we increment the high bit

counter3

         inc count
         bne counter3
         inc count+1
         bne counter3
         inc count+2
         lda count+2

         cmp #$ff
         bne counter3

counter2
         inc count
         bne counter2
         inc count+1
         lda count+1
         cmp #$ff
         bne counter2

counter1
         inc count
         lda count
         cmp #$ff
         bne counter1

It took 165 seconds with the screen on, 155 seconds with the screen off, and 138 seconds with the screen off and zero page.

Method 5

This one came from both David Youd and Robin Harbron within minutes of each other. They both pointed out that if you didn’t want to stop at an arbitrary value and only wanted to get right to before all three overflow, you could check for the overflow and be off by one.

Then Robin added the dec idea, which we’re calling the “slam on the brakes and throw her in reverse” method.

This only works for counting to the max value, but keep this in your toolbox for those cases. It might be helpful.

counter

         inc count
         bne counter
         inc count+1
         bne counter
         inc count+2
         bne count24


done

         dec count
         dec count+1
         dec count+2

It took 165 seconds with the screen on, 155 seconds with the screen off, and 110 seconds with the screen off and zero page.

Keep going?

When we started, we were at 751 seconds. We’re now down to 110 seconds. That’s a 583% improvement.

At this point, I’m pretty happy with these results. There are a couple more things we could consider, but I will encourage you to try those. IF you can beat these times, please let me know. I’d love to see what you come up with. And show your work, none of this “back in my day, we did it this way, and it was faster” stuff. If you’re gonna brag, you gotta prove it.

I dare you. Or do it for fun. Or science. Or whatever. Just do it.

Results

Method	Time (seconds)	Time w/Screen Blanked (seconds)	Blanked with Zero Page (seconds)
BASIC	21793	20506	N/A
ASM 1	751	703	583
ASM 2	441	412	289
ASM 3	331	310	275
ASM 4	165	155	138
ASM 5	165	155	110

This is an NTSC C64. PAL could be a little slower.

I was surprised that methods 4 and 5 were the same number of jiffies for the two non-zero page tests. $26bf(9919) and $2447 (9287)

We might have to do a follow-up post to explain why that is.

Thanks to:

Many thanks to Robin, David, Gregorio, and Adam for discussing on X and Signal on this one. I had way more fun on this because of you guys than if I had just done it one my own.

Counting to 2^24 in Assembly