How fast can a 6502 transfer memory

Marketing Parody Image by Gregorio Naçu at c64os.com
Image by Gregorio Naçu at c64os.com

The amazing Gregorio Naçu posted the article title graphic this week to bring attention to the venerable 6502 processor and poke fun at Apple's M2 chip marketing slides. He's doing probably the most ambitious single-person Commodore 64 project I know of and has a fantastic blog.

Apple claims the new M2 chip has the following specs.

M2 features Image by Apple via Youtube
M2 features Image by Apple via Youtube

We all know that these numbers are probably a little fluffy. Maybe a lot fluffy, and in practical applications, they are probably pretty far off. Benchmarking in a lab is fine, but the numbers rarely reflect real-world performance.

Tom's hardware does an excellent breakdown on this new chip. It does look pretty neato!

How fast is a 6502?

After Gregorio posted this image earlier this week, it sparked a fair amount of discussion on the interwebs about the memory transfer speed of a 6502 processor.

The 6502 on Commodore machines shares the clock with the video chip. Since dual ported ram wasn't financially feasible at the time, they chose a memory access trick that allowed both the video chip and processor to access memory during a single clock cycle. I think it's the same on most Commodores, but on the VIC-20, the processor accesses the memory on the low part of the signal and the VIC chip on the high part. Maybe that's backward... anyhoo, you get the point.

VIC-20 PAL Clock signal from the 6561

Memory at 1MB per second

Going back to the slide, this 1Mhz memory bandwidth is what folks are questioning.

On every clock cycle, the 6502 reads memory from somewhere... the stack, registers, program counter, memory locations, etc. So at 1 Mhz, typical for Commodore machines, this 1MB per second bandwidth is probably accurate in a vacuum, where marketing people hang out.

Image by Gregorio Naçu at c64os.com

It's important to note that Gregorio Naçu's slide was a parody and not intended to be a hard numbers accurate kind of thing. Please remember that because if you don't, the rest of this discussion will ruffle your feathers.

Testing real-world block transfers

We'll try some memory transfers to get an idea of what actual transfer speeds might look like using standard Commodore hardware. Other 6502-based platforms might be faster or slower, so I encourage you to try some tests of your own, and please let me know what you find.

We're going for average user experience, NOT "how fast can this processor perform in a lab."

Think Suzie, the tech writer opening a document on her computer. That's more what we're going for with these tests.

Again, remember that transferring memory takes more clock cycles than just reading or writing...

The Commodore 64 Version

Let's give this a go on the most popular 6502-based system of all time, the Commodore 64.

Everyone has a heads-up display for their Commodore 64 these days.

The transfer

We'll take a cue right from the venerable Rodney Zaks.

Incidentally, Robin did a long video fixing this book's implementation bug. I'll be using the revised version as I think it's a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we're going for.

You can read this excellent chapter on how this works, and Robin's video goes into it in great detail. Here's what we're going to do:

source   = $0800
dest     = $4800
len      = $4000
from     = $fb
to       = $fd
tmpx     = $a6


copyr
         .block

         lda #<source
         sta from
         lda #>source
         sta from+1
         lda #<dest
         sta to
         lda #>dest
         sta to+1

         ldy #0
         ldx #>len
         beq remain
next     lda (from),y
         sta (to),y
         iny
         bne next
         inc from+1
         inc to+1
         dex
         bne next
remain   ldx #<len      
         beq done
nextr    lda (from),y
         sta (to),y
         iny
         dex
         bne nextr
done     rts

         .bend

We can count jiffies on a Commodore to give us an idea of how fast this copy takes. Sure there's a slight overhead in the setup, but I think it's marginal enough that we can ignore it for our purposes.

$12(18) jiffies 

Okay, that's pretty fast. Since that's 16k transferred, it works out to about 54.6 k per second.

Let's do a bunch of them and see what it comes out as.

We can call this pretty quickly 255 times and do the same math.

         lda #$00
         sta $a2
         sta $a1
         sta $a0

         ldx #255
         stx tmpx
lp
         jsr copyr
         dec tmpx
         ldx tmpx
         bne lp

         lda $a0
         jsr printbyte
         lda $a1
         jsr printbyte
         lda $a2
         jsr printbyte
$1128(4392)

So at $1128 jiffies(4392) and 255 transfers of 16,384, we're seeing around 57K per second.

Grain of salt, yes, but real-world enough.

Yeah, there's some overhead in the setup and running of the transfer. We could probably make this loop a few percentage points faster. Maybe if we make it tight, we could get 15% better out of it. But the point was real-world uses, and this is a pretty good example of a tight but flexible loop to transfer. Let's not get TOO pedantic here.

What's important to note is that transferring memory takes several clock cycles per byte. If we count them, it's about a dozen cycles, which tracks roughly with our results.

KIM-1 version

The KIM-1 is arguably the most simple and pure 6502 platform, so it will be interesting to try and do memory transfers on it.

It IS clocked a little slower than a Commodore 64, so I expect it to transfer slightly slower. But it doesn't have to compete for access time as VIC-II "badlines," so maybe it'll be pretty close.

Let's find out.

I don't own a "real" KIM-1, but I do own what is considered the best two clones. Today, let's use the Corsham KIM-1 Clone. I'm going to call it a KIM-1 from here forward, mostly because I enjoy getting angry letters about this. You've been warned.

Measuring time

The KIM-1 doesn't have a jiffy clock like the other Commodore machines.

The "Application ports" are easily accessible, so if we set a pin high when we start and set it low again when we finish, we can easily use an oscilloscope to measure the time.

With the expansion bus hooked up on my Corsham KIM board, the Application port A direction is set to output with.

	lda #$ff
	sta $1603
Set all ports out output

And then, we can toggle pin PA0 by setting it high or low. We'll use $FF and $0 for that for simplicity.

Side note: this is a non-standard location for this port, your KIM-1 or clone probably has it in the $1700 range. Check your documentation.

16k in 262 Milliseconds is around 62.5k per second. Slightly faster than a Commodore 64 even though an NTSC Commodore 64 runs at a slightly higher clock speed (1.023MHz) than our KIM here.

Let's do this 255 times in a tight loop, ignoring the overhead of things like JSR, which takes a few clock cycles each loop. We're going for a ballpark here.

So our loop code then looks something like

         lda #$ff
         sta $1603
         sta $1601 ;technically setting all pins high here
         		   ;could just use #$01


         ldx #255 
         stx tmpx
lp
         jsr copyr
         dec tmpx
         ldx tmpx
         bne lp
         
         lda #$00
         sta $1601


         brk

Then if we probe it with an oscilloscope, we can measure the 1+ minute square wave.

So 255 transfers of 16,384 bytes take 67 seconds. Or about 62k per second.

One more for fun, how about a 2021 6502 processor clocked at 8Mhz?

I happen to have a Cerberus 2080 board. As far as I know, mine is the only green one in the world.

This has dual-ported RAM and can clock the brand new (yes, they still make them) WDC 65c02S processor at a blazing 8Mhz. Let's see what kind of results we get from it.

Again, we have a no jiffy clock problem, so I'm going to skip right to the 4MB transfer, time it over the video capture, and have it show "done" on the screen when it finishes. Unlike the KIM-1, I don't have a straightforward way to time it with an I/O pin. It'll give us a good enough idea of where we are.

0:00
/
About 6.29 seconds

16,384 bytes 255 times took 6.29 seconds, so maxed out, a modern 6502 at 8MHz can do about 664.2k per second. Not too bad!

Thoughts

Sure, this was not a comprehensive set of tests. But in the real world, a 6502 can copy the entire contents of a Commodore 64's memory from one place to another in about a second. Pretty respectable, and it was pretty fast for the time.

Unrolling

You could certainly use self modifying code and unroll this copy routine to get better performance at the price of flexibility and arguably understanding for the average casual 6502 assembly coder.

Again, this was not a "how fast can we absolutely make it" but an everyday use examination.

This copy can handle from one to 216 bytes and every number in between. And as my favorite Youtuber is fond of saying "I know I know, but I didn't do that. Let the angry emails begin."

REU

If you have an REU on your Commodore, that can theoretically swap out the memory at a byte per clock cycle. A true 1MB per second. I heard that games like Sam's Journey make use of this feature quite a bit.

0:00
/
Sam's Journey First Level

I'd love to hear your thoughts on how you'd approach this, pedantic, nit-picky, and otherwise. Bonus points if you demonstrate methods that show dramatically better results.

Whatever you do, be sure to have fun and don't take marketing slides too seriously.