We did an implementation of a once popular hashing function in Part 8 on RC4 without really going into why.
A hash function is any function that is used to map data of an arbitrary size (input) to a fixed size output. An algorithm is applied to the input and results in a fixed length output called the hash.
In cryptography it's important to get a repeatable output for a given input, but ideally NO information about the input that created it.
How important this lack of backwards "what created this?" information is, depends on the application. We'll be looking at use case wish a simple hash where security isn't important, to demonstrate the qualities of hash functions in general.
There are three key properties of hashes that make them suitable for security. Let's try and break these down to make them a little easier to view in the big picture.
- Preimage resistance - Given a small thing, the hash output, it should be difficult to get the big thing(doesn't have to be big actually) which is the input. Ideally this would be impossible. So hard to reverse.
- Second-preimage resistance - If you have a hash (small thing) and the input (big thing), it should be hard to find another input that produces the same hash.
- Collision resistance - It should be hard to find two inputs (big things) that create the exact same hash
Let's concentrate on collisions for this article since there is a fun way to practically demonstrate this on the Commodore.
Collisions Simple XOR checksum vs SHA-256
Not to spoil it too early, but SHA-256 is a modern hash function that at least at the time of this writing has no reported issues with preimage resistance, second-preimage resistance, or collision. It's used all over the world in thousands of applications, including Bitcoin.
Knowing this, we can't practically demonstrate SHA-256 collisions on ANY computer, let alone a 1Mhz machine with 64k of RAM.
Enter the simple XOR checksum. We've used exclusive OR in several of the previous articles in this series, and it's still useful for some applications. It has never been suitable for cryptography on it's own because of it has a lack of collision resistance.
A practical example from the 80s
"Back in the day", it was very common for books and magazines to have entire programs listed in them that you could actually type into your computer. Some were short, but many were very very long. Let's look at a really fun one, Salvage Diver, from Ahoy! Sept 1984.
This goes on for 4 more pages...
As you might imagine, typing in 200+ lines of code from a magazine was prone to errors. Sometimes a lot of them, and depending on how it was written, trying to find those typos could be challenging.
It became a common practice after a while for publications to provide a program that you typed in and saved that could then be used to check your work.
Bug Repellent (cue trumpet sounds)
The output was simple, a line number and a checksum byte. In the listing, this information was in a table following the program and you could just run down the list and check for line checksums that didn't match yours. Then you look at that line and fix the error.
It worked pretty well actually. Let's look at this in practice. We'll load the Bug Repellent Program, type in (or load) our Salvage diver program. Then run the Bug Repellent program with a
SYS 49152 command and it will output the codes for the version we typed in.
Now we just have to compare the codes to the codes in the magazine. Any incorrect lines can then be looked at again to be corrected.
You'll notice at the 1:20 Mark, I changed line 820 just slightly and now it doesn't match the magazine listing. It should be
FG but now Bug checks as
This is an example of a simple hash algorithm at work. This one is not sophisticated because its intended purpose is to be easy to use and "good enough" to help you find a mistyped program line. Doing a complicated thing like SHA-256 would be overkill.
What does this have to do with collisions?
I did a little disassembly of this Bug Repellent program and it's more or less a XOR checksum with a few other twists thrown in. If there's some interest, maybe some other time we'll do a complete disassembly of it and compare it to what some other magazines did for this problem.
Something worth noting is that all of the checksums are in uppercase A-Z characters only. This means that with less than 700 total possible hash combinations, the chances of two lines of code producing the same hash output are fairly high. In fact lines 708 and 820 of our Salvage Diver program should both output FG as their checksum.
This is a hash collision.
Does it matter here? Not really.
Just for fun lets look for word collisions
Since we can, let's do a simple XOR checksum in the same spirit of the Bug Repellent program and look for word collisions. We'll constrain our output to A-Z just the same and look for english word collisions.
ARE from "ARE YOU KEEPING UP WITH THE COMMODORE? BECAUSE THE COMMODORE IS KEEPING UP WITH YOU."
ARE in ASCII hexadecimal is
$41, $52, $45
A crazy simple checksum hash of this would be to XOR them together which would give us
If we try all of the possible combinations of inputs that XOR together to produce
$56 as the checksum:
What was initially a little surprising to me was not just how many letter combinations produce this output, but look at how many are English words.
The chances of CAT and ARE xoring to the same value are actually not that small since "XOR only" checksums have poor avalanche properties.
How much a given hash changes the resulting hash bits when the input changes is called the avalanche property.
Let's try this on our XOR example. We're still using ARE:
$41, $52, $45 which XORed together gives us
But let's look at this in binary:
$41 01000001 $52 01010010 xor -------------------- 00010011 $45 01000101 xor -------------------- checksum 01010110
Now let's just change a single input bit. If we change ARE to CRE the bytes are
$43 01000011 $52 01010010 xor -------------------- 00010001 $45 01000101 xor -------------------- checksum 01010100
Changing a single bit in our 3 byte simple checksum only changes 1 bit in the result.
$54. Only the second from the last bit changed.
This is an example of poor avalanche properties. Does this matter in our line checker? Not really. We just need to be able to see the lines that are wrong. XOR checksums are useful when you need speed and simplicity but care less about cryptography uses.
In cryptography this is not a desirable property. At. All.
Avalanche in SHA-256
Let's look at changing ARE to CRE in SHA-256, a modern and still unbroken hash function.
ARE: 16aec4cc0cd3dca7df02abddcec6f18917115dbbd69f57058bfbfb9a78db1bc0 CRE: 1c6f084212f85c53cabe6130a053bb1135c13364fe120064a87247cd4c6c7493
That's very different. I didn't count the different bits, but it looks like only 2 bytes are the same. SHA-256 is considered secure and in widespread use because this is one of its core properties. A small change in the input has a large change in the output.
Collision resistance is important sometimes
I hope this helps clear up what collision resistance is and where you might and might not actually care about this property. When in doubt, use more modern and secure hashes. They are free and well documented.
Later in the series we'll examine SHA-256 in detail on the Commodore 64, but here's a teaser. It clocks in at 2.857 hashes per second on an unaccelerated Commodore 64 (NTSC).
Salvage diver gameplay
That's it, I'll leave you with a taste of this very playable game. Happy ciphering!