Unipeak is a free anonymous proxy, it encodes the URLs like this:
http://www.unipeak.com/gethtml.php?_u_r_l_=aHR0cDovL3d3dy5nb29nbGUuY29t (this is http://www.google.com).
Suppose you had to reverse engineer the encoding scheme, how could you proceed? You are in a comfortable position, because you can execute a Chosen Plaintext Attack.
First we need to find out if the encoding scheme is reversible, because it could also be a hash or another key used to access the cache of the proxy (if it’s a caching proxy).
So we add a letter ‘a’ to the encoded URL and see what Unipeak replies:
http://www.unipeak.com/gethtml.php?_u_r_l_=aHR0cDovL3d3dy5nb29nbGUuY29ta
and we see the Google website.
So it’s not a hash, it’s reversible.
We add another ‘a’:
http://www.unipeak.com/gethtml.php?_u_r_l_=aHR0cDovL3d3dy5nb29nbGUuY29taa
and now we get an error message:
unable to connect to http://www.google.comi:80/
It’s definitely reversible.
Searching with Google via Unipeak gives another URL:
http://www.unipeak.com/gethtml.php?_u_r_l_=aHR0cDovL3d3dy5nb29nbGUuY29tOjgwL3NlYXJjaA%3D%3D&hl=en&q=unipeak&btnG=Google+Search
This URL starts with the same sequence as our first URL, so it’s probably a simple encoding scheme where the characters are processed from left to right.
So let’s start another experiment, we enter this URL: aaaaaaaaaa
The encoded URL is:
http://www.unipeak.com/gethtml.php?_u_r_l_=YWFhYWFhYWFhYQ==
Very interesting, we also get a repeating pattern, but the cycle is 4 characters long (YWFh).
Ok, now let’s use a trick: we enter a series of characters Us. The character U is special, its ASCII encoding written in binary is 01010101. Thus UU is 0101010101010101, UUU is 010101010101010101010101, …
Entering UUUUUUUUUU gives us:
http://www.unipeak.com/gethtml.php?_u_r_l_=VVVVVVVVVVVVVQ==
Another nice sequence!
This is a strong indication that the encoding is done at the bit level: the input is seen as a stream of bits, the bits are grouped in groups of X bits (where X is unknown). Each group is transformed to another sequence of bits by a function F, and the same function F is used for each group. We can also assume that X is even, otherwise we wouldn’t get a sequence of identical characters, but a sequence of identical pairs.
We perform some extra tests to prove (or disprove) our hypothesis.
We encode sequences of different lengths and compare the length of the cleartext and the cyphertext: the ratio is about 3 to 4, 3 input characters generate 4 output characters (BTW, the fact that we get a cycle of 4 characters for aaaaa… is also a strong indication for this ratio).
So X can be 3, 6, 9, 12, … . Except we assume X is even: 6, 12, …
Let’s test X = 6.
We try URL 000, this gives us MDAw (http://www.unipeak.net/gethtml.php?_u_r_l_=MDAw)
Now 000 is 30 30 30 (in hexadecimal ASCII)
or 00110000 00110000 00110000 in binary, grouped in 8 bits (1 byte)
or 001100 000011 000000 110000 in binary, but grouped in 6 bits (X = 6)
Now increment the first group:
001101 000011 000000 110000
or 00110100 00110000 00110000 in binary, grouped in 8 bits (1 byte)
or 34 30 30 (in hexadecimal ASCII)
or 400
So 000 becomes 400 when you increment the first group of 6 bits.
Testing URL 400 gives NDAw: changing the first 6 bits changes only the first character!
We do the same for the remaining groups:
000 -> 0@0 -> MEAw
000 -> 00p -> MDBw
000 -> 001 -> MDAx
So X is indeed 6, because changing a group of 6 bits at a time changes only one encoded character.
And we can also assume that function F is linear, because incrementing the input with 1 increments the output with 1 (M -> N, D -> E, A -> B and w -> x).
Now we could try every possible permutation of 6 bits, and see what the corresponding encoded character is.
We would discover that F maps 0..63 to ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
And this is a very common encoding scheme: base64