Plotting keyspaces
I recently noticed the absence of visual representation on keyspaces and dictionary quality. Below is an approach to keyspace visual representation, and plotting several informations regarding the entropy of password dictionnaries, keyspaces, and keyspace enumeration.
I assume my readers already know the content beyond the following hyperlinks:
The several approaches to hash cracking (incremental bruteforce, dictionary, dictionary + rules, dictionary + mixing (prince), etc.) can be considered as different approaches to the same problem:
For a given hash h, enumerate all the numbers in ℕ, to find some n ∈ ℕ such as hash(n) = h. As the hashes are surjectives, there might be several n ∈ ℕ matching this condition.
Strings are numbers.
A dictionary is a list of strings.
As an example, here are the first 10 entries of the very famous "rockyou.txt" dictionary:
(pardon my Windows, blame the gaming industry)
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;print('\n'.join([repr(bytes(l.strip(),'utf-8')) for l in sys.stdin.readlines()]))"
b'123456'
b'12345'
b'123456789'
b'password'
b'iloveyou'
b'princess'
b'1234567'
b'rockyou'
b'12345678'
b'abc123'
Those passwords can be seen as huge numbers :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))"
59602136937009
228509037105
1055515178193424429617
7237970109966541168
8462115701538122857
8319104414412010096
15540725856023089
33055139558551410
4050765991979987505
56290669978209
But it would be clearer if they were printed in a hexadecimal format, justified to the right:
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:>20x}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]
))"
363534333231
3534333231
393837363534333231
64726f7773736170
756f7965766f6c69
737365636e697270
37363534333231
756f796b636f72
3837363534333231
333231636261
So, now, we can plot rockyou wordlist as a buch of numbers, which happen to match those volatile secret thoughts coined as "passwords" :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 200 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" | gnuplot -e "set terminal pngcairo;set output 'rockyou.10d.png';plot '< py stdin.py' using 1 title ''"

That's ugly. Let's sort those numbers, and use a logarithmic scale. Guess what, those levels are corresponding to the number of bytes in the passwords :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 200 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" | C:\cygwin64\bin\sort.exe -n | gnuplot -e "set terminal pngcairo;set logscale y;set output 'rockyou.10d.log.png';plot '< py stdin.py' using 1 title ''"

This is the password "complexity" . The more bits, the more complex, the more time needed to go straight from 0 to n using a dumb brute-force attack.
If you think that "aaaaaaaaaaaaaaaaaaaaaaaaaa" is not a secure password, that could be because you think there is a quick path from 0 to 1.5e+62
PS C:\Users\username\Downloads> py -3 -c "a=bytes('aaaaaaaaaaaaaaaaaaaaaaaaaa','utf-8');print('{:e}'.format(sum(a[i]<<(8*i) for i in range(len(a)))))"
1.564843e+62
You might be getting my point there. There are several representation of the quality of passwords, and the nature of the rules being used to map the « passwords » set to ℕ are the roots of efficient password cracking. There are paths in the keyspace (ℕ), we are going to explore them, and to plot them.
The quality of a keyspace path depends on the number of actual human passwords met along the way.
Therefore, password dictionaries are just some tourist maps of ℕ showing the most visited places such as :
PS C:\Users\username> py -3 -c "k=[bytes(z,'utf-8') for z in ['password','letmein','123456']];print('\n'.join(['{:12s} {:>20d} {:>20x}'.format(repr(a),sum(a[i]<<(8*i) for i in range(len(a))),sum(a[i]<<(8*i) for i in range(len(a)))) for a in k]))"
b'password' 7237970109966541168 64726f7773736170
b'letmein' 31078131787130220 6e69656d74656c
b'123456' 59602136937009 363534333231
One last thing regarding the casting back from integers to byte arrays: you have align the data stream with a multiple of 8 bits, padding with zeroes.
cat rockyou.txt | python3 -c "import sys;p=[l for l in sys.stdin.buffer.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" > rockyou.numbers.txt
sort -n rockyou.numbers.txt > rockyou.numbers.sorted.txt
cat rockyou.numbers.sorted.txt | gnuplot -e 'set terminal pngcairo;set output "rockyou.fulld.png";set logscale y;set xtic rotate;plot "< cat" using 1 with dots title "rockyou.txt"'
And here is a full representation of the "rockyou.txt" wordlist, which should look like any other wordlist, as we can only percieve the overall length of contained passwords :

As this is quite unreadable, please find below a nice visual representation of the rockyou.txt lengths :
cat rockyou.txt | awk '{print length($0)}' | sort | uniq -c | gnuplot -e 'set terminal pngcairo;set output "rockyou.lengths.png";plot "< cat" using 1:2 with boxes title "rockyou.txt lengths"'

Amongst the features that can be extracted from passwords, length is one of the simplest properties that can be mapped to some kind of robustness.
You could also pick one of the following:
- length
- number of different bytes
- number of different byte tuples
- number of byte classes (digits, letters, etc.)
Keyspace paths
There exists several families of paths in ℕ, corresponding to various properties of hash alrogithms and dictionaries :
- Paths that can be crossed quickly from an attacker point of view (Optimisations)
- Paths that can be stored in an efficient way (Applying rules to a dictionary)
- Paths that always give the same hash output (a.k.a hash collisions)
- Paths that can't be crossed without consuming too much resources (Like, a full star energy just by flipping bits)
As an example of paths, you could consider the following optimisation related to Merkle-Damgård hashes : hash("password"+"1") is just one round away from hash("password"). Thus, taking passwords as starting points to small explorations in ℕ :
- Starting from n, obtaining the hashes from n + (2length × suffix) (equivalent to adding a byte) only cost one round.
- Also, meet-in-the-middle
The problem we have when plotting keyspaces, and moreover keyspace paths is : how to plot the numbers that are going to be enumerated by an attacker (hopefully, by chunks of several thousands of numbers, thanks to GPGPU parallelism).
Drawing ℕ subsets
In order to plot such huge sets of numbers, we need a function to map [0..N] to [(0,0)..(H,W)].
A trivial approach is to pick a width W and to follow the following conditions:
h = N % W
w = (N - (N % W)) / W
% ./wordlist-scrambler.exe a b c 7 8 '' | head
separator : «»
length : 8
nwords : 4
aaaaaaaa
baaaaaaa
caaaaaaa
7aaaaaaa
abaaaaaa
bbaaaaaa
cbaaaaaa
7baaaaaa
acaaaaaa
bcaaaaaa
% ./wordlist-scrambler.exe a b c 7 8 '' | wc -l
separator : «»
length : 8
nwords : 4
65536
% ./wordlist-scrambler.exe a b c 7 8 '' | ./handle.py str2int | head
separator : «»
length : 8
nwords : 4
7016996765293437281
7016996765293437282
7016996765293437283
7016996765293437239
7016996765293437537
7016996765293437538
7016996765293437539
7016996765293437495
7016996765293437793
7016996765293437794
% # fun fact ! there might exist numbers which decimal representation decoded as hexadecimal would have another meaning :) :)
% echo 7016996765293437539 | xxd -r -p | hd
00000000 70 16 99 67 65 29 34 37 53 |p..ge)47S|
% ./wordlist-scrambler.exe a b c 7 8 5 '' | ./handle.py str2int | gnuplot -e 'set terminal pngcairo;set output "trivial.linear.png";plot "< cat " using 1 with dots title ""'
separator : «»
length : 5
nwords : 5

Cool ! There is a visual representation of the fractal we are exploring !
% ./wordlist-scrambler.exe a b c 7 8 5 '' | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set yrange [0 to 256];set output "trivial.modulo.png";plot "< cat " using 1:2 title ""'
separator : «»
length : 5
nwords : 5

Hmm. We might have done some product between our two projections.
Let's try to plot two datasets :
% ./wordlist-scrambler.exe Y T H O ? '~' 4 '' > dico.ytho.txt separator : «»
length : 4
nwords : 6
% ./wordlist-scrambler.exe a b c 7 8 5 '' | cat - dico.ytho.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x;set xrange [1:];set output "trivial.mixed.png";plot "< cat " using 1:2 title ""'
separator : «»
length : 5
nwords : 5

Yeah ! we are getting somewhere.
Let's try to plot rockyou :

Don't you think it's pretty ? Yeah, neither do I, those grids are hurting my brain. Also, we can't are missing two informations :
- The density in the crowded areas
- The actual number of bytes in the lowest part.
Let's fix the easiest part, the number of bytes:
% time cat rockyou.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x; set xrange [2:];set output "rockyou.modulo.power.png";plot "< cat" using 1:2 with dots title ""'
real 1m43.468s # I'll work on a subset now :D
user 1m54.892s
sys 0m2.572s

You might wonder if those vertical values are corresponding to the first or the last letter. I did too, and had to check my "it's the last" assumption :
% ./pattern-walker.exe abcde 6 ZZZ | head
ZZZaaa
ZZZaab
ZZZaac
ZZZaad
ZZZaae
ZZZaba
ZZZabb
ZZZabc
ZZZabd
ZZZabe
% ./pattern-walker.exe abcde 6 ZZZ | wc -l
125
% ./pattern-walker.exe abcde 6 ZZZ | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x; set xrange [2:];set output "testfirstlast.png";plot "< cat" using 1:2 pt 7 title ""'
Warning: empty y range [90:90], adjusting to [89.1:90.9]

Answer : the last.
So, what about our byte count ? We just need to use a logarithmic scale of base 256 instead of 10 or 2, and the numbers will fit nicely as long as we use the "%L" format specifier:
% head -n 200000 rockyou.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x 256;set format x "%L" ; set xrange [2:];set output "rockyou.modulo.length.png";plot "< cat" using 1:2 with points title ""'
Ok. Just before fixing the density information, let's define a fixed height ranging from 0 to 0xff :
% head -n 200000 rockyou.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x 256;set format x "%L" ;set xrange [2:];set ytics 0x20;set ytics add (0xff);set yrange [0:255];set output "rockyou.modulo.dotheight.png";plot "< cat" using 1:2 with dots title ""'
<- do you see the fractals ?
Now, let's work on this loss of information caused by the density of plots.
There is no automatic dentisy plot for surfaces in gnuplot (Even though ther is a nice 2D counterpart : 'cumulative').
I would have to produce myself some image generation code, like windyoona and her visually stunning oversampling perl file.
As I'm not yet in the mood for this, you'll instead have a 3D plot with a third information extracted from each password.
- The first is the length of the password (~bitlength)
- The second is the first letter of the password (%256)
- The third is the password character classes
To compute this third information, we need to define "character classes".
pclasses = {
'subspecial' : list(range(0x00,0x20)),
'special' : list(range(0x20,0x30))+list(range(0x3a,0x41))+list(range(0x5b,0x61))+list(range(0x7b,0x7f)),
'digits' : list(range(0x30,0x3a)),
'uppercase' : list(range(0x41,0x5b)),
'lowercase' : list(range(0x61,0x7b)),
'highbyteA' : list(range(0x7f,0xc1)),
'highbyteB' : list(range(0xc1,0xff)),
}
classes_values = {
'subspecial' : 5,
'special' : 4,
'digits' : 3,
'lowercase' : 1,
'uppercase' : 2,
'highbyteA' : 6,
'highbyteB' : 7,
}
Thus, a simple sum of the used classes can add information regarding the characters used :
% cat sample.txt
abcde
AbdhbE
helo123
feahué!
% ./handle.py str2int_mod_class sample.txt
435475931745 97 1
76288960520769 65 3
14410411716404584 104 5
2425684783345526118 102 105
Let's run this :
% cat gnuplot.pm3d.gpi
set terminal pngcairo
set xtics rotate
set logscale x 256
set format x "%L"
set xrange [2:]
set ytics 0x20
set ytics add (0xff)
set zrange [0:256]
#set pm3d map
#set yrange [0:255]
set output "rockyou.3Dclasses.png"
#set pm3d #interpolate 0,0
splot "< cat" using 1:2:3 title ""
% cat rockyou.txt | ./handle.py str2int_mod_class | gnuplot gnuplot.pm3d.gpi
Fontconfig warning: ignoring C.UTF-8: not a valid language tag
real 0m0.390s
user 0m0.202s
sys 0m0.355s

Well, not so pretty. To get better graphs, let's free ourself of this "password is a number" principle, and let's reduce passwords to arrays of bytes. Thus, we will be able to have integer data on them :
- length
- letters (first, second, etc.)
- letter classes
Passwords are now byte arrays