Plotting keyspaces

I recently noticed the absence of visual representation on keyspaces and dictionary quality. Below is an approach to keyspace visual representation, and plotting several informations regarding the entropy of password dictionnaries, keyspaces, and keyspace enumeration.
I assume my readers already know the content beyond the following hyperlinks: The several approaches to hash cracking (incremental bruteforce, dictionary, dictionary + rules, dictionary + mixing (prince), etc.) can be considered as different approaches to the same problem: For a given hash h, enumerate all the numbers in ℕ, to find some n ∈ ℕ such as hash(n) = h. As the hashes are surjectives, there might be several n ∈ ℕ matching this condition.

Strings are numbers.

A dictionary is a list of strings. As an example, here are the first 10 entries of the very famous "rockyou.txt" dictionary:
(pardon my Windows, blame the gaming industry)
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;print('\n'.join([repr(bytes(l.strip(),'utf-8')) for l in sys.stdin.readlines()]))"
b'123456'
b'12345'
b'123456789'
b'password'
b'iloveyou'
b'princess'
b'1234567'
b'rockyou'
b'12345678'
b'abc123'
Those passwords can be seen as huge numbers :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))"

59602136937009
228509037105
1055515178193424429617
7237970109966541168
8462115701538122857
8319104414412010096
15540725856023089
33055139558551410
4050765991979987505
56290669978209
But it would be clearer if they were printed in a hexadecimal format, justified to the right:
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 10 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:>20x}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]
))"
        363534333231
          3534333231
  393837363534333231
    64726f7773736170
    756f7965766f6c69
    737365636e697270
      37363534333231
      756f796b636f72
    3837363534333231
        333231636261
				
So, now, we can plot rockyou wordlist as a buch of numbers, which happen to match those volatile secret thoughts coined as "passwords" :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 200 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" | gnuplot -e "set terminal pngcairo;set output 'rockyou.10d.png';plot '< py stdin.py' using 1 title ''"

That's ugly. Let's sort those numbers, and use a logarithmic scale. Guess what, those levels are corresponding to the number of bytes in the passwords :
PS C:\Users\username\Downloads> Get-Content .\rockyou.txt -First 200 | py -3 -c "import sys;p=[bytes(l.strip(),'utf-8') for l in sys.stdin.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" | C:\cygwin64\bin\sort.exe -n | gnuplot -e "set terminal pngcairo;set logscale y;set output 'rockyou.10d.log.png';plot '< py stdin.py' using 1 title ''"

This is the password "complexity" . The more bits, the more complex, the more time needed to go straight from 0 to n using a dumb brute-force attack.
If you think that "aaaaaaaaaaaaaaaaaaaaaaaaaa" is not a secure password, that could be because you think there is a quick path from 0 to 1.5e+62
PS C:\Users\username\Downloads> py -3 -c "a=bytes('aaaaaaaaaaaaaaaaaaaaaaaaaa','utf-8');print('{:e}'.format(sum(a[i]<<(8*i) for i in range(len(a)))))"
1.564843e+62
You might be getting my point there. There are several representation of the quality of passwords, and the nature of the rules being used to map the « passwords » set to ℕ are the roots of efficient password cracking. There are paths in the keyspace (ℕ), we are going to explore them, and to plot them.
The quality of a keyspace path depends on the number of actual human passwords met along the way.
Therefore, password dictionaries are just some tourist maps of ℕ showing the most visited places such as :
PS C:\Users\username> py -3 -c "k=[bytes(z,'utf-8') for z in ['password','letmein','123456']];print('\n'.join(['{:12s} {:>20d} {:>20x}'.format(repr(a),sum(a[i]<<(8*i) for i in range(len(a))),sum(a[i]<<(8*i) for i in range(len(a)))) for a in k]))"
b'password'   7237970109966541168     64726f7773736170
b'letmein'      31078131787130220       6e69656d74656c
b'123456'          59602136937009         363534333231
One last thing regarding the casting back from integers to byte arrays: you have align the data stream with a multiple of 8 bits, padding with zeroes.
cat rockyou.txt | python3 -c "import sys;p=[l for l in sys.stdin.buffer.readlines()];print('\n'.join(['{:d}'.format(sum(x[i]<<(8*i) for i in range(len(x)))) for x in p]))" > rockyou.numbers.txt
sort -n rockyou.numbers.txt > rockyou.numbers.sorted.txt
cat rockyou.numbers.sorted.txt | gnuplot -e 'set terminal pngcairo;set output "rockyou.fulld.png";set logscale y;set xtic rotate;plot "< cat" using 1 with dots title "rockyou.txt"'
And here is a full representation of the "rockyou.txt" wordlist, which should look like any other wordlist, as we can only percieve the overall length of contained passwords :
As this is quite unreadable, please find below a nice visual representation of the rockyou.txt lengths :
cat rockyou.txt | awk '{print length($0)}' | sort | uniq -c | gnuplot -e 'set terminal pngcairo;set output "rockyou.lengths.png";plot "< cat" using 1:2 with boxes title "rockyou.txt lengths"'

Amongst the features that can be extracted from passwords, length is one of the simplest properties that can be mapped to some kind of robustness. You could also pick one of the following:

Keyspace paths

There exists several families of paths in ℕ, corresponding to various properties of hash alrogithms and dictionaries : As an example of paths, you could consider the following optimisation related to Merkle-Damgård hashes : hash("password"+"1") is just one round away from hash("password"). Thus, taking passwords as starting points to small explorations in ℕ :
The problem we have when plotting keyspaces, and moreover keyspace paths is : how to plot the numbers that are going to be enumerated by an attacker (hopefully, by chunks of several thousands of numbers, thanks to GPGPU parallelism).

Drawing ℕ subsets

In order to plot such huge sets of numbers, we need a function to map [0..N] to [(0,0)..(H,W)].
A trivial approach is to pick a width W and to follow the following conditions:
h = N % W
w = (N - (N % W)) / W
% ./wordlist-scrambler.exe a b c 7 8 '' | head
           separator : «»
              length : 8
              nwords : 4
aaaaaaaa
baaaaaaa
caaaaaaa
7aaaaaaa
abaaaaaa
bbaaaaaa
cbaaaaaa
7baaaaaa
acaaaaaa
bcaaaaaa
% ./wordlist-scrambler.exe a b c 7 8 '' | wc -l
           separator : «»
              length : 8
              nwords : 4
65536
% ./wordlist-scrambler.exe a b c 7 8 '' | ./handle.py str2int | head
           separator : «»
              length : 8
              nwords : 4
7016996765293437281
7016996765293437282
7016996765293437283
7016996765293437239
7016996765293437537
7016996765293437538
7016996765293437539
7016996765293437495
7016996765293437793
7016996765293437794
% # fun fact ! there might exist numbers which decimal representation decoded as hexadecimal would have another meaning :) :)
% echo 7016996765293437539 | xxd -r -p | hd
00000000  70 16 99 67 65 29 34 37  53                       |p..ge)47S|
% ./wordlist-scrambler.exe a b c 7 8 5 '' | ./handle.py str2int | gnuplot -e 'set terminal pngcairo;set output "trivial.linear.png";plot "< cat " using 1 with dots title ""'
           separator : «»
              length : 5
              nwords : 5

Cool ! There is a visual representation of the fractal we are exploring !
% ./wordlist-scrambler.exe a b c 7 8 5 '' | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set yrange [0 to 256];set output "trivial.modulo.png";plot "< cat " using 1:2 title ""'
           separator : «»
              length : 5
              nwords : 5

Hmm. We might have done some product between our two projections.
Let's try to plot two datasets :
% ./wordlist-scrambler.exe Y T H O ? '~' 4 '' > dico.ytho.txt                                  separator : «»
              length : 4
              nwords : 6
% ./wordlist-scrambler.exe a b c 7 8 5 '' | cat - dico.ytho.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x;set xrange [1:];set output "trivial.mixed.png";plot "< cat " using 1:2 title ""'
           separator : «»
              length : 5
              nwords : 5

Yeah ! we are getting somewhere.
Let's try to plot rockyou :

Don't you think it's pretty ? Yeah, neither do I, those grids are hurting my brain. Also, we can't are missing two informations : Let's fix the easiest part, the number of bytes:
% time cat rockyou.txt | ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x; set xrange [2:];set output "rockyou.modulo.power.png";plot "< cat" using 1:2 with dots title ""'

real    1m43.468s # I'll work on a subset now :D
user    1m54.892s
sys     0m2.572s

You might wonder if those vertical values are corresponding to the first or the last letter. I did too, and had to check my "it's the last" assumption :
% ./pattern-walker.exe abcde 6 ZZZ | head
ZZZaaa
ZZZaab
ZZZaac
ZZZaad
ZZZaae
ZZZaba
ZZZabb
ZZZabc
ZZZabd
ZZZabe
% ./pattern-walker.exe abcde 6 ZZZ | wc -l
125
% ./pattern-walker.exe abcde 6 ZZZ |  ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x; set xrange [2:];set output "testfirstlast.png";plot "< cat" using 1:2 pt 7 title ""'
Warning: empty y range [90:90], adjusting to [89.1:90.9]

Answer : the last.
So, what about our byte count ? We just need to use a logarithmic scale of base 256 instead of 10 or 2, and the numbers will fit nicely as long as we use the "%L" format specifier:
% head -n 200000 rockyou.txt |  ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x 256;set format x "%L" ; set xrange [2:];set output "rockyou.modulo.length.png";plot "< cat" using 1:2 with points title ""'

Ok. Just before fixing the density information, let's define a fixed height ranging from 0 to 0xff :
     % head -n 200000 rockyou.txt |  ./handle.py str2int_mod --modulo 256 | gnuplot -e 'set terminal pngcairo;set xtics rotate;set logscale x 256;set format x "%L" ;set xrange [2:];set ytics 0x20;set ytics add (0xff);set yrange [0:255];set output "rockyou.modulo.dotheight.png";plot "< cat" using 1:2 with dots title ""'
     
<- do you see the fractals ?
Now, let's work on this loss of information caused by the density of plots.
There is no automatic dentisy plot for surfaces in gnuplot (Even though ther is a nice 2D counterpart : 'cumulative').
I would have to produce myself some image generation code, like windyoona and her visually stunning oversampling perl file.
As I'm not yet in the mood for this, you'll instead have a 3D plot with a third information extracted from each password.
To compute this third information, we need to define "character classes".
pclasses = {
    'subspecial' : list(range(0x00,0x20)),
    'special'    : list(range(0x20,0x30))+list(range(0x3a,0x41))+list(range(0x5b,0x61))+list(range(0x7b,0x7f)),
    'digits'     : list(range(0x30,0x3a)),
    'uppercase'  : list(range(0x41,0x5b)),
    'lowercase'  : list(range(0x61,0x7b)),
    'highbyteA'  : list(range(0x7f,0xc1)),
    'highbyteB'  : list(range(0xc1,0xff)),
}
classes_values = {
    'subspecial' : 5,
    'special'    : 4,
    'digits'     : 3,
    'lowercase'  : 1,
    'uppercase'  : 2,
    'highbyteA'  : 6,
    'highbyteB'  : 7,
}
Thus, a simple sum of the used classes can add information regarding the characters used :
% cat sample.txt
abcde
AbdhbE
helo123
feahué!
% ./handle.py str2int_mod_class sample.txt
435475931745 97 1
76288960520769 65 3
14410411716404584 104 5
2425684783345526118 102 105
Let's run this :
% cat gnuplot.pm3d.gpi
set terminal pngcairo
set xtics rotate
set logscale x 256
set format x "%L"
set xrange [2:]
set ytics 0x20
set ytics add (0xff)
set zrange [0:256]
#set pm3d map
#set yrange [0:255]
set output "rockyou.3Dclasses.png"
#set pm3d #interpolate 0,0
splot "< cat" using 1:2:3 title ""
% cat rockyou.txt | ./handle.py str2int_mod_class | gnuplot gnuplot.pm3d.gpi
Fontconfig warning: ignoring C.UTF-8: not a valid language tag

real    0m0.390s
user    0m0.202s
sys     0m0.355s

Well, not so pretty. To get better graphs, let's free ourself of this "password is a number" principle, and let's reduce passwords to arrays of bytes. Thus, we will be able to have integer data on them :

Passwords are now byte arrays