ebook letter frequencies
I wrote this one roughly a year ago, but figure may as well add it to the blog. Given ebooks (mostly from Project Gutenberg), find their letter frequencies. So not super interesting, but let's add it anyway.
Here is the code, and the resulting sw file.
Now a couple of matrices in the console:
sa: load ebook-letter-counts.sw
sa: matrix[letter-count]
[ a ] = [ 9083 26317 142241 23325 76232 35669 260565 35285 23871 ] [ Alice-in-Wonderland ]
[ b ] [ 1621 4766 25476 4829 15699 6847 50138 6117 4763 ] [ Frankenstein ]
[ c ] [ 2817 9055 37297 7379 21938 11349 72409 10725 6942 ] [ Gone-with-Wind ]
[ d ] [ 5228 16720 85897 12139 37966 18763 144619 18828 15168 ] [ I-Robot ]
[ e ] [ 15084 45720 228415 37293 117608 59029 440119 54536 37230 ] [ Moby-Dick ]
[ f ] [ 2248 8516 34779 5940 20363 9936 73859 9105 6270 ] [ nineteen-eighty-four ]
[ g ] [ 2751 5762 38283 6037 20489 9113 61948 8023 6822 ] [ Shakespeare ]
[ h ] [ 7581 19400 119901 16803 61947 28093 234301 28284 19130 ] [ Sherlock-Holmes ]
[ i ] [ 7803 21411 101987 20074 62942 30304 214275 27361 18380 ] [ Tom-Sawyer ]
[ j ] [ 222 431 1501 346 915 310 2955 421 465 ]
[ k ] [ 1202 1722 18290 2370 8011 3512 32029 3590 3136 ]
[ l ] [ 5053 12603 79783 12870 42338 18395 156371 17276 12426 ]
[ m ] [ 2245 10295 39595 6534 22871 10513 101507 11391 7255 ]
[ n ] [ 7871 24220 123989 21302 65429 31516 231652 29337 20858 ]
[ o ] [ 9245 25050 130230 24555 69648 34287 299732 34452 24251 ]
[ p ] [ 1796 5939 23979 5148 16553 8058 50638 6987 4766 ]
[ q ] [ 135 323 1270 321 1244 397 2998 416 182 ]
[ r ] [ 6400 20708 105074 17003 52446 25861 224994 25378 16262 ]
[ s ] [ 6980 20808 107430 18044 62734 28382 232317 27105 17852 ]
[ t ] [ 11631 29706 157163 28316 86983 42127 311911 39232 28389 ]
[ u ] [ 3867 10340 50453 9483 26933 12903 121631 13527 9376 ]
[ v ] [ 911 3788 15224 3062 8540 4252 36692 4471 2451 ]
[ w ] [ 2696 7335 43623 6761 21174 11225 78929 10754 7735 ]
[ x ] [ 170 675 1700 508 1037 779 4867 567 326 ]
[ y ] [ 2442 7743 37639 6552 16849 9071 90162 9267 6830 ]
[ z ] [ 79 243 1045 208 598 303 1418 150 155 ]
sa: norm |*> #=> normalize[100] letter-count |_self>
sa: map[norm,normalized-letter-count] rel-kets[letter-count]
sa: matrix[normalized-letter-count]
[ a ] = [ 7.75 7.75 8.12 7.85 8.11 7.91 7.38 8.16 7.92 ] [ Alice-in-Wonderland ]
[ b ] [ 1.38 1.4 1.45 1.62 1.67 1.52 1.42 1.41 1.58 ] [ Frankenstein ]
[ c ] [ 2.4 2.67 2.13 2.48 2.34 2.52 2.05 2.48 2.3 ] [ Gone-with-Wind ]
[ d ] [ 4.46 4.92 4.9 4.08 4.04 4.16 4.09 4.35 5.03 ] [ I-Robot ]
[ e ] [ 12.87 13.46 13.04 12.55 12.52 13.09 12.46 12.61 12.36 ] [ Moby-Dick ]
[ f ] [ 1.92 2.51 1.98 2.0 2.17 2.2 2.09 2.1 2.08 ] [ nineteen-eighty-four ]
[ g ] [ 2.35 1.7 2.18 2.03 2.18 2.02 1.75 1.85 2.26 ] [ Shakespeare ]
[ h ] [ 6.47 5.71 6.84 5.65 6.59 6.23 6.63 6.54 6.35 ] [ Sherlock-Holmes ]
[ i ] [ 6.66 6.3 5.82 6.75 6.7 6.72 6.06 6.32 6.1 ] [ Tom-Sawyer ]
[ j ] [ 0.19 0.13 0.09 0.12 0.1 0.07 0.08 0.1 0.15 ]
[ k ] [ 1.03 0.51 1.04 0.8 0.85 0.78 0.91 0.83 1.04 ]
[ l ] [ 4.31 3.71 4.55 4.33 4.51 4.08 4.43 3.99 4.12 ]
[ m ] [ 1.92 3.03 2.26 2.2 2.43 2.33 2.87 2.63 2.41 ]
[ n ] [ 6.72 7.13 7.08 7.17 6.96 6.99 6.56 6.78 6.92 ]
[ o ] [ 7.89 7.38 7.43 8.26 7.41 7.6 8.48 7.96 8.05 ]
[ p ] [ 1.53 1.75 1.37 1.73 1.76 1.79 1.43 1.62 1.58 ]
[ q ] [ 0.12 0.1 0.07 0.11 0.13 0.09 0.08 0.1 0.06 ]
[ r ] [ 5.46 6.1 6.0 5.72 5.58 5.73 6.37 5.87 5.4 ]
[ s ] [ 5.96 6.13 6.13 6.07 6.68 6.29 6.58 6.27 5.93 ]
[ t ] [ 9.93 8.75 8.97 9.53 9.26 9.34 8.83 9.07 9.42 ]
[ u ] [ 3.3 3.04 2.88 3.19 2.87 2.86 3.44 3.13 3.11 ]
[ v ] [ 0.78 1.12 0.87 1.03 0.91 0.94 1.04 1.03 0.81 ]
[ w ] [ 2.3 2.16 2.49 2.27 2.25 2.49 2.23 2.49 2.57 ]
[ x ] [ 0.15 0.2 0.1 0.17 0.11 0.17 0.14 0.13 0.11 ]
[ y ] [ 2.08 2.28 2.15 2.2 1.79 2.01 2.55 2.14 2.27 ]
[ z ] [ 0.07 0.07 0.06 0.07 0.06 0.07 0.04 0.03 0.05 ]
sa: save ebook-letter-counts--normalized.sw
And I guess that is it.
Update: while we are here, may as well give the simm matrix:
sa: simm |*> #=> 100 self-similar[letter-count] |_self>
sa: map[simm,simm-matrix] rel-kets[letter-count]
sa: matrix[simm-matrix]
[ Alice-in-Wonderland ] = [ 100.0 94.94 96.52 97.32 96.76 97.11 95.57 97.09 97.49 ] [ Alice-in-Wonderland ]
[ Frankenstein ] [ 94.94 100.0 95.97 96.01 95.22 96.48 95.24 96.52 95.54 ] [ Frankenstein ]
[ Gone-with-Wind ] [ 96.52 95.97 100.0 96.0 96.98 97.01 95.91 97.12 97.17 ] [ Gone-with-Wind ]
[ I-Robot ] [ 97.32 96.01 96.0 100.0 97.3 97.87 96.06 97.35 97.12 ] [ I-Robot ]
[ Moby-Dick ] [ 96.76 95.22 96.98 97.3 100.0 98.05 96.07 97.39 96.85 ] [ Moby-Dick ]
[ nineteen-eighty-four ] [ 97.11 96.48 97.01 97.87 98.05 100.0 95.55 97.88 97.1 ] [ nineteen-eighty-four ]
[ Shakespeare ] [ 95.57 95.24 95.91 96.06 96.07 95.55 100 97.08 95.89 ] [ Shakespeare ]
[ Sherlock-Holmes ] [ 97.09 96.52 97.12 97.35 97.39 97.88 97.08 100 97.54 ] [ Sherlock-Holmes ]
[ Tom-Sawyer ] [ 97.49 95.54 97.17 97.12 96.85 97.1 95.89 97.54 100 ] [ Tom-Sawyer ]
So we see that English text has largely the same letter frequencies over different ebooks. Which makes sense of course, but nice to see it visually.
And it would be nice to have an "unscaled-similar[op]" operator. The problem is that would require an entire new function in the new_context class, which I am reluctant to do, since unscaled-simm is a rare use case. Currently it can be done for special occasions by changing the simm function in new_context.pattern_recognition() to unscaled_simm(A,B).
Home
previous: simple prolog vs bko example
next: on emerging patterns
updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org