introducing a new tool wikivec and wikivec similarity
OK. A really cool one today! Previously I had discovered that out-going links on wikipedia pages don't share the semantic similarity that you might expect. But usefully for us, in going links do. And this is a very strong effect, as we will see below, and the whole point of this post. Unfortunately doing this all in the console was painfully slow, even after optimizing the superposition simm by a factor of 50 (yeah, cool right!). So the console told me roughly what algo I needed, but I had to implement it elsewhere. Hence: create-wikivec.py and wikivec-similarity.py.
The create-wikivec.py code corresponds closely to this code:
sa: load 30k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: hash-op |*> #=> hash[6] inverse-links-to |_self>
sa: map[hash-op,wikivec] rel-kets[inverse-links-to]
sa: save 30k--wikivec.sw
Interestingly, this is 5 lines of code vs roughly 110 in create-wikivec.py. Though create-wikivec.py pays that work off by being efficient enough to even find 300k--wikivec.sw.
$ ./create-wikivec.py sw-examples/300k--wikipedia-links.sw
Here are a couple of example learn rules in 300k--wikivec.sw:
wikivec |Adelaide_University> => |a6cf5f> + |be49ce> + |35cd3f> + |d96f29> + |ca656e> + |42f440> + |62b8b7>
wikivec |Adelaide,_South_Australia> => |63c9f6> + |c8e53e> + |9c3bfb> + |ae043c> + |4dcb61> + |d225b1> + |b09b3c> + |e66ecf> + |416586> + |7db43d> + |ab0cf1> + |9f14fe> + |9fdf79> + |315616> + |e6e424> + |4044ae> + |a20820> + |e48b2d> + |7a653a> + |6ac327> + |9b981d> + |2d6521> + |8cc4c4> + |914bf6> + |242091>
wikivec |Flinders_University> => |b8d999> + |788f0d> + |e40813> + |c88e1f> + |f04795> + |4cdeb5> + |6f7024> + |f324ae> + |410fc8> + |6adddf> + |315616> + |be9aaa> + |23c003> + |2c0a36> + |46f87c> + |00bef3> + |35cd3f> + |9c55f4> + |71fd70> + |d96f29> + |e971e6> + |f7fb89> + |458567> + |5fe8a1> + |250dc5> + |8e082a> + |049619>
Now, what is the usefulness of wikivec? Well, it has the interesting property of semantic similarity. You enter in a wikipage, and you get out semantically similar wikipages.
Let's give some examples:
$ ./wikivec-similarity.py 'Richard Feynman'
----------------
wikipage: Richard_Feynman
pattern: |bc5f69> + |21188d> + |598f2b> + |4d7b53> + |d0bb87> + |4d30e1> + |e1c298> + |c50e18> + |535935> + |035c48> + |8a5907> + |dbe144> + |5e9c7f> + |8f25ec> + |0166e5> + |89780b> + |eb246d> + |cb6c0b> + |6a31a1> + |cf200e> + |3c1344> + |3dc31c> + |c31b08> + |6f6073> + |1f771c> + |3c8903> + |0bbb4a> + |c01cf3> + |c1b9b6> + |3dfbb4> + |c6c9ea> + |d85395> + |1266a1> + |0ff36f> + |83c6e6> + |6267c4> + |ff4a40> + |3a535a> + |ede342> + |b5cb0d> + |b34209> + |daaf19> + |880c95> + |5286a9> + |3cb0ba> + |93fe8e> + |d8cd9f> + |6582a8> + |44f9a7> + |f97758> + |7ef2ca> + |411fde> + |89385c> + |ff3f25> + |95a2c2> + |2a092d> + |bea185> + |8d9817> + |1093a0> + |00554e> + |a2b605> + |362c5f> + |3c9704> + |38ed16> + |6c1d88> + |42f6b0> + |4a7e62> + |bbcfea> + |c1d4ac> + |a3cf59> + |0a228a> + |9357f7> + |cb12a0> + |581cff> + |2a9cc1> + |4552fc> + |21eb3c> + |77caf9> + |7ed689>
pattern length: 79
----------------
1 Richard_Feynman 100.0
2 Werner_Heisenberg 24.39
3 special_relativity 20.79
4 Niels_Bohr 20.25
5 Paul_Dirac 20.25
6 particle_physics 20.22
7 classical_mechanics 20.0
8 fermion 18.99
9 spin_(physics) 18.99
10 Standard_Model 17.72
11 quantum_field_theory 17.72
12 Schrdinger_equation 17.72
13 electromagnetism 17.24
14 Pauli_exclusion_principle 16.46
15 Julian_Schwinger 16.46
16 Erwin_Schrdinger 16.46
17 quark 16.46
18 Stephen_Hawking 16.46
19 quantum_electrodynamics 16.46
20 Category:Concepts_in_physics 16.28
21 statistical_mechanics 15.19
22 Enrico_Fermi 15.19
23 Dirac_equation 15.19
24 Max_Planck 15.19
25 matter 15.09
26 photon 14.19
27 general_relativity 14.17
28 thermodynamics 14.13
29 photoelectric_effect 13.92
30 Max_Born 13.92
Enter table row number, or wikipage: Stephen_Hawking
----------------
wikipage: Stephen_Hawking
pattern: |ea507a> + |10f49f> + |ea5ab5> + |c5b981> + |a7822a> + |4b4d7e> + |31d840> + |811950> + |0f5ef5> + |f3ceb4> + |3a2d42> + |a97ca0> + |e06e72> + |4cf7a5> + |7b8d5f> + |ff3f25> + |6d8404> + |3dfbb4> + |4d1323> + |f9ebe8> + |54060b> + |b0de39> + |6f6073> + |745049> + |a077a1> + |f4e699> + |8978b1> + |c1b9b6> + |0166e5> + |4a7e62> + |5853ee> + |4552fc> + |7bd69d> + |c266af> + |abfdf5> + |c27486> + |5ba8b0> + |d21237> + |cc5274> + |b6baba> + |483336> + |b5cb0d> + |5eca63> + |ab8f77> + |d43833> + |d17d5d> + |635adb> + |9c11b5> + |ffd306> + |7ef2ca> + |8ca453> + |8f25ec> + |dbe144> + |6187c8> + |df0825> + |9795ef> + |8d5d02> + |880c95> + |3e34b6> + |6681d0> + |b9de4b> + |ad5b5d> + |c54116> + |cdc8cb> + |bbcfea> + |1209bc> + |798fe0> + |3f31b3> + |92603f> + |cffb3d> + |f41208> + |8a4178> + |2b6101>
pattern length: 73
----------------
1 Stephen_Hawking 100.0
2 Roger_Penrose 27.4
3 quantum_gravity 23.29
4 black_hole 22.08
5 string_theory 20.55
6 Big_Bang 20.39
7 spacetime 18.18
8 Universe 17.81
9 general_relativity 17.5
10 Richard_Feynman 16.46
11 cosmology 16.46
12 quantum_field_theory 16.44
13 Niels_Bohr 15.19
14 Standard_Model 15.07
15 John_Archibald_Wheeler 15.07
16 Paul_Dirac 15.07
17 Werner_Heisenberg 14.63
18 Michael_Faraday 14.63
19 particle_physics 14.61
20 electromagnetism 13.79
21 dark_matter 13.7
22 second_law_of_thermodynamics 13.7
23 space 13.7
24 Enrico_Fermi 13.7
25 Erwin_Schrdinger 13.7
26 causality 13.7
27 classical_mechanics 12.94
28 universe 12.5
29 supersymmetry 12.33
30 Steven_Weinberg 12.33
Enter table row number, or wikipage: 4
----------------
wikipage: black_hole
pattern: |c4c39b> + |6a9e58> + |bea185> + |7ef2ca> + |cc5274> + |dbe144> + |291a64> + |205a51> + |8f25ec> + |634f08> + |894281> + |be6128> + |7bd69d> + |a97ca0> + |a8e57c> + |d1445e> + |4d1323> + |1da685> + |be410c> + |b0de39> + |6f6073> + |7ab386> + |91da93> + |d43833> + |81d9b5> + |3a2d42> + |3dfbb4> + |e06e72> + |d8f06f> + |b005aa> + |a12d9e> + |17e6c4> + |db32bc> + |548a4b> + |d23d08> + |2a9cc1> + |cf96ec> + |277acc> + |ec9e16> + |870dd1> + |d51b2d> + |75030e> + |445551> + |ffd306> + |68aced> + |e85f02> + |581cff> + |3e5d6e> + |c62201> + |b4e81b> + |aab465> + |a2fd45> + |70b142> + |89385c> + |343f48> + |df0825> + |cbf57c> + |362c5f> + |3e34b6> + |a5ffc2> + |50345c> + |c403e1> + |bd8415> + |a94489> + |1a3d5b> + |96346d> + |3647b8> + |0cd2d4> + |dba224> + |b9c56b> + |f41208> + |298068> + |067848> + |44ae31> + |71ebb5> + |8bcf88> + |84782a>
pattern length: 77
----------------
1 black_hole 100.0
2 spacetime 32.47
3 neutron_star 31.17
4 general_relativity 24.17
5 galaxy 23.75
6 event_horizon 23.38
7 white_dwarf 22.08
8 Stephen_Hawking 22.08
9 dark_matter 20.78
10 gravitation 20.78
11 neutrino 20.78
12 quantum_gravity 19.48
13 pulsar 19.48
14 astrophysics 19.48
15 Big_Bang 19.42
16 quantum_field_theory 18.18
17 matter 17.92
18 supernova 17.48
19 Monthly_Notices_of_the_Royal_Astronomical_Society 17.2
20 dark_energy 16.88
21 solar_mass 16.88
22 elementary_particle 16.88
23 special_relativity 16.83
24 star 16.04
25 string_theory 15.58
26 Standard_Model 15.58
27 gravitational_constant 15.58
28 photon 15.54
29 gravity 15.5
30 universe 15.44
Enter table row number, or wikipage: 25
----------------
wikipage: string_theory
pattern: |10f49f> + |5bf012> + |21188d> + |7ef2ca> + |bf1c18> + |b9dd13> + |2a092d> + |0166e5> + |e06e72> + |10f129> + |745049> + |ff2ac9> + |bc58c1> + |5c3d64> + |471066> + |3e4bfb> + |c266af> + |81249b> + |b032fd> + |1266a1> + |80153f> + |483336> + |d43833> + |1f252d> + |75030e> + |c50e18> + |0ff36f> + |84cbb6> + |2a9cc1> + |8a4178> + |45b55a> + |8f25ec> + |dbe144> + |008ea6> + |be6128> + |3e34b6> + |a4fc1e> + |9cf44e> + |f84a1a> + |cc5274> + |f41208> + |1da685> + |22bdaa> + |2b6101> + |c4e210>
pattern length: 45
----------------
1 string_theory 100.0
2 quantum_gravity 40.0
3 Standard_Model 36.73
4 quark 32.08
5 supersymmetry 31.11
6 dark_energy 28.89
7 gluon 28.89
8 quantum_field_theory 28.17
9 theoretical_physics 27.66
10 spacetime 27.27
11 Roger_Penrose 26.67
12 theory_of_everything 26.67
13 quantum_electrodynamics 26.67
14 particle_physics 25.84
15 space 25.0
16 strong_interaction 24.44
17 John_Archibald_Wheeler 24.44
18 M-theory 24.44
19 dark_matter 22.92
20 boson 22.22
21 Steven_Weinberg 22.22
22 loop_quantum_gravity 22.22
23 fundamental_interaction 22.22
24 gravitation 22.03
25 fermion 21.15
26 general_relativity 20.83
27 Stephen_Hawking 20.55
28 gauge_theory 20
29 Minkowski_space 20
30 quantum_physics 20
Enter table row number, or wikipage: california
----------------
wikipage: California
pattern: |b0c37e> + |0a36a9> + |b97200> + |4b4d7e> + |240c6b> + |b7204b> + |48fbc6> + |a6ff5a> + ...
pattern length: 793
----------------
1 California 100.0
2 Los_Angeles 18.79
3 New_York 18.54
4 Texas 16.77
5 President_of_the_United_States 16.02
6 Massachusetts 14.88
7 San_Francisco 14.5
8 Washington,_D.C. 13.87
9 New_York_City 13.27
10 Illinois 13.11
11 Republican_Party_(United_States) 12.74
12 Florida 12.61
13 Pennsylvania 12.23
14 Democratic_Party_(United_States) 12.11
15 United_States_Senate 11.6
16 The_New_York_Times 11.22
17 Michigan 10.97
18 American_Civil_War 10.84
19 Mexico 10.84
20 Virginia 10.84
21 Nobel_Prize_in_Physiology_or_Medicine 10.59
22 Ohio 10.59
23 New_Jersey 10.47
24 Arizona 10.34
25 Chicago 10.21
26 Oregon 10.09
27 Nobel_Prize_in_Chemistry 10.09
28 Minnesota 9.84
29 Nobel_Prize_in_Physics 9.84
30 Ronald_Reagan 9.71
Enter table row number, or wikipage: Alan_Turing
----------------
wikipage: Alan_Turing
pattern: |045fbc> + |4efc96> + |c4b5ac> + |cdb6ba> + |645e2a> + |7b8d5f> + |d266e8> + |b2151a> + |c55099> + |c26306> + |762dc9> + |84782a> + |4ae48b> + |3f4cd2> + |72ee56> + |d7fa06> + |82fdb5> + |52d2f2> + |1ef969> + |64363c> + |61a732> + |b404e7> + |628dac> + |2b081b> + |923ec9> + |5853ee> + |41bd03> + |c54116> + |798d9f> + |2e9ea3> + |afe718> + |bb5349> + |4cb705> + |1209bc> + |d38636> + |286eaf> + |997643> + |c7ee9d> + |66c387> + |6a31a1> + |26cdc2> + |d21237> + |c939a2> + |f3fac2> + |20eb27> + |9a8319> + |82d1e3> + |33fe27> + |247805> + |0cc34a> + |5aa405> + |f88a44> + |fe443c> + |921715> + |3c10cd> + |7c0b6c> + |0c2b0a> + |a85db0> + |2ab320> + |ff3f25> + |c5e594> + |11786b> + |e14fde> + |b8866c> + |d2a848> + |933aa2> + |50316a> + |4f6b9a> + |c02ba1> + |c09ec5> + |96d4da> + |335e2a> + |210256> + |a077a1> + |cb12a0> + |a487fe> + |3a535a> + |544f2c> + |e33db4> + |468b43> + |9bb8a2> + |77756e> + |bf581a> + |70414f>
pattern length: 84
----------------
1 Alan_Turing 100.0
2 Kurt_Gdel 26.19
3 John_von_Neumann 20.69
4 Alonzo_Church 19.05
5 Bletchley_Park 17.86
6 Charles_Babbage 15.48
7 Enigma_machine 14.29
8 artificial_intelligence 14.07
9 Claude_Shannon 13.1
10 theoretical_computer_science 13.1
11 Turing_machine 13.1
12 mathematical_logic 13.1
13 David_Hilbert 13.1
14 Marvin_Minsky 11.9
15 Colossus_computer 11.9
16 cognitive_science 11.9
17 cryptography 11.7
18 formal_language 10.71
19 halting_problem 10.71
20 Alfred_North_Whitehead 10.71
21 Ada_Lovelace 10.71
22 Universal_Turing_machine 10.71
23 cryptanalysis 10.71
24 John_Searle 10.71
25 theory_of_computation 10.71
26 set_theory 10.11
27 Georg_Cantor 9.52
28 lambda_calculus 9.52
29 philosophy_of_mind 9.52
30 National_Security_Agency 9.52
Enter table row number, or wikipage: 3
----------------
wikipage: John_von_Neumann
pattern: |39ea50> + |95e32c> + |280f40> + |0cffd2> + |133177> + |352c38> + |fe1ab4> + |7bd69d> + |12dbd2> + |72f5aa> + |5001d5> + |f33c86> + |4f6b9a> + |a1dd12> + |bbf17a> + |3689f8> + |cd8d71> + |545c7c> + |40edd4> + |6c5b91> + |d2a848> + |759ba4> + |3038c6> + |880c95> + |247805> + |921715> + |35f6bd> + |f54210> + |71771d> + |a0a9c1> + |ff3f25> + |3c8b72> + |6c1d88> + |c70e25> + |c54116> + |01ae36> + |33c924> + |a077a1> + |cb12a0> + |a38825> + |22bdaa> + |2b6101> + |1c9755> + |50bf91> + |04335a> + |c55099> + |cbe282> + |f84a1a> + |72ee56> + |2c4d42> + |a38649> + |eb246d> + |cb6c0b> + |64363c> + |7975b9> + |923ec9> + |5071a0> + |8c811e> + |409b48> + |0166e5> + |a16ffc> + |b2e663> + |e14fde> + |26cdc2> + |cdb6ba> + |f80b3d> + |d43833> + |672229> + |3ad69a> + |df7e9a> + |5a9c38> + |008ea6> + |d1ce05> + |0c2b0a> + |8540a1> + |7f6407> + |3357cc> + |89b2da> + |7ed689> + |c2c6f1> + |9b29fd> + |5ba8b0> + |575d02> + |70414f> + |2161b5> + |634c4f> + |c02ba1>
pattern length: 87
----------------
1 John_von_Neumann 100.0
2 Kurt_Gdel 21.84
3 Alan_Turing 20.69
4 Werner_Heisenberg 17.24
5 game_theory 17.24
6 Niels_Bohr 16.09
7 David_Hilbert 14.94
8 Max_Born 13.79
9 Erwin_Schrdinger 13.79
10 Max_Planck 13.79
11 information_theory 12.64
12 Edward_Teller 12.64
13 Paul_Dirac 12.64
14 Manhattan_Project 11.76
15 Robert_Oppenheimer 11.49
16 Enrico_Fermi 11.49
17 probability 10.68
18 Category:Foreign_Members_of_the_Royal_Society 10.34
19 statistical_mechanics 10.34
20 Gottfried_Wilhelm_Leibniz 10.34
21 Wolfgang_Pauli 10.34
22 Los_Alamos_National_Laboratory 10.34
23 Claude_Shannon 10.34
24 mathematical_logic 10.34
25 Hans_Bethe 10.34
26 Institute_for_Advanced_Study 10.34
27 Stephen_Hawking 10.34
28 Hermann_Weyl 10.34
29 operations_research 10.34
30 Henri_Poincar 10.34
Enter table row number, or wikipage: knowledge
----------------
wikipage: knowledge
pattern: |57328a> + |1cade0> + |a2c164> + |3f45f2> + |ded1b1> + |e88da4> + |00ed19> + |f27a1e> + |4e5143> + |5b8fba> + |dd808d> + |ce7f83> + |2dd2d7> + |0597a6> + |8e72ff> + |bed3d8> + |05eb39> + |dc0ea6> + |f0bfb6> + |90f967> + |a56bee> + |d75192> + |cfe9cb> + |9d6a01> + |5cbb89> + |ef61be> + |d8cd9f> + |dc79de> + |d2e0b1> + |5238d3> + |2e00b5> + |148ecf> + |cf23e9> + |8a4178> + |d21237> + |fed2d8> + |092ffc> + |5f08dc> + |46b618> + |294cf5> + |2a6072> + |83b8cc> + |e14e23> + |7e3504> + |c808ec> + |96a0e3> + |cf8f52> + |0c2b0a> + |a85db0> + |ff3f25> + |fd9ffa> + |1477d8> + |d2a848> + |fcd820> + |f9252f> + |6fd899> + |668c81> + |0bd005> + |d3f8ad> + |ed2624> + |4d6f6d> + |a077a1> + |8c6d7c> + |7586a4> + |3a535a> + |a35319> + |99dfdc> + |188841> + |6af893> + |f8c970>
pattern length: 70
----------------
1 knowledge 100.0
2 epistemology 19.83
3 nature 18.92
4 reason 18.67
5 truth 18.57
6 mind 15.71
7 William_James 14.29
8 ontology 14.29
9 empiricism 14.29
10 Descartes 14.29
11 perception 14.29
12 Empiricism 14.29
13 Karl_Popper 14.1
14 Charles_Sanders_Peirce 13.89
15 Sren_Kierkegaard 13.7
16 John_Dewey 12.86
17 experiment 12.86
18 experience 12.86
19 theory 12.86
20 empirical 12.86
21 rationalism 12.86
22 reality 12.86
23 Ren_Descartes 11.89
24 consciousness 11.63
25 Epistemology 11.43
26 fact 11.43
27 philosophy_of_mind 11.43
28 cognitive_science 11.43
29 neuroscience 11.43
30 idealism 11.43
Enter table row number, or wikipage: love
----------------
wikipage: love
pattern: |886dc0> + |b61f68> + |f0bfb6> + |abe3e9> + |e80af4> + |e1a113> + |a2558d> + |85c746> + |dc8653> + |c00299> + |041580> + |b97749> + |c488db> + |95471f> + |048171> + |a13607> + |e71ed4> + |9a4e5a> + |245b0d> + |bb8a1b> + |6a39ef> + |0fddfd> + |69197a> + |a35319> + |a2be34> + |ab5578> + |7cbcd7> + |3220b7> + |2877db> + |8fcda4> + |b2f462>
pattern length: 31
----------------
1 love 100.0
2 fear 12.9
3 happiness 12.9
4 motivation 11.76
5 Daniel_Dennett 10
6 Al-Farabi 9.68
7 anger 9.68
8 friendship 9.68
9 John_the_Evangelist 9.68
10 olfaction 9.68
11 hope 9.68
12 logos 9.68
13 awareness 9.68
14 limbic_system 9.68
15 feeling 9.68
16 Personality_psychology 9.68
17 emotion 8.7
18 Brahman 8.57
19 neuroscience 8.51
20 Atonement_in_Christianity 7.69
21 idealism 7.5
22 cognitive_science 7.32
23 reason 6.67
24 cognition 6.67
25 Averroes 6.52
26 secular_humanism 6.45
27 Hackett_Publishing 6.45
28 Heaven_(Christianity) 6.45
29 Theory_of_Forms 6.45
30 anomie 6.45
Enter table row number, or wikipage: 17
----------------
wikipage: emotion
pattern: |a387d6> + |0f7858> + |eaee0f> + |4ade23> + |2e1fb7> + |fdc3fd> + |6c0414> + |e95f90> + |996558> + |fdf2e7> + |12223e> + |c4167a> + |0d744a> + |dc0ab0> + |e12a12> + |0f5be9> + |b62879> + |e5ee5a> + |69197a> + |3ffea4> + |85c746> + |7e8cd5> + |883680> + |f0bfb6> + |96dba7> + |35c74d> + |f1d687> + |dfab09> + |19f554> + |2d2b41> + |ddf188> + |834b10> + |d7125e> + |dbfcca> + |049195> + |0c2b0a> + |7911e1> + |6261f9> + |cdd6f9> + |8429b8> + |31f076> + |737ab3> + |a35319> + |5d273d> + |312b37> + |cef78a>
pattern length: 46
----------------
1 emotion 100.0
2 perception 24.07
3 cognition 21.74
4 mind 17.74
5 thought 17.39
6 memory 16.67
7 behavior 15.22
8 motivation 15.22
9 cognitive_science 13.04
10 reality 13.04
11 psychotherapy 13.04
12 virtue 10.87
13 evolutionary_psychology 10.87
14 imagination 10.87
15 anger 10.87
16 Brahman 10.87
17 lesion 10.87
18 human_behavior 10.87
19 attention 10.87
20 cognitive_neuroscience 10.87
21 creativity 10.87
22 hippocampus 10.87
23 intelligence 10.87
24 awareness 10.87
25 Category:Emotions 10.87
26 feeling 10.87
27 happiness 10.87
28 anxiety 10.61
29 social_science 9.68
30 ritual 9.62
Enter table row number, or wikipage: science
----------------
wikipage: science
pattern: |d5c0cb> + |005999> + |a4dd79> + |4b4d7e> + |957ba7> + |01f3a7> + |c50366> + ...
pattern length: 221
----------------
1 science 100.0
2 biology 21.68
3 philosophy 20.63
4 engineering 17.19
5 scientific_method 17.19
6 religion 16.37
7 psychology 15.84
8 physics 15.03
9 economics 14.93
10 medicine 14.03
11 chemistry 13.62
12 technology 13.57
13 Isaac_Newton 13.56
14 logic 13.12
15 Aristotle 12.81
16 ethics 12.22
17 Immanuel_Kant 11.76
18 history 11.31
19 astronomy 11.16
20 evolution 11.11
21 Plato 10.89
22 sociology 10.86
23 theology 10.86
24 Charles_Darwin 10.57
25 art 10.41
26 quantum_mechanics 10.08
27 Karl_Popper 9.95
28 epistemology 9.95
29 Albert_Einstein 9.63
30 Ren_Descartes 9.5
Enter table row number, or wikipage: DNA
----------------
wikipage: DNA
pattern: |5822da> + |ea8418> + |26c813> + |c06420> + |431b1f> + |2072d0> + |4cf7a5> + |fcd7d4> + ...
pattern length: 377
----------------
1 DNA 100.0
2 protein 31.56
3 RNA 27.32
4 bacteria 19.63
5 amino_acid 18.3
6 enzyme 17.77
7 oxygen 16.12
8 gene 15.92
9 genome 15.38
10 Nature_(journal) 14.85
11 evolution 14.85
12 cell_(biology) 14.59
13 eukaryote 13.79
14 hydrogen 12.2
15 molecular_biology 11.94
16 species 11.94
17 biochemistry 11.14
18 Science_(journal) 11.14
19 organism 11.14
20 genetics 11.14
21 virus 10.88
22 nucleic_acid 10.61
23 metabolism 10.61
24 mutation 10.61
25 polymer 10.08
26 adenine 9.81
27 nucleotide 9.81
28 carbon_dioxide 9.81
29 molecule 9.55
30 chromosome 9.28
Enter table row number, or wikipage: Tim Berners-Lee
----------------
wikipage: Tim_Berners-Lee
pattern: |ee463a> + |0a36a9> + |909162> + |35f462> + |556fc5> + |553b20> + |40ace1> + |d266e8> + |770f49> + |57b9bb> + |d8d95d> + |9e72df> + |5a576a> + |6ba526> + |7b8d5f> + |00d231> + |ef2628> + |189b17> + |434f12> + |d8fc0d> + |d0bb87> + |90f130> + |6d3e37> + |8831ba> + |e23565> + |be9708> + |09d59b> + |395a2d> + |1626d2> + |d2aac8> + |26cdc2> + |1209bc> + |0f7c6c> + |520178> + |bd1488> + |b9ea7a> + |c96252> + |33fe27> + |b03400> + |6a57a4> + |f898c2> + |445704> + |d7beb8> + |6a3f0c> + |2e5f4c> + |b9de4b> + |4a7e62> + |21d730> + |6e14c2> + |949fd1> + |519ba7> + |263003> + |d9cead> + |fce081> + |70414f>
pattern length: 55
----------------
1 Tim_Berners-Lee 100.0
2 hypertext 29.09
3 World_Wide_Web_Consortium 29.09
4 World_Wide_Web 20.93
5 Cascading_Style_Sheets 20.0
6 CERN 20.0
7 Vannevar_Bush 18.18
8 hyperlink 16.36
9 Hypertext_Transfer_Protocol 16.36
10 XHTML 16.36
11 web_browser 15.84
12 HTML 15.65
13 File_Transfer_Protocol 14.55
14 WorldWideWeb 14.55
15 W3C 14.55
16 Netscape 14.55
17 ARPANET 14.55
18 web_server 14.55
19 HTTP 13.56
20 Internet_Engineering_Task_Force 12.86
21 wiki 12.73
22 Request_for_Comments 12.73
23 TCP/IP 12.73
24 Internet_Explorer 12.73
25 Robert_Cailliau 12.73
26 Web_search_engine 12.73
27 Ted_Nelson 12.73
28 IP_address 12.73
29 Domain_Name_System 12.73
30 markup_language 12.73
Enter table row number, or wikipage: q
Perhaps I should try to explain a little of what the code is doing. So, you start with the exact title of a wikipedia page. Frequently I have to use google to find the one I'm after. Then the code converts that to a wikivec using a straight key lookup in a hash-table/dictionary.
pattern = sw_dict[wikipage]
Then it uses my similarity metric to search the entire dictionary worth of wikivec patterns, and returns the most similar:
# clean superposition version:
def faster_pattern_recognition(dict,pattern,t=0):
result = []
for label,sp in dict.items():
value = faster_simm(pattern,sp)
if value > t:
result.append((label,value))
return result
# find matching patterns:
result = faster_pattern_recognition(sw_dict,pattern)
Finally, it sorts the result, keeps the best 30 results, and then pretty prints it in table form.
But what if you don't know the exact wikipage, and don't want to google it all the time? Well, we use our pattern recognition code again, this time against letter ngrams. First, we have this code that converts a string to a letter-ngram superposition:
def process_string(s):
def create_letter_n_grams(s,N):
for i in range(len(s)-N+1):
yield s[i:i+N]
r = superposition()
for k in [1,2,3]:
for w in create_letter_n_grams(s.lower(),k):
r.add(w)
return r
Then we create a guess-dictionary, which is a mapping of the wikivec keys to these letter ngram superpositions:
def load_sw_dict_into_guess_dict(sw_dict):
dict = {}
for key in sw_dict:
dict[key] = process_string(key)
return dict
Now, what does this thing look like? Here are a couple of entries in the guess-dict:
|dog_paddle> => |a> + |_p> + |pa> + |e> + |o> + |pad> + |dl> + |dd> + |_pa> + |do> + |le> + |og> + |ddl> + |dle> + |g_> + |g_p> + |add> + |p> + |og_> + 3|d> + |_> + |ad> + |g> + |l> + |dog>
|podium> => |um> + |pod> + |di> + |i> + |iu> + |o> + |od> + |p> + |po> + |odi> + |m> + |diu> + |u> + |d> + |ium>
|Lists_of_mammals_by_region> => 2|a> + |e> + |ts_> + |by> + |_ma> + |f_> + |_m> + |ts> + |y> + 2|ma> + |st> + |_b> + |by_> + |ls> + |_by> + 2|i> + |al> + |y_r> + |als> + |reg> + |mm> + |_r> + 3|s> + |egi> + 2|s_> + |ls_> + |y_> + 4|_> + 3|m> + 2|l> + |ist> + |b> + |gi> + |of_> + |is> + 2|o> + |mma> + |mal> + |of> + |s_b> + |lis> + |gio> + |_re> + |am> + |ion> + |t> + |li> + |r> + |_o> + |eg> + |on> + |n> + |sts> + |f> + |amm> + |io> + |_of> + |mam> + |g> + |s_o> + |re> + |f_m>
Finally we have the guess-ket function. You feed it a string, and it outputs the best matching key in the guess-dictionary. In this case, if you feed in a string, it will output the title of the wikipage that most closely matches that string:
# find the key in the dictionary that is closest to s:
def guess_ket(guess_dict,s):
pattern = process_string(s)
result = ''
best_simm = 0
for label,sp in guess_dict.items():
similarity = fast_simm(pattern,sp) # can't use faster_simm, since some coeffs are not equal 1
if similarity > best_simm:
result = label
best_simm = similarity
return result
So we have two levels here. If the input to wikivec-similarity.py is not an exact wikipage title, we first run it through guess-ket (a tweak on our pattern-recognition code), and it spits out the best matching wikipage title. Then that is fed in to the pattern recognition code and returns the top 30 closest wikipages based on their wikivec similarity.
Unfortunately guess-ket is a bit on the slow side, taking roughly a minute. Presumably if I used standard edit distance etc, that spell-check engines use, we should be able to make it faster. But that kind of defeats the point of this post. The point is to make use of our superposition pattern recognition code, over two distinctly different superposition types.
A final note is that this post was motivated by the word2vec cosine similarity results. Though our wikivec's don't seem to have the nice vector composition properties of word2vec. I quote:
It was recently shown that the word vectors capture many linguistic regularities,
for example vector operations vector('Paris') - vector('France') + vector('Italy')
results in a vector that is very close to vector('Rome'),
and vector('king') - vector('man') + vector('woman') is close to vector('queen')
Whew! That's it for this post.
Home
previous: new operator simm add
next: new operators append column and random column
updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org