revisiting wikipedia inverse links to semantic similarities
Nothing new here, just some more wikipedia semantic similarity examples. The motivation was partly word2vec. They have some examples of semantic similarity using their word vectors. I had planned to write my own word2sp but so far my idea failed! And I couldn't use their word vectors because they have negative coeffs, while my similarity metric requires positive coeffs.
So in the mean-time I decided to re-run my wikipedia code. I tried to use 300,000 wikipedia links sw file, but that failed too. It needed too much RAM and took too long to run. I thought I had used it in the past, in which case I don't know why it failed this time!
Here is the first word2vec example (distance to "france"):
Word Cosine distance
-------------------------------------------
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
Here it is using my code:
sa: load 30k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: T |*> #=> table[page,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self>
sa: T |WP: France>
+--------------------------------+--------+
| page | coeff |
+--------------------------------+--------+
| France | 100.0 |
| Germany | 31.771 |
| United_Kingdom | 30.537 |
| Italy | 27.452 |
| Spain | 23.566 |
| United_States | 20.152 |
| Japan | 19.556 |
| Netherlands | 19.309 |
| Russia | 18.877 |
| Canada | 18.384 |
| Europe | 17.273 |
| India | 17.212 |
| China | 16.78 |
| Paris | 16.595 |
| England | 16.286 |
| World_War_II | 15.923 |
| Australia | 15.238 |
| Soviet_Union | 14.867 |
| Belgium | 14.189 |
| Poland | 14.127 |
| Portugal | 13.819 |
| World_War_I | 13.757 |
| Austria | 13.695 |
| Sweden | 13.572 |
| Switzerland | 13.51 |
| Egypt | 12.647 |
| European_Union | 12.4 |
| Brazil | 12.338 |
| United_Nations | 12.091 |
| Greece | 11.906 |
| London | 11.906 |
| Israel | 11.783 |
| Turkey | 11.783 |
| Denmark | 11.598 |
| French_language | 11.536 |
| Norway | 11.413 |
| Latin | 10.611 |
| Rome | 10.364 |
| Mexico | 10.364 |
| English_language | 9.994 |
| South_Africa | 9.747 |
...
which works pretty well I must say.
Here is the next word2vec example (distance to San Francisco):
Word Cosine distance
-------------------------------------------
los_angeles 0.666175
golden_gate 0.571522
oakland 0.557521
california 0.554623
san_diego 0.534939
pasadena 0.519115
seattle 0.512098
taiko 0.507570
houston 0.499762
chicago_illinois 0.491598
Here it is using my code:
+---------------------------------------+--------+
| page | coeff |
+---------------------------------------+--------+
| San_Francisco | 100.0 |
| Los_Angeles | 16.704 |
| Chicago | 15.919 |
| 1924 | 15.522 |
| 1916 | 14.566 |
| California | 14.502 |
| 1915 | 14.286 |
| 2014 | 14.217 |
| 1933 | 14.031 |
| 1913 | 14.006 |
| 1918 | 14.0 |
| 1930 | 14.0 |
| Philadelphia | 13.99 |
| 1925 | 13.984 |
| 1931 | 13.904 |
| 1920 | 13.802 |
| 1932 | 13.776 |
| 1942 | 13.744 |
| 1999 | 13.725 |
...
Hrmm... that didn't work so great. I wonder why.
Here is the next word2vec example:
Enter word or sentence (EXIT to break): /en/geoffrey_hinton
Word Cosine distance
--------------------------------------------------
/en/marvin_minsky 0.457204
/en/paul_corkum 0.443342
/en/william_richard_peltier 0.432396
/en/brenda_milner 0.430886
/en/john_charles_polanyi 0.419538
/en/leslie_valiant 0.416399
/en/hava_siegelmann 0.411895
/en/hans_moravec 0.406726
/en/david_rumelhart 0.405275
/en/godel_prize 0.405176
And here it is using my code:
+------------------------------------------------------------+--------+
| page | coeff |
+------------------------------------------------------------+--------+
| Geoffrey_Hinton | 100 |
| perceptron | 66.667 |
| Tom_M._Mitchell | 66.667 |
| computational_learning_theory | 66.667 |
| Nils_Nilsson_(researcher) | 66.667 |
| beam_search | 66.667 |
| Raj_Reddy | 50 |
| AI_effect | 40 |
| ant_colony_optimization | 40 |
| List_of_artificial_intelligence_projects | 33.333 |
| AI-complete | 33.333 |
| Cyc | 33.333 |
| Hugo_de_Garis | 33.333 |
| Joyce_K._Reynolds | 33.333 |
| Kleene_closure | 33.333 |
| Mondegreen | 33.333 |
| Supervised_learning | 33.333 |
...
And now a couple more examples:
sa: T |WP: Linux>
+------------------------------------------------+--------+
| page | coeff |
+------------------------------------------------+--------+
| Linux | 100.0 |
| Microsoft_Windows | 46.629 |
| operating_system | 37.333 |
| Unix | 28.956 |
| Mac_OS_X | 26.936 |
| C_(programming_language) | 24.242 |
| Microsoft | 22.535 |
| GNU_General_Public_License | 22.222 |
| Mac_OS | 19.529 |
| Unix-like | 19.192 |
| IBM | 19.048 |
| open_source | 17.845 |
| FreeBSD | 17.845 |
| Apple_Inc. | 16.498 |
| Java_(programming_language) | 15.825 |
| OS_X | 15.488 |
| free_software | 15.488 |
| Sun_Microsystems | 15.152 |
| C++ | 15.152 |
| source_code | 15.152 |
| Macintosh | 14.815 |
| MS-DOS | 13.468 |
| Solaris_(operating_system) | 13.468 |
| PowerPC | 13.131 |
| DOS | 13.131 |
| Android_(operating_system) | 13.131 |
| Windows_NT | 12.795 |
| Intel | 12.458 |
| programming_language | 12.121 |
| personal_computer | 12.121 |
| OpenBSD | 11.785 |
| Unicode | 11.111 |
| graphical_user_interface | 10.774 |
| video_game | 10.774 |
| Cross-platform | 10.774 |
| Internet | 10.574 |
| OS/2 | 10.438 |
...
sa: T |WP: Ronald_Reagan>
+---------------------------------------------------------+--------+
| page | coeff |
+---------------------------------------------------------+--------+
| Ronald_Reagan | 100.0 |
| John_F._Kennedy | 22.951 |
| Bill_Clinton | 22.404 |
| Barack_Obama | 22.283 |
| George_H._W._Bush | 22.131 |
| Jimmy_Carter | 22.131 |
| Richard_Nixon | 22.131 |
| George_W._Bush | 22.131 |
| Republican_Party_(United_States) | 21.785 |
| Democratic_Party_(United_States) | 20.779 |
| United_States_Senate | 19.444 |
| President_of_the_United_States | 17.538 |
| White_House | 15.574 |
| Franklin_D._Roosevelt | 15.301 |
| Vietnam_War | 15.242 |
| United_States_House_of_Representatives | 14.754 |
| United_States_Congress | 14.085 |
| Supreme_Court_of_the_United_States | 13.388 |
| Lyndon_B._Johnson | 13.388 |
| Margaret_Thatcher | 13.115 |
| Cold_War | 13.093 |
| Dwight_D._Eisenhower | 12.568 |
| Nobel_Peace_Prize | 12.368 |
| The_Washington_Post | 12.295 |
| Gerald_Ford | 12.022 |
...
sa: T |WP: Los_Angeles>
+--------------------------------------------------+--------+
| page | coeff |
+--------------------------------------------------+--------+
| Los_Angeles | 100.0 |
| Chicago | 20.852 |
| California | 18.789 |
| Los_Angeles_Times | 17.833 |
| San_Francisco | 16.704 |
| New_York_City | 15.536 |
| Philadelphia | 14.221 |
| NBC | 12.641 |
| Washington,_D.C. | 11.484 |
| Boston | 11.061 |
| USA_Today | 10.609 |
| Texas | 10.384 |
| Academy_Award | 10.158 |
| Seattle | 9.932 |
| New_York | 9.88 |
| Time_(magazine) | 9.851 |
| Mexico_City | 9.707 |
| The_New_York_Times | 9.685 |
| Rolling_Stone | 9.481 |
| CBS | 9.481 |
| Toronto | 9.481 |
...
Now for a final couple of comments. First, my code is painfully slow. But to be fair I'm a terrible programmer and this is research, not production, code. The point I'm trying to make is that a real programmer could probably make the speed acceptable, and hence usable.
For the second point, let me quote from word2vec:
"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen')"
I haven't tested it, but I'm pretty sure my inverse-links-to sparse vectors do not have this fun property. That is why I want to create my own word2sp code. Though it would take a slightly different form in BKO. Something along the lines of:
vector |some object> => exclude(vector|France>,vector|Paris>) + vector|Italy>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should, if it works, return |Rome> as one of the top similarity matches.
Likewise:
vector |some object> => exclude(vector|man>,vector|king>) + vector|woman>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>
should return |queen> as a top hit.
Definitely something I would like to test, but I need working code first.
Home
previous: revisiting the letter rambler
next: visualizing sw files
updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org