revisiting wikipedia inverse links to semantic similarities

Nothing new here, just some more wikipedia semantic similarity examples. The motivation was partly word2vec. They have some examples of semantic similarity using their word vectors. I had planned to write my own word2sp but so far my idea failed! And I couldn't use their word vectors because they have negative coeffs, while my similarity metric requires positive coeffs.

So in the mean-time I decided to re-run my wikipedia code. I tried to use 300,000 wikipedia links sw file, but that failed too. It needed too much RAM and took too long to run. I thought I had used it in the past, in which case I don't know why it failed this time!

Here is the first word2vec example (distance to "france"):

                 Word       Cosine distance
-------------------------------------------
                spain              0.678515
              belgium              0.665923
          netherlands              0.652428
                italy              0.633130
          switzerland              0.622323
           luxembourg              0.610033
             portugal              0.577154
               russia              0.571507
              germany              0.563291
            catalonia              0.534176

Here it is using my code:

sa: load 30k--wikipedia-links.sw
sa: find-inverse[links-to]
sa: T |*> #=> table[page,coeff] select[1,200] 100 self-similar[inverse-links-to] |_self>
sa: T |WP: France>
+--------------------------------+--------+
| page                           | coeff  |
+--------------------------------+--------+
| France                         | 100.0  |
| Germany                        | 31.771 |
| United_Kingdom                 | 30.537 |
| Italy                          | 27.452 |
| Spain                          | 23.566 |
| United_States                  | 20.152 |
| Japan                          | 19.556 |
| Netherlands                    | 19.309 |
| Russia                         | 18.877 |
| Canada                         | 18.384 |
| Europe                         | 17.273 |
| India                          | 17.212 |
| China                          | 16.78  |
| Paris                          | 16.595 |
| England                        | 16.286 |
| World_War_II                   | 15.923 |
| Australia                      | 15.238 |
| Soviet_Union                   | 14.867 |
| Belgium                        | 14.189 |
| Poland                         | 14.127 |
| Portugal                       | 13.819 |
| World_War_I                    | 13.757 |
| Austria                        | 13.695 |
| Sweden                         | 13.572 |
| Switzerland                    | 13.51  |
| Egypt                          | 12.647 |
| European_Union                 | 12.4   |
| Brazil                         | 12.338 |
| United_Nations                 | 12.091 |
| Greece                         | 11.906 |
| London                         | 11.906 |
| Israel                         | 11.783 |
| Turkey                         | 11.783 |
| Denmark                        | 11.598 |
| French_language                | 11.536 |
| Norway                         | 11.413 |
| Latin                          | 10.611 |
| Rome                           | 10.364 |
| Mexico                         | 10.364 |
| English_language               | 9.994  |
| South_Africa                   | 9.747  |
...

which works pretty well I must say.

Here is the next word2vec example (distance to San Francisco):

                 Word       Cosine distance
-------------------------------------------
          los_angeles              0.666175
          golden_gate              0.571522
              oakland              0.557521
           california              0.554623
            san_diego              0.534939
             pasadena              0.519115
              seattle              0.512098
                taiko              0.507570
              houston              0.499762
     chicago_illinois              0.491598

Here it is using my code:

+---------------------------------------+--------+
| page                                  | coeff  |
+---------------------------------------+--------+
| San_Francisco                         | 100.0  |
| Los_Angeles                           | 16.704 |
| Chicago                               | 15.919 |
| 1924                                  | 15.522 |
| 1916                                  | 14.566 |
| California                            | 14.502 |
| 1915                                  | 14.286 |
| 2014                                  | 14.217 |
| 1933                                  | 14.031 |
| 1913                                  | 14.006 |
| 1918                                  | 14.0   |
| 1930                                  | 14.0   |
| Philadelphia                          | 13.99  |
| 1925                                  | 13.984 |
| 1931                                  | 13.904 |
| 1920                                  | 13.802 |
| 1932                                  | 13.776 |
| 1942                                  | 13.744 |
| 1999                                  | 13.725 |
...

Hrmm... that didn't work so great. I wonder why.

Here is the next word2vec example:

Enter word or sentence (EXIT to break): /en/geoffrey_hinton

                        Word       Cosine distance
--------------------------------------------------
           /en/marvin_minsky              0.457204
             /en/paul_corkum              0.443342
 /en/william_richard_peltier              0.432396
           /en/brenda_milner              0.430886
    /en/john_charles_polanyi              0.419538
          /en/leslie_valiant              0.416399
         /en/hava_siegelmann              0.411895
            /en/hans_moravec              0.406726
         /en/david_rumelhart              0.405275
             /en/godel_prize              0.405176

And here it is using my code:

+------------------------------------------------------------+--------+
| page                                                       | coeff  |
+------------------------------------------------------------+--------+
| Geoffrey_Hinton                                            | 100    |
| perceptron                                                 | 66.667 |
| Tom_M._Mitchell                                            | 66.667 |
| computational_learning_theory                              | 66.667 |
| Nils_Nilsson_(researcher)                                  | 66.667 |
| beam_search                                                | 66.667 |
| Raj_Reddy                                                  | 50     |
| AI_effect                                                  | 40     |
| ant_colony_optimization                                    | 40     |
| List_of_artificial_intelligence_projects                   | 33.333 |
| AI-complete                                                | 33.333 |
| Cyc                                                        | 33.333 |
| Hugo_de_Garis                                              | 33.333 |
| Joyce_K._Reynolds                                          | 33.333 |
| Kleene_closure                                             | 33.333 |
| Mondegreen                                                 | 33.333 |
| Supervised_learning                                        | 33.333 |
...

And now a couple more examples:

sa: T |WP: Linux>
+------------------------------------------------+--------+
| page                                           | coeff  |
+------------------------------------------------+--------+
| Linux                                          | 100.0  |
| Microsoft_Windows                              | 46.629 |
| operating_system                               | 37.333 |
| Unix                                           | 28.956 |
| Mac_OS_X                                       | 26.936 |
| C_(programming_language)                       | 24.242 |
| Microsoft                                      | 22.535 |
| GNU_General_Public_License                     | 22.222 |
| Mac_OS                                         | 19.529 |
| Unix-like                                      | 19.192 |
| IBM                                            | 19.048 |
| open_source                                    | 17.845 |
| FreeBSD                                        | 17.845 |
| Apple_Inc.                                     | 16.498 |
| Java_(programming_language)                    | 15.825 |
| OS_X                                           | 15.488 |
| free_software                                  | 15.488 |
| Sun_Microsystems                               | 15.152 |
| C++                                            | 15.152 |
| source_code                                    | 15.152 |
| Macintosh                                      | 14.815 |
| MS-DOS                                         | 13.468 |
| Solaris_(operating_system)                     | 13.468 |
| PowerPC                                        | 13.131 |
| DOS                                            | 13.131 |
| Android_(operating_system)                     | 13.131 |
| Windows_NT                                     | 12.795 |
| Intel                                          | 12.458 |
| programming_language                           | 12.121 |
| personal_computer                              | 12.121 |
| OpenBSD                                        | 11.785 |
| Unicode                                        | 11.111 |
| graphical_user_interface                       | 10.774 |
| video_game                                     | 10.774 |
| Cross-platform                                 | 10.774 |
| Internet                                       | 10.574 |
| OS/2                                           | 10.438 |
...

sa: T |WP: Ronald_Reagan>
+---------------------------------------------------------+--------+
| page                                                    | coeff  |
+---------------------------------------------------------+--------+
| Ronald_Reagan                                           | 100.0  |
| John_F._Kennedy                                         | 22.951 |
| Bill_Clinton                                            | 22.404 |
| Barack_Obama                                            | 22.283 |
| George_H._W._Bush                                       | 22.131 |
| Jimmy_Carter                                            | 22.131 |
| Richard_Nixon                                           | 22.131 |
| George_W._Bush                                          | 22.131 |
| Republican_Party_(United_States)                        | 21.785 |
| Democratic_Party_(United_States)                        | 20.779 |
| United_States_Senate                                    | 19.444 |
| President_of_the_United_States                          | 17.538 |
| White_House                                             | 15.574 |
| Franklin_D._Roosevelt                                   | 15.301 |
| Vietnam_War                                             | 15.242 |
| United_States_House_of_Representatives                  | 14.754 |
| United_States_Congress                                  | 14.085 |
| Supreme_Court_of_the_United_States                      | 13.388 |
| Lyndon_B._Johnson                                       | 13.388 |
| Margaret_Thatcher                                       | 13.115 |
| Cold_War                                                | 13.093 |
| Dwight_D._Eisenhower                                    | 12.568 |
| Nobel_Peace_Prize                                       | 12.368 |
| The_Washington_Post                                     | 12.295 |
| Gerald_Ford                                             | 12.022 |
...

sa: T |WP: Los_Angeles>
+--------------------------------------------------+--------+
| page                                             | coeff  |
+--------------------------------------------------+--------+
| Los_Angeles                                      | 100.0  |
| Chicago                                          | 20.852 |
| California                                       | 18.789 |
| Los_Angeles_Times                                | 17.833 |
| San_Francisco                                    | 16.704 |
| New_York_City                                    | 15.536 |
| Philadelphia                                     | 14.221 |
| NBC                                              | 12.641 |
| Washington,_D.C.                                 | 11.484 |
| Boston                                           | 11.061 |
| USA_Today                                        | 10.609 |
| Texas                                            | 10.384 |
| Academy_Award                                    | 10.158 |
| Seattle                                          | 9.932  |
| New_York                                         | 9.88   |
| Time_(magazine)                                  | 9.851  |
| Mexico_City                                      | 9.707  |
| The_New_York_Times                               | 9.685  |
| Rolling_Stone                                    | 9.481  |
| CBS                                              | 9.481  |
| Toronto                                          | 9.481  |
...

Now for a final couple of comments. First, my code is painfully slow. But to be fair I'm a terrible programmer and this is research, not production, code. The point I'm trying to make is that a real programmer could probably make the speed acceptable, and hence usable.

For the second point, let me quote from word2vec:
"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen')"

I haven't tested it, but I'm pretty sure my inverse-links-to sparse vectors do not have this fun property. That is why I want to create my own word2sp code. Though it would take a slightly different form in BKO. Something along the lines of:

vector |some object> => exclude(vector|France>,vector|Paris>) + vector|Italy>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>

should, if it works, return |Rome> as one of the top similarity matches.
Likewise:

vector |some object> => exclude(vector|man>,vector|king>) + vector|woman>
table[object,coeff] select[1,20] 100 self-similar[vector] |some object>

should return |queen> as a top hit.

Definitely something I would like to test, but I need working code first.

Home
previous: revisiting the letter rambler
next: visualizing sw files

updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org