wikipedia fragment to frequency list
Now, we had a lot of success in mapping wikipedia to its link structure, and finding semantic similarities. That uses "similar[op]". This time, we map wikipedia (well, a small piece of it) to frequency lists, and see how "find-topic[op]" works.
Here is my code to convert wikipedia xml to single word frequencies.
$ ./play_with_wikipedia_freq_list.py data/fragments/0.xml
10,604 minutes later, we have this.
Now, some examples:
sa: T |*> #=> table[wikipage,coeff] select[1,300] 100 intn-find-topic[words-1] |_self>
sa: T |river torrens>
+-----------------------------+----------+
| wikipage | coeff |
+-----------------------------+----------+
| Murray_River | 2210.857 |
| The_Bronx | 1450.875 |
| South_Australia | 1243.607 |
| Adelaide | 1130.552 |
| Prince_Edward_Island | 746.164 |
| Gypsum | 710.633 |
| Port_Adelaide_Football_Club | 678.331 |
| June_14 | 552.714 |
| Trade | 552.714 |
| October_25 | 497.443 |
| Dinosaur | 226.11 |
+-----------------------------+----------+
Time taken: 27 minutes, 19 seconds, 709 milliseconds
sa: T |adelaide university>
+-------------------------------------------+--------+
| wikipage | coeff |
+-------------------------------------------+--------+
| Macquarie_University | 90.953 |
| Immanuel_Kant | 74.416 |
| Robert_Menzies | 71.625 |
| David_Hume | 71.625 |
| Theology | 68.214 |
| Adelaide | 65.114 |
| Austin,_Texas | 65.114 |
| Yoga | 65.114 |
| Gregor_Mendel | 63.951 |
| Mike_Moore_(New_Zealand_politician) | 63.951 |
| New_South_Wales | 61.393 |
| Perth | 61.393 |
| Aristophanes | 61.393 |
| Bob_Hawke | 61.393 |
| Culture_of_Canada | 61.393 |
| John_Milton | 61.393 |
| West_Bengal | 61.393 |
| Brewing | 61.393 |
| Fyodor_Dostoyevsky | 61.393 |
| Hunter_College | 61.393 |
| John_Stuart_Mill | 61.393 |
...
Time taken: 27 minutes, 8 seconds, 965 milliseconds
sa: T |apple juice>
+---------------------------------------+---------+
| wikipage | coeff |
+---------------------------------------+---------+
| Vinegar | 402.189 |
| McIntosh_(apple) | 367.835 |
| Fruit | 361.97 |
| Cuisine_of_the_United_States | 329.064 |
| Drink | 321.751 |
| Vietnamese_cuisine | 321.751 |
| List_of_cocktails | 321.751 |
| Hungarian_language | 294.268 |
| Arsenic | 289.576 |
| Chardonnay | 289.576 |
| Pear | 271.478 |
| Swedish_cuisine | 271.478 |
| Cuisine_of_the_Southern_United_States | 271.478 |
| Food_preservation | 241.314 |
| Turkish_cuisine | 241.314 |
| Mead | 241.314 |
| French_cuisine | 217.182 |
| Mojito | 206.84 |
...
Time taken: 27 minutes, 25 seconds, 378 milliseconds
T |russia china japan australia new zealand egypt>
+-----------------------------------------------------------+--------+
| wikipage | coeff |
+-----------------------------------------------------------+--------+
| Tram | 77.349 |
| List_of_national_capitals_and_largest_cities_by_country | 75.967 |
| General_Motors | 74.448 |
| 2000s_(decade) | 70.903 |
| History_of_painting | 70.903 |
| 2010s | 67.68 |
| British_Empire | 67.68 |
| Foreign_relations_of_China | 67.68 |
| Self-determination | 67.68 |
| Foreign_relations_of_Taiwan | 67.68 |
| Toyota | 67.68 |
| Dwight_D._Eisenhower | 65.991 |
| Psychology | 65.991 |
| 2008 | 63.813 |
| List_of_former_sovereign_states | 63.813 |
| Foreign_relations_of_Indonesia | 63.813 |
| Foreign_relations_of_Japan | 63.813 |
| Foreign_relations_of_North_Korea | 63.813 |
| Peninsula | 63.813 |
| Pandemic | 63.813 |
| United_Nations_Security_Council | 63.813 |
| 1996 | 63.813 |
| List_of_mountains | 63.813 |
...
Time taken: 1 hour, 39 minutes, 16 seconds, 194 milliseconds
Anyway, largely rubbish results! Doesn't mean find-topic[op] is completely useless, eg, seems to work well with finding name type (male, female, last), but just doesn't work that well on wikipedia.
Home
previous: even more inverse simm results
next: non linear resonance
updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org