find unique op applied to webpage superpositions
Recall back here we mapped sample websites to superpositions, and then did pattern recognition on them. Well, using find-unique[op] we can give a vastly better result!
-- load up the data:
sa: load improved-fragment-webpages.sw
sa: load create-average-website-fragments.sw
Here is the tweaked BKO that makes use of find-unique[op]:
(this is inside this file)
-- define the list of average websites:
|ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube>
-- we want average hash to be distinct from the other hashes:
|null> => map[hash-4B,average-hash-4B] "" |ave list>
-- find unique kets for our average superpositions:
|null> => find-unique[average-hash-4B] |>
-- now, let's see how well these patterns recognize the pages we left out of our average:
result |abc 11> => 100 similar[hash-4B,unique-average-hash-4B] |abc 11>
result |adelaidenow 11> => 100 similar[hash-4B,unique-average-hash-4B] |adelaidenow 11>
result |slashdot 11> => 100 similar[hash-4B,unique-average-hash-4B] |slashdot 11>
result |smh 11> => 100 similar[hash-4B,unique-average-hash-4B] |smh 11>
result |wikipedia 11> => 100 similar[hash-4B,unique-average-hash-4B] |wikipedia 11>
result |youtube 11> => 100 similar[hash-4B,unique-average-hash-4B] |youtube 11>
-- tidy results:
tidy-result |abc 11> => normalize[100] result |_self>
tidy-result |adelaidenow 11> => normalize[100] result |_self>
tidy-result |slashdot 11> => normalize[100] result |_self>
tidy-result |smh 11> => normalize[100] result |_self>
tidy-result |wikipedia 11> => normalize[100] result |_self>
tidy-result |youtube 11> => normalize[100] result |_self>
And here are the results:
sa: matrix[result]
[ average abc ] = [ 36.0 0 0 0 0 0 ] [ abc 11 ]
[ average adelaidenow ] [ 0 38.66 0 0 0 0 ] [ adelaidenow 11 ]
[ average slashdot ] [ 0 0 35.48 0.04 0 0 ] [ slashdot 11 ]
[ average smh ] [ 0 0.02 0.02 36.99 0 0 ] [ smh 11 ]
[ average wikipedia ] [ 0 0.01 0.03 0 36.54 0 ] [ wikipedia 11 ]
[ average youtube ] [ 0 0.02 0 0 0 36.72 ] [ youtube 11 ]
sa: matrix[tidy-result]
[ average abc ] = [ 100 0 0 0 0 0 ] [ abc 11 ]
[ average adelaidenow ] [ 0 99.87 0 0 0 0 ] [ adelaidenow 11 ]
[ average slashdot ] [ 0 0 99.86 0.1 0 0 ] [ slashdot 11 ]
[ average smh ] [ 0 0.05 0.05 99.9 0 0 ] [ smh 11 ]
[ average wikipedia ] [ 0 0.03 0.1 0 100 0 ] [ wikipedia 11 ]
[ average youtube ] [ 0 0.05 0 0 0 100 ] [ youtube 11 ]
Some notes:
1) that is a seriously good result! Yeah, without the normalize[100] we are down from 90% to 36% similarity, but the gap between best result and next best is now rather large! Which we see clearly in the tidy-result matrix (that does make use of normalize[100]). Heh, and we don't need drop-below[t] anymore either!
2) it is interesting that we can get such a big improvement using only 1 new line of code (the find-unique[average-hash-4B] bit) and a few tweaks to the existing BKO.
3) this technique of dropping back to considering only unique kets only works some of the time. For a start you need large superpositions, and a lot of unique kets from superposition to superposition. For example this technique would not work for the Iris example, the wage prediction example, or the document-type example. I'm wondering if there is a way to borrow the general idea of suppressing kets that are duplicate, but not as harsh as only considering unique kets. Maybe as simple as, if ket is in n superpositions, map coeff => coeff/n? Or do we need something smarter than that?
4) lets take a look at how many unique kets we have:
sa: how-many-hash |*> #=> to-comma-number how-many average-hash-4B |_self>
sa: how-many-unique-hash |*> #=> to-comma-number how-many unique-average-hash-4B |_self>
sa: delta |*> #=> arithmetic(how-many average-hash-4B |_self>,|->,how-many unique-average-hash-4B |_self>)
sa: table[website,how-many-hash,how-many-unique-hash,delta] "" |ave list>
+---------------------+---------------+----------------------+-------+
| website | how-many-hash | how-many-unique-hash | delta |
+---------------------+---------------+----------------------+-------+
| average abc | 1,492 | 1,391 | 101 |
| average adelaidenow | 11,869 | 11,636 | 233 |
| average slashdot | 5,462 | 5,275 | 187 |
| average smh | 10,081 | 9,784 | 297 |
| average wikipedia | 3,182 | 3,084 | 98 |
| average youtube | 6,390 | 6,310 | 80 |
+---------------------+---------------+----------------------+-------+
I didn't really expect that. I thought there would be a lot more duplicate kets, but instead we only see a couple of hundred. But since removing them had such an improvement on our results, presumably the duplicate kets had relatively large coeffs. eg, the ket generated from </a> in html will be universal across our webpages, and have a large coeff.
I guess that is it for this post. Back to find-topic[op] in the next couple of posts.
Home
previous: new function find unique op
next: mapping sw files to frequency lists
updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org