-- load up the data: sa: load improved-fragment-webpages.sw sa: load create-average-website-fragments.swHere is the tweaked BKO that makes use of find-unique[op]:

(this is inside this file)

-- define the list of average websites: |ave list> => |average abc> + |average adelaidenow> + |average slashdot> + |average smh> + |average wikipedia> + |average youtube> -- we want average hash to be distinct from the other hashes: |null> => map[hash-4B,average-hash-4B] "" |ave list> -- find unique kets for our average superpositions: |null> => find-unique[average-hash-4B] |> -- now, let's see how well these patterns recognize the pages we left out of our average: result |abc 11> => 100 similar[hash-4B,unique-average-hash-4B] |abc 11> result |adelaidenow 11> => 100 similar[hash-4B,unique-average-hash-4B] |adelaidenow 11> result |slashdot 11> => 100 similar[hash-4B,unique-average-hash-4B] |slashdot 11> result |smh 11> => 100 similar[hash-4B,unique-average-hash-4B] |smh 11> result |wikipedia 11> => 100 similar[hash-4B,unique-average-hash-4B] |wikipedia 11> result |youtube 11> => 100 similar[hash-4B,unique-average-hash-4B] |youtube 11> -- tidy results: tidy-result |abc 11> => normalize[100] result |_self> tidy-result |adelaidenow 11> => normalize[100] result |_self> tidy-result |slashdot 11> => normalize[100] result |_self> tidy-result |smh 11> => normalize[100] result |_self> tidy-result |wikipedia 11> => normalize[100] result |_self> tidy-result |youtube 11> => normalize[100] result |_self>And here are the results:

sa: matrix[result] [ average abc ] = [ 36.0 0 0 0 0 0 ] [ abc 11 ] [ average adelaidenow ] [ 0 38.66 0 0 0 0 ] [ adelaidenow 11 ] [ average slashdot ] [ 0 0 35.48 0.04 0 0 ] [ slashdot 11 ] [ average smh ] [ 0 0.02 0.02 36.99 0 0 ] [ smh 11 ] [ average wikipedia ] [ 0 0.01 0.03 0 36.54 0 ] [ wikipedia 11 ] [ average youtube ] [ 0 0.02 0 0 0 36.72 ] [ youtube 11 ] sa: matrix[tidy-result] [ average abc ] = [ 100 0 0 0 0 0 ] [ abc 11 ] [ average adelaidenow ] [ 0 99.87 0 0 0 0 ] [ adelaidenow 11 ] [ average slashdot ] [ 0 0 99.86 0.1 0 0 ] [ slashdot 11 ] [ average smh ] [ 0 0.05 0.05 99.9 0 0 ] [ smh 11 ] [ average wikipedia ] [ 0 0.03 0.1 0 100 0 ] [ wikipedia 11 ] [ average youtube ] [ 0 0.05 0 0 0 100 ] [ youtube 11 ]Some notes:

1) that is a seriously good result! Yeah, without the normalize[100] we are down from 90% to 36% similarity, but the gap between best result and next best is now rather large! Which we see clearly in the tidy-result matrix (that does make use of normalize[100]). Heh, and we don't need drop-below[t] anymore either!

2) it is interesting that we can get such a big improvement using only 1 new line of code (the find-unique[average-hash-4B] bit) and a few tweaks to the existing BKO.

3) this technique of dropping back to considering only unique kets only works some of the time. For a start you need large superpositions, and a lot of unique kets from superposition to superposition. For example this technique would not work for the Iris example, the wage prediction example, or the document-type example. I'm wondering if there is a way to borrow the general idea of suppressing kets that are duplicate, but not as harsh as only considering unique kets. Maybe as simple as, if ket is in n superpositions, map coeff => coeff/n? Or do we need something smarter than that?

4) lets take a look at how many unique kets we have:

sa: how-many-hash |*> #=> to-comma-number how-many average-hash-4B |_self> sa: how-many-unique-hash |*> #=> to-comma-number how-many unique-average-hash-4B |_self> sa: delta |*> #=> arithmetic(how-many average-hash-4B |_self>,|->,how-many unique-average-hash-4B |_self>) sa: table[website,how-many-hash,how-many-unique-hash,delta] "" |ave list> +---------------------+---------------+----------------------+-------+ | website | how-many-hash | how-many-unique-hash | delta | +---------------------+---------------+----------------------+-------+ | average abc | 1,492 | 1,391 | 101 | | average adelaidenow | 11,869 | 11,636 | 233 | | average slashdot | 5,462 | 5,275 | 187 | | average smh | 10,081 | 9,784 | 297 | | average wikipedia | 3,182 | 3,084 | 98 | | average youtube | 6,390 | 6,310 | 80 | +---------------------+---------------+----------------------+-------+I didn't really expect that. I thought there would be a lot more duplicate kets, but instead we only see a couple of hundred. But since removing them had such an improvement on our results, presumably the duplicate kets had relatively large coeffs. eg, the ket generated from </a> in html will be universal across our webpages, and have a large coeff.

I guess that is it for this post. Back to find-topic[op] in the next couple of posts.

Home

previous: new function find unique op

next: mapping sw files to frequency lists

updated: 19/12/2016

by Garry Morrison

email: garry -at- semantic-db.org