some bigger sw examples

Now we have most of the basics out of the way, we can look at some bigger examples.
So here are several public databases that I have mapped to sw format:

GeoNames
"The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge."
http://www.geonames.org/
And my rough version of the Australian data here (yeah, I want to redo at some stage):
geonames-au-id-version.sw (50 MB,  7 op types and 871,186 learn rules)
improved sw versions:
improved-geonames-au.sw (109 MB, 11 op types and 1,544,996 learn rules)
improved-geonames-cities-1000.sw (94 MB, 11 op types and 1,375,844 learn rules)
improved-geonames-cities-15000.sw (17 MB, 11 op types and 236,184 learn rules)
improved-geonames-de.sw (99 MB, 11 op types and 1,454,997 learn rules)
improved-geonames-fr.sw (83 MB, 11 op types and 1,203,510 learn rules)
improved-geonames-gb.sw (31 MB, 11 op types and 441,431 learn rules)
improved-geonames-us.sw (1.3 GB, 11 op types and 18,950,003 learn rules)

Moby Thesaurus
"Moby Thesaurus is the largest and most comprehensive thesaurus data source in English available for commercial use. This second edition has been thoroughly revised adding more than 5,000 root words (to total more than 30,000) with an additional million synonyms and related terms (to total more than 2.5 million synonyms and related terms)."
http://icon.shef.ac.uk/Moby/mthes.html
Again, my rough version here:
moby-thesaurus.sw (32 MB, 1 op type and 30,244 learn rules)
improved sw version:
improved-moby-thesaurus.sw (50 MB, 1 op type and 30,260 learn rules)

Moby Part-of-Speech
"This second edition is a particularly thorough revision of the original Moby Part-of-Speech. Beyond the fifteen thousand new entries, many thousand more entries have been scrutinized for correctness and modernity. This is unquestionably the largest P-O-S list in the world."
http://icon.shef.ac.uk/Moby/mpos.html
The sw version:
part-of-speech.sw (16 MB, 1 op type and 233,090 learn rules)

Frequently Occurring Surnames from Census 1990
http://www.census.gov/topics/population/genealogy/data/1990_census/1990_census_namefiles.html#
The sw version:
names.sw (1.7 MB,  1 op type and 4 learn rules)

IMDB database:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/
The sw version:
improved-imdb.sw (588 MB, 2 op types and 2,591,132 learn rules)
imdb-ratings.sw (1.6 MB, 4 op types and 21,820 learn rules)
improved-imdb-year.sw (470 MB, 4 op types and 3,146,301 learn rules)

A year's worth of historical share data (I forget the source!)
shares.sw (17 MB, 5 op types and 2,622 learn rules)

And I guess that is about it. The only note I want to make is that in each of these examples I had to write a custom script to parse! Once in sw format, parsing is trivial and identical in each case. Yeah, I'm trying to push my sw notation! I certainly think it is superior to XML.

Update: in a future phase of the project I would like to extend this, and map even more data sets to sw format.

Update: I now have a script to produce the stats summary of sw files.


Home
previous: finding common movies and actors
next: the maths rules for the bko scheme

updated: 19/12/2016
by Garry Morrison
email: garry -at- semantic-db.org