|
|||||
Recent ArticlesClojure is a relatively new language to appear on the Java Virtual Machine (JVM), although it draws on very mature roots in the form of the LISP langu ... Should You Care About Requirements Engineering? Recently, I (Adil) was invited to participate in a one day seminar on the subject of Requirements Engineering. Whilst I have no direct experience of t ... Tips for Setting Up Your First Business Website To attract all potential customers to your business you need a presence on the web. The problem is that if you haven't set up a website before, you p ... LISP is a general-purpose programming language and is the second-oldest programming language still in use, but how much do you know about it? Did you ... Open Source Tools for Developers: Why They Matter From a developer's point of view use of open-source tools has advantages beyond the obvious economic ones. With the open-source database MySQL in mind ... |
Lord of the Strings (Part 2)The Story So Far...When I saw the latest in the Lord of the Rings trilogy of movies a short while ago, I wondered how Tolkien had invented the artificial languages of Middle Earth, such as Dwarvish and Elvish. In my previous article, Lord of the Strings (Part 1), I told of my desire to discover which real language had been the biggest influence on Tolkien for his invented ones. As a software developer, I wanted to discover this information algorithmically. My idea was to use my own string similarity algorithm (see my article 'How to Strike a Match') to compare each word from a list of Tolkien words to words from 14 other real languages. For each Tolkien word, I would find and record the language with the word that is (lexically) most similar. The set of most-similar words and the languages from which they came would provide new insights into the influences on Tolkien. The Part 1 article described how I obtained word lists for 14 real languages, as well as a list of Tolkien words, cleaned them all up in preparation for processing, and represented them in a MySQL relational database. At the end of the article, I had over 1.3 million words in the database, a Java implementation of my similarity metric (from the 'How to Strike a Match' article), and JDBC database access drivers at my disposal that would provide access to the word lists from a Java program. I also used some SQL queries to analyse the sizes of the word lists, as these will be important for any conclusions made. Preparing to Store Similarity ResultsNow that the word lists were ready, my attention turned to the algorithm that would be applied to the word lists and the results that would be generated. The question in my mind was 'What should I do with a result once I have discovered it?' That is, as soon as I know the most similar word to a Tolkien word, do I print it out to the screen, only for that information to become lost when the screen buffer size is exceeded? Do I write it out to a file, so that the result is persisted for later analysis? Or do I write the results back to the database, so that they are immediately available for analysis through SQL queries? In fact, I opted to write the results back to the database, but not for the reason given above. Actually, I was concerned about the size of the problem. There were 470 Tolkien words in the database, each of which could potentially be compared against 1.3 million other strings. That's a lot of computation, and even for a single given word I thought it would surely take a while to compute the most similar string. My idea was to use the persistence of the database to help address the combinatorics of the problem. If I designed the program such that it wrote results to the database as it computed them, and chose words to analyze that are not already in the results list in the database, then the computer could be shut down and restarted and the program would still pick up where it left off. In other words, very little processing time would be lost through a system shutdown, even if the analysis was not complete. This is very desirable behavior, so I created another database table, called 'matches' for storing the results of the similarity comparisons. The table needed to store the word (or its id), its best matching word, the language of the best matching word, and the similarity score. I used the following SQL command to create it: CREATE TABLE matches ( word varchar(60), word_id int(10) NOT NULL, best varchar(60), lang enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN", "HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI", "SWEDISH", "TOLKIEN"), similarity float, primary key(word_id), index word_i (word_id) ); (Although not strictly necessary, I have included the word and best word in this table as well as in the words table. I realize this duplicates information and runs the risk of the data becoming inconsistent, but it keeps things simple in this article. The normalization and further refinement of the database schema is left as an exercise for the reader... :-)) Discovering String SimilaritiesI wrote a Java program that, for each Tolkien word, computes the most similar word and stores that value back into the database. I will explain how the program works without providing details about the JDBC database access (if you are interested in such details, try this short introduction, or a book such as Beginning Java Databases.) In fact, the details of the database access are also hidden from the main program, as it uses a database access class which hides its inner workings, and instead exposes the following interface: public interface QueryRunner { public void openConnection() throws SQLException; public ResultSet runQuery(String query) throws SQLException; public int runUpdate(String sql) throws SQLException; public void closeConnection() throws SQLException; }
As you can see there's a method for opening a database connection, methods for executing a query and performing a database update, and finally, a method for closing the connection to the database. It is a very simple interface, but sufficient for the task at hand. The program also makes use of a small class that represents a word, its identifier, and the language from which it came: class Word { private String word; private int id; private String lang;
public Word(String wd, int wdId, String language) { word = wd; id = wdId; lang = language; }
public String toString() { StringBuffer buf = new StringBuffer("Word["); buf.append(word); buf.append(","); buf.append(id); buf.append(","); buf.append(lang); buf.append("]"); return buf.toString(); } }
Now I return to the main logic of the program, which is implemented in five methods of a class called 'Compare'. The top-level structure of the class is as follows (and you can see the full program listing here): public class Compare { protected QueryRunner queryRunner;
public Compare() {} protected Word findBestMatch(Word baseWord, Collection collection) {} protected void storeBestMatch(Word w, Word bestMatch, double bestScore) {} protected Word getWord(String lang) throws SQLException {} protected Collection getCandidateStrings(String str) throws SQLException {} } As you can see, the class uses an instance of a QueryRunner to access the database, and also makes use of the Word class, both as a method parameter and a result. The following pseudo-code explains the algorithm and the interaction between the methods: OpenDatabaseConnection; // Get a Tolkien word that has not been considered While (w = getWord("Tolkien") { // Run a first pass query to return promising strings Collection candidates = getCandidateStrings(w); // Find the best match of those candidates findBestMatch(w, candidates); } CloseDatabaseConnection;
First, I open a database connection, and then repeatedly retrieve Tolkien words that have not yet been considered. (The getWord() method runs an SQL query, which returns a Tolkien word that does not yet appear in the comparison results table.) For each Tolkien word, I run a 'first pass' query to retrieve a subset of promising 'candidate' words from all those in the database. I used the database's pattern matching ability to return all those (non-Tolkien) words that start with the same two letters as the Tolkien word. The program then calls findBestMatch() on the Tolkien word and the candidate words to find and store the best matching word. When this has been done for all the Tolkien words, the database connection is closed.
findBestMatch(Word w, Collection candidates): bestScore = -1; for each c in candidates { similarity = compareStrings(c, w); if (similarity > bestScore) { bestScore = similarity; bestMatch = c; } } storeBestMatch(w, bestMatch, bestScore); In the findBestMatch() method, the program iterates through each candidate word and computes the similarity metric. If the similarity metric is better than the previous best, then the program stores the new value as its current best. Once all the candidate strings have been considered, the best match is saved to the database using an SQL update query. Analyzing The ResultsWith the first pass mechanism incorporated into the algorithm as described, I found that the program runs in quite acceptable time. On my 1.6 GHz PC with 768 MB Ram using Sun's Java J2SE 1.4.2 and MySQL 3.23.49, it runs to completion in about four minutes. The java program and database both run on the same machine, so there is no network latency in this configuration. The main reason for this relatively quick execution time is the first pass mechanism that returns a subset of the whole dictionary of words. Without this mechanism, I am certain that the program would take orders of magnitude longer to finish. (And if it took longer to finish, then my idea of storing results back into the database as they are found would gain significance.) Anyway, let's have a look at the results obtained. First of all, let's see how many exact matches were found and for which languages: mysql> select count(*), lang from matches where similarity=1.0 group by lang; +----------+-----------+ | count(*) | lang | +----------+-----------+ | 2 | DANISH | | 7 | DUTCH | | 3 | ENGLISH | | 13 | FINNISH | | 1 | FRENCH | | 1 | GERMAN | | 1 | HUNGARIAN | | 7 | JAPANESE | | 2 | LATIN | | 2 | NORWEGIAN | | 2 | POLISH | | 4 | SPANISH | | 1 | SWEDISH | +----------+-----------+ 13 rows in set (0.00 sec)
The most exact matches were found for Finnish. By executing the following query, we can inspect the words of those exact matches: select word from matches where similarity=1 and lang='Finnish';
The exact matches were found for the Finnish words 'andor', 'aragorn', 'arwen', 'balin', 'hobbit', 'ilmarin', 'lindon', 'maia', 'nahar', 'ori', 'rohan', 'sylvan', and 'velar'; whilst English scored exact matches for the words 'goblin', 'hobgoblin' and 'thrush'. The following query shows the frequency with which each language was chosen as being most similar, over the whole set of 470 Tolkien words. mysql> select lang, count(*) as hits from matches group by lang order by hits desc; +-----------+------+ | lang | hits | +-----------+------+ | FINNISH | 100 | | ENGLISH | 64 | | SPANISH | 62 | | JAPANESE | 52 | | DUTCH | 40 | | LATIN | 26 | | POLISH | 25 | | DANISH | 19 | | NORWEGIAN | 19 | | FRENCH | 18 | | HUNGARIAN | 18 | | GERMAN | 16 | | SWAHILI | 7 | | SWEDISH | 4 | +-----------+------+ 14 rows in set (0.00 sec)
If a language is selected as the best match for a Tolkien word, I call it a 'hit'. As you can see from the results of the query, Finnish scores the highest by a clear margin, with 100 hits. The languages with the next-most hits are English, Spanish and Japanese, with 64, 62 and 52 hits, respectively. At first sight, this seems to support the argument that Finnish is the language that had the biggest influence on Tolkien (as suggested by the Finnish speaker, Harri Perälä). However, if you recall the sizes of the word lists, analyzed at the end of the Part 1 article, you will remember that the Finnish word list is also the largest word list of all the languages, comprising 21.4% of the words in the database. So perhaps the reason why Finnish scores so many hits is simply because I had so many Finnish words in the database? Clearly, to conclude anything from the results, we need to take account of the relative sizes of the word lists for each of the languages. Perhaps the best way of doing this is to compute an expected number of hits for each language (based on the size of its word list), and compare that number to the actual number of hits that we found. The following pair of queries computes the number of expected hits for each of the languages, under the assumption that any of the languages is equally likely to contain the most similar word to any given Tolkien word (a so-called 'null hypothesis'). mysql> select @total:=count(*) from words where lang<>'tolkien'; +------------------+ | @total:=count(*) | +------------------+ | 1342940 | +------------------+ 1 row in set (1.20 sec)
mysql> select lang, round(count(*)*470/@total,1) as expected_hits from words where lang<>'tolkien' group by lang order by expected_hits desc; +-----------+---------------+ | lang | expected_hits | +-----------+---------------+ | FINNISH | 100.5 | | DUTCH | 62.4 | | GERMAN | 56.0 | | FRENCH | 48.4 | | JAPANESE | 40.3 | | POLISH | 38.3 | | SPANISH | 30.1 | | LATIN | 27.0 | | NORWEGIAN | 21.6 | | ENGLISH | 19.7 | | DANISH | 8.9 | | SWAHILI | 6.4 | | HUNGARIAN | 6.2 | | SWEDISH | 4.2 | +-----------+---------------+ 14 rows in set (1.06 sec)
Computing the expected number of hits is the statistical equivalent of a control experiment in the physical sciences - in other words, what would be the outcome if there were no interesting behaviour to study? You might think such an exercise to be of little use, as we're pretty sure there is something interesting going on. The point is that we can now compare the actual results that we obtained to the expected results, and look at the differences. The following table orders the languages according to the size of the difference between actual and expected number of hits.
Now we're getting somewhere. The languages with differences that are greater than zero may have had an influence on Tolkien. Furthermore, the size of the difference is also an indication of the level of influence. So we're beginning to see that Tolkien's mother tongue of English seems to have had the most profound influence on him. My reaction to this was, at first, one of surprise, then of reassurance. I was surprised because of the apparent dissimilarity of Tolkien's invented words to English, and the fact that the Tolkien words matched only three English words exactly. But I was reassured because Tolkien was, after all, English, and you would expect him to be heavily influenced by his native language. Note also the particularly strong result for Hungarian, which received nearly three times as many hits as expected. Finnish performed almost exactly as expected, indicating no appreciable influence on Tolkien, whilst French and German performed well under expectations, perhaps indicating that Tolkien was deliberately avoiding the influences of these languages. ConclusionsWhen I started this investigation, I had no idea what the result would be. I just clung firmly onto the belief that my string similarity metric, together with a simple algorithm to iterate over the set of possible word pair comparisons, would provide an interesting result. In fact, the results are very satisfying. I found that English had a profound effect on Tolkien's invented languages, with perhaps further influences from Hungarian and Spanish. This is satisfying because it is entirely reasonable (at least the part about English!), though not exactly what I expected after reading about the (apparently unfounded) claims for the influences of Finnish. It is also satisfying because it increases my confidence in the string similarity method. And as developers, we like to have confidence in our methods. Simon White |