Catalysoft   Turtle
home products articles about us contact us

Recent Articles

What's Good About Clojure?

Clojure is a relatively new language to appear on the Java Virtual Machine (JVM), although it draws on very mature roots in the form of the LISP langu ...

Should You Care About Requirements Engineering?

Recently, I (Adil) was invited to participate in a one day seminar on the subject of Requirements Engineering. Whilst I have no direct experience of t ...

Tips for Setting Up Your First Business Website

To attract all potential customers to your business you need a presence on the web. The problem is that if you haven't set up a website before, you p ...

What's Good about LISP?

LISP is a general-purpose programming language and is the second-oldest programming language still in use, but how much do you know about it? Did you ...

Open Source Tools for Developers: Why They Matter

From a developer's point of view use of open-source tools has advantages beyond the obvious economic ones. With the open-source database MySQL in mind ...

Lord of the Strings (Part 1)

Discuss this article >>>

Introduction

I recently enjoyed the latest in the Trilogy of the 'Lord of the Rings' movies at the cinema. I hadn't read any of Tolkien's books, so watching the films was my first exposure to 'Middle Earth', with all its strange creatures and languages. I was especially intrigued by Tolkien's invented languages (such as Elvish and Dwarvish) and was curious to know where the languages came from, or, more precisely, which real language was the biggest influence on Tolkien for his inventions. As I have been thinking about issues of string similarity recently (see my previous articles 'Taming the Beast by Matching Similar Strings' and 'How to Strike a Match') I wondered whether I could extend my ideas of string similarity to language similarity. In other words, could I discover to which real language Tolkien's artificial language is most similar?

Apparently, this is a much discussed topic. The article 'Are High Elves Finno-Ugric?' suggests that Finnish had the greatest influence on the development of the Elvish language Quenya. Tolkien first came across a Finnish grammar while he was studying at Oxford, and admitted that it made a strong (even 'intoxicating'!) impression on him. Indeed, in early versions of Quenya there are many Finnish or near-Finnish words, although the meanings of the words are not those of Finnish. Tolkien himself wrote that Quenya was based on Latin, but with the added 'phonaesthetic ingredients' of Finnish and Greek. It has also been argued that some aspects of Tolkien's invention are more like Uralic languages that are outside of Baltic Finnish, whilst other aspects more closely resemble Hungarian.

An Algorithmic Approach

As a developer, I was thinking about an algorithmic approach to the problem. My idea was to write a program that takes each Tolkien word in turn and finds which real language has the word which is most similar. By inspecting the number of times each language is chosen, we should be able to decide which language was Tolkien's biggest influence. Of course I would need to look on the Web to find lists of Tolkien words, as well as word lists for other languages, but I assumed that wouldn't be a problem. My own string similarity metric (described in How to Strike a Match) could be used for the word-by-word comparison, and is a good choice because it acknowledges similarity for a common substring of any size, and is robust to differences in string size. Of course this would be a comparison of lexical similarity, as my string similarity algorithm makes only lexical comparisons. It is still possible that the inspiration for the grammar and the lexical structure of Tolkien's languages came from entirely different sources.

Although I had an existing implementation of the string similarity metric and a good idea of the basic approach, this was a truly investigative project. I didn?t know what the outcome was going to be, and I knew there would be some problems to solve along the way. But then, that's what makes it so pleasing when you do get a result. In this article, I explain the first part of my investigation - how I obtained the word lists that enabled me to do the analysis, how I processed them to clean them up, and how I represented the word lists in a database.

Acquiring and Cleaning the Word Lists

A quick Google search led me to believe that I should be able to get the data sources I needed to do the investigation ? I found suitable word lists at phreak.org and cotse.com, including a list of Tolkien's invented words.

After downloading a number of these word lists, I found that they needed some 'cleaning' before I could use them. I wanted each file to be a list of words, formatted as one word per line. This was not the case with several of the downloaded files, so I found myself 'cleaning' the data. For these basic file manipulation and formatting tasks, I found the speed and flexibility of the Unix-style bash shell invaluable. The tasks were as follows:

  1. Using a text editor, I deleted any explanatory text or comments from the tops of the files.
  2. I found that the list of Hungarian words had first a word, but then a number on each line. I stripped the numbers from this file using the simple awk script '{print $1}'.
  3. I sorted the files, and then used a text editor to remove non-alphabetic words, such as '>='.
  4. I combined different word lists for the same language. For example, there were multiple word lists for English, so I simply appended one list onto the end of the other:

cat englex-dict.txt >> english.txt

 

  1. I removed duplicates from all of the word lists. It is easy to do this programmatically in the bash shell. For example:

cat english.txt | sort | uniq > new-english.txt

 

  1. I made sure that all the files used the same line termination sequence. (Text files developed under Windows use two characters, Carriage Return and Line Feed, to signify the end of a line, whereas Unix just uses a Carriage Return.) As I was using the bash shell, it was easiest to convert all the files to Unix file format using:

dos2unix *.txt

 

At this point, I had 15 word lists of different sizes, and a total of over 1.3 million words (the wc command shows the number of lines, words and characters in each file):

$ wc *.txt

  25485   25485  259551 danish.txt

 178429  178430 1998881 dutch.txt

  56553   56553  509773 english.txt

 287698  287698 3500749 finnish.txt

 138257  138257 1524757 french.txt

 160086  160086 2060734 german.txt

  18028   18028  172943 hungarian.txt

 115506  115506  934652 japanese.txt

  77107   77107  850131 latin.txt

  61843   61843  589234 norwegian.txt

 109861  109861 1022137 polish.txt

  86061   86061  850532 spanish.txt

  18417   18417  181973 swahili.txt

  12146   12146  105192 swedish.txt

    470     470    3768 tolkien.txt

1345947 1345948 14565007 total

 

 

Storing the Word Lists in a Database

Now that the word lists had been cleaned, my next aim was to access them from a computer program. Although I could have written a program to access the word lists directly as files, I felt a database would offer considerable flexibility to query the data and analyse the results. I was also worried about the volume of data, and reasoned that the database would help in accessing and managing the word lists efficiently. I didn't look around much when choosing a database to store the word lists - MySQL was the natural choice because it is fast, flexible and above all, free. And besides, it was already installed on my computer!

I knew I would need only a single table to store all the word lists in the database. Each row of the table could hold one word together with the language to which it belongs. However, to devise the schema precisely, I needed to find out how many characters to allow per word. A quick bash shell command against the text files told me the lengths of the words in the word lists:

$ cat *.txt|awk '{print length($0)}'|sort -n|uniq

 

The command first runs an awk script over the text files to get the lengths of the lines, then performs a numeric sort, and finally removes duplicate lines in the output. Using this command, I found that the longest word in the input was 57 characters, so decided to make the database column to hold the words 60 characters long.

The table for storing the words is created as follows:

CREATE TABLE words (

  word        varchar(60),

  lang        enum("DANISH", "DUTCH", "ENGLISH", "FINNISH", "FRENCH", "GERMAN", "HUNGARIAN", "JAPANESE", "LATIN", "NORWEGIAN", "POLISH", "SPANISH", "SWAHILI", "SWEDISH", "TOLKIEN"),

  word_id     int(10)     NOT NULL auto_increment,

  primary key (word_id),

  index lang_i (lang),

  index word_i (word)

  );

 

In addition to the word and its language, there is an identifier (word_id), which is the primary key for the table. Note that I used an enum type for the lang column, since we know we are only going to use a limited set of languages. By using an enum, only one byte need be used to store the value of that column - far less data than if I'd used a varchar. I also added indexes for the lang and word columns to improve execution times for queries that constrain these columns.

Now for each language, I loaded the words from the text file into the database table. The following two SQL commands load the word list for Danish; the other languages were loaded similarly.

load data infile 'C:\\temp\\danish.txt' into table words(word);

update words set lang='danish' where lang is null;

 

Once the word lists were in the database, I carried out one further data cleansing action. I removed any words that were less than three characters long:

mysql> delete from words where length(word)<3;

Query OK, 2112 rows affected (4.70 sec)

 

Checking the Data

At this point it is reassuring to run a query to get an overview of the data that we have stored. First, let's check how many words are in the database. As we have one word per row of the database table, that's the same as counting the number of rows in the table. The following query counts the number of rows, but also stores that value in a variable, called @total.

mysql> select @total:=count(*) as wordcount from words;

+-----------+

| wordcount |

+-----------+

|   1343410 |

+-----------+

1 row in set (0.00 sec)

 

Now let's look at the breakdown of the word lists into languages:

mysql> select lang as language, count(word) as wordcount

       from words group by lang;

+-----------+-----------+

| language  | wordcount |

+-----------+-----------+

| DANISH    |     25291 |

| DUTCH     |    178341 |

| ENGLISH   |     56355 |

| FINNISH   |    287231 |

| FRENCH    |    138168 |

| GERMAN    |    159989 |

| HUNGARIAN |     17818 |

| JAPANESE  |    115291 |

| LATIN     |     77049 |

| NORWEGIAN |     61679 |

| POLISH    |    109343 |

| SPANISH   |     85965 |

| SWAHILI   |     18363 |

| SWEDISH   |     12057 |

| TOLKIEN   |       470 |

+-----------+-----------+

15 rows in set (3.55 sec)

 

Given that we stored the total number of rows in a variable, it is now quite easy to run a query to express the word counts as a percentage of the total number of words. The query rounds the percentages to the nearest number to one decimal place.

mysql> select lang as language, round(100*count(word)/@total,1) as percent

       from words group by lang;

+-----------+---------+

| language  | percent |

+-----------+---------+

| DANISH    |     1.9 |

| DUTCH     |    13.3 |

| ENGLISH   |     4.2 |

| FINNISH   |    21.4 |

| FRENCH    |    10.3 |

| GERMAN    |    11.9 |

| HUNGARIAN |     1.3 |

| JAPANESE  |     8.6 |

| LATIN     |     5.7 |

| NORWEGIAN |     4.6 |

| POLISH    |     8.1 |

| SPANISH   |     6.4 |

| SWAHILI   |     1.4 |

| SWEDISH   |     0.9 |

| TOLKIEN   |     0.0 |

+-----------+---------+

15 rows in set (3.56 sec)

 

By converting the output of this query to a comma-separated values file, and then loading it into Microsoft Excel, we can generate the following pie chart:

It is important that we understand how well represented the different languages are in the set of word lists, as this will affect our interpretation of the results of the lexical similarity analysis. Clearly Finnish is well represented in our word lists, as are Dutch, French and German. With Finnish having the largest number of words, we have a good starting point for testing the belief that Finnish was the biggest influence on Tolkien's languages of Middle Earth.

In my next article, I will explain the algorithm that I used to analyze the word lists, present an overview of the Java source code, and reveal the language that, according to my findings, most influenced Tolkien.

Discuss this article >>>


Simon White