Monday, October 07, 2013

Language stuff - type/token ratio

I just finished the Open University module E303 - English Grammar in Context - it sounds pretty deadly, but I loved it. Language is inherent to who we are as human beings - we all communicate. And yet, it's only relatively recently that the resources have been available to examine language in a systematic, large-scale way. A lot of the underlying theory is actually newer than the computer science theory that felt pretty new when I was doing my first degree.

I've promised blog series before, and they rarely amount to much, but I'd like to see whether I can write about some of the ideas we covered, and maybe get across some of the reason that I found the material so fascinating.

The first concept is type/token ratio. "Tokens" are the number of words in a piece of text - if I do a word count, then it tells me the number of tokens. But not all of them are unique. The most common word "the" I have now used ten times so far in this text (don't hold me to that - it's likely to have been edited in a highly non-linear manner - but you get the idea). You can get some insight into a text by dividing the number of unique words by the total number of words, and expressing it as a percentage. So for the text up to the start of this sentence, there were 229 words and 134 types - giving a type/token ratio of 59%.

A couple of things about type/token ratio. The first is that as a piece of text gets longer, the type/token ratio is likely to fall. The number of words is clearly increasing, but the number of types is increasing more slowly - it's more likely that you will be using the same words again. What that means is that if you want to compare type/token ratio of two different texts, they need to be about the same size.

The next thing is that different sorts of text will have different type/token ratios, as they are a measure of the diversity of the vocabulary being used. For my final assignment, I looked at pop song lyrics. I had a database of around 34000 words, and this had a type/token ratio of just under 10%. I compared this with a slightly larger database of words from a work of fiction, and this had a higher type/token ratio - just over 12%. A slightly smaller database of words from transcribed conversations had a lower type/token ratio - about 6.5%.

One might assume that the language used in pop music was pretty narrow in its range. But it turns out that it is quite diverse - almost as diverse as fictional writing, and much more so than the sort of language that's used in everyday conversation.

No comments: