Wednesday, July 16, 2008

Vocabulary Profiling

A lot of people have discovered Wordle lately and have been using it to make attractive word 'clouds' of biblical texts. (Our church is considering using it to create occasional bulletin covers.) While Wordle is attractive and gives a quick glimpse of important words in a text, for more helpful analysis of a text, you want something like the VocabProfiler. To understand what this tool does, it helps to understand a bit of corpus linguistics, but this blog posting will give you a quick background. I've talked about this kind of stuff before with reference to strategies for minimizing the amount of Greek vocabulary one needs to know. I'll summarize:

  • A person needs to know about 95% of the vocab in a text to comprehend it without frustration.
  • Analyzing large amounts of texts allow one to construct reliable frequency lists.
  • In English, learning the 1000 most common words and their families will give you 74% comprehension. (The K1 list)
  • Learning the next 1000 words and their families will add another 5% bringing one up to 79% comprehension. (The K2 list)
  • Rather than trying to learn the next 1000 words which only adds 1%, it is better to identify the most common field-specific words, i.e., words used in a particular field of study or reference. For example, adding 570 word families of words in academic texts increases comprehension 8.5%. This group of words is called the Academic Word List. (AWL - In non-academic writing, it will provide much lower improvement.)
What VocabProfiler does is provide this kind of analysis. As an example look at this representation of the text of 3 John from The Message.
The words in blue are on the K1 list, in green are on the K2 list, in yellow on the AWL, and the red words are the remaining words. One of the things we can do is use this tool to compare translations. Below is 3 John in the NRSV. In this case, there is not much difference.Another possible way of using this tool is to compare different passages using the same version. Here is Jude in the NRSV, and it is easy to see that there is a much higher percentage of red words as compared to the 3 John graphic.
[UPDATE: As Iyov helpfully notes in the comments, this tool only works for English and French texts.]
Some practical applications of this tool--and you really should use the tool and see all the data it returns in addition to these highlighted texts--include the following:
  • Identify the best words to memorize if one is learning English as a second language and is interested in biblical texts.
  • It can be used to compare various translations to gauge the reading levels.
  • A person can, of course, also run there own writings through this tool. Preachers, check out the likely-more-challenging words in your sermons! I suspect this tool will turn up a lot of 'churchy' words in red.
  • You can see how James Tauber is trying to apply this kind of linguistics work to Greek (check this post and follow the links) and the development of a graded reader (check this post by James).
(HT: Downes)


  1. It is a pity that the tool is only available in English and French. (I suspect that most readers are fluent in English.)

  2. Sorry, I meant to type "most readers of your blog are fluent in English."