EdRLS

The New Edinburgh Edition of the Collected Works of Robert Louis Stevenson

Language statistics

with 4 comments

Wordsmith

Wordsmith Tools is the name of a program for statistically analysing the vocabulary of large samples of language, to see how words are patterned; it was developed by a gentle genius, Dr Mike Scott. I thought it might be interesting to compare RLS’s essays with some other essays, just to get an inkling of some things that might make them distinctive. To do this, I made a word-frequency list of all of RLS’s essays and compared it with two control corpuses via a ‘keywords’ analysis. For Wordsmith ‘key words’ are those whose frequency is unusually high in comparison with a comparative corpus. (You can also look at the words that are unusually infrequent in comparison with the other corpus.) I did this all very quickly, so it’s only intended here as an amusing entertainment that might provoke thought.

1. Comparison of RLS’s essays with Sampson’s 1912 anthology ‘Nineteenth-century Essays’

Sampson’s  Nineteenth-century Essays  is a one-volume collection (so quite small) including: Carlyle, “On History”; Macaulay, “Ranke’s History of the Popes”; Bagehot, “Shakespeare — the Man”; Newman, “Literature”; Ruskin, “Sir Joshua and Holbein”; Arnold, “Marcus Aurelius”; and Stevenson, “A Penny Plain and Twopence Coloured” (which I omitted from the corpus file).

The most characteristic words in  RLS (I took all his essays) compared with Sampson’s selection are (in descending ‘keyness’):

I    MY    UPON    A    ME    YOU    SOME    YOUR    AND    SOMEWHAT    ROAD    MOMENT    LAST    ALTHOUGH    ABOUT    HIMSELF

and the words most characteristic of the Sampson corpus that are little used by RLS in his essays are (in ascending negative ‘keyness’):

PROTESTANTS    SCRIPTURE    DOCTRINE    THEE    CHRISTIANS    THEREFORE    LANGUAGE    ROMAN    SCIENCE    POWER    SPAIN    CHRISTIANITY    EUROPE    THY    PROTESTANT    GREEK    CHURCH    CATHOLIC    HISTORY    THOU    ROME    WHICH

Who would have thought that ‘which’ was so little used by RLS in comparison with the six other essayists?

The two lists make an interesting random poem: RLS’s key words  focussing on subjectivity, partiality (some, somewhat), concession (although), simply perceived phenomena (and), movement (road – one of only two  nouns!), experience (moment); while the Victorian sages have those terribly heavy nouns and heavy links (therefore, which).

2. Comparison with Modern English Essays, edited by Ernest Rhys

Rhys’s substantial five-volume collection from 1922 contains RLS’s “Walking Tours” (in vol 2), which I removed from the corpus file, then made a word-frequency list, and used this to compare with the wordlist of Stevenson’s essays.

Interestingly the words that stood out as most unusual in comparison with the other texts were again suggestive of subjectivity and interpersonal relations, and once more we find ‘and’:

YOU    I    YOUR    MY    AND

The most characteristic Stevenson words include some proper nouns (Knox, Burns, Arethusa – I had included An Inland Voyage as an essay-like text), but also UPON (RLS tends to use this rather than ‘on’), SOME (13th position) and SOMEWHAT (in 20th place), as in the previous list. Other interesting words near the top of the list include: PLEASURE (14), PLEASURES (23); YET (15), the only conjunction in the top group; and, once again ROAD (24).

What about the words that were, instead, significantly more frequent in the five volumes of ‘Modern English Essays’? Here, the list contains a lot of proper names (MONTAIGNE, JAMES, GEORGE…) as Rhys’s selection tends towards critical essays, but once again the most frequent word in the control corpus in comparison with Stevenson is WHICH.  Curious.

Advertisements

Written by rdury

20/11/2012 at 6:27 pm

4 Responses

Subscribe to comments with RSS.

  1. I worked with WordSmith tools version 3 for a long time. But WS versions 4 and 5 are not as “translator-friendly” as the previous ones. Now, I have been using AntConc and I like it very very much, for both linguistic reserch and terminology (customized corpus).

    • Thanks for this – I didn’t know abot AntConc, which I see has the advantage of being freeware and available in both Mac and Windows versions.

      In a separate message Anthony Mandal indicates two web-based language-analysis sites: http://hermeneuti.ca/voyeur and http://voyant-tools.org/.

      rdury

      21/11/2012 at 5:06 am

  2. The infrequency of ‘which’ may be related to the frequency of ‘and’ and what we know independently of RLS’s preference for the semicolon: he likes to observe and place things side-by-side rather than insert them in a structure of explanation. Ne cherchez pas a comprendre – la vie c’est une série de flash.

    rdury

    21/11/2012 at 5:40 am

  3. […] a recent post, I described two statistical comparisons of Stevenson’s essays with other nineteenth-century […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: