The New Edinburgh Edition of the Collected Works of Robert Louis Stevenson

Reading transcriptions of manuscripts

with 11 comments

A technical post about how we present manuscripts in our volumes

Reading transcriptions

Some of the texts in the EdRLS volumes will be based on manuscripts, and this leads us to the problem of how to present them. One way to do this would to ‘reproduce’ the manuscript as a full diplomatic transcription with all deletions and insertions. EdRLS has decided not to do this, but to ‘publish’ MSS in a reading transcription, with the volume editor acting as a respectful intermediary: ignoring deletions, adding insertions, changing underlinings to italics, ‘&’ to ‘and’, correcting clear slips of the pen, supplying clearly missing punctuation and correcting all other language-processing errors.

An unavoidable problem comes with non-standard spelling, and for our edition we have decided to ‘correct’ all spelling that we feel sure a contemporary printer would have changed and that RLS would have accepted (and provably did so, accepting the change in dozens of cases where we have manuscript with one spelling and printed editions with the other). This means that the ubiquitous ‘niether’ is changed to ‘neither’, ‘it’s cause’ becomes ‘its cause’ etc. No problem.

Problem cases

However, we don’t want to iron out any spelling variants that seem to have been acceptable at the time and had a chance of being accepted by a printer or editor: examples in this grey area are ‘to develope’ ‘at the bakers’, ‘to recal’, ‘cloke’, ‘carreer’. Google Advanced Book Searches (GABS) show evidence of these forms being used in the nineteenth century, so how do we decide in these cases?

A proposal: test with Google N-Grams

One way to decide is to look in the ‘Forms’ given by the OED: any form marked as ’19’ (i.e. 19th century), or with an open range of centuries (century number followed by dash and space), is indicated by the OED as a variant spelling in the 19th century; this helps us decide about ‘develope’, which is marked ’16–’ (i.e. 16th cent. onwards). However, GABS often shows forms in print that are not listed as variant forms by the OED. In this case, I propose that we test the two forms using Google N-grams.

Google N-Grams shows relative frequency of words and phrases in a huge number of books. Let’s take an example, ‘carreer’, which RLS uses in the MS of ‘Essays, Reflections and Remarks on Human Life’ (1880; ‘at other periods of my carreer’) and again in the MS of Kidnapped (1886; ‘I was in full carreer’, a spelling kept in Barry Menikoff’s edition). Do we keep this as an interesting personal way of writing, a touch of individual savouring, or can we be sure that RLS would have accepted its correction without batting an eyelid and even thanked the printer for helping him with his uncertain spelling? Well, let’s put ‘carreer,career’ in N-Grams, select British English and date range 1870-90…. Press Enter and we get:Screen shot 2013-05-29 at 12.23.33This convinces me that ‘carreer’ had a snowball in hell’s chance of getting past a printer in 1886, and that RLS himself would have sensed it as strange if he saw it in print.

But what about if the tested form was around at the time but maybe just happened to get past a printer only a few times? Let’s try ‘cloke’, an interesting case because in Webster’s Dictionary of 1828 it is the one and only spelling given for the word, so it had certainly existed as a respectable spelling in the 19th century. Here’s the result with N-Grams:

Screen shot 2013-05-29 at 14.26.14‘Cloke’ is there but surviving on the level of the flat-fish. Now here’s my proposal: put the cursor anywhere on the vertical line that marks 1880 (not here: in Google N-Grams) and you get a reading of the frequency for books published in that year (N.B. it includes any historical books published then, which is where I suspect the occurences of ‘cloke’ come from): in this case it is ‘cloak 0.00115%; cloke 0.00004%’. I’ve underlined the zeros, because I propose that, counting the number of zeros after the decimal point, wherever the minority spelling is within one zero point away (on average 10 times less frequent) we consider it as a variant that would have been reasonably familiar in print; but wherever is it two zero points (or more) away (on average 100(+) times less frequent) we ‘correct’ it. Here, we have four zeros against two, a difference of two zero points, so ‘cloke’ doesn’t pass the test.

[Additional comment (January 2016): the ‘one zero’ measure is I think too rough: one zero point away goes down to .009 vs .0001, which is 99% of occurrences vs 1%. I suggest that ‘rare’ occurrences could be counted as those below 10% with respect to 90% of the dominant form, e.g. .oo9 vs .0009. The links to the examples of the minority form should also be inspected: some of these may be from books published in the selected period, but in editions of older authors. The actual facsimile pages should also be inspected as sometimes the snippet views show wrongly transcribed forms.]

OK, that helps us change ‘cloke’ to ‘cloak’, what about the other examples? ‘to develope’ and ‘at the bakers’ pass the test – frequent enough in print to possibly be accepted; ‘recal’ doesn’t, suggesting it should be changed to ‘recall’.

Any comments?


11 Responses

Subscribe to comments with RSS.

  1. Wow…


    29/05/2013 at 3:03 pm

    • It’s a pity that we can’t (easily) tell how many of those N-Gram clokes were from books quoting the Bible: (” … not using your liberty for a cloke of maliciousness, but as the servants of God. (Peter 2:13-16)), or from Blake (“Shame is Prides cloke.” (The Marriage of Heaven and Hell.)) Quotations must skew the statistics a little.


      29/05/2013 at 3:22 pm

      • I agree, and I said I suspected that such cases were where they came from. We could go through all the examples given by Google N-grams (or a sample of them) and check, but in this case it’s not necessary as ‘cloke’ doen’t pass the test anyway, so we would only succeed in showing that it was even less ‘current’.

        If a spelling unexpectedly did pass the test, then we should check that the statstics aren’t being skewed by the kind of instances you give.


        29/05/2013 at 3:55 pm

  2. The test you propose seems fair and also a powerful use of N-Grams as an editorial tool. I presume, however, this doesn’t bypass discretion where, for example, a non-standard spelling arises in the context of a passage in Scots, or where an archaism may have been intended?


    29/05/2013 at 4:07 pm

    • Good points: each case has to be decided individually, bearing these factors in mind too. And any Scots spellings outside of passages in Scots I think should stay in, for example the repeated spelling ‘sorcerar’ in one of the Fables. So we regularly consult the online DSL and the SND.
      It is difficult tightrope to walk: you can’t press a button (= write a series of rules that cover every case), and are being continually tempted by what you would prefer. In the end it’s a compromise: remove the most distracting forms, but try to leave the language with a personal flavour.


      29/05/2013 at 5:05 pm

  3. This is an excellent post, Richard. Really enjoyed reading it: a good break from grading papers. I’ll be in touch about various things, once I’ve surmounted the next sixty Walter Scott essays. In the meanwhile, you might also want to experiment with these tools, which I’ve found incredibly useful. I think I mentioned them in Bergamo, but here’s the link: http://voyant-tools.org/

    Anthony Mandal

    29/05/2013 at 8:23 pm

  4. Well, OK – and in the end of the day (it always comes to this) ‘each case has to be decided individually’.

    In the present instance, though, ask yourself what is the data in the Google database from which these statistical ingenuities derive? Do we know?

    Our conundrum is that on the one hand we have bunch of words in a manuscript by Stevenson – a literary manuscript, moreover, in that it is written wholly or in part to give pleasure through its handling of the language itself – written by a fellow born in Scotland, too.

    On the other hand, who knows where the data in Google being taken as ‘normal’ is from? From my own experience of Google the whole range seems to be there. That’s part of the point.

    The language ‘data’ which it is proposed to use, deciding about Stevenson by using a majority vote of Google, consists of everything from government reports to the poems of Keats, scientific textbooks to popular magazines, original works and translations from other languages.

    The nationalities – and word choices and spellings and house styles – whose language(s) Google sorts blindly as ‘English’ are not even limited to the British Isles. There are Americans and Canadians and countless others in lands from China to Peru, or anyway from South Africa to New Zealand, where something called ‘English’ is written.

    We may be forcing literary prose by a master craftsman, with a one-of-a-kind subject in hand, down to the level of the common herd. Once again we are letting the majority tyrannize – or are tempted to do so – because we think that there’s safety in numbers. Whatever they may look like to other fish, ten thousand sardines schooling together still do not make up a shark.

    I would rather have this information than not have it – although I must say that compiling and extracting and studying it hardly seems to me the best use of anyone’s time who who is interested in Stevenson. There are jobs more in need of doing: extra proofreadings and collations, for instance, or checking what really supports some assertions.

    But let the statistic, if it is used at all, be just one detail among many sounder, better ones that the editor also produces – maybe only privately, for the edification of the spouse and children and maybe the general editors – to explain the whole collection of choices that he or she has made. Record them when they are debatable or borderline. And then move on, letting the editorial approach in aggregate and as a whole carry the day, if it can. There is no sense ‘reasoning’ about it: the proof is in the successful literary results. Is this what Stevenson would be proud of? Is this what he would like us to see and judge him by? Only readers can decide that.

    All editorial choices are arbitrary and personal. All that we as readers ask is that there be thoughtful, considered, plausible reasons for the general approach and the choices that have been made.

    Running the text through the Google machine isn’t enough. In my opinion it is close to being nothing at all. It gives no results much worth having (we don’t know what they mean) and it actually moves thinking in a bad direction, not a good one, by taking minnows for sharks. Lots of nobodies are still nobody.

    Another way to put this is to ask a simple question. Do we really want to make choices whose net effect, if we make any large number of them in this way, will be to make Stevenson sound like the average of writers of city reports in San Francisco, travel books from India, and whatever else Google decides is ‘English’?

    As Stevenson himself once said: Not I.

    Roger Swearingen

    30/05/2013 at 2:42 am

    • I totally agree that all editorial choices are in the end arbitrary and personal, and that there should be ‘thoughtful, considered, plausible reasons for the general approach and the choices that have been made’.

      At the end of the post I said that spellings that pass the ‘test’ are ‘frequent enough in print to possibly be accepted’ and a spelling’s failure to pass the ‘test’ ‘suggest[s] it should be changed’. In other words, I wasn’t presenting it as a mechanical technique to decide definitively one way or another.

      It could be, I think, a quick procedure to help in deciding the handful of uncertain cases in each text, along with (i) consultation of OED and SND, (ii) investigation of how the spelling fared in print for other MS examples we have of it, and (iii) internalized understanding and common sense. Like all tests and measurements its imperfect but its a help.


      30/05/2013 at 6:24 am

  5. Continuing to try out this test, it has helped me correct common misspellings (that I at first thought might have gained a certain independence of their own): dillettante, principle (‘principle virtue’), pavillion. It also helped me decide to keep ‘to wile away’.


    01/06/2013 at 6:53 am

  6. Many thanks for this, Richard. Emerging from a mass of less interesting MS problems (undergraduate exam scripts), I welcome your suggestion of counting the zeros in N-Grams as a useful way of resolving some of the tricky issues arising from working with the MS. However, although textual decisions shouldn’t be left entirely to a feeling in the gut, in the end of the day I do think it’s often what you as an editor feel least uncomfortable with imposing on the reader. For example, I was delighted to stumble across the old Scots spelling ‘sorcerar’ when puzzling over what I had thought was RLS’s mis-spelling of ‘sorcerer’ as ‘sorceror’. However, there is one case in the MS of ‘The Isle of Voices’ (para 26) where Stevenson, who frequently writes above the line a neat version of what he feels might be tricky words for the printer (often Hawaiian names, but not always), quite clearly writes ‘sorceror’. So in the cases where the penultimate letter of ‘sorceror’ in the Fables MS might be an ‘a’ (Stevenson’s ‘a’ and ‘o’ being often difficult to distinguish), I’d of course much prefer Stevenson to be using an old Scots term, but my instinct tells me it’s probably (if less interestingly) just a mistake (I make ‘sorcerer’ beat ‘sorceror’ 3 zeros to 7 in 1890 in N-gram terms). So ‘sorcerar’ might be worth a textual note, perhaps, but I’d hesitate to impose it on a modern readership—mainly because I suspect it distracts the reader’s attention from Stevenson’s story, and foregrounds my work as an editor.

    Bill Gray

    02/06/2013 at 2:23 pm

    • My apologies for my over-hasty comment before: I hadn’t realized that the ‘sorceror’ spelling was a neat ‘spellng-out’ or ‘printing’–that clearly is a clincher.
      Clearly then what looks like ‘socerar’ is intended as ‘soceror’ (why else would RLS write it one way and then give a clear indication of spelling it another way?). Typically RLS goes down to the line to start an up-stroke–this often makes ‘o’ look identical to ‘a’. In this case RLS’s ‘r’ (formed like an inverted ‘v’) starts from the base line, so the descending link-line from the ‘o’ looks like the final stroke of an ‘a’.
      If this interpretation is accepted, then ‘sorcerar’ should be changed to ‘sorceror’ in the diplomatic transcription. Probably GNG and OED will then comfirm that this is a personal spelling that would always have been corrected in print.


      02/06/2013 at 7:05 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: