Reading transcriptions of manuscripts
A technical post about how we present manuscripts in our volumes
Some of the texts in the EdRLS volumes will be based on manuscripts, and this leads us to the problem of how to present them. One way to do this would to ‘reproduce’ the manuscript as a full diplomatic transcription with all deletions and insertions. EdRLS has decided not to do this, but to ‘publish’ MSS in a reading transcription, with the volume editor acting as a respectful intermediary: ignoring deletions, adding insertions, changing underlinings to italics, ‘&’ to ‘and’, correcting clear slips of the pen, supplying clearly missing punctuation and correcting all other language-processing errors.
An unavoidable problem comes with non-standard spelling, and for our edition we have decided to ‘correct’ all spelling that we feel sure a contemporary printer would have changed and that RLS would have accepted (and provably did so, accepting the change in dozens of cases where we have manuscript with one spelling and printed editions with the other). This means that the ubiquitous ‘niether’ is changed to ‘neither’, ‘it’s cause’ becomes ‘its cause’ etc. No problem.
However, we don’t want to iron out any spelling variants that seem to have been acceptable at the time and had a chance of being accepted by a printer or editor: examples in this grey area are ‘to develope’ ‘at the bakers’, ‘to recal’, ‘cloke’, ‘carreer’. Google Advanced Book Searches (GABS) show evidence of these forms being used in the nineteenth century, so how do we decide in these cases?
A proposal: test with Google N-Grams
One way to decide is to look in the ‘Forms’ given by the OED: any form marked as ’19’ (i.e. 19th century), or with an open range of centuries (century number followed by dash and space), is indicated by the OED as a variant spelling in the 19th century; this helps us decide about ‘develope’, which is marked ’16–’ (i.e. 16th cent. onwards). However, GABS often shows forms in print that are not listed as variant forms by the OED. In this case, I propose that we test the two forms using Google N-grams.
Google N-Grams shows relative frequency of words and phrases in a huge number of books. Let’s take an example, ‘carreer’, which RLS uses in the MS of ‘Essays, Reflections and Remarks on Human Life’ (1880; ‘at other periods of my carreer’) and again in the MS of Kidnapped (1886; ‘I was in full carreer’, a spelling kept in Barry Menikoff’s edition). Do we keep this as an interesting personal way of writing, a touch of individual savouring, or can we be sure that RLS would have accepted its correction without batting an eyelid and even thanked the printer for helping him with his uncertain spelling? Well, let’s put ‘carreer,career’ in N-Grams, select British English and date range 1870-90…. Press Enter and we get:This convinces me that ‘carreer’ had a snowball in hell’s chance of getting past a printer in 1886, and that RLS himself would have sensed it as strange if he saw it in print.
But what about if the tested form was around at the time but maybe just happened to get past a printer only a few times? Let’s try ‘cloke’, an interesting case because in Webster’s Dictionary of 1828 it is the one and only spelling given for the word, so it had certainly existed as a respectable spelling in the 19th century. Here’s the result with N-Grams:
‘Cloke’ is there but surviving on the level of the flat-fish. Now here’s my proposal: put the cursor anywhere on the vertical line that marks 1880 (not here: in Google N-Grams) and you get a reading of the frequency for books published in that year (N.B. it includes any historical books published then, which is where I suspect the occurences of ‘cloke’ come from): in this case it is ‘cloak 0.00115%; cloke 0.00004%’. I’ve underlined the zeros, because I propose that, counting the number of zeros after the decimal point, wherever the minority spelling is within one zero point away (on average 10 times less frequent) we consider it as a variant that would have been reasonably familiar in print; but wherever is it two zero points (or more) away (on average 100(+) times less frequent) we ‘correct’ it. Here, we have four zeros against two, a difference of two zero points, so ‘cloke’ doesn’t pass the test.
[Additional comment (January 2016): the ‘one zero’ measure is I think too rough: one zero point away goes down to .009 vs .0001, which is 99% of occurrences vs 1%. I suggest that ‘rare’ occurrences could be counted as those below 10% with respect to 90% of the dominant form, e.g. .oo9 vs .0009. The links to the examples of the minority form should also be inspected: some of these may be from books published in the selected period, but in editions of older authors. The actual facsimile pages should also be inspected as sometimes the snippet views show wrongly transcribed forms.]
OK, that helps us change ‘cloke’ to ‘cloak’, what about the other examples? ‘to develope’ and ‘at the bakers’ pass the test – frequent enough in print to possibly be accepted; ‘recal’ doesn’t, suggesting it should be changed to ‘recall’.