mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

Elementary, my dear Watson

I was lucky to receive a copy of Corpus Linguistics for Grammar [] today and thought it might be interesting to jot down things here as I read it. A live review or  liview if you will :).

The first sentence in Part 1 of the book points out that elementary, my dear Watson was never written in any of the Sherlock Holmes books.

I downloaded the texts that the authors mention they used to form their Sherlock Holmes corpus as well as a more complete canon version.

Using AntConc we find that of the 8 instances of elementary in the canon the closest are 2 uses of elementary, said he.

Of my dear Watson the closest could be said to be 3 instances of exactly, my dear Watson.

Here are the files the book used – []

and here are the more complete files – []

Both taken from with just the end notices removed.

One could go on to investigate how film/tv representations of Holmes differs to the written versions. Since getting hold of subtitles is fairly easy nowadays.

I may post more stuff or may not about the book. So stay tuned or don’t 🙂

Related – Sherlock Holmes Graded readers []

Alex Boulton e-mail interview

You may have seen the video I posted up of Alex Boulton giving a talk on a meta-analysis of #DDL at #engcorpora2015 , link here in case you haven’t [].

Alex was kind enough to answer a few questions about #corpora and DDL:

1. Please share with us a little of your background.

Alex: British in a former life, I’ve been in France for 25 years or more now and have no intention of moving if I can help it. I did a BA in French and linguistics at the UEA in Norwich, a sandwich course which entailed a year as a teaching assistant at schools in Alsace – a year which started out extremely scary and ended up as one of the best experiences I could have wished for.

I then worked in a language school in Paris for a couple of years, followed by an MA in TEFL back in Britain. I did a DEA and a PhD in Nancy with Henri Holec at the Crapel, and was recruited to a maître de conférences post teaching in distance degrees in English (linguistics, teacher training, corpus linguistics, translation, research methodology, etc.).

I was eventually persuaded to do an HDR and was recruited to a professor’s post here in 2013. My background inevitably means I’m a bit different from most of my colleagues in English or linguistics, and perhaps I do have a more international outlook in some ways.  But they tend also to think that being a native speaker of English makes things easy, which is by no means true of institutional life in France and the various responsibilities that go with it.

2. Do you remember the first uses you made of corpora in your teaching?

Alex: When we revamped our distance MA in English about 15 years ago, we took on a lot of students from very different backgrounds, some of whom had never done any linguistics at all before; so I wanted the linguistics component to be new to everyone, both challenging and accessible at the same time. Corpus linguistics seemed to fit the bill, though I had to learn at the same time as my first students. As language majors, I expected them to build and analyse their own corpora for literary, cultural, professional or personal purposes as well as purely linguistic ones.

After a couple of years it suddenly occurred to me that corpus linguistics could have major implications for language learning per se; in other words, I independently hit on the same idea that others (e.g. Tim Johns) had had over 20 years previously. This often happens (to me anyway), which can be depressing initially but I prefer to be positive about convergent thinking, it means I may actually be doing something right if I’m in good such good company. Since then I’ve been interested in pretty much all aspects of corpus linguistics in language teaching and learning.

3. From your experiences of using DDL in class what key issues would you highlight?

Alex: There’s no one right way to use corpora in language teaching and learning. Some teachers or researchers think that corpora aren’t much use at lower levels, or for non-language majors, or for general purposes, or without extensive training; I’d say that their multiple affordances mean that there’s something there for everyone if introduced appropriately. In many cases this might be little more than using corpora to create materials that aren’t much different from traditional ones (just better informed), or helping learners to use Google more effectively as a way to query the internet (which when you think about it is very similar to DDL, i.e. getting software to help find answers to questions in large collections of texts).

In other cases, learners may assume more responsibility at various stages and even create their own specialised corpora to help with writing or translation, for example. It depends on their current and future needs, both linguistic and professional, as well as their personalities, time available, confidence with ICT, etc.

4. I was interested in some of the theoretical underpinnings you listed in your engcorpora2015 talk, do you have a preference for any? If so why/why not?

Alex: Certainly DDL would seem to be in line with much of what we know about language and processing. Usage-based theories (e.g. Tomasello 2003) suggest that we need massive exposure to language, but naturalistic contact is simply too rare, especially in foreign-language contexts (e.g. Schmitt et al. in preparation). Zahar et al. (2001) calculate that with an hour of reading a week, their learners would need 29 years to acquire 2000 words incidentally from that reading; DDL can help to organise and focus the exposure (Gaskell & Cobb 2004).

Language is not rule-driven but fuzzy and probabilistic in nature (Hanks 2013), with grammar and meaning both emerging from use (Beckner et al. 2009); and the mind works with exemplars beyond the level of word in line with dynamic systems theory (Larsen-Freeman & Cameron 2008), Sinclair’s (1991) idiom principle, Hoey’s (2005) lexical priming or Taylor’s (2012) model of the mental corpus, and finds support in recent psycholinguistic work on ‘chunking’ (e.g. Millar 2011), among others. It’s interesting how much of this is a more-or-less direct product of corpus linguistics – or maybe it’s circular, since obviously corpus linguists are going to find ways to justify their work.

5. Anything else you would like to add?

Alex: If you watch a professional athlete, they make it look so easy; you realise it isn’t as soon as you try it yourself – how much training, time and talent are needed to get to that result. At the same time, you don’t need to be a professional to do sport and get enjoyment or other benefits from it. It’s the same with corpus linguistics: many people seem to think that it’s beyond the reach of ordinary teachers or learners, but in my experience that’s confusing hard-end research with everyday uses for ordinary people.

Really all it entails is to take a bit of initiative and explore language for yourself with a bit of help from a computer, as opposed to relying exclusively on other people (e.g. teachers, coursebook writers or lexicographers) pre-digesting everything for you. Anyone can chew on language for themselves.


Beckner, C., R. Blythe, J. Bybee, M. Christiansen, W. Croft, N. Ellis, J. Holland, J. Ke, D. Larsen-Freeman & T. Schoenemann (The ‘Five Graces Group’). 2009. Language is a complex adaptive system: Position paper. In N. Ellis & D. Larsen-Freeman (eds), Language as a Complex Adaptive System. Language Learning, 59 (supplement): 1-26.

Gaskell, D. & Cobb, T. 2004. Can learners use concordance feedback for writing errors? System, 32(3): 301-319.

Hanks, P. 2013. Lexical Analysis: Norms and Exploitations. Cambridge MA: MIT Press.

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

Larsen-Freeman, D. & L. Cameron. 2008. Complex Systems and Applied Linguistics. Oxford: Oxford University Press.

Millar, N. 2011. The processing of malformed formulaic language. Applied Linguistics, 32(2): 129-148.

Schmitt, N., T. Cobb, M. Horst & D. Schmitt. In preparation. How much vocabulary is needed to use English? Replication of Van Zeeland &

Schmitt (2012), Nation (2006), and Cobb (2007). Language Teaching.

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Taylor, J. 2012. The Mental Corpus: How language is Represented in the Mind. Oxford: Oxford University Press.

Tomasello, M. 2005. Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard: Harvard University Press.

Zahar, R., T. Cobb & N. Spada. 2001. Acquiring vocabulary through reading: Effects of frequency and contextual richness. Canadian Modern Language Review, 57(3): 541-572.

xml tagging UK 2015 elections

So as the #UK #elections are fast approaching there has been a number of corpus linguistics analysis for example Paul Rayson has a collection of Wmatrix analysis of the three main parties’ manifestos []

Inspired by this and wanting to go through the xml tutorial I posted earlier [], I thought it would be interesting to look at some of the rhetorical devices in the parties foreword sections of the manifesto.

I have only so far coded the Labour foreword as it was the shortest with 313 words.

The graph [] shows the distribution of words per sentence, showing an alternating structure between short and long sentences which builds to a peak near the end. There we have 33 words ending in people.

In rhetoric anaphora refers to repeating lexical items usually at the beginning of a sentence (e.g. I have a dream – Martin Luther King). Isocola refer to repeating same or similar syntactic structures and tend to occur near end of sentences (e.g. veni, vidi, vici – Julius Cesar).

Cola can be di (two), tri(three) or tetra(four).

In the foreword to the 2015 Labour manifesto there are 5 anaphora and 12 isocola. Of the 12 isocola 10 are dicola and 2 tetracola.

The anaphora are – it means, it means strongwe are a great country, i have heard, this manifesto

Examples of dicola are – great country/great people, great ambitions/great anxieties

The 2 tetra cola are –

the countless people/the young people/the dedicated staff/all those who have served our country

your stories/your hopes/your dreams/your frustrations

Of course my understanding and hence coding of these two rhetoric structures may well be off. Use with caution!

I attach the xml for Labour [] will update with Conservative and Liberal Democrat when I get some time.

Some more notes on AMALGrAM 2.0

Whilst trying to figure out how best to extract multi-word expressions (MWE) from the output files, which, with my poor coding skills, will take a while, I found a workaround.

First use the cut command with the .sst file to extract all the text:

cut -f2 file.pred.swt > file_cut.txt

Then load up the resulting text file in AntConc and use the following search in concordance window to find all the strong multi-word expressions:

“*_*” that is asterix underscore asterix

So for example using the Brown M corpus of science fiction texts (with a token count of approx 12000) we get a count of 580 strong multi-word expressions.

of course is the most frequent MWE with 11 counts

had to/have to appears 12 times (6 for each form)

at all appears 5 times

a few/a little appears 8 times (4 per form)

See also: part 1 notes []

Notes for AMALGrAM 2.0 a multi-word expression and semantic tagger

The tagger expects input file to be part of speech (POS) tagged and in a tabbed format. Also there needs to be a space between sentences (

1. So first we need to split the words in a file into separate lines:

tr -s ‘[[:space:]]’ ‘\n’ < file

Note this is for a file already with spaces in between punctuations, for proper tokenisations use the stanford tokenizer:

java edu.stanford.nlp.process.PTBTokenizer file

update been having problems running standford tokenizer so use the following instead:

grep -Eo ‘\w+|[^\w ]’ input.txt

2. POS tag using stanford tagger with -tokenize false flag

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-bidirectional-distsim.tagger -tokenize false -textFile file

3. Insert newline after tagged full stop:

sed ‘s/\._\./&\n/g’ file

4. Convert underscores into Tab delimited:

sed ‘s/_/\t/g’ file

5. Finally run file in pysupersensetagger (AMALGrAM)

./ file

See also: part 2 notes []

Ivor Timmis email interview

A tweet by Pérez-Paredes @perezparedes  alerted me to the release of a new book titled Corpus Linguistics for ELT: Research & Practice by Dr. Ivor Timmis who kindly responded to some questions about the new publication.

1. Can you share a little of your background?

I’ve been involved in ELT for about 30 years as a teacher, teacher educator, PhD supervisor and researcher, more or less in that order.  I have a particular interest in spoken corpora and in the relationship between corpus linguistics and language teaching.  Currently I am working on developing an historical corpus of conversations recorded in a working class community in the 1930s (the Bolton/Worktown corpus).  I also have an interest in materials development for ELT and have just co-written a book with Freda Mishan on that topic.

2. Why did you decide to write the book?

Well, I was invited to do it!  However, I did think there was a real need for it.  For too long there has been talk of the ‘potential’ of corpus linguistics to contribute to language teaching, but I don’t think that potential has been realised to any great extent.  As I say in the book, it was once said of an English footballer, ‘He has a great future behind him’.  His youthful promise never came to fruition – I don’t want that to be said about corpora and ELT.

3. How much of your book would you say is about practice rather than theory?

That is difficult to say – the whole point of the book is to encourage teachers to use corpora and corpus tools (and to encourage their learners to do so).  For that reason, every chapter has hands-on practical tasks.  There is, however, enough theoretical background to ensure informed practice.

4. What kind of corpus skills do you assume your readers will have?

None.  There is a chapter on how to build a corpus and the tasks gradually familiarise readers with corpus skills.

5. Are there any similar books you would recommend?

The closest in spirit is:

O’Keeffe, A.  McCarthy, M. and Carter, R. (2007) From corpus to classroom.  CUP.

Their book, however, self-avowedly, focuses on spoken language.

6. Anything else you would like to add?

I think an acquaintance with corpus research and an ability to exploit corpora should be a part of every English language teacher’s repertoire (among the many other skills teachers need).  I hope the book contributes to that aim.

Many thanks to Ivor for the interview. It’s great to see a new corpus publication focusing on ELT, I look forward to getting it on my bookshelf.

Related – a couple of more interviews with authors of recent books on corpora and teaching:

James Thomas’s Discovering English with SketchEngine []

Christian Jones and Daniel Waller on Corpus Linguistics for Grammar: A guide for research []

Andrew Caines spoken corpus project

I asked Andrew Caines about the spoken corpus he is compiling. He kindly responded:

What is the purpose of this spoken corpus?

We’re collecting this spoken corpus for several reasons: firstly on the basis of ‘the more data the better’ — it does no harm to have new corpora for up-to-date language models, especially of spoken language, as these are harder to collect and therefore harder to come by. Secondly, we’re running the corpus collection exercise as a novel use of crowdsourcing for a special session at the INTERSPEECH conference this year.

And finally, we’re collecting two corpus types: the first is a corpus of English native speakers undertaking similar tasks as are found in EFL exams, so that we can start to address the question ‘what would native speakers do in this context?’ The second is a corpus of bilingual German-English speakers undertaking the same tasks as in the first corpus, but this time allowing us to address the question of first language transfer effects in the second language.

What size are you aiming for?

We’re looking for more than one hundred individual speakers and asking them to talk for anything between 2.5 and 5 minutes each. We don’t yet know how many words this will give us in total: it depends on people’s speaking rates!

How does it differ from the new BNC2014  spoken corpus project?

The new BNC2014 project will be a wonderful update to the original BNC conversation section: containing face-to-face spontaneous dialogues between British English speakers, on whatever topic occurs to them. In contrast, our corpus involves monologues, with people answering our specified questions, and the speakers being native speakers of (British/American/Canadian/etc) English or German.

What format(s) will the corpus be available in to public?

We’ll make the recordings and transcripts available to other researchers via the Speech and Language Data Repository.

When is a likely version be available to public?

We hope it will be made available before the end of this year.

Any other info you would like to add?

We welcome contributions from native speakers of English, and German-English bilinguals.

If you have an Android mobile/tablet device, all you will need is the (free) Crowdee app, 10-15 spare minutes, and a quiet place to record.

We’ll even pay you! (note, there’s a minimum pay-out of €5 on Crowdee).

Further info is available here:

Bigger is not necessarily better

It was interesting to read that a large #corpus such as #COCA -BYU can actually bias the language collected.  Before I go on how do you think a large corpus can bias the data?

This paper – On Using Corpus Frequency, Dispersion and Chronological Data to Help Identify Useful Collocations [] reports that a lot of the collocations it found were related to food and cooking. That is because the magazines and newspapers used by COCA regularly had recipes.

Another bias the study discusses is with the spoken section of COCA which uses a lot of television news and talk shows hence the language reflects those of newscasters and talk show hosts.

Finally the study highlights the lack of a specific business sub corpus such as there is for academic language, at the moment business language is spread across spoken, magazine and newspaper sources but not with particularly high frequencies.

Google as a corpus with students

Google as a corpus with students

Although using Google as a #corpus has limitations I think that with features like customized search, autocomplete and basic search syntax and above all its dominance in #students lives I am more and more inclined to use it in class.

A recent paper called How EFL students can use Google to correct their “untreatable” written errors by Luc Geiller describes the use of a Custom Google Search consisting of 28 news sites with French English learners. (h/t corpuscall @corpuscall)

A custom google search allows one to tailor the search to sites of interest, thus cutting down on irrelevant search results. The paper reports on six types of search the students performed.

Though many students were able to successfully self-correct their errors using Google, some found the process overwhelming, a common issue when using corpus search results.

I was wanting to examine work that has been done in this area of using Google as a corpus and so thought as part of that process it would be useful to compile a list of papers/work that one can read online that look at the teaching aspects of Google as a corpus.

If you know of any others do please add:


How EFL students can use Google to correct their “untreatable” written errors (2014) Luc Geiller

Google Scholar as a linguistic tool: new possibilities in EAP (2012) by Vaclav Brezina

Google Drafting (2012) by Joe Geluso

How Can Search Engines Improve Your Writing? (2011) by Adam Acar, Joe Geluso, Tadaki Shiki

Internet tools for language learning: University students taking control of their writing (2010) by Mark A. Conroy

From Body to Web: An Introduction to the Web as Corpus by Maristella Gatto (2008)