mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

SKELL examples for a review quiz

Everyday corpus use

I wanted to do a quick review quiz for a class where they have to learn a set of education related vocabulary such as take an exam, internships, syllabus etc.

A gap fill is one way to do such a quiz and to find good and authentic examples we can use SKELL (

SKELL (SketchEngine for Language Learning) allows you to look for sentences based on the search word as well as common grammatical collocations and similar words.

Notably it uses an automatic procedure where “good” examples are listed. i.e.

filtering out effectively all sentences with special terminology, typos and rare words (rare names). By default,  short  sentences  are  preferred,  sentences  containing  inappropriate  or spam words are scored lower (

So in about 30 mins I was able to prepare 10 questions for the quiz, nice 🙂

Jeremy Corbyn phrasal verbs

so just putting this out there in the possibility that some may find it of use

it is a list of all the phrasal verbs in Jeremy Corbyn’s final rally speech that took place on the Friday before he was announced winner of the Labour leadership contest on the Saturday

i may well have missed some verbs 🙂

the PHaVE list refers to

also some more CL related labour leadership a venncloud ( of speeches by the candidates –

Spoken Corpora

Publically available spoken corpora is rarer than the most rare thing you can think of.

A couple of people have been enquiring and my latest look has turned up ORTOLANG Speech and Language Data Repository (SLDR/ORTOLANG) [], hat tip Yutaka Ishii ‏@yishii_0207.

In particular there is the Open ANC [] which includes the Charlotte Narrative corpus [].

This may be of use to people looking for a ‘conversation’ corpus and looking to do ‘backchannel’ language research. (those were the two queries recently)

Business idioms and business corpora

Here’s a little exercise some might be interested in doing together.

People often post lists of vocabulary without any explicit reference to corpora, for example here is a recent list of business idioms – [] .

How many in this list can be found in say these two business corpora – the 1Million Business Letter Corpus – [] and the 2Million Business English Corpus – [] – use test for both username & password to access this corpus. Of course both are from the written genre so bear this in mind.

Across the board is a promising start with 3 instances in the Business Letter Corpus and 4 instances in the Business English Corpus.

Ahead of the curve produces no instances in either corpus

to be continued… by you? 🙂

Elementary, my dear Watson

I was lucky to receive a copy of Corpus Linguistics for Grammar [] today and thought it might be interesting to jot down things here as I read it. A live review or  liview if you will :).

The first sentence in Part 1 of the book points out that elementary, my dear Watson was never written in any of the Sherlock Holmes books.

I downloaded the texts that the authors mention they used to form their Sherlock Holmes corpus as well as a more complete canon version.

Using AntConc we find that of the 8 instances of elementary in the canon the closest are 2 uses of elementary, said he.

Of my dear Watson the closest could be said to be 3 instances of exactly, my dear Watson.

Here are the files the book used – []

and here are the more complete files – []

Both taken from with just the end notices removed.

One could go on to investigate how film/tv representations of Holmes differs to the written versions. Since getting hold of subtitles is fairly easy nowadays.

I may post more stuff or may not about the book. So stay tuned or don’t 🙂

Related – Sherlock Holmes Graded readers []

Alex Boulton e-mail interview

You may have seen the video I posted up of Alex Boulton giving a talk on a meta-analysis of #DDL at #engcorpora2015 , link here in case you haven’t [].

Alex was kind enough to answer a few questions about #corpora and DDL:

1. Please share with us a little of your background.

Alex: British in a former life, I’ve been in France for 25 years or more now and have no intention of moving if I can help it. I did a BA in French and linguistics at the UEA in Norwich, a sandwich course which entailed a year as a teaching assistant at schools in Alsace – a year which started out extremely scary and ended up as one of the best experiences I could have wished for.

I then worked in a language school in Paris for a couple of years, followed by an MA in TEFL back in Britain. I did a DEA and a PhD in Nancy with Henri Holec at the Crapel, and was recruited to a maître de conférences post teaching in distance degrees in English (linguistics, teacher training, corpus linguistics, translation, research methodology, etc.).

I was eventually persuaded to do an HDR and was recruited to a professor’s post here in 2013. My background inevitably means I’m a bit different from most of my colleagues in English or linguistics, and perhaps I do have a more international outlook in some ways.  But they tend also to think that being a native speaker of English makes things easy, which is by no means true of institutional life in France and the various responsibilities that go with it.

2. Do you remember the first uses you made of corpora in your teaching?

Alex: When we revamped our distance MA in English about 15 years ago, we took on a lot of students from very different backgrounds, some of whom had never done any linguistics at all before; so I wanted the linguistics component to be new to everyone, both challenging and accessible at the same time. Corpus linguistics seemed to fit the bill, though I had to learn at the same time as my first students. As language majors, I expected them to build and analyse their own corpora for literary, cultural, professional or personal purposes as well as purely linguistic ones.

After a couple of years it suddenly occurred to me that corpus linguistics could have major implications for language learning per se; in other words, I independently hit on the same idea that others (e.g. Tim Johns) had had over 20 years previously. This often happens (to me anyway), which can be depressing initially but I prefer to be positive about convergent thinking, it means I may actually be doing something right if I’m in good such good company. Since then I’ve been interested in pretty much all aspects of corpus linguistics in language teaching and learning.

3. From your experiences of using DDL in class what key issues would you highlight?

Alex: There’s no one right way to use corpora in language teaching and learning. Some teachers or researchers think that corpora aren’t much use at lower levels, or for non-language majors, or for general purposes, or without extensive training; I’d say that their multiple affordances mean that there’s something there for everyone if introduced appropriately. In many cases this might be little more than using corpora to create materials that aren’t much different from traditional ones (just better informed), or helping learners to use Google more effectively as a way to query the internet (which when you think about it is very similar to DDL, i.e. getting software to help find answers to questions in large collections of texts).

In other cases, learners may assume more responsibility at various stages and even create their own specialised corpora to help with writing or translation, for example. It depends on their current and future needs, both linguistic and professional, as well as their personalities, time available, confidence with ICT, etc.

4. I was interested in some of the theoretical underpinnings you listed in your engcorpora2015 talk, do you have a preference for any? If so why/why not?

Alex: Certainly DDL would seem to be in line with much of what we know about language and processing. Usage-based theories (e.g. Tomasello 2003) suggest that we need massive exposure to language, but naturalistic contact is simply too rare, especially in foreign-language contexts (e.g. Schmitt et al. in preparation). Zahar et al. (2001) calculate that with an hour of reading a week, their learners would need 29 years to acquire 2000 words incidentally from that reading; DDL can help to organise and focus the exposure (Gaskell & Cobb 2004).

Language is not rule-driven but fuzzy and probabilistic in nature (Hanks 2013), with grammar and meaning both emerging from use (Beckner et al. 2009); and the mind works with exemplars beyond the level of word in line with dynamic systems theory (Larsen-Freeman & Cameron 2008), Sinclair’s (1991) idiom principle, Hoey’s (2005) lexical priming or Taylor’s (2012) model of the mental corpus, and finds support in recent psycholinguistic work on ‘chunking’ (e.g. Millar 2011), among others. It’s interesting how much of this is a more-or-less direct product of corpus linguistics – or maybe it’s circular, since obviously corpus linguists are going to find ways to justify their work.

5. Anything else you would like to add?

Alex: If you watch a professional athlete, they make it look so easy; you realise it isn’t as soon as you try it yourself – how much training, time and talent are needed to get to that result. At the same time, you don’t need to be a professional to do sport and get enjoyment or other benefits from it. It’s the same with corpus linguistics: many people seem to think that it’s beyond the reach of ordinary teachers or learners, but in my experience that’s confusing hard-end research with everyday uses for ordinary people.

Really all it entails is to take a bit of initiative and explore language for yourself with a bit of help from a computer, as opposed to relying exclusively on other people (e.g. teachers, coursebook writers or lexicographers) pre-digesting everything for you. Anyone can chew on language for themselves.


Beckner, C., R. Blythe, J. Bybee, M. Christiansen, W. Croft, N. Ellis, J. Holland, J. Ke, D. Larsen-Freeman & T. Schoenemann (The ‘Five Graces Group’). 2009. Language is a complex adaptive system: Position paper. In N. Ellis & D. Larsen-Freeman (eds), Language as a Complex Adaptive System. Language Learning, 59 (supplement): 1-26.

Gaskell, D. & Cobb, T. 2004. Can learners use concordance feedback for writing errors? System, 32(3): 301-319.

Hanks, P. 2013. Lexical Analysis: Norms and Exploitations. Cambridge MA: MIT Press.

Hoey, M. 2005. Lexical Priming: A New Theory of Words and Language. London: Routledge.

Larsen-Freeman, D. & L. Cameron. 2008. Complex Systems and Applied Linguistics. Oxford: Oxford University Press.

Millar, N. 2011. The processing of malformed formulaic language. Applied Linguistics, 32(2): 129-148.

Schmitt, N., T. Cobb, M. Horst & D. Schmitt. In preparation. How much vocabulary is needed to use English? Replication of Van Zeeland &

Schmitt (2012), Nation (2006), and Cobb (2007). Language Teaching.

Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

Taylor, J. 2012. The Mental Corpus: How language is Represented in the Mind. Oxford: Oxford University Press.

Tomasello, M. 2005. Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard: Harvard University Press.

Zahar, R., T. Cobb & N. Spada. 2001. Acquiring vocabulary through reading: Effects of frequency and contextual richness. Canadian Modern Language Review, 57(3): 541-572.

xml tagging UK 2015 elections

So as the #UK #elections are fast approaching there has been a number of corpus linguistics analysis for example Paul Rayson has a collection of Wmatrix analysis of the three main parties’ manifestos []

Inspired by this and wanting to go through the xml tutorial I posted earlier [], I thought it would be interesting to look at some of the rhetorical devices in the parties foreword sections of the manifesto.

I have only so far coded the Labour foreword as it was the shortest with 313 words.

The graph [] shows the distribution of words per sentence, showing an alternating structure between short and long sentences which builds to a peak near the end. There we have 33 words ending in people.

In rhetoric anaphora refers to repeating lexical items usually at the beginning of a sentence (e.g. I have a dream – Martin Luther King). Isocola refer to repeating same or similar syntactic structures and tend to occur near end of sentences (e.g. veni, vidi, vici – Julius Cesar).

Cola can be di (two), tri(three) or tetra(four).

In the foreword to the 2015 Labour manifesto there are 5 anaphora and 12 isocola. Of the 12 isocola 10 are dicola and 2 tetracola.

The anaphora are – it means, it means strongwe are a great country, i have heard, this manifesto

Examples of dicola are – great country/great people, great ambitions/great anxieties

The 2 tetra cola are –

the countless people/the young people/the dedicated staff/all those who have served our country

your stories/your hopes/your dreams/your frustrations

Of course my understanding and hence coding of these two rhetoric structures may well be off. Use with caution!

I attach the xml for Labour [] will update with Conservative and Liberal Democrat when I get some time.

Some more notes on AMALGrAM 2.0

Whilst trying to figure out how best to extract multi-word expressions (MWE) from the output files, which, with my poor coding skills, will take a while, I found a workaround.

First use the cut command with the .sst file to extract all the text:

cut -f2 file.pred.swt > file_cut.txt

Then load up the resulting text file in AntConc and use the following search in concordance window to find all the strong multi-word expressions:

“*_*” that is asterix underscore asterix

So for example using the Brown M corpus of science fiction texts (with a token count of approx 12000) we get a count of 580 strong multi-word expressions.

of course is the most frequent MWE with 11 counts

had to/have to appears 12 times (6 for each form)

at all appears 5 times

a few/a little appears 8 times (4 per form)

See also: part 1 notes []