mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

Dogme and manual concordancing

So I was working 1-to-1with a student on listening to Scottish accents using a recording taken from the International Dialects of English Archive.

It was a simple procedure of listening to extracts of the recording a few times to identify what he could, doing a number of micro-listens, then turning to the transcript to check.

After this we went through the transcript to look at any interesting lexis. My student noticed that the speaker was using a lot of ‘guess’. I had not noticed this as I was still looking for parts related to decoding issues.

Whilst we were having a discussion on why the speaker was using guess and how, it struck me to try a manual #concordancing of the word guess from the text.

The photo is the result. I explained to the student that we can call this a concordance and that it helps us look at how words work – in this case use of pronoun I, co-occurrence with so, but, and positions at start or middle of sentence.

I also took the opportunity to mention as a way to see more uses of concordances.

So I claim this as a use of #dogme concordancing. 🙂

For an interesting read of blackboard concordancing or hand concordancing have a read of this report of two Spanish teachers in France – Corpus work with ordinary teachers: Data-driven learning activities by Henry Tyne –

An idea to use very short scenes from films as a way to #teach   #corpus use #literacy .

A #dialogue from the latest Hobbit movie features the word jiffy. Ask students to listen to the scene and transcribe the dialogue. After some listens they may get the following:

Thorin: Balin, can you still mix a flash flame?

Balin: Aye, it’ll only take a jiffy, c’mon

Dwalin: We don’t have a jiffy.

Ask students what they think jiffy means?

Then ask them to go to #COCA and enter the following search:

it ‘ll only take a *

they should get the following:






ask them what kind of word they see in the list; ask them to click on few and couple to see what follows them;

then see if it holds true for the following search:

it will only take a *

additionally tell students:

  1. that an asterisk represents any single word, ask them to add another asterisk to first search term
  2. punctuation marks need to include a space when they are in the search term
  3. to combine both ‘ll and will in one search use the pipe character



Using Sketch Engine with Bawe

This tutorial shows you how to use the #BAWE #corpus   Sketch Engine  Open corpora site –

The last section on regex using corpus query language is very useful

I found that one can use the BAWE as a learner corpus, using CQL search you need to use this command:


so i searched for uses of the word important in French users like thus:

[lemma=”important”] within

I got 117 results out of which i spotted 5 “incorrect” uses of important:


a) Also, Kathmandu being known as the most important gambling city of the Indian subcontinent

b) entry in Nepal are low, which creates an important threat of new entrants.

c) There seem to be an important British demand for tourism, since in 2002

d) or Caribbean immigrants who actually form important communities in the capital.

e) Nepalese technological environment present important deficiencies.


Looking for #authentic audio with pauses, hesitations, ums and errs? 

Looking for #authentic audio with pauses, hesitations, ums and errs?  (e.g. read this rationale

Want to help improve the #Lancaster interface to the #BNC audio #corpus ?

Sign up here If you don’t get your automatic email authorization contact Sebastian Hoffman hoffmann at uni-trier dot de. Once signed up head on over to

The 1st screenhot shows a keyword analysis comparing spoken BNC to whole BNC (spoken+written). Check those ers and erms 🙂

The second screenshot shows an example of searching for the word contract. The transcription is in pink at the top, then you see audio player and below the audio player is where you can give feedback on how aligned the audio is with the transcript.

Making a graded reader corpus

A common problem with using #corpora like #COCA is the amount of surrounding words that can cause problems for students; of course one way around that is to choose carefully the concordances and/or modify them.

Another route is to use a #gradedreader #corpus ,the lextutor site provides some graded readers here

They total to about 1.3 million words so not a shabby corpus size. 🙂

The lextutor graded reader interface doesn’t allow wildcard searches (though you could use the main lextutor concordancer), downloading files and using #antconc will give you more options and control.

You can of course copy paste/download each file by hand but a much quicker way is to use the wget command (this is standard on Linux and OSX and separate download available for Windows).

In a terminal first build a file of the URLs from the graded reader site:

cat > listofurls.txt

Then run:

wget -i listofurls.txt

If you are not familiar/comfortable with running command lines let me know to get a copy of the texts.

So for example I was looking for sentences with the word contract (for my TOEIC class).

As you can see from the 1st screenshot the word contract comes from a level 5 text which is B2 or upper intermediate; this still produces a relatively “simple” sentence  with some nice collocations e.g.  In our business, contracts are made and cancelled routinely.

With less specialized items you can get more range of levels e.g. advice (the 2nd screenshot):

a level 3 text gives – She always gave us good advice.

A graded reader corpus is also a way for students to get into concordances; for some info on whether a graded corpus is ‘authentic’ see “Can a graded reader corpus provide ‘authentic’ input?” by Rachel Allan (2008)

Related – COCA-BYU’s Fiction-Juvenile section []

Corpus-based exercise formats

Using Corpora to Help Teach Difficult-to-distinguish English Words (2013) by Dilin Liu, section 3 gives a nice format for #teachers to present #corpus derived (from #COCA ) exercises.

Ex1 type is a fill in the gap, Ex2 is a replace keeping original sentence meaning and Ex3 is an error detection one with a translation of corrected sentence.

See the following which is extracted from exercises on differentiating the synonyms incorrectly and wrongly from page 37 and 38:

(1) Exercise 1: Decide whether incorrectly or wrongly fills in each blank better semantically; write either if you believe either adverb works equally well.

a. The United Nations TV had ——— identified Mr. Smith as the ambassador (either)

(2) Exercise 2: Decide whether incorrectly or wrongly or either adverb may replace each underlined word while keeping the original meaning and tone of the sentence.

a. It has been found that some of the prisoners were unjustly convicted. (wrongly)

(3) Exercise 3. Some of the underlined uses of incorrectly and wrongly are inappropriate. Identify and correct them. Then translate the sentences into Korean.

a. I incorrectly identified Brit Hume as being located at the White House right now. (correct)

Some of the sentences have been adapted e.g.:

sentence clauses into full sentences: The tape was acquired from United Nations TV which had incorrectly identified the ambassador —> The United Nations TV had incorrectly/wrongly identified Mr. Smith as the ambassador;

filling in ellipsis: David Hilliard’s title –> David Hilliard’s job title.

more or less wholesale rewrites: Some of the prisoners USA TODAY contacted — and their lawyers — were stunned to find out that they were imprisoned for something that turned out not to be a federal crime —> It has been found that some of the prisoners were unjustly convicted.

The article does not mention whether sentences were changed to suit student level and/or readibility.

Wrangling regular expressions and commandeering the command line

At some point when dealing with texts you will have to use the #commandline   and #regularexpressions   as it makes life much easier, although of course there is a steep learning curve.

Needs must and I recently bit the bullet and did my first #regex commands.

I am working with text from The Setup – a series of interviews which poses 4 simple questions about the equipment setup that people use in their work.

Now this website is great not only for the interesting interviews but for the fact that since the project is hosted on github one can very easily have a copy of all the interviews (i.e. you need a github account, then you can copy(fork) the project and download files).

Once you have the text files they need a little cleaning since the words which are linked are given in square brackets and the http link is put into round brackets e.g. :

I’m [Alex Payne]( “Alex’s website.”). I go by [al3x]( “Alex’s Twitter account.”) around the Interwho. I work at [Twitter]( “Micro-blogging FTW.”) in San Francisco as their [API]( “The Twitter API Wiki.”) Lead.

I want to remove only the square brackets [ ] and everything between the round brackets, including the round brackets.

A command for removing only the square brackets using the #sed   command is:

sed -i.bak ‘s/[ ] [ ]//g’ file.txt

The -i.bak is really two commands the -i means replace the original file with the results of the sed program (keeping both the unmatched text and the matched text) and .bak means make a backup of the original file.

The ‘s/[ ][ ]//g’ is the interesting part as this contains the regular expression.

The normal structure of the substitute sed s command is


In this case the characters I want is [ and ], but since [ is already a command i.e. [abc] will match a or b or c,  you can put the closing square bracket first and then the opening square bracket i.e. [ ] [ ] will match ] or [ which is the same as matching [ or ].

The replacement is simply a blank. The g means match all occurences of the regex pattern.

Now I want to delete the round brackets and everything in between the round brackets the command is :

sed -i.bak2 ‘s/([^( )]*)//g’ file.txt

Notice I called the new backup bak2 so I can distinguish easily between the results of the two commands. It is possible to combine commands but not figured that out yet!

The regex looks for the opening bracket (.

Then anything between that is not an opening bracket or close bracket, the caret ^ means not the following character i.e. not ( or ).

The star means match zero or more of such characters which are not an opening bracket or closing bracket.

Then finally match a closing bracket ).

Not sure how useful my particular example will be to people here but I wanted to highlight the fact that understanding regular expressions and using the command line saves a lot of time.

For example at the time of writing this there are 395 interview files, imagine having to open them up in a text editor and doing find and replace! In the command line it is a simple case of using unix wildstar on the text, in my case 2*.txt as the files are labelled year-month-day-nameofinterviewee.txt

There are numerous resources on using regex and sed command online; I learnt a bit because I had the need not to go through 395 text files!

Thanks for reading.