mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

Exploring audio clip tools using the BNC spoken corpus

There are now a number of audio corpus interfaces available that allow you to get examples of spoken English. These vary from Movies and TV shows such as Playphrase ( and Yarn (, h/t Sandy Millin ‏@sandymillin, to Youtube videos such as Youglish ( and Divii (

A question arises as to what search terms to use, i.e. what are frequent terms that we could use in class? In order to get a list of frequent terms one can use the Spokes interface to the Spoken BNC (British National Corpus).[]

It has a feature that lists all the common formulae such as the following top 3:

thank you very much

i don’t know

isn’t it

One now has a nice base in order to explore the tools mentioned initially. Of course one must bear in mind that the BNC is dated so current conversational English will not be represented well.

Thanks to Cara Leopold @eltinfrance for prompting this note.

BYU-COCA Corpus Query – prepositions of place

I had a class recently where a question rose as to whether it is I’m on the street or I’m in the street.

How would you look for relevant info on this using BYU corpora tools? What functions and what search queries would you make?

Do post any search results.

I’ll post my findings in a week or so.


hey all somewhat forgot about this : )

so yes I did initially what Marc suggests (see comment below), then I looked at comparing US use UK use i.e. comparing COCA with BNC using this search term – [p*] [be] * the street []

although overall frequencies are low it is clear that UK English has significant uses of in the street while US English has significant uses of on the street

added to the (speculative) fact that the video maker [see previous comment about context] is probably more used to UK English then the issue that the video flags on the street as an error is somewhat more understandable

here is the video in question – []


Between genres: science journalism and science research

Julie Moore  has a nice post [] on the use of science journalism articles in an EAP settings and how important it is to make students aware that these types of texts are very different to journal research articles.

She points out that part of the appeal of magazine articles on science is that abstracts in journals “are incredibly densely packed and require a certain degree of skill to decode.”

The PLOS (Public Library of Science) website asks authors to write an author summary which is “Distinct from the scientific abstract, the Author Summary is included in the article to make findings accessible to an audience of both scientists and non-scientists.”

This presents a possible halfway house for EAP students. The PLOS abstracts are restricted to mainly biology and medical domains and not all papers have author summaries.

One could simply copy paste abstracts and author summaries from the web pages. Or one could semi-automate this.

There is a nice scraper called quickscrape [] which allows you to download articles from various journals. Follow the instructions on the github site to set it up and to understand the quickscrape commands. The configuration for plos journals can be modified so that you only need to download the abstracts.

In the journal-scrapers/scrapers/plos.json file modify the file like so:


  “url”: “plos.*\\.org”,

  “elements”: {

    “abstract_html”: {

      “selector”: “(//div[contains(@class,’abstract’)])[2]”





The number in the above config just downloads the author summary, to download the original abstract change the number to 1.

There seems to be a limitation if you start hitting the journal server too much so be wary of that.

Here are the files for 10 abstracts and 10 author summaries []

SKELL examples for a review quiz

Everyday corpus use

I wanted to do a quick review quiz for a class where they have to learn a set of education related vocabulary such as take an exam, internships, syllabus etc.

A gap fill is one way to do such a quiz and to find good and authentic examples we can use SKELL (

SKELL (SketchEngine for Language Learning) allows you to look for sentences based on the search word as well as common grammatical collocations and similar words.

Notably it uses an automatic procedure where “good” examples are listed. i.e.

filtering out effectively all sentences with special terminology, typos and rare words (rare names). By default,  short  sentences  are  preferred,  sentences  containing  inappropriate  or spam words are scored lower (

So in about 30 mins I was able to prepare 10 questions for the quiz, nice 🙂

Jeremy Corbyn phrasal verbs

so just putting this out there in the possibility that some may find it of use

it is a list of all the phrasal verbs in Jeremy Corbyn’s final rally speech that took place on the Friday before he was announced winner of the Labour leadership contest on the Saturday

i may well have missed some verbs 🙂

the PHaVE list refers to

also some more CL related labour leadership a venncloud ( of speeches by the candidates –

Spoken Corpora

Publically available spoken corpora is rarer than the most rare thing you can think of.

A couple of people have been enquiring and my latest look has turned up ORTOLANG Speech and Language Data Repository (SLDR/ORTOLANG) [], hat tip Yutaka Ishii ‏@yishii_0207.

In particular there is the Open ANC [] which includes the Charlotte Narrative corpus [].

This may be of use to people looking for a ‘conversation’ corpus and looking to do ‘backchannel’ language research. (those were the two queries recently)

Business idioms and business corpora

Here’s a little exercise some might be interested in doing together.

People often post lists of vocabulary without any explicit reference to corpora, for example here is a recent list of business idioms – [] .

How many in this list can be found in say these two business corpora – the 1Million Business Letter Corpus – [] and the 2Million Business English Corpus – [] – use test for both username & password to access this corpus. Of course both are from the written genre so bear this in mind.

Across the board is a promising start with 3 instances in the Business Letter Corpus and 4 instances in the Business English Corpus.

Ahead of the curve produces no instances in either corpus

to be continued… by you? 🙂