If you use BootCat here is a command to help you separate the collected corpus into individual files using the CURRENT URL line as a separator in a regex:

awk ‘/CURRENT URL/{g++} { print $0 > g”.txt”}’ corpus.txt

Be careful when copy pasting this command into your command line that the apostrophe ‘ is straight and not curly.


Exploring audio clip tools using the BNC spoken corpus

There are now a number of audio corpus interfaces available that allow you to get examples of spoken English. These vary from Movies and TV shows such as Playphrase ( and Yarn (, h/t Sandy Millin ‏@sandymillin, to Youtube videos such as Youglish ( and Divii (

A question arises as to what search terms to use, i.e. what are frequent terms that we could use in class? In order to get a list of frequent terms one can use the Spokes interface to the Spoken BNC (British National Corpus).[]

It has a feature that lists all the common formulae such as the following top 3:

thank you very much

i don’t know

isn’t it

One now has a nice base in order to explore the tools mentioned initially. Of course one must bear in mind that the BNC is dated so current conversational English will not be represented well.

Thanks to Cara Leopold @eltinfrance for prompting this note.

BYU-COCA Corpus Query – prepositions of place

I had a class recently where a question rose as to whether it is I’m on the street or I’m in the street.

How would you look for relevant info on this using BYU corpora tools? What functions and what search queries would you make?

Do post any search results.

I’ll post my findings in a week or so.


hey all somewhat forgot about this : )

so yes I did initially what Marc suggests (see comment below), then I looked at comparing US use UK use i.e. comparing COCA with BNC using this search term – [p*] [be] * the street []

although overall frequencies are low it is clear that UK English has significant uses of in the street while US English has significant uses of on the street

added to the (speculative) fact that the video maker [see previous comment about context] is probably more used to UK English then the issue that the video flags on the street as an error is somewhat more understandable

here is the video in question – []


Between genres: science journalism and science research

Julie Moore  has a nice post [] on the use of science journalism articles in an EAP settings and how important it is to make students aware that these types of texts are very different to journal research articles.

She points out that part of the appeal of magazine articles on science is that abstracts in journals “are incredibly densely packed and require a certain degree of skill to decode.”

The PLOS (Public Library of Science) website asks authors to write an author summary which is “Distinct from the scientific abstract, the Author Summary is included in the article to make findings accessible to an audience of both scientists and non-scientists.”

This presents a possible halfway house for EAP students. The PLOS abstracts are restricted to mainly biology and medical domains and not all papers have author summaries.

One could simply copy paste abstracts and author summaries from the web pages. Or one could semi-automate this.

There is a nice scraper called quickscrape [] which allows you to download articles from various journals. Follow the instructions on the github site to set it up and to understand the quickscrape commands. The configuration for plos journals can be modified so that you only need to download the abstracts.

In the journal-scrapers/scrapers/plos.json file modify the file like so:


  “url”: “plos.*\\.org”,

  “elements”: {

    “abstract_html”: {

      “selector”: “(//div[contains(@class,’abstract’)])[2]”





The number in the above config just downloads the author summary, to download the original abstract change the number to 1.

There seems to be a limitation if you start hitting the journal server too much so be wary of that.

Here are the files for 10 abstracts and 10 author summaries []

Corpusmooc 2015 round 3 Gems 1

There have been some interesting links shared by #corpusmoocers which members may find of interest, note this is just a sample of what I find interesting, at the close of the mooc I will put up the full list I am collecting.

First up is a learner’s language corpus of Japanese that was shared by Amelia Joulain-Jay –

Next is a source of management language texts shared by Arthur McKeown –

Lastly an analysis of the book The curious incident of the dog in the night by Elena Semino –

Stay tuned for more corpusmooc gems.

See also: A SKELL intro []