How can I use corpora to improve my academic writing in English?

Quick, straightforward intro to using corpora to help with academic writing. For a more comprehensive take check member Monika Sobejko’s post Teaching writing with the aid of COCA []

Fiction genre and learner collocational knowledge

Fiction genre in both COCA and BNC best predicts (i.e. highest correlation with) learner collocation knowledge, Durrant (2014). This could be useful to select collocations to use with lower proficiency students. For higher proficiency students filter the previous list by mutual information (MI) score in COCA as learners are known to be weak with low (word) frequency items that MI scores pick out. Or use the KWIC function of COCA to easily see any idiomatic uses that higher proficiency learners may want to learn.

For example searching on collocations of all uses of get in the Fiction section in COCA shows highly frequent collocations include get out (of) + location and get up. These can be appropriate for low proficiency learners.

Filtering by MI score we have contraction of (have) got to i.e. gotta and get rid (of). Looking more closely in a KWIC search in COCA we see uses of the idiom get out of hand. These may be appropriate for higher proficiency students to learn.

Durrant, P. (2014). Corpus frequency and second language learners’ knowledge of collocations: A meta-analysis. International Journal of Corpus Linguistics, 19, 443–477.

BYU-COCA Corpus Query – on the one hand

Here’s another challenge for you. What happens if you look for the term on the one hand by comparing the Spoken sections in BYU-COCA with each of the other 4 sections – Fiction, Magazine, Newspaper, Academic.

1. What do you predict given you know that on the one hand is often associated with explanations?

Now do the same with on the other hand.

The above query came about because I noticed that in speaking people tended to use both pairs of on the one hand & on the other hand whereas by contrast in written online texts I noticed only the second part of pair i.e. on the other hand being used.

See comments below.


BYU-COCA Corpus Query – Prepositions of place []

BYU Corpora Digs 1 – Rock up []

Is this the longest term in English language teaching?

Check out this regex (regular expression) to search any CLAWS7 tagged corpus for the present perfect, “It accounts for both contracted and full forms and also allows a number of intervening words”[]

Corpusmooc 2016 round 4 aids

Here I will occasionally put up helpers/explainers/aids related to queries people have had on the #corpusmooc .

For example there were a number of questions about collocation statistics so I knocked up a chart based on a video and one of the readings – Identifying Collocations []

I did some similar charts for previous corpusmoocs you can look up in the corpusmooc section.

Another set of queries was about distinguishing collocations from colligations (see previous corpusmooc for a related chart), this time round found a nice example from John Sinclair:

For example, the English letter string second can be claimed to have two primary meanings based on results of a collocation analysis: (i) ‘next to the first’ when it is used together with words such as the, world, war, year, child and wife, and (ii) ‘a unit of time’ when it is preceded by words such as per, radians and cycles. In addition to being attracted to different lexical contexts, the two primary meanings of second also prefer different grammatical contexts: second as ‘next to the first’ is often a part of a definite noun phrase, while second as ‘a unit of time’ can be usually found in an indefinite noun phrase (Sinclair 1991: 107). Hence, the collocational preferences of the two senses correlate with the colligational preferences.

which was cited in the following pdf – []

Hope these are of use : )

CorpusMooc 2016 round 4 Blog Posts

CorpusMooc 2016 has kicked off and time to collect participants blog posts : )

First one on the radar is by Marc Jones

#CorpusMOOC Week 1 notes


Marc asks if certain frequent language items are acquired later rather than earlier. No doubt, and not easy to answer but an important point is raised that we can’t solely depend on frequency when deciding what to teach based on corpus descriptions.

Michael Brown describes his experience of Week 1 of #corpusmooc – []

Vedrana Vojkovic Estatiev  who did the mooc in 2014 reflects on changes []

Marc Jones gets down to week 2 musings []

Michael Brown reports his week 2 experiences []

Michael Brown talks about week 3 []

Michael Brown gives his lowdown on weeks 4 and 5 []

Michael Brown reports on weeks 6 and 7 []

Final blog by Michael Brown on week 8 []

If you look in the #corpusmooc section you can read previous iterations of mooc participants thoughts on the course.

BYU-Corpora Digs 1 – Rock up

The corpora one can access through the BYU interface [] range from US Soap Operas to British Parliament Speeches, from historical English in the US in the 1800s right up to yesterday’s news on the web in 20 countries round the world.

This allows interested parties a number of ways to look at some language in use.

A recent story about UK politics reports on a politician using the follow language:

No, I just rocked up and then waved at the CCTV.


This use of rock up seems worthy of a little attention. So my thought is what information can we get about the use of rock up using the corpora at BYU?

I’ll post my thoughts in a few weeks. Thanks for any consideration : )

Email corpora

Email corpora

1) The most famous is no doubt the ENRON corpus e.g. [] or the ENRON Sent corpus []

2) There is the Business Letter Corpus [], shame interface to this is limited.

3) British Columbia Conversation Corpora (BC3): Email corpus []

4) SPAM and non-SPAM email datasets []

5) Hilary Clinton email archive by Wikileaks [].

6) US Democratic National Convention email archive [].

Thanks to Laura Adele Soracco for prompting this post.


Enron corpus primer tutorial []