mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

Notes for AMALGrAM 2.0 a multi-word expression and semantic tagger

The tagger expects input file to be part of speech (POS) tagged and in a tabbed format. Also there needs to be a space between sentences (

1. So first we need to split the words in a file into separate lines:

tr -s ‘[[:space:]]’ ‘\n’ < file

Note this is for a file already with spaces in between punctuations, for proper tokenisations use the stanford tokenizer:

java edu.stanford.nlp.process.PTBTokenizer file

update been having problems running standford tokenizer so use the following instead:

grep -Eo ‘\w+|[^\w ]’ input.txt

2. POS tag using stanford tagger with -tokenize false flag

java -mx300m -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-bidirectional-distsim.tagger -tokenize false -textFile file

3. Insert newline after tagged full stop:

sed ‘s/\._\./&\n/g’ file

4. Convert underscores into Tab delimited:

sed ‘s/_/\t/g’ file

5. Finally run file in pysupersensetagger (AMALGrAM)

./ file

See also: part 2 notes []

Ivor Timmis email interview

A tweet by Pérez-Paredes @perezparedes  alerted me to the release of a new book titled Corpus Linguistics for ELT: Research & Practice by Dr. Ivor Timmis who kindly responded to some questions about the new publication.

1. Can you share a little of your background?

I’ve been involved in ELT for about 30 years as a teacher, teacher educator, PhD supervisor and researcher, more or less in that order.  I have a particular interest in spoken corpora and in the relationship between corpus linguistics and language teaching.  Currently I am working on developing an historical corpus of conversations recorded in a working class community in the 1930s (the Bolton/Worktown corpus).  I also have an interest in materials development for ELT and have just co-written a book with Freda Mishan on that topic.

2. Why did you decide to write the book?

Well, I was invited to do it!  However, I did think there was a real need for it.  For too long there has been talk of the ‘potential’ of corpus linguistics to contribute to language teaching, but I don’t think that potential has been realised to any great extent.  As I say in the book, it was once said of an English footballer, ‘He has a great future behind him’.  His youthful promise never came to fruition – I don’t want that to be said about corpora and ELT.

3. How much of your book would you say is about practice rather than theory?

That is difficult to say – the whole point of the book is to encourage teachers to use corpora and corpus tools (and to encourage their learners to do so).  For that reason, every chapter has hands-on practical tasks.  There is, however, enough theoretical background to ensure informed practice.

4. What kind of corpus skills do you assume your readers will have?

None.  There is a chapter on how to build a corpus and the tasks gradually familiarise readers with corpus skills.

5. Are there any similar books you would recommend?

The closest in spirit is:

O’Keeffe, A.  McCarthy, M. and Carter, R. (2007) From corpus to classroom.  CUP.

Their book, however, self-avowedly, focuses on spoken language.

6. Anything else you would like to add?

I think an acquaintance with corpus research and an ability to exploit corpora should be a part of every English language teacher’s repertoire (among the many other skills teachers need).  I hope the book contributes to that aim.

Many thanks to Ivor for the interview. It’s great to see a new corpus publication focusing on ELT, I look forward to getting it on my bookshelf.

Related – a couple of more interviews with authors of recent books on corpora and teaching:

James Thomas’s Discovering English with SketchEngine []

Christian Jones and Daniel Waller on Corpus Linguistics for Grammar: A guide for research []

Andrew Caines spoken corpus project

I asked Andrew Caines about the spoken corpus he is compiling. He kindly responded:

What is the purpose of this spoken corpus?

We’re collecting this spoken corpus for several reasons: firstly on the basis of ‘the more data the better’ — it does no harm to have new corpora for up-to-date language models, especially of spoken language, as these are harder to collect and therefore harder to come by. Secondly, we’re running the corpus collection exercise as a novel use of crowdsourcing for a special session at the INTERSPEECH conference this year.

And finally, we’re collecting two corpus types: the first is a corpus of English native speakers undertaking similar tasks as are found in EFL exams, so that we can start to address the question ‘what would native speakers do in this context?’ The second is a corpus of bilingual German-English speakers undertaking the same tasks as in the first corpus, but this time allowing us to address the question of first language transfer effects in the second language.

What size are you aiming for?

We’re looking for more than one hundred individual speakers and asking them to talk for anything between 2.5 and 5 minutes each. We don’t yet know how many words this will give us in total: it depends on people’s speaking rates!

How does it differ from the new BNC2014  spoken corpus project?

The new BNC2014 project will be a wonderful update to the original BNC conversation section: containing face-to-face spontaneous dialogues between British English speakers, on whatever topic occurs to them. In contrast, our corpus involves monologues, with people answering our specified questions, and the speakers being native speakers of (British/American/Canadian/etc) English or German.

What format(s) will the corpus be available in to public?

We’ll make the recordings and transcripts available to other researchers via the Speech and Language Data Repository.

When is a likely version be available to public?

We hope it will be made available before the end of this year.

Any other info you would like to add?

We welcome contributions from native speakers of English, and German-English bilinguals.

If you have an Android mobile/tablet device, all you will need is the (free) Crowdee app, 10-15 spare minutes, and a quiet place to record.

We’ll even pay you! (note, there’s a minimum pay-out of €5 on Crowdee).

Further info is available here:

Bigger is not necessarily better

It was interesting to read that a large #corpus such as #COCA -BYU can actually bias the language collected.  Before I go on how do you think a large corpus can bias the data?

This paper – On Using Corpus Frequency, Dispersion and Chronological Data to Help Identify Useful Collocations [] reports that a lot of the collocations it found were related to food and cooking. That is because the magazines and newspapers used by COCA regularly had recipes.

Another bias the study discusses is with the spoken section of COCA which uses a lot of television news and talk shows hence the language reflects those of newscasters and talk show hosts.

Finally the study highlights the lack of a specific business sub corpus such as there is for academic language, at the moment business language is spread across spoken, magazine and newspaper sources but not with particularly high frequencies.

Google as a corpus with students

Google as a corpus with students

Although using Google as a #corpus has limitations I think that with features like customized search, autocomplete and basic search syntax and above all its dominance in #students lives I am more and more inclined to use it in class.

A recent paper called How EFL students can use Google to correct their “untreatable” written errors by Luc Geiller describes the use of a Custom Google Search consisting of 28 news sites with French English learners. (h/t corpuscall @corpuscall)

A custom google search allows one to tailor the search to sites of interest, thus cutting down on irrelevant search results. The paper reports on six types of search the students performed.

Though many students were able to successfully self-correct their errors using Google, some found the process overwhelming, a common issue when using corpus search results.

I was wanting to examine work that has been done in this area of using Google as a corpus and so thought as part of that process it would be useful to compile a list of papers/work that one can read online that look at the teaching aspects of Google as a corpus.

If you know of any others do please add:


How EFL students can use Google to correct their “untreatable” written errors (2014) Luc Geiller

Google Scholar as a linguistic tool: new possibilities in EAP (2012) by Vaclav Brezina

Google Drafting (2012) by Joe Geluso

How Can Search Engines Improve Your Writing? (2011) by Adam Acar, Joe Geluso, Tadaki Shiki

Internet tools for language learning: University students taking control of their writing (2010) by Mark A. Conroy

From Body to Web: An Introduction to the Web as Corpus by Maristella Gatto (2008)

How to develop effective concordance materials using online corpus – a slideshare by Professor Sooin Chun

How to develop effective concordance materials using online corpus a slideshare by Professor Sooin Chun, outlines 2 essential things to do when constructing DDL (data-driven learning) materials.

Note the difficulties with DDL mentioned in slide 5 i.e.

– text difficulty

– skills in using corpora

– time consuming activities

The two things to do are:

1. concordance lines should be representative of frequency of text types and part of speech. So using she gives examples of looking at win and beat and using concordance lines that represent the uses of these words across the text types of Spoken, Fiction, Magazine, Newspaper, Academic and in the two parts of speech verb and noun, see slide 19.

2. the concordance lines so produced (labelled a specialised corpus) can be used for general use and if students have further particular issues then a sample of these concordance lines can be used to investigate further (labelled a micro-specialised corpus), see slide 26.

I am not so clear on what is meant by a micro-specialised corpus but I imagine say when looking at uses of win students spot a feature particular to uses of it in academic texts, then further examples of its use in such texts are pulled and examined further?

Anyway slides worth a look if you are considering DDL.

Re-writing texts for graded readers

The following is related to an #iTDi course on ELT Reading materials design.

A slideshare presentation by Ryota Ito (@i_narrator)[] shows some fascinating differences found in #graded #readers of #Sherlock #Holmes stories that goes beyond simple lexical coverage. Using factor analysis he reports 3 factors that differentiate original readers of Sherlock Holmes from their graded reader equivalents.

The three factors are

1. keywords and explicitness

2. century (19C vs 20C)

4. subjectivity

The first screen shot shows keywords in Original Readers (OR) and Graded Readers (GR):


In GR there is an increase in explicitness (see next screenshot), increase in colloquialisms and reduction in post-modifiers (e.g. prepositional phrases, relative clauses).

The following is a screenshot illustrating explicitness i.e. the graded readers are re-written to explicitly indicate who is talking:

[Screenshot2 –]

The next screenshot shows first person rewrites of third person language:

[Screenshot3 –]

And the following is again a subjective factor, this time maybe can be described as conversational/spoken language rewrite with the use of But:

[Screenshot4 –]

This was a great slide presentation for me with regard to graded readers as I was only previously thinking about token issues rather than wider text issues such as explicitness, first person/third person, subjective/objective.

All errors and omissions mine and not Ryota Ito’s.

AntWord Profiler and specialised vocabulary profiling

I am following an #iTDi course on ELT Reading materials design. Week2 session included use of vocabulary profilers such as #lextutor vocab profile to check whether a text reaches the 95-98 percent coverage of vocab needed to understand it well.

An issue that arose is how to handle specialized texts. That is how can we create a vocab profile that will take into account technical/specialised vocabulary.

One rough way to do that is first collect the specialized texts you want.

Then using #AntWordProfiler [] open this collection.

Next make sure the Include words not in lists(s) checkbox is ticked. Then press Start.

Scroll through the result to find the Groups NOT Found In Base Lists results. Copy and paste these into Excel.

From Excel just select  the first column (the words column) and copy paste that into a text file, name it appropriately and save it.

You can now load this text file into AntWordProfiler in addition to the default lists of the General Service List and the Academic Word List.

Finally load up a sample text you want to profile and you will now see any specialized vocabulary being profiled due to the new word list you created previously.

Note1: if you are using mac osx mavericks when you paste the word column into a text file, make sure to save the text file as a windows (MS-DOS) text file otherwise AntWordProfiler will put all the words onto one line when loading the text. Though Laurence Anthony  notes simply saving as a UTF-8 should suffice.

Note2: the words in your new list won’t be grouped into word families each word will be treated as a word family. Not sure yet best way to group word families automatically.