mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching

How to develop effective concordance materials using online corpus – a slideshare by Professor Sooin Chun

How to develop effective concordance materials using online corpus a slideshare by Professor Sooin Chun, outlines 2 essential things to do when constructing DDL (data-driven learning) materials.

Note the difficulties with DDL mentioned in slide 5 i.e.

– text difficulty

– skills in using corpora

– time consuming activities

The two things to do are:

1. concordance lines should be representative of frequency of text types and part of speech. So using she gives examples of looking at win and beat and using concordance lines that represent the uses of these words across the text types of Spoken, Fiction, Magazine, Newspaper, Academic and in the two parts of speech verb and noun, see slide 19.

2. the concordance lines so produced (labelled a specialised corpus) can be used for general use and if students have further particular issues then a sample of these concordance lines can be used to investigate further (labelled a micro-specialised corpus), see slide 26.

I am not so clear on what is meant by a micro-specialised corpus but I imagine say when looking at uses of win students spot a feature particular to uses of it in academic texts, then further examples of its use in such texts are pulled and examined further?

Anyway slides worth a look if you are considering DDL.

Re-writing texts for graded readers

The following is related to an #iTDi course on ELT Reading materials design.

A slideshare presentation by Ryota Ito (@i_narrator)[] shows some fascinating differences found in #graded #readers of #Sherlock #Holmes stories that goes beyond simple lexical coverage. Using factor analysis he reports 3 factors that differentiate original readers of Sherlock Holmes from their graded reader equivalents.

The three factors are

1. keywords and explicitness

2. century (19C vs 20C)

4. subjectivity

The first screen shot shows keywords in Original Readers (OR) and Graded Readers (GR):


In GR there is an increase in explicitness (see next screenshot), increase in colloquialisms and reduction in post-modifiers (e.g. prepositional phrases, relative clauses).

The following is a screenshot illustrating explicitness i.e. the graded readers are re-written to explicitly indicate who is talking:

[Screenshot2 –]

The next screenshot shows first person rewrites of third person language:

[Screenshot3 –]

And the following is again a subjective factor, this time maybe can be described as conversational/spoken language rewrite with the use of But:

[Screenshot4 –]

This was a great slide presentation for me with regard to graded readers as I was only previously thinking about token issues rather than wider text issues such as explicitness, first person/third person, subjective/objective.

All errors and omissions mine and not Ryota Ito’s.

AntWord Profiler and specialised vocabulary profiling

I am following an #iTDi course on ELT Reading materials design. Week2 session included use of vocabulary profilers such as #lextutor vocab profile to check whether a text reaches the 95-98 percent coverage of vocab needed to understand it well.

An issue that arose is how to handle specialized texts. That is how can we create a vocab profile that will take into account technical/specialised vocabulary.

One rough way to do that is first collect the specialized texts you want.

Then using #AntWordProfiler [] open this collection.

Next make sure the Include words not in lists(s) checkbox is ticked. Then press Start.

Scroll through the result to find the Groups NOT Found In Base Lists results. Copy and paste these into Excel.

From Excel just select  the first column (the words column) and copy paste that into a text file, name it appropriately and save it.

You can now load this text file into AntWordProfiler in addition to the default lists of the General Service List and the Academic Word List.

Finally load up a sample text you want to profile and you will now see any specialized vocabulary being profiled due to the new word list you created previously.

Note1: if you are using mac osx mavericks when you paste the word column into a text file, make sure to save the text file as a windows (MS-DOS) text file otherwise AntWordProfiler will put all the words onto one line when loading the text. Though Laurence Anthony  notes simply saving as a UTF-8 should suffice.

Note2: the words in your new list won’t be grouped into word families each word will be treated as a word family. Not sure yet best way to group word families automatically. is a script that has loads of potential for language learning –

so i have been playing around with #videogrep , and it works a treat.

for example here is a cut of adjective-noun pairs using one #episode of the #bigbangtheory supercut bigbangS07E22 adj noun

an obvious activity is to get students to identify the pairs

videogrep also allows to search for hypernyms e.g. all food related vocabulary

here is another cut of the word well (see comment) that can be used with this vocabulary prompt

note that the POS tagger that videogrep uses does not seem as accurate as say the #TreeTagger parser that #TagAnt uses


Building your own corpus – TagAnt

Laurence Anthony  does it again bringing difficult to set up programs to the masses with a #tagger called #TagAnt (

This is based on TreeTagger ( which if you follow the link you will see how involved it is to setup!

I ran TagAnt on my multimedia #corpus then used the tagged corpus in AntConc.

Before working with a tagged corpus in AntConc make sure to check all the boxes in the Global Settings>Token Definitions as in Screenshot1.

Also here is a link to list of all the tags that Treetagger uses:

Then use the #Clusters tool in AntConc (gleaned from reading the Google groups for AntConc!forum/antconc).

For example Screenshot2 shows running a search for verb + noun(inspect Screenshot2 for exact search term). Note that I have set the cluster size to 2.

This shows me that verb + support is quite common:

add support

adds support

adding support

include support

including support

bringing support

drop support

introduces support

I did not notice this when using the non-tagged corpus although support was in the top 60 in the wordlist it would have taken me longer to discern interesting patterns.

Happy Tagging!

p.s. this is continuing my series of diy corpus I started on my blog which you can read here if so interested:

UCREL semantic tagger

#corpusmooc introduced me to the #UCREL #semantic tagger which you can find here

i apologize in advanced for bringing a #sugatamitra related post but hear me out!

i thought the #iatefl   mitra debates would be a nice way to get to know about semantic tagging. so i ran the texts i collected on Wednesday 09 April from the net (note i only did a minimal of cleaning up text, so  some duplicates for example are in my current corpus) through the #lancaster tagger.

note also i have texts divided into original blog posts and texts of comments to the blog posts.

the graphic you see are from the comments to the blog posts and is a way for me to get my head around what could be likely concordances to investigate manually – the words in brackets in the figure labels are potential keywords, the ~ sign represents the most interesting collocation.

let me know what you think, is this a good way to pre-analyse a semantically tagged corpus or not?

Costas Gabrielatos interview

Costas Gabrielatos a linguist from Edge Hill University has kindly answered some questions I posed to him.

1. Tell us a little about your background.

I started as a language teacher, and then moved to teacher education. Almost from the beginning, I got more interested in the ‘language’ side of ‘language teaching’ – a main contributing reason being the high frequency of overgeneralisations and inaccuracies in the information (‘rules’) provided in coursebooks and pedagogical grammars. This led me to corpus-based linguistics, but with an eye to pedagogical implications. My currently focus, as far as LT is concerned, is on corpus-based pedagogical grammar and analysis of learner language. 

2. Why should teachers consider #corpora in their classroom?

I think ‘should’ might be a bit too strong. More to the point, I think encouraging teachers to adopt corpus-based teaching approaches irrespective of their knowledge/skills is simply misguided. Before teachers (or researchers, for that matter) attempt to use corpus-based techniques, they need to have acquired relevant knowledge and skills. I’ve observed enough lessons based on misunderstood notions of ‘communicative teaching’ to shudder at the idea of a hasty adoption of CL techniques in language teaching. However, given the knowledge/skills, then access to corpora can enrich any teaching approach – provided, of course, that the approach does not allow ‘rules’ to trump evidence of actual language use.

3. Why do you think take up of it has been slow if not non-existent?

Lack of knowledge and skills, and perhaps lack of time and/or interest. Another possibility is the perception of corpus-based approaches as rather ‘academic’ (cue in the stereotypical aloof lecturer and lab-coated researcher). In fact, in light of my previous answer, I don’t find the current low level of adoption something to be unhappy about.

4. Do you think that is changing with wider availability of corpus interfaces such as COCA?

Judging from the increasing number of relevant journal papers, conference presentations, discussion groups, and websites/blogs, the interest in the utility of CL techniques for language teaching appears to be rising. However, it seems to me that the increase in interest is not so much on the part of what we might call traditional classroom teachers, but those who teach in universities, or are involved in online language teaching, or are in the process of moving away from the (virtual) classroom and towards academia.

5. Do you know to what extent the Lancaster corpus MOOC would interest teachers?

I can’t tell if it would interest teachers, but I think it offers a very good way in for those considering adding corpus-based elements to their teaching. Not only because there is a component on corpora and language teaching, but also, and more importantly, because the MOOC introduces participants to the core concepts, constructs and techniques in CL.

6. Can you recommend just one reference/resource to help language teachers with corpora?

The short answer is: ‘I refuse to do that’. The longer answer is that I think it is limiting and educationally detrimental to only derive information from a single source. CL is not monolithic, and there are currently a lot of disputes regarding some of its core theoretical notions and analytical approaches – even the very nature of CL is being debated (for example, see ). My advice would be to read as many introductory chapters/books by different authors as possible and, in typical CL fashion,  try to identify patterns in approaches and practices – keeping in mind that newer introductions are not necessarily better than older ones.

7. Any comments you would like to add not yet covered?

Just a few things I always stress at the beginning of a CL module/seminar:

·      Corpus linguistics is something you learn primarily by doing: working with corpus tools and reflecting critically on both the results you get and the techniques that yielded these results.

·      If you expect the ease and automaticity that would be afforded by a StarTrek-type computer, you’ll be disappointed – CL involves a lot of manual work.

·      Corpus linguistics is very easy to do badly.

For a detailed account of my views on corpora and language teaching, see here:

BootCat Seeding

There could be some disappointment with #corpusmooc week 4 billed as building your own corpus. The lectures are more of a general discussion of corpus building along with very useful lectures on the CLAWS part of speech tagging and USAS semantic tagging.

In my opinon a section on the web as a corpus would have been good. For example the use of #BootCat . You can read about how to use BootCat here – What I would like to note is the process of finding good seed words.

How BootCat works is this:

1.seed (words) – e.g. you can use keywords derived from comparing a sample corpus (that you build manually) to a reference corpus

2. tuples – the seed words are combined, default is three so for example if seeds are one, two, three, four, five (you need min of 5 seed words); a tuple could be one, two, four , default no. of tuples is 10

3. collect urls – collect urls that contain these tuples, you need to go through this carefully to ensure the text is what you want; BootCat helpfully links to the urls for you to check but of course if you have a lot of URLs it is a time consuming process; you can alleviate this somewhat by specifying domains to leave out of the url collection stage

4. build corpus – build corpus with the collected urls

If you are not satisfied with the resulting corpus you can redo BootCat process using say keywords and/or n-grams from the (unsatisfying) corpus to build a new BootCat corpus. You can of course keep repeating this.

When I was looking to build web design related corpora the results I was getting were disappointing. Then I hit upon the idea of using already existing categories from A List Apart, a well known site for web designers. They have a topics section, so for example one of their top level topics of Design includes Brand Identity, Graphic Design, Layout & Grids, Mobile/Multidevice, Responsive Design,Typography & Web Fonts.

So I used these as seed words and the resulting corpus was much better, I intend to do a similar approach to build corpora reflecting two other A List Apart top level topics of Code and User Experience.

Apparently there is a new version of BootCat coming with some neat new features.

FYI if you already have a website that you want to collect texts from have a read of my post here –

see also – BootCat Custom URL []

The Disabled Access Friendly Campaign uses English language teaching to raise awareness of disability issues

I submitted a lesson plan for a competition which has been shortlisted (yeah me!), the lesson plan is on #corpus use literacy using the #wordandphrase .info site. You can check the lesson here,%20Muralee.%20Corpus%20use%20literacy.pdf

Also an interesting lesson is provided by one Willy Cardoso who uses what he calls an inquiry from ignorance approach, which has a lot of the hallmarks of a corpus approach but using Google as the search landscape, worth checking out :,%20Willy.%20Mobility%20disability.%20An%20inquiry%20based%20lesson%20(1).pdf,%20Muralee.%20Corpus%20use%20literacy.pdf