mainly the Google Plus Corpus Linguistics Archive

puzzling language teaching is a script that has loads of potential for language learning –

so i have been playing around with #videogrep , and it works a treat.

for example here is a cut of adjective-noun pairs using one #episode of the #bigbangtheory supercut bigbangS07E22 adj noun

an obvious activity is to get students to identify the pairs

videogrep also allows to search for hypernyms e.g. all food related vocabulary

here is another cut of the word well (see comment) that can be used with this vocabulary prompt

note that the POS tagger that videogrep uses does not seem as accurate as say the #TreeTagger parser that #TagAnt uses


Building your own corpus – TagAnt

Laurence Anthony  does it again bringing difficult to set up programs to the masses with a #tagger called #TagAnt (

This is based on TreeTagger ( which if you follow the link you will see how involved it is to setup!

I ran TagAnt on my multimedia #corpus then used the tagged corpus in AntConc.

Before working with a tagged corpus in AntConc make sure to check all the boxes in the Global Settings>Token Definitions as in Screenshot1.

Also here is a link to list of all the tags that Treetagger uses:

Then use the #Clusters tool in AntConc (gleaned from reading the Google groups for AntConc!forum/antconc).

For example Screenshot2 shows running a search for verb + noun(inspect Screenshot2 for exact search term). Note that I have set the cluster size to 2.

This shows me that verb + support is quite common:

add support

adds support

adding support

include support

including support

bringing support

drop support

introduces support

I did not notice this when using the non-tagged corpus although support was in the top 60 in the wordlist it would have taken me longer to discern interesting patterns.

Happy Tagging!

p.s. this is continuing my series of diy corpus I started on my blog which you can read here if so interested:

UCREL semantic tagger

#corpusmooc introduced me to the #UCREL #semantic tagger which you can find here

i apologize in advanced for bringing a #sugatamitra related post but hear me out!

i thought the #iatefl   mitra debates would be a nice way to get to know about semantic tagging. so i ran the texts i collected on Wednesday 09 April from the net (note i only did a minimal of cleaning up text, so  some duplicates for example are in my current corpus) through the #lancaster tagger.

note also i have texts divided into original blog posts and texts of comments to the blog posts.

the graphic you see are from the comments to the blog posts and is a way for me to get my head around what could be likely concordances to investigate manually – the words in brackets in the figure labels are potential keywords, the ~ sign represents the most interesting collocation.

let me know what you think, is this a good way to pre-analyse a semantically tagged corpus or not?

Costas Gabrielatos interview

Costas Gabrielatos a linguist from Edge Hill University has kindly answered some questions I posed to him.

1. Tell us a little about your background.

I started as a language teacher, and then moved to teacher education. Almost from the beginning, I got more interested in the ‘language’ side of ‘language teaching’ – a main contributing reason being the high frequency of overgeneralisations and inaccuracies in the information (‘rules’) provided in coursebooks and pedagogical grammars. This led me to corpus-based linguistics, but with an eye to pedagogical implications. My currently focus, as far as LT is concerned, is on corpus-based pedagogical grammar and analysis of learner language. 

2. Why should teachers consider #corpora in their classroom?

I think ‘should’ might be a bit too strong. More to the point, I think encouraging teachers to adopt corpus-based teaching approaches irrespective of their knowledge/skills is simply misguided. Before teachers (or researchers, for that matter) attempt to use corpus-based techniques, they need to have acquired relevant knowledge and skills. I’ve observed enough lessons based on misunderstood notions of ‘communicative teaching’ to shudder at the idea of a hasty adoption of CL techniques in language teaching. However, given the knowledge/skills, then access to corpora can enrich any teaching approach – provided, of course, that the approach does not allow ‘rules’ to trump evidence of actual language use.

3. Why do you think take up of it has been slow if not non-existent?

Lack of knowledge and skills, and perhaps lack of time and/or interest. Another possibility is the perception of corpus-based approaches as rather ‘academic’ (cue in the stereotypical aloof lecturer and lab-coated researcher). In fact, in light of my previous answer, I don’t find the current low level of adoption something to be unhappy about.

4. Do you think that is changing with wider availability of corpus interfaces such as COCA?

Judging from the increasing number of relevant journal papers, conference presentations, discussion groups, and websites/blogs, the interest in the utility of CL techniques for language teaching appears to be rising. However, it seems to me that the increase in interest is not so much on the part of what we might call traditional classroom teachers, but those who teach in universities, or are involved in online language teaching, or are in the process of moving away from the (virtual) classroom and towards academia.

5. Do you know to what extent the Lancaster corpus MOOC would interest teachers?

I can’t tell if it would interest teachers, but I think it offers a very good way in for those considering adding corpus-based elements to their teaching. Not only because there is a component on corpora and language teaching, but also, and more importantly, because the MOOC introduces participants to the core concepts, constructs and techniques in CL.

6. Can you recommend just one reference/resource to help language teachers with corpora?

The short answer is: ‘I refuse to do that’. The longer answer is that I think it is limiting and educationally detrimental to only derive information from a single source. CL is not monolithic, and there are currently a lot of disputes regarding some of its core theoretical notions and analytical approaches – even the very nature of CL is being debated (for example, see ). My advice would be to read as many introductory chapters/books by different authors as possible and, in typical CL fashion,  try to identify patterns in approaches and practices – keeping in mind that newer introductions are not necessarily better than older ones.

7. Any comments you would like to add not yet covered?

Just a few things I always stress at the beginning of a CL module/seminar:

·      Corpus linguistics is something you learn primarily by doing: working with corpus tools and reflecting critically on both the results you get and the techniques that yielded these results.

·      If you expect the ease and automaticity that would be afforded by a StarTrek-type computer, you’ll be disappointed – CL involves a lot of manual work.

·      Corpus linguistics is very easy to do badly.

For a detailed account of my views on corpora and language teaching, see here:

BootCat Seeding

There could be some disappointment with #corpusmooc week 4 billed as building your own corpus. The lectures are more of a general discussion of corpus building along with very useful lectures on the CLAWS part of speech tagging and USAS semantic tagging.

In my opinon a section on the web as a corpus would have been good. For example the use of #BootCat . You can read about how to use BootCat here – What I would like to note is the process of finding good seed words.

How BootCat works is this:

1.seed (words) – e.g. you can use keywords derived from comparing a sample corpus (that you build manually) to a reference corpus

2. tuples – the seed words are combined, default is three so for example if seeds are one, two, three, four, five (you need min of 5 seed words); a tuple could be one, two, four , default no. of tuples is 10

3. collect urls – collect urls that contain these tuples, you need to go through this carefully to ensure the text is what you want; BootCat helpfully links to the urls for you to check but of course if you have a lot of URLs it is a time consuming process; you can alleviate this somewhat by specifying domains to leave out of the url collection stage

4. build corpus – build corpus with the collected urls

If you are not satisfied with the resulting corpus you can redo BootCat process using say keywords and/or n-grams from the (unsatisfying) corpus to build a new BootCat corpus. You can of course keep repeating this.

When I was looking to build web design related corpora the results I was getting were disappointing. Then I hit upon the idea of using already existing categories from A List Apart, a well known site for web designers. They have a topics section, so for example one of their top level topics of Design includes Brand Identity, Graphic Design, Layout & Grids, Mobile/Multidevice, Responsive Design,Typography & Web Fonts.

So I used these as seed words and the resulting corpus was much better, I intend to do a similar approach to build corpora reflecting two other A List Apart top level topics of Code and User Experience.

Apparently there is a new version of BootCat coming with some neat new features.

FYI if you already have a website that you want to collect texts from have a read of my post here –

see also – BootCat Custom URL []

corpusmooc 2014 round 1 blogs

A couple of Some blog posts on #corpusmooc have appeared, check them out and if you know of others do post here:

one by aProfessor Moravec

and one by Carol Goodey

A post by  Patrick Andrews that highlights the differences between #corpusmooc and an online learning environment like the Open University.

I would agree that the question of moocs for educating the young is limited, there are much greater benefits for professional development.

Michael Harrison has posted first impressions of #corpusmooc

a post by PhD student Julie Voce ‏@julievoce

a week2 post by Michael Harrison – mike i would jump to week 4 for corpus building (and all of the #antconc videos) and some of week 6 and all of week 7 for language learning.

Carol Goodey has added another entry answering some common concerns about completing a mooc –

Sam Shepherd has written about some limits of the FutureLearn setup i.e. discussion forums and getting lost in them – A couple of ways round this is to follow people and filter by their names another is to use this activity feed

Of course a further way is to bring in your discussion to this community 🙂 so do post any questions/thoughts here Sam 🙂

a week 3 post by Julie Voce ‏@julievoce noting the difficulties with the commenting system

a post by TeacherPants questioning how to make sense of the data from corpus searches

Ann Priestley ‏@annindk is posting her thoughts here and here –

Mary Carr ‏@marymcarr1 has written some thoughts, apparently not a Chomsky fan.

Patrick Andrews shares some of his thoughts on one of the readings in #corpusmooc

really nice week6 musings from aProfessor Moravec on looking at -ly adverbs

end of course thoughts by Duygu Çandarlı ‏@duygucandarli and Karen Carlson ‏@sloopie72

There are some interesting analysis happening in the first practical activity 1.26 in #corpusmooc .

There are some interesting analysis happening in the first practical activity 1.26 in #corpusmooc .

I used the Lextutor Text Compare feature ( and screenshotted the first 30 odd results. (I combined corpus files using command line cat and copy pasted into lextutor text compare).

You can see immediately differences in proper nouns – names and places between the Brown corpus and the LOB corpus.

Common noun differences also appear – e.g. gyro in Brown, grid in LOB. There are also some artifacts e.g. pc in LOB only appears once when using antconc and not 38 times as in screenshot. The word number seems to be an artifact as well since there are about the same in Brown as in LOB.

The Disabled Access Friendly Campaign uses English language teaching to raise awareness of disability issues

I submitted a lesson plan for a competition which has been shortlisted (yeah me!), the lesson plan is on #corpus use literacy using the #wordandphrase .info site. You can check the lesson here,%20Muralee.%20Corpus%20use%20literacy.pdf

Also an interesting lesson is provided by one Willy Cardoso who uses what he calls an inquiry from ignorance approach, which has a lot of the hallmarks of a corpus approach but using Google as the search landscape, worth checking out :,%20Willy.%20Mobility%20disability.%20An%20inquiry%20based%20lesson%20(1).pdf,%20Muralee.%20Corpus%20use%20literacy.pdf