There could be some disappointment with #corpusmooc week 4 billed as building your own corpus. The lectures are more of a general discussion of corpus building along with very useful lectures on the CLAWS part of speech tagging and USAS semantic tagging.
In my opinon a section on the web as a corpus would have been good. For example the use of #BootCat . You can read about how to use BootCat here – http://bootcat.sslmit.unibo.it/?section=home. What I would like to note is the process of finding good seed words.
How BootCat works is this:
1.seed (words) – e.g. you can use keywords derived from comparing a sample corpus (that you build manually) to a reference corpus
2. tuples – the seed words are combined, default is three so for example if seeds are one, two, three, four, five (you need min of 5 seed words); a tuple could be one, two, four , default no. of tuples is 10
3. collect urls – collect urls that contain these tuples, you need to go through this carefully to ensure the text is what you want; BootCat helpfully links to the urls for you to check but of course if you have a lot of URLs it is a time consuming process; you can alleviate this somewhat by specifying domains to leave out of the url collection stage
4. build corpus – build corpus with the collected urls
If you are not satisfied with the resulting corpus you can redo BootCat process using say keywords and/or n-grams from the (unsatisfying) corpus to build a new BootCat corpus. You can of course keep repeating this.
When I was looking to build web design related corpora the results I was getting were disappointing. Then I hit upon the idea of using already existing categories from A List Apart, a well known site for web designers. They have a topics section, so for example one of their top level topics of Design includes Brand Identity, Graphic Design, Layout & Grids, Mobile/Multidevice, Responsive Design,Typography & Web Fonts.
So I used these as seed words and the resulting corpus was much better, I intend to do a similar approach to build corpora reflecting two other A List Apart top level topics of Code and User Experience.
Apparently there is a new version of BootCat coming with some neat new features.
FYI if you already have a website that you want to collect texts from have a read of my post here – http://eflnotes.wordpress.com/2013/03/06/building-your-own-corpus-textstat-antconc/
see also – BootCat Custom URL [https://eflnotes.wordpress.com/2014/10/08/building-your-own-corpus-bootcat/]