BootCat Seeding

There could be some disappointment with #corpusmooc week 4 billed as building your own corpus. The lectures are more of a general discussion of corpus building along with very useful lectures on the CLAWS part of speech tagging and USAS semantic tagging.

In my opinon a section on the web as a corpus would have been good. For example the use of #BootCat . You can read about how to use BootCat here – What I would like to note is the process of finding good seed words.

How BootCat works is this:

1.seed (words) – e.g. you can use keywords derived from comparing a sample corpus (that you build manually) to a reference corpus

2. tuples – the seed words are combined, default is three so for example if seeds are one, two, three, four, five (you need min of 5 seed words); a tuple could be one, two, four , default no. of tuples is 10

3. collect urls – collect urls that contain these tuples, you need to go through this carefully to ensure the text is what you want; BootCat helpfully links to the urls for you to check but of course if you have a lot of URLs it is a time consuming process; you can alleviate this somewhat by specifying domains to leave out of the url collection stage

4. build corpus – build corpus with the collected urls

If you are not satisfied with the resulting corpus you can redo BootCat process using say keywords and/or n-grams from the (unsatisfying) corpus to build a new BootCat corpus. You can of course keep repeating this.

When I was looking to build web design related corpora the results I was getting were disappointing. Then I hit upon the idea of using already existing categories from A List Apart, a well known site for web designers. They have a topics section, so for example one of their top level topics of Design includes Brand Identity, Graphic Design, Layout & Grids, Mobile/Multidevice, Responsive Design,Typography & Web Fonts.

So I used these as seed words and the resulting corpus was much better, I intend to do a similar approach to build corpora reflecting two other A List Apart top level topics of Code and User Experience.

Apparently there is a new version of BootCat coming with some neat new features.

FYI if you already have a website that you want to collect texts from have a read of my post here –

see also – BootCat Custom URL []


Leave a Reply

Your email address will not be published. Required fields are marked *