At some point when dealing with texts you will have to use the #commandline and #regularexpressions as it makes life much easier, although of course there is a steep learning curve.
Needs must and I recently bit the bullet and did my first #regex commands.
I am working with text from The Setup http://usesthis.com/ – a series of interviews which poses 4 simple questions about the equipment setup that people use in their work.
Now this website is great not only for the interesting interviews but for the fact that since the project is hosted on github one can very easily have a copy of all the interviews (i.e. you need a github account, then you can copy(fork) the project and download files).
Once you have the text files they need a little cleaning since the words which are linked are given in square brackets and the http link is put into round brackets e.g. :
I’m [Alex Payne](http://al3x.net/ “Alex’s website.”). I go by [al3x](http://twitter.com/al3x “Alex’s Twitter account.”) around the Interwho. I work at [Twitter](http://twitter.com/ “Micro-blogging FTW.”) in San Francisco as their [API](http://apiwiki.twitter.com/ “The Twitter API Wiki.”) Lead.
I want to remove only the square brackets [ ] and everything between the round brackets, including the round brackets.
A command for removing only the square brackets using the #sed command is:
sed -i.bak ‘s/[ ] [ ]//g’ file.txt
The -i.bak is really two commands the -i means replace the original file with the results of the sed program (keeping both the unmatched text and the matched text) and .bak means make a backup of the original file.
The ‘s/[ ][ ]//g’ is the interesting part as this contains the regular expression.
The normal structure of the substitute sed s command is
In this case the characters I want is [ and ], but since [ is already a command i.e. [abc] will match a or b or c, you can put the closing square bracket first and then the opening square bracket i.e. [ ] [ ] will match ] or [ which is the same as matching [ or ].
The replacement is simply a blank. The g means match all occurences of the regex pattern.
Now I want to delete the round brackets and everything in between the round brackets the command is :
sed -i.bak2 ‘s/([^( )]*)//g’ file.txt
Notice I called the new backup bak2 so I can distinguish easily between the results of the two commands. It is possible to combine commands but not figured that out yet!
The regex looks for the opening bracket (.
Then anything between that is not an opening bracket or close bracket, the caret ^ means not the following character i.e. not ( or ).
The star means match zero or more of such characters which are not an opening bracket or closing bracket.
Then finally match a closing bracket ).
Not sure how useful my particular example will be to people here but I wanted to highlight the fact that understanding regular expressions and using the command line saves a lot of time.
For example at the time of writing this there are 395 interview files, imagine having to open them up in a text editor and doing find and replace! In the command line it is a simple case of using unix wildstar on the text, in my case 2*.txt as the files are labelled year-month-day-nameofinterviewee.txt
There are numerous resources on using regex and sed command online; I learnt a bit because I had the need not to go through 395 text files!
Thanks for reading.