Commit 257c6a2f authored by Toby Hodges's avatar Toby Hodges

Merge branch 'THchangesJan2017' into 'master'

Sections and exercises for `rev`, `fmt`, and `xargs`

A few new sections, plus some typo and formatting fixes, in preparation for the January 2017 course.

See merge request !5
parents 7b91b3f8 5e1261d6
......@@ -18,6 +18,7 @@ Special thanks go to contributors / helping hands (alphabetical order):
* Christian Arnold
* Jean-Karim Hériché
* Nicolai Karcher
* Yan Ping Yuan
* Bora Uyar
* Thomas Zichner
......@@ -59,9 +59,10 @@ Running Linux Commands in Windows
Babun
""""""
The easiest way t oget a linux-like console on a Windows host is probably `babun <http://babun.github.io/>`!
The easiest way to get a linux-like console on a Windows host is probably `babun <http://babun.github.io/>`!
Babun features the following:
- Pre-configured Cygwin with a lot of addons
- Command-line installer, no admin rights required
- advanced package manager (like apt-get or yum)
......
......@@ -28,9 +28,10 @@ script.
Running a Script
================
There are basically three ways to run a script:
There are basically three ways to run a script, regardless of the language in which the
script is written:
a) the location to your script is not in your ``$PATH`` variable, then you have to specify the full path to the script:
a) where the location to your script is not in your ``$PATH`` variable, then you have to specify the full path to the script:
::
......@@ -38,7 +39,7 @@ a) the location to your script is not in your ``$PATH`` variable, then you have
[...]
$
b) the location to the script is in the ``$PATH`` variable, then you can simply type its name:
b) where the location to the script is in the ``$PATH`` variable, then you can simply type its name:
::
......@@ -49,7 +50,7 @@ b) the location to the script is in the ``$PATH`` variable, then you can simply
In both situations, the script will need to have execute permissions to be run. If for some
reason you can only read but not execute the script, then it can still be run in the following way:
c) specifying the :index:`interpreter` (i.e. the program required to run the script). For shellscripts this is the appropriate shell). The full path (relative or absolute) to the script has to be provided in this case, no matter whether the script location is already contained in ``$PATH`` or not:
c) by specifying the :index:`interpreter` (i.e. the program required to run the script). For shellscripts this is the appropriate shell). The full path (relative or absolute) to the script has to be provided in this case, no matter whether the script location is already contained in ``$PATH`` or not:
::
......@@ -290,7 +291,7 @@ A) Evaluating the exit status of a command: Simply use the command as condition.
.. Note:: In `csh/tcsh`
a) To evaluate the exit status of a command in it must be
a) To evaluate the exit status of a command it must be
placed within curly brackets with blanks separating the brackets from the
command: ``if ({ grep -q root /etc/passwd }) then [...]``
b) Redirection of commands in conditions does not work
......@@ -302,11 +303,33 @@ B) Evaluating of conditions or comparisons:
Conditions and comparisons are evaluated using a special :index:`command <test>` ``test`` which is
usually written :index:`as <[>` "``[``" (no joke!). As "``[``" is a command, it must be followed by
a blank. As a speciality the "``[``" command must be :index:`ended <]>` with "``]``" (note the
a blank. As a speciality the "``[``" command must be :index:`ended <]>` with "`` ]``" (note the
preceding blank here)
.. Note:: In csh/tcsh the ``test`` (or ``[``) command is not needed. Conditions and comparisons are directly placed within the round braces.
Watch Out For The Exit Code!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's important to consider the exit status of conditional blocks. An ``if-then-else``
block will return exit code 0, indicating success, as long as no errors were
encountered during execution. This means that, if you use an ``if-then-elif``
block (i.e. without an ``else`` statement), your script will run successfully
regardless of whether any of the conditions were actually met.
This might be what you want to happen, but in most circumstances it is good practise
to include an ``else`` statement, to specify the desired behaviour when none of the
expected conditions have been met. You coud use this ``else`` block to exit the script
with a non-zero code, print an error message, or anything else that could be useful
for debugging in future.
Remember that it is often difficult to foresee every possible input/use case when
you first write a script, and being diligent now will probably save you a lot of
time and head-scratching in the future!
+--------------------+------------------------------------------------------+--------------------+
| **sh/bash** | | **csh/tcsh** |
+--------------------+------------------------------------------------------+--------------------+
......@@ -454,6 +477,7 @@ Example:
echo '/opt and /usr are not contained in $PATH'
;;
esac
.. Note:: Just like ``if-then-else`` blocks (see "Watch Out For The Exit Code!" in the previous section), a ``case`` block will return exit code 0 regardless of whether any of its options were matched during execution. Always try to design a "in all other circumstances" option, that is guaranteed to be met, so that your script will sensibly handle situations where the value(s) passed to ``case`` don't fall into any of your expected categories. Remember that cases are given priority by the order that they appear in the block, so make your "catch-all" case non-specific and place it last in the block to match anything that wasn't picked up by the other options.
Loops
-----
......@@ -848,8 +872,8 @@ Three variants for the same (print out who you are in English text): ::
Create Temporary Files
----------------------
You can create temporary files with mktemp. By default it will create a new
file in /tmp and print its name: ::
You can create :index:`temporary files <temporary files>` with ``mktemp``.
By default it will create a new file in /tmp and print its name: ::
$ mktemp
/tmp/tmp.Yaafh19370
......
......@@ -149,7 +149,7 @@ Count the number of fasta sequences (they start with a ">") in a file:
::
# grep -c '>' twofiles.fasta
# grep -c '>' twoseqs.fasta
2
List all files containing the term "Ensembl":
......@@ -171,7 +171,7 @@ Search a file compressed with ``gzip`` using ``zgrep``:
REV
---
:index:`rev` is a tool that reverses lines of input.
:index:`rev <rev>` is a tool that reverses lines of input.
**Usage**: ``rev file``
......@@ -199,7 +199,7 @@ be reversed to restore the original orientation of the input file.
FMT
---
:index:`fmt` is used to control the format of text input.
:index:`fmt <fmt>` is used to control the format of text input.
**Usage**: ``fmt [options] file(s)``
......@@ -222,7 +222,7 @@ of values into a single column:
XARGS
-----
:index:`xargs` can be used to provide file contents or output of one command as arguments
:index:`xargs <xargs>` can be used to provide file contents or output of one command as arguments
to the next.
**Usage**: ``xargs [options] [ tool [options] [arguments] ]``
......@@ -235,10 +235,9 @@ By default, ``xargs`` passes the strings given to it onto the ``echo`` command.
KPLGVALTNRFGEDADERID
RPIGPEIQNRFGENAEERIP
RSVATQVFNRFGDDTESKLP
RAIGAELQNRFSNDAEQRIP
# cat motifs.txt | xargs
KPLGVALTNRFGEDADERID RPIGPEIQNRFGENAEERIP RSVATQVFNRFGDDTESKLP RAIGAELQNRFSNDAEQRIP
KPLGVALTNRFGEDADERID RPIGPEIQNRFGENAEERIP RSVATQVFNRFGDDTESKLP
In this way we can achieve the reverse of the row vector -> column operation performed in
the ``fmt`` example above. But ``xargs`` can be used for much more powerful things than
......@@ -265,10 +264,9 @@ tool/command that we want ``xargs`` to pass the strings to as arguments.
EMBL
One of the most common uses of ``xargs`` is in combination with the ``find`` command, allowing
the user to operate on multiple files across multiple locations at once. For example, to
search for the word 'protein' in all ``.txt`` files underneath the 'Documents' directory, we
could use the approach below:
Use ``xargs`` in combination with the ``find`` command, allowing you to operate on multiple
files across multiple locations at once. For example, to search for the word 'protein' in
all ``.txt`` files underneath the 'Documents' directory, we could use the approach below:
::
......@@ -288,11 +286,9 @@ throughout the filesystem.
# find /tmp -name '*.tmp' | xargs rm
The command above will find any files with '.tmp' extension and pass them to ``rm`` for
deletion. Of course, care should always be taken when using commands that alter the
filesystem, such as ``rm`` and ``mv``, so you need to be sure that you know what's going to
happen before you execute a command like the one above. Helpfully, ``xargs`` provides an
option ``-p`` that will prompt the user before executing commands.
Take care whenever you use commands like ``rm`` and ``mv`` that overwrite/remove files
permamently. Helpfully, ``xargs`` provides an option ``-p`` that will prompt the user
before executing commands.
::
......@@ -307,24 +303,27 @@ these large files.
If you need to control where exactly the strings passed to ``xargs`` are placed in the
command that it subsequently calls, use the ``-I`` option:
::
# find /home/toby/alignments -name "*.fasta" | xargs -I OLDFASTA mv OLDFASTA OLDAFASTA.old
# find /home/toby/alignments -name "*.fasta" | xargs -I OLDFASTA mv OLDFASTA OLDFASTA.old
Useful options:
========== ===================================
Option: Effect:
========== ===================================
``-n INT`` pass INT strings as arguments to each invocation of tool
``-0`` use NULL as separator (good for handling strings/filenames containing spaces)
``-t`` echo commands to STDERR as they are executed
========== ===================================
============= ===================================
Option: Effect:
============= ===================================
``-n INT`` pass INT strings as arguments to each invocation of tool
``-0`` use NULL as separator (good for handling strings/filenames containing spaces)
``-t`` echo commands to STDERR as they are executed
``-p`` prompt with command before execution
``-I STRING`` specify placeholder name for arg
============= ===================================
SED
---
:index:`sed` is a Stream EDitor, it modifies text (text can be a file or a pipe) on the fly.
:index:`sed <sed>` is a Stream EDitor, it modifies text (text can be a file or a pipe) on the fly.
**Usage**: ``sed command file``,
......@@ -384,7 +383,7 @@ the Linux command ``rev`` to reverse the output of the ``sed`` command:
# echo "AGTGGCTAAGTCCCTTTAATCAGG" | sed 'y/ACGT/UGCA/' | rev
CCUGAUUAAAGGGACUUAGCCACU
When used on a file, sed prints the file to standard output, replacing text as it goes
When used on a file, ``sed`` prints the file to standard output, replacing text as it goes
along:
::
......@@ -395,7 +394,7 @@ along:
This is stuff
This is even more stuff
sed can also be used to print certain lines (not replacing text) that match a pattern.
``sed`` can also be used to print certain lines (not replacing text) that match a pattern.
For this you leave out the leading 's' and just provide a pattern: '/PATTERN/p'. The
trailing letter determines, what sed should do with the text that matches the pattern
('p': print, 'd': delete)
......@@ -407,7 +406,7 @@ trailing letter determines, what sed should do with the text that matches the pa
This is even more text
This is even more text
As sed by default prints each line, you see the line that matched the pattern,
As ``sed`` by default prints each line, you see the line that matched the pattern,
printed twice. Use option '-n' to suppress default printing of lines.
::
......@@ -422,7 +421,7 @@ Delete lines matching the pattern:
# sed '/more/d' textfile
This is text
Multiple sed statements can be applied to the same input stream by prepending
Multiple ``sed`` statements can be applied to the same input stream by prepending
each by option '-e' (edit):
::
......@@ -431,7 +430,7 @@ each by option '-e' (edit):
That is good stuff
That is even more good stuff
Normally, sed prints the text from a file to standard output. But you can also edit
Normally, ``sed`` prints the text from a file to standard output. But you can also edit
files in place. Be careful - this will change the file! The '-i' (in-place editing) won't
print the output. As a safety measure, this option will ask for an extension that will
be used to rename the original file to. For instance, the following option '-i.bak'
......@@ -452,9 +451,9 @@ AWK
---
:index:`awk` is more than just a command, it is a complete text processing language (the
name is an abbreviation of the author's names).
name is an acronym of the author's names).
Each line of the input (file or pipe) is treated as a record and is broken into fields.
Generally, awk commands are of the form: ::
Generally, ``awk`` commands are of the form: ::
awk condition { action }
......@@ -475,7 +474,7 @@ lines that match the condition.
# awk '/more/ {print}' textfile
This is even more text
awk reads each line of input and automatically splits the line into columns. These
``awk`` reads each line of input and automatically splits the line into columns. These
columns can be addressed via $1, $2 and so on ($0 represents the whole line).
So an easy way to print or rearrange columns of text is:
......@@ -487,7 +486,7 @@ So an easy way to print or rearrange columns of text is:
# echo "Master Obi-Wan has lost a planet" | awk '{print $4,$5,$6,$1,$2,$3}'
lost a planet Master Obi-Wan has
awk splits text by default on whitespace (spaces or tabs), which might not be ideal in all situations. To change the
``awk`` splits text by default on whitespace (spaces or tabs), which might not be ideal in all situations. To change the
field separator (FS), use option '-F' (remember to quote the field separator):
::
......@@ -518,7 +517,7 @@ pattern 'PDBsum' (case sensitive):
...
awk really is powerful in filtering out columns, you can for instance print only
``awk`` really is powerful in filtering out columns, you can for instance print only
certain columns of certain lines. Here we print the third column of those lines
where the second column is 'PDBsum':
......@@ -628,8 +627,8 @@ variables>`. By convention, environment variables are written in uppercase
letters.
**Shell variables** are **only available to the current shell** and not inherited when
you start an other shell or script from the commandline. Consequently, these
variables will not be available for your shellscripts.
you start another shell or script from the commandline. Consequently, these variables
will not be available for your shellscripts.
**Environment variables** are **passed on** to shells and scripts started from your
current shell.
......@@ -752,7 +751,7 @@ Tips and Tricks
Quoting
-------
In Programming it is often necessary to "glue together" certain words. Usually, a program or
In programming it is often necessary to "glue together" certain words. Usually, a program or
the shell splits sentences by whitespace (space or tabulators) and treats each word
individually. In order to tell the computer that certain words belong together, you need to
":index:`quote <quoting>`" them, using either single (') or double (") quotes. The difference between these two is
......@@ -840,7 +839,7 @@ annoying errors due to typos.
Tab-Completion: A Reminder
^^^^^^^^^^^^^^^^^^^^^^^^^^
You're probably already aware of tab-completion, where you push the ``tab`` key to
You're probably already aware of tab-completion, where you push the ``TAB`` key to
complete the name of a command, file, directory, etc. This is a huge time-saver and great
tool for preventing the accidental inclusion of errors.
......
>sequence1
AGTGTTGGATTTAAAGCTGGTGTTAAAGATTACAGATTGACTTATTATACTCCTGATTACGAAACCAAAG
ATACTGATATCTTGGCAGCATTCCGAGTAACTCCTCAACCTGGGGTTCCCCCTGAAGAGGCAGGGGCTGC
GGTAGCTGCGGAATCTTCTACTGGTACATGGACAACTGTGTGGACTGATGGACTTACCAGTCTTGATCGT
TACAAAGGACGATGCTACCACATTGAGGCCGTTGTTGGGGAAGAAAATCAATACATTGCTTATGTAGCTT
ATCCTTTAGACCTTTTTGAAGAAGGTTCTGTTACTAACATGTTTACTTCCATTGTAGGTAATGTATTTGG
TTTCAAAGCCCTACGAGCTCTACGTCTGGAGGATCTGCGAATTCCCCCTGCTTATTCCAAAACTTTCCAA
GGCCCGCCTCACGGCATCCAAGTTGAAAGAGATAAATTGAACAAATATGGTCGTCCCCTATTGGGATGTA
CTATTAAACCAAAATTGGGATTATCTGCAAAAAACTACGGTAGAGCGGTTTATGAATGTCTA
>sequence2
CATCTAGAAAAACCCCAATAGACGCATCAGCCACCCTTCCTCACTTTAATACTCGCGATTGCTTATATTG
CCTCCTGTGCCACTCGAGCCATTCCACCCATTGGTTATTCTGATTATACTCATGCGAGGCATGACGGGCA
TACCTTGCATCTCTGCCTCTACATCGCATCGTCAAAGGGGTCAAAAGTGCAATTCGGCTAGTTCCTTTAA
AGCCATCGAACAGCCCAACCGCTGCAAGCTTATTGCATAAATGCACAGAACGGAACCTCGGTTTAACGGA
TGAACACTTGCCACAAACCAATAAAACCTAT
......@@ -33,6 +33,25 @@ GREP
5. Does this number agree with the annotated number of atoms? The PDB file has a comment which tells you how many atoms there are annotated in this file. This comment can be found by searching for the term "protein atoms" (use quotes and case insensitive search here!).
REV
---
1. By combining ``rev`` with the ``cut`` tool, print the last word of each line in DNA.txt. Make sure that the words are readable when they are printed out.
XARGS
-----
1. Use ``xargs`` to print the first line of the files listed in to_be_previewed.txt
2. Create a copy of each of these files by passing the lines in to_be_copied.txt two-at-a-time to ``cp``
3. A better way to back up these files might be to keep the original names while copying them. Make another copy of each file listed in to_be_previewed.txt, adding ".backup" onto the end of each filename. (Hint: remember the "-I" option!)
4. ADVANCED: we've made a bit of a mess now, and it's time to clean up. Make a new directory called 'garbage', which you will move all these new files into it by combining the ``find`` tool with ``xargs`` and ``mv``. Use ``find`` to find all files in the current directory that were last modified less than ten minutes ago, and ``xargs`` with ``mv`` to change their location. BE CAREFUL!
.. Hint:: you'll need to check out the options available for ``find``, and you might consider using the "-p" option with ``xargs`` to help avoid accidentally deleting something that you might regret!) Once you've moved the files, check the contents of the 'garbage' directory and, if you're sure that you don't want any of those files anymore, delete them and the directory.
SED
---
......@@ -59,7 +78,7 @@ AWK
b. Now use awk to show all lines containing "17".
c. Next try show only those lines where column three equals 17 (Hint: The file is semicolon-separated...).
c. Next try to show only those lines where column three equals 17 (Hint: The file is semicolon-separated...).
d. Finally print the PMIDs (column 6) of all lines that contain "17" in column 3.
......
......@@ -112,6 +112,101 @@ GREP
3600
REV
---
1. By combining ``rev`` with the ``cut`` tool, print the last word of each line in DNA.txt. Make sure that the words are readable when they are printed out.
::
$ rev DNA.txt | cut -d' ' -f1 | rev
the
known
of
life.
adenine,
DNA
[...]
XARGS
-----
1. Use ``xargs`` to print the first line of the files listed in to_be_previewed.txt
::
$ cat to_be_previewed.txt | xargs head -n1
==> 3UA7.pdb <==
HEADER TRANSFERASE/VIRAL PROTEIN 21-OCT-11 3UA7
==> ENST00000380152.fasta <==
>ENSG00000139618:ENST00000380152 cds:KNOWN_protein_coding
==> ENST00000530893.fasta <==
>ENSG00000139618:ENST00000530893 cds:KNOWN_protein_coding
[...]
2. Create a copy of each of these files by passing the lines in to_be_copied.txt two-at-a-time to ``cp``
::
$ ls -1t *.{pdb,txt,fasta}
to_be_copied.txt
to_be_previewed.txt
tabular_data.txt
twoseqs.fasta
files.txt
motifs.txt
1Y57.pdb
3UA7.pdb
DNA.fasta
[...]
$ cat to_be_copied.txt | xargs -n2 cp
$ ls -1t *.{pdb,txt,fasta}
sequenceA.fasta
sequenceB.fasta
sequenceC.fasta
sequenceD.fasta
sequenceE.fasta
structure.pdb
text.txt
to_be_copied.txt
to_be_previewed.txt
[...]
3. A better way to back up these files might be to keep the original names while copying them. Make another copy of each file listed in to_be_previewed.txt, adding ".backup" onto the end of each filename. (Hint: remember the "-I" option!)
::
$ cat to_be_previewed.txt | xargs -I FILENAME cp FILENAME FILENAME.backup
$ ls -1 *.backup
3UA7.pdb.backup
ENST00000380152.fasta.backup
ENST00000530893.fasta.backup
ENST00000544455.fasta.backup
P04062.fasta.backup
P05480.fasta.backup
P12931.fasta.backup
PROTEINS.txt.backup
4. ADVANCED: we've made a bit of a mess now, and it's time to clean up. Make a new directory called 'garbage', which you will move all these new files into it by combining the ``find`` tool with ``xargs`` and ``mv``. Use ``find`` to find all files in the current directory that were last modified less than ten minutes ago, and ``xargs`` with ``mv`` to change their location. BE CAREFUL!
.. Hint:: you'll need to check out the options available for ``find``, and you might consider using the "-p" option with ``xargs`` to help avoid accidentally deleting something that you might regret!) Once you've moved the files, check the contents of the 'garbage' directory and, if you're sure that you don't want any of those files anymore, delete them and the directory.
::
$ mkdir garbage
$ find . -type f -mtime -10m | xargs -I FILENAME -p mv FILENAME garbage/
$ ls garbage
# either (risky but quicker)
$ rm -r garbage
# or (safer but slower)
$ rm -ri garbage/*
$ rmdir garbage
SED
---
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment