exercises.rst 4.93 KB
Newer Older
1

2
Commandline Exercises
Holger Dinkel's avatar
Holger Dinkel committed
3
======================
4
5

TAR & GZIP
6
----------
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

1. Use :index:`gzip <gzip>` to compress the file P12931.txt

2. Decompress the resulting file P12931.txt.gz (revert previous command)

3. Use :index:`tar <tar>` to create an archive containing all fasta files in the current directory into an archive called "fastafiles.tar"

4. Use gzip to compress the archive "fastafiles.tar"

5. How can you achieve the two previous steps "using tar to create archive" and "gzip the archive" in one command? 

6. Test (list the contents of) the compressed archive "fastafiles.tar.gz"

7. Download the compressed PDB file for entry 1Y57 from rcsb.org (eg. ``wget "http://www.rcsb.org/pdb/files/1Y57.pdb.gz"``) and decompress it. 

 
GREP
24
----
25
26
27
28
29

1. Which of the DNA files ENST0* contains "TATATCTAA" as part of the sequence? 

2. List only the names of the DNA files ENST0* that contain "CAACAAA" as part of the sequence.

Holger Dinkel's avatar
Holger Dinkel committed
30
3. Considering the previous example, would you consider grep a suitable tool to perform motif searches? Why not? Try to find the pattern "CAACAAA" by manual inspection of the first three lines of each sequence.
31

Holger Dinkel's avatar
Holger Dinkel committed
32
4. Count the number of ATOMs in the file 1Y57.pdb. 
33

Holger Dinkel's avatar
Holger Dinkel committed
34
5. Does this number agree with the annotated number of atoms? The PDB file has a comment which tells you how many atoms there are annotated in this file. This comment can be found by searching for the term "protein atoms" (use quotes and case insensitive search here!).
35

Toby Hodges's avatar
Toby Hodges committed
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
REV
---

1. By combining ``rev`` with the ``cut`` tool, print the last word of each line in DNA.txt. Make sure that the words are readable when they are printed out.


XARGS
-----

1. Use ``xargs`` to print the first line of the files listed in to_be_previewed.txt

2. Create a copy of each of these files by passing the lines in to_be_copied.txt two-at-a-time to ``cp``

3. A better way to back up these files might be to keep the original names while copying them. Make another copy of each file listed in to_be_previewed.txt, adding ".backup" onto the end of each filename. (Hint: remember the "-I" option!)

Toby Hodges's avatar
Toby Hodges committed
51
52
4. ADVANCED: we've made a bit of a mess now, and it's time to clean up. Make a new directory called 'garbage', which you will move all these new files into it by combining the ``find`` tool with ``xargs`` and ``mv``. Use ``find`` to find all files in the current directory that were last modified less than ten minutes ago, and ``xargs`` with ``mv`` to change their location. BE CAREFUL!
..Hint:: you'll need to check out the options available for ``find``, and you might consider using the "-p" option with ``xargs`` to help avoid accidentally deleting something that you might regret!) Once you've moved the files, check the contents of the 'garbage' directory and, if you're sure that you don't want any of those files anymore, delete them and the directory.
Toby Hodges's avatar
Toby Hodges committed
53

54
55

SED
56
---
57
58
59
60
61
62
63
64
65
66
67

1. Use sed to print only those lines that contain "version" in the files P05480.txt and P04062.txt

2. Use sed to change the text "sequence version 3" to "sequence version 4" in the files P05480.txt and P04062.txt (without actually changing the files, just printing) 

3. Use sed to update the text "sequence version 3" to "sequence version 4" in the files P05480.txt and P04062.txt (this time, make the changes directly in the files) 

4. Replace (transliterate) all occurrences of "r" by "l" and "l" by "r" (at the same time) in the file PROTEINS.txt (so that "structural" becomes "stluctular") 


AWK
68
---
69
70
71
72
73
74
75

1. Use awk to print only those lines that contain "version" in the files P12931.txt and P05480.txt and think about how this procedure is different to sed. 

2. For all FASTA files that begin with "P" ("P*.fasta") print only the second item of the header (split on "|") eg. for ">sp|P12931|SRC_HUMAN Proto-oncogene", print only "P12931"

3. The file "P12931.csv" contains phosphorylation sites in the protein P12931. (If the file "P12931.csv" does not exist, use ``wget http://phospho.elm.eu.org/byAccession/P12931.csv`` to download it ). 

76
77
78
79
   a. Column three of this file lists the amino acid position of the phosphorylation site. You are only interested in position 17 of the protein. Try to use "grep" to filter out all these lines containing "17". 
  
   b. Now use awk to show all lines containing "17".
  
Toby Hodges's avatar
Toby Hodges committed
80
   c. Next try to show only those lines where column three equals 17 (Hint: The file is semicolon-separated...).
81
82
  
   d. Finally print the PMIDs (column 6) of all lines that contain "17" in column 3. 
83
84
85


Quoting and Escaping
86
--------------------
87

Holger Dinkel's avatar
Holger Dinkel committed
88
89
1. Familiarize yourself with quoting and escaping.

90
 a. Run the following commands to see the difference between single and double quotes when expanding variables:
Holger Dinkel's avatar
Holger Dinkel committed
91
92
93
94
95
  ::

    $ echo "$HOSTNAME"
    ...
    $ echo '$HOSTNAME'
96
97
98

 b. Next, use ssh to login to a different machine to run the same command there, again using both quoting methods:

Holger Dinkel's avatar
Holger Dinkel committed
99
100
101
102
103
104
  ::

    $ ssh pc-atcteach01 'echo $HOSTNAME'
    ...
    $ ssh pc-atcteach01 "echo $HOSTNAME"

Holger Dinkel's avatar
Holger Dinkel committed
105
2. Closely inspect the results; is that what you were expecting? Discuss this with your neighbour.
106
107