README.md 3.23 KB
Newer Older
Toby Hodges's avatar
Toby Hodges committed
1
# Debugging and Coding Style in Python
2

Toby Hodges's avatar
Toby Hodges committed
3
__Materials for an exercise on debugging and coding in Python.__
4

Toby Hodges's avatar
Toby Hodges committed
5
The file `script1.py` is a Python script that has been written to calculate
Toby Hodges's avatar
Toby Hodges committed
6
statistics about DNA sequences described in a [FASTA file*](#fasta-format).
7

Toby Hodges's avatar
Toby Hodges committed
8
Download or `git clone` this repository and then follow the instructions below.
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

#### Your Tasks

1. __Before__ you look at the sequence files in this repository, open the script in your favourite editor and discuss ways in which it could be improved. Things to think about might include
 - How easy is it to understand what the script does?
 - How robust is the script?
 - Does it follow good coding standards?
 - Does it do what it is supposed to?
 - What problems can you foresee, if the script were to be shared with others or applied to a different sequence file?
2. Now run the script on `exampleSequences1.fasta`
 - Has this made you notice any more improvements that could be made?
3. What about if you run the script on `exampleSequences2.fasta`?
4. Make a copy of the script (or start from scratch if you prefer!) and make as many improvements to the code as you think are necessary to make it
 - robust
 - portable
 - shareable
 - easy to maintain/adapt
Toby Hodges's avatar
Toby Hodges committed
26
 - do what it is supposed to do!
27
28
(__Note:__ You may be aware that the Biopython library includes functions and object classes to work with sequence objects. Please avoid using the library for these exercises.)
5. If you have time, try to further adapt the script to expand its functionality such that, given a file of protein sequences instead, it will produce counts of the different amino acids.
Toby Hodges's avatar
Toby Hodges committed
29

Toby Hodges's avatar
Toby Hodges committed
30
#### *FASTA Format
Toby Hodges's avatar
Toby Hodges committed
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

FASTA is a file format designed to hold information about biological sequence
molecules. Generally speaking, there are three different types of molecule that
can be represented in a FASTA file - DNA, RNA, and protein. All of these molecules
are polymers - strings of repeated units in a particular order. These subunits
are drawn from a finite pool, with each specific type of subunit referred to by
a letter. For this reason, the pool of possible units is often referred to as an
_alphabet_.

DNA and RNA molecules are constructed from four different units (_nucleotides_):
 _A_, _C_, _G_, and _T_ for DNA; _A_, _C_, _G_, and _U_ for RNA.
The protein alphabet is larger, made up of 20 different possible units (_amino acids_).

The file format itself is constructed as follows:

```
>id1 additional_info
<sequence1>
[<sequence1>]
[<sequence1>]
[...]
[<sequence1>]

[>id2 additional_info
<sequence2>
[<sequence2>]
[<sequence2>]
[...]
[<sequence2>] ]

[...]
```

A single FASTA file may contain records for one or more sequences. Each sequence
record is constructed from the following two elements:

1. A header line, beginning with a `>` symbol. This header line can contain the
following parts:
  - an identifier for the sequence (__required__). This should be unique (at least)
within the file
  - more information about the sequence (__optional__). This additional information
is separated from the identifier by a space
2. The sequence of the molecule, described in the approriate alphabet. Long
sequences can be split across multiple lines.

The Wikipedia page for FASTA format is well written and has more information:
https://en.wikipedia.org/wiki/FASTA_format