Computational Bioinformatics Assignment 3 Solution

$35.00

Description

Introduction

In this assignment you will use the phase of the Fourier Transform to get information about DNA sequences. In particular, you will see that the the phase of the N=3 coe cient can be used to detect deletions of 1 or 2 nucleotides. The algorithm is given below. You will implement the algorithm in R and display the results on a synthetic and a real sequence.

Question 1

In this question you will verify a property reported in the paper D. Kotlar and Y. Lavner, \Gene pre-diction by spectral rotation measure: a new method for identifying protein-coding regions”, Genome research, vol. 13, no. 8, pp. 19301937, 2003.

This paper observed that the phase of the Fourier spectral component at N=3 in each window from within a protein-coding region is tightly clustered around a single value, whereas the same quantity is uniformly distributed between 0 and 2 in non-coding regions.

Verify this by using sequences from the last assignment (you can choose one exon and one non-coding region of size about 5000. Use a window size of 351 and plot for each window position the phase of N=3 Fourier coe cient.

Question 2

Next, implement the following algorithm for window size W = 351.

  1. Compute the binary indicator function for nucleotide G the rst W nucleotides of the query DNA sequence.

  1. Calculate the DFTs of the binary indicator sequence obtained in step 1, and record the phase of G[W=3].

  1. The window is moved forward by 3 nucleotides, and repeat steps 1 and 2 until the entire sequence is read.

  1. Compute a histogram from the phases of G[W=3] that you recorded. The largest peak in these histograms is labeled as .

  1. The phase in each window is classi ed into 3 classes C 2 =3; C ; C +2 =3, and labeled -1,0,1. This classi cation is done using the nearest-neighbor heuristic, and the class assigned to a

window is henceforth referred to as its phase label. The phase label thus de nes a function f : [1; nW ] ! 1; 0; +1, where nW is the number of windows used by the algorithm.

1

F(j) − Synthetic Sequence − No Deletion

F(j) − Synthetic Sequence − Deletion at 5347

0

200

0

500

200

1000

400

1500

600

800

2000

1000

2500

1200

1400

3000

1600

4969

3500

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

1800

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

0

0

Figure 1: Sample graphs

6. We then de ne the cumulative function F : [1; Nn + 1] ! [N; +N] as

j

X

F (j) = f(k)

k=1

The function F (j), when plotted against j, gives a visualization of the changes in phase. A sequence of windows with phases in the same class will yield a straight line. Any change in phase results in a change of slope (see gure 1.

To the above add a step to predict the location of a phase change. Test the algorithm on a coding region and a non-coding region by testing 1 or 2 nucleotides at 5 random locations for each and list the predictions from your algorithm.

2