In this assignment you will use the phase of the Fourier Transform to get information about DNA sequences. In particular, you will see that the the phase of the N=3 coe cient can be used to detect deletions of 1 or 2 nucleotides. The algorithm is given below. You will implement the algorithm in R and display the results on a synthetic and a real sequence.
In this question you will verify a property reported in the paper D. Kotlar and Y. Lavner, \Gene pre-diction by spectral rotation measure: a new method for identifying protein-coding regions”, Genome research, vol. 13, no. 8, pp. 19301937, 2003.
This paper observed that the phase of the Fourier spectral component at N=3 in each window from within a protein-coding region is tightly clustered around a single value, whereas the same quantity is uniformly distributed between 0 and 2 in non-coding regions.
Verify this by using sequences from the last assignment (you can choose one exon and one non-coding region of size about 5000. Use a window size of 351 and plot for each window position the phase of N=3 Fourier coe cient.
Next, implement the following algorithm for window size W = 351.
Compute the binary indicator function for nucleotide G the rst W nucleotides of the query DNA sequence.
Calculate the DFTs of the binary indicator sequence obtained in step 1, and record the phase of G[W=3].
The window is moved forward by 3 nucleotides, and repeat steps 1 and 2 until the entire sequence is read.
Compute a histogram from the phases of G[W=3] that you recorded. The largest peak in these histograms is labeled as .
The phase in each window is classi ed into 3 classes C 2 =3; C ; C +2 =3, and labeled -1,0,1. This classi cation is done using the nearest-neighbor heuristic, and the class assigned to a
window is henceforth referred to as its phase label. The phase label thus de nes a function f : [1; nW ] ! 1; 0; +1, where nW is the number of windows used by the algorithm.
F(j) − Synthetic Sequence − No Deletion
F(j) − Synthetic Sequence − Deletion at 5347
Figure 1: Sample graphs
6. We then de ne the cumulative function F : [1; Nn + 1] ! [N; +N] as
F (j) = f(k)
The function F (j), when plotted against j, gives a visualization of the changes in phase. A sequence of windows with phases in the same class will yield a straight line. Any change in phase results in a change of slope (see gure 1.
To the above add a step to predict the location of a phase change. Test the algorithm on a coding region and a non-coding region by testing 1 or 2 nucleotides at 5 random locations for each and list the predictions from your algorithm.