gggatttagctcagtt gggagagcgccagact gaa gat ttg...

9
1 DNA2: Aligning ancient diversity (Last week) Comparing types of alignments & algorithms Dynamic programming (DP) Multi- sequence alignment Space - time - accuracy tradeoffs Finding genes- - motif profiles Hidden Markov Model (HMM) for CpG Islands 2 RNA1: Structure & Quantitation Integration with previous topics (HMM & DP for RNA structure) Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays) Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk). Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA). Time series data: causality, mRNA decay, time-warping 3 Discrete & continuous bell-curves .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 .10 0 10 20 30 40 50 Normal (m=20, s=4.47) Poisson (m=20, s^2=m) Binomial (N=2020, p=.01, m=Np) t-dist (m=20, s=4.47, dof=2) ExtrVal(u=20, L=1/4.47) gggatttagctcagtt gggagagcgccagact gaa gat ttg gag gtcctgtgttcgatcc acagaattcgcacca Primary to tertiary structure 5 Non-watson-crick bps ref -CH 3 6 Modified bases & bps in RNA " " ref 1 72

Upload: others

Post on 12-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

1

1

DNA2: Aligning ancient diversity(Last week)

Comparing types of alignments & algorithmsDynamic programming (DP)Multi- sequence alignmentSpace- time- accuracy tradeoffsFinding genes - - motif profilesHidden Markov Model (HMM) for CpG Islands

2

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping

3

Discrete & continuous bell-curves

.00

.01

.02

.03

.04

.05

.06

.07

.08

.09

.10

0 10 20 30 40 50

Normal (m=20, s=4.47)

Poisson (m=20, s^2=m)

Binomial (N=2020, p=.01, m=Np)

t-dist (m=20, s=4.47, dof=2)

ExtrVal(u=20, L=1/4.47)

4

gggatttagctcagttgggagagcgccagactgaa gatttg gaggtcctgtgttcgatccacagaattcgcacca

Primary to tertiary structure

5

Non-watson-crick bps

ref

-CH3

6

Modified bases & bps in RNA

" "

ref

1 72

Page 2: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

2

7

Covariance

Mij= Σfxixjlog2[fxixj/(fxifxj)] M=0 to 2 bits; x=base typexixj see Durbin et al p. 266-8.

D-stem

anticodonTψC

3’acc

8

Mutual Information

ACUUAU M1,6= Σ = fAU log2[fAU/(fA*fU)]...CCUUAG x1x6GCUUGC =4*.25log2[.25/(.25*.25)]=2UCUUGAi=1 j=6 M1,2= 4*.25log2[.25/(.25*1)]=0

Mij= Σfxixjlog2[fxixj/(fxifxj)] M=0 to 2 bits; x=base typexixj see Durbin et al p. 266-8.

See Shannon entropy, multinomial Grendar

9

RNA secondary structure prediction

Mathews DH, Sabina J, Zuker M, Turner DH J Mol Biol 1999 May 21;288(5):911-40

Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure.

Each set of 750 generated structures contains one structure that, on average, has 86 % of known base-pairs.

10

Stacked bp & ss

11

Initial 1981 O(N2) DP methods: Circular Representation of RNA Structure

Did not handle pseudoknots

5’

3’

12

RNA pseudoknots, important biologically, but challenging for structure searches

Page 3: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

3

13

Dynamic programming finally handles RNA pseudoknots too.

Rivas E, Eddy SR J Mol Biol 1999 Feb 5;285(5):2053-68 A dynamic programming algorithm for RNA structure prediction includingpseudoknots. (ref)

Worst case complexity of O(N6) in time and O(N4) in memory space.

Bioinformatics 2000 Apr;16(4):334-40 (ref)

14

C+

A+

G+

T+

P(G+|C+) >

P(A+|A+)

CpG Island + in a ocean of -First order Markov Model

MM=16, HMM= 64 transition probabilities (adjacent bp)

C-

A-

G-

T-

P(C-|A+)>

Hidden

15

Small nucleolar (sno)RNAstructure & function

Lowe et al. Science (ref) 16

SnoRNA Search

17

Performance of RNA-fold matching algorithms

Algorithm CPU bp/sec True pos. False pos.

TRNASCAN’91 400 95.1% 0.4x10-6

TRNASCAN-SE ’97 30,000 99.5% <7x10-11

SnoRNAs’99 >93% <10-7

(See p. 258, 297 of Durbin et al.; Lowe et al 1999)

18

Putative Sno RNA gene disruption effects on rRNA modification

Primer extension pauses at 2'O-Me positions forming bands at low dNTP.

Lowe et al. Science 1999 283:1168-71 (ref)

Page 4: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

4

19

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay

20

RNA (array) & Protein/metabolite (MS) quantitation

RNA measures are closer to genomic regulatory motifs & transcriptional control

Protein/metabolite measures are closer to Flux & growth phenotypes.

21

Protein fusions

Coregulated sets of genes

purM purN purH purDB. subtilis

purH purD

purM purNE. coli

Microarray data

Phylogenetic profiles

Metabolic pathwaysConserved operons

Known regulons in other organisms

EC

P1P2P3P4P5P6P7

SC

1101101

BS

0110111

HI

1010110

A-BAB

TCAcycle

Integrating 8 regulon measures

In vivo crosslinking& selection (1-hybrid)

In vitroarray binding or selection

22

B. subtilis purE purK purB purC purL purF purM purN purH purD

In E. coli, each color above is a separate but coregulated operon:

C. acetobutylicum purE purC purF purM purN purH purD

purE purK

purBpurCpurLpurF

purM purN

purH purD

E. coli PurR motif

Check regulons from conserved operons(chromosomal proximity)

Predicting regulons and their cis-regulatory motifs by comparative genomics. Mcguire & Church, (2000) Nucleic Acids Research 28:4523-30.

23

M. janaschii

purE purK purM purN purH purD

P. furiosus

C. jejuniP. horokoshiiM. tuberculosisE. coli

purM purF purH purN

purF purC

purQ purC purL purH

purM purF purC purY

purQ purL purY

FC MN

HD

EKL

QYThe above predicts regulon

connections among these genes:

Predicting the PurR regulon by piecing together smaller operons

24

(Whole genome) RNA quantitation objectives

RNAs showing maximum changeminimum change detectable/meaningful

RNA absolute levels (compare protein levels)minimum amount detectable/meaningful

Network - - direct causality- - motifs

Classify (e.g. stress, drug effects, cancers)

Page 5: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

5

25

(Sub)cellular inhomogeneity

( see figure)

Dissected tissues have mixed cell types.

Cell-cycle differences in expression.

XIST RNA localized on inactive X-chromosome

26

Fluorescent in situ hybridization (FISH)

•Time resolution: 1msec•Sensitivity: 1 molecule•Multiplicity: >24•Space: 10 nm (3-dimensional, in vivo)

10 nm accuracy with far-field optics energy-transfer fluorescent beads nanocrystal quantum dots,closed-loop piezo-scanner (ref)

27

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping

28

Steady-state population-average RNA quantitation methodology

Microarrays1

~1000 bp hybridization

experiment

control • R/G ratios

• R, G values

• quality indicators

ORF

1 DeRisi, et.al., Science 278:680-686 (1997) 4 Brenner et al, 2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)3 Velculescu, et.al, Serial Analysis of Gene Expression, Science 270:484-487 (1995)

ORF

PMMM

• Averaged PM-MM

• “presence”Affymetrix2

25-bp hybridization

• Counts of SAGE

14 to 22-mers sequence tags for each ORF

ORF SAGE Tag

concatamers

SAGE3

sequence counting

MPSS4

29

Biotinylated RNAfrom experiment

GeneChip expressionanalysis probe array

Image of hybridized probe array

Each probe cell containsmillions of copies of a specific oligonucleotide probe

Streptavidin-phycoerythrinconjugate

30

Yeast RNA25-mer arrayWodicka, Lockhart, et al. (1997) Nature Biotech 15:1359-67

Most RNAs < 1

molecule per cell.

(ref)

Reproducibilityconfidence intervalsto find significantdeviations.

Page 6: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

6

31

(Public) RNA array analyses (1, 2)

Image analysisdCHIP Wong/AffyArrayTools NCI-BRBArrayViewer TIGRF/P-SCAN NIHScanAlyze EisenGAPS HMS

StatisticsAFM

AMiADAChurchillR-SMA SAMVERA-SAM ISBR-Bioconductor

Database & managementGeneXMAPSAMAD

Cluster&VisualizeCLUSFAVORCluster-TreeView EisenGeneCluster WIJ-Express BergenPaGEPlaidSVDMAN LANLTreeArrange WaterlooXCluster Stanford

Free vs Open:OSI FSF ISCB 32

Statistical models for repeated array data(RNA vs. experiment repeats)

Li & Wong (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol 2(8):0032

Kuo et al. (2002) Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18(3):405-12

Tusher, Tibshirani and Chu (2001) Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98(9):5116-21.

Selinger, et al. (2000) RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nature Biotech. 18, 1262-7.

33

“Significant” distributions

t-test t= ( Mean / SD ) * sqrt( N ). Degrees of freedom = N-1H0: The mean value of the difference =0. If difference distribution is not normal, use the Wilcoxon Matched-Pairs Signed-Ranks Test.

graph.00.01.02.03.04.05.06.07.08.09.10

-30 -20 -10 0 10 20 30

Normal (m=0, s=4.47)

t-dist (m=0, s=4.47, dof=2)

ExtrVal(u=0, L=1/4.47)

34

Independent Experiments

Microarray analysis of the transcriptional network controlled by the photoreceptor homeobox gene Crx.Livesay, et al. (2000) Current Biology

35

RNA quantitation

Is less than a 2- fold RNA- ratio ever important?Yes; 1.5- fold in trisomies.

Why oligonucleotides rather than cDNAs?Alternative splicing, 5' & 3' ends; gene families.

What about using a subset of the genomeor ratios to a variety of control RNAs?

It makes trouble for later (meta) analyses.

36

Page 7: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

7

37

(Whole genome) RNA quantitation methods

Method Advantages

Genes immobilized labeled RNA Chip manufactureRNAs immobilized labeled genes-

Northern gel blot RNA sizesQRT- PCR Sensitivity 1e- 10Reporter constructs No crosshybridizationFluorescent In Situ Hybridization Spatial relationsTag counting (SAGE) Gene discoveryDifferential display & subtraction "Selective" discovery

38

Microarray to Northern

39

Non-coding sequences

Protein coding 25-mers

(12% of genome)

295,936 oligonucleotides (including controls)Intergenic regions: ~6bp spacing Genes: ~70 bp spacingNot polyA (or 3' end) biased

Strengths: Gene family paralogs, RNA fine structure (adjacent promoters), untranslated & antisense RNAs, DNA-protein interactions.

Affymetrix: Mei, Gentalen, Johansen, Lockhart(Novartis Inst)HMS: Church, Bulyk, Cheung, Tavazoie, Petti, Selinger

tRNAs, rRNAs

E. coli25-mer array

Genomic oligonucleotide microarrays

40

Random & Systematic Errors in RNA quantitation

• Secondary structure• Position on array (mixing, scattering)• Amount of target per spot• Cross-hybridization• Unanticipated transcripts

41

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

0

100200

300400

500600

0100

200300

400500

Inte

nsity

X

Y

0

500

1000

1500

2000

2500

3000

0

100200

300400

500600

0100

200300

400500

Inte

nsity

X

Y

Experiment 1 experiment 2

Spatial Variation in Control Intensity

Selinger et al 42

b0671 - ORF of unknown function, tiled in the opposite orientation

Expression Chip Reverse Complement Chip

“intergenic region 1725” - is actually a small untranslated RNA (csrB)

Crick Strand Watson Strand (same chip)

Detection of Antisense and Untranslated RNAs

Page 8: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

8

43

Mapping deviations from expected repeat ratios

Li & Wong 44

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping

45

-1

-0.5

0

0.5

1

1.5

2

-300 -200 -100 0 100 200 300 400

Bases from Translation Start

Inte

nsity

(PM

- M

M) /

Sm

ax

LogStationaryGenomic DNA

KnownHairpin

Translation Stop(237 bases)

Known Transcription Start(position -33)

Independent oligosanalysis of RNA structure

Selinger et al 46

Predicting RNA-RNA interactions

Human RNA-splice

junctionssequencematrix

http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html

47

of the human genome using microarray technology.Shoemaker, et al. (2001) Nature 409:922-7.

48

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping

Page 9: gggatttagctcagtt gggagagcgccagact gaa gat ttg gagsites.fas.harvard.edu/~bphys101/lecturenotes/03e_oct14_r...Expanded sequence dependence of thermodynamic parameters improves prediction

9

49

Time courses

•To discriminate primary vs secondary effects we need conditional gene knockouts .

•Conditional control via transcription/translation is slow (>60 sec up & much longer for down regulation)

•Chemical knockouts can be more specific than temperature (ts-mutants).

50

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12 14 16 18

Time (min)

Frac

tion

of In

itial

(16S

nor

mal

ized

) cspE Chiplpp ChipcspE Northernlpp Northern

cspE half lifechip 2.4 minNorthern 2.9 min

lpp half lifechip >20 minNorthern >300 min

Chip metric = Smax

lpp Northern

lpp Chip

cspE Northern cspE Chip

Beyond steady state: mRNA turnover rates (rifampicin time-course)

51

i i+1

j

j+1

t0

u0

u1u2

u3u4t1

t2t3

t4

t5 t6

t0

u0

u1u2

u3u4t1

t2t3

t4

t5 t6

series a

series b

series a

series b

a b c

d e f

0 1 2 3 4 5 ...

0

1

2

3

4

...

series a

serie

s b

0 1 2 3 4 5 ...0

1

2

3

4

...series a

serie

s b

i i+1j

j+1 *††1

†2

TimeWarp: pairs of expression series, discrete or interpolative

Aach & Church 52

TimeWarp: cell-cycle experiments

53

TimeWarp: alignment example

54

RNA1: Structure & Quantitation

Integration with previous topics (HMM & DP for RNA structure)Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality)Genomics-grade measures of RNA and protein and how we choose and integrate (SAGE, oligo-arrays, gene-arrays)Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk).Interpretation issues (splicing, 5' & 3' ends, gene families, small RNAs, antisense, apparent absence of RNA).Time series data: causality, mRNA decay, time-warping