MSc IAMZ - UAB - UPV 2007 - 2008

105 pages
4 views

Please download to get full document.

View again

of 105
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Description
MSc IAMZ - UAB - UPV 2007 - 2008. Essential Bioinformatics for Animal Breeders Miguel Pérez-Enciso miguel.perez @ uab.es www.icrea.es. Outline. Why bioinformatics Complex and simple traits: Why statistics Quantitative Trait Locus (QTL) detection Microarrays. Ultimate goal.
Transcript
MSc IAMZ - UAB - UPV 2007 - 2008Essential Bioinformatics for Animal BreedersMiguel Pérez-Encisomiguel.perez @ uab.eswww.icrea.esOutline
  • Why bioinformatics
  • Complex and simple traits: Why statistics
  • Quantitative Trait Locus (QTL) detection
  • Microarrays
  • Ultimate goalDetermine the genetic basis of 'complex' traits (there are many other applications of DNA markers)Ultimate goalDetermine the genetic basis of 'complex' traits conservationtraceabilitymarker assisted selection...Genetics and animal breeding has become a data rich science, where the limiting step already NOW is the data analysis, rather than in the obtention of the data themselves.Three main streams of data:DNA SequenceDNA polymorphism (markers)Expression data (functional genomics)Trait classification
  • 'Simple' or Mendelian traits
  • 'Complex' or quantitative traits
  • This classification is rather artificial, there is a continuum rather than a clearcut classification'Simple' traits in Animal Genetics
  • Double muscling in cattle
  • Halothane gene in pigs
  • RN mutation in pigs
  • Feather pecking in chicken
  • Most color mutations
  • Double muscling in cattleBelgian blueAsturianaPerformance Asturiana VallesGrobet et al. (1997) Nat Genet 17:71
  • Results in an increase in the number of muscle fibres (hyperplasia), and fibre enlargement (hypertrophy), lower fat and collagen.
  • Caused by a stop codon in myostatin.
  • Myostatin is a member of the transforming growth factor (TGF)-β superfamily actively represses skeletal muscle growth.
  • ~ 10 mutations that cause disruption of the gene, different breeds have different mutations.
  • Halothane gene in pigs
  • Pigs sensitive to halothane and to stress.
  • Higher percentage of lean.
  • Higher growth.
  • Higher mortality.
  • Lower meat quality (loss of water).
  • Fujii et al. (1991) Science 253:448
  • Non synonimous mutation in Ryanodine Receptor 1 (Ryr1).
  • It changed a key aminoacid Arg -> Cys.
  • Ca release gate membrane protein.
  • What are 'complex' (quantitative) traits? Sensitive to the environment Affected by several genesTraits showing a continuous distributionMost traits of interest are 'complex'
  • Milk and meat production
  • Litter size
  • Disease resistance
  • ...
  • What are the consequences of complexity?In a 'simple' trait the phenotype is predominantly determined by the genotype. In a complex trait ...Uncertainty!Uncertainty  Ignorance !Uncertainty  ErrorWE REQUIRE STATISTICSEXAMPLESuppose
  • A trait is normally distributed.
  • There are two alleles at a given locus.
  • An additive mutation that increases growth.
  • A 'simple trait'p(y|g=qq)p(y|G=Qq)p(y|G=QQ)A complex traitp(y|G=QQ)p(y|G=Qq)p(y|g=qq)Mixture visualizationp(y|G=QQ)p(y)p(y|G=Qq)p(y|g=qq)EXAMPLE
  • What is the expected genotype of an individual whose phenotype is the mean?
  • What is the expected genotype of an individual whose phenotype is 1 SD above the mean?
  • Statistics
  • Science to deal with uncertainty.
  • In Animal Breeding, it provides the link between molecular genetics and the applied world, between genotype and phenotype.
  • A very important area of research now, the limit lies more in analyzing data than in obtaining the data (Bioinformatics).
  • Two key aspects
  • Description
  • Inference or Prediction
  • Inference: the concept of model
  • Simplification of reality
  • Links data to some abstract useful concept
  • A model is not supposed to be 'TRUE'
  • A model is meant to be USEFUL
  • Desirable characteristics of models
  • Adjust to data
  • Parsimonious (austere)
  • Interpretable
  • Examples of models in GeneticsMilk_Production = Herd + Genetic_Effect + ErrorMeat_Quality = Age + Sex + Halothane_Genotype + ErrorGrowth = Sex + Breed + Myostatin_Genotype + ErrorUsual Steps
  • Define a model
  • Estimate parameters
  • Carry out significance test
  • Usual Approach in Genetics
  • Define a model
  • Carry out a scan across marker or genome positions
  • Estimate parameters
  • Carry out significance test
  • called a Quantitative Trait Locus (QTL) analysisQuantitative trait locus experiments
  • Principles
  • Crosses between inbred lines
  • Outbred line
  • GENE  LOCUS: a stretch of DNA whose variants (alleles) produce a change in a trait.QTL: in principle, a gene whose polymorphism affects a quantitative trait; in practice, a huge genome sequence statistically associated with the trait.MARKER (M): a ‘neutral’ polymorphism. Microsatellites, SNPs and AFLPs are markers.GENOTYPE (G): The paternal and maternal alleles define the genotype at that locus.PHENOTYPE (y): the observed characteristic (trait) of each individual.HAPLOTYPE (H): the paternal (or maternal) set of alleles of each individual.PHASE : Two alleles are in the same phase if they were inherited from the same parent. They are in cis; otherwise, they are in trans.Sources of informationy: phenotypesM: markersP: pedigreeThe modellocus effects(QTL)fixed effectsinfinitesimalgenetic effectresidualphenotypeUsual approaches
  • Simple experimental design:
  • F2
  • BC
  • Isolated families
  • Simplify genetic model:
  • One single locus
  • Alternative alleles in each parental lines
  • Statistical Techniques
  • Regression (Least squares)
  • Maximum likelihood
  • Bayesian methods
  • Non parametric methods
  • Usual genetic decomposition (biallelic gene):Genotype QQ Qq qq Genetic value a d -aor a = [ E(y | G=QQ) - E(y | G=qq) ] / 2 d = E(y | G=Qq)Crosses between inbred linesMmMmmMMQQQqqQqqmPxmMmmqqQqxF1BCF2 cross schemexaa,bbAA,BBxAa,BbAa,BbAA,BBAa,BBaa,BBAA,BbAa,Bbaa,bbAA,bbAa,bbr and QTL effects are confounded in a single marker analysisGenotype Freq. E(y|G)MQ (1-r)/2 aMq r/2 -amQ r/2 amq (1-r)/2 -a E(y|M) = a (1-r) - a r E(y|m) = a r - a (1-r) D = E(y|M) - E(y|m) = 2 a (1-2r) Var(D) = 2 [s2 + 4 a2 r (1-r)] / nInterval mappingBy using intervals delimited by two markers : - we can distinguish between r and a (and d) - we use more information, and we reduce errorRegression approach: Haley and Knott (1992)Maximum likelihood: Lander and Botstein (1990)Interval mappingMMmMmmq????QNmnNNnnnnnnnmmmmmmqqqqqqa = E(y|Qq)-a = E(y|qq)xBackcrossr = recomb fraction between markers M and N(known)r1 (r2) = recomb fraction between marker 1 (2) and QTLr = r1 / r r1, a: unknownInterval mappingmMqQNnnnmmqqa = E(y|Qq)-a = E(y|qq)xBackcrossGenotype Freq. P(G=Q|M) P(G=q|M) E(y|M)MN (1-r)/2 1 0 aMn r/2 r2/r r1/r a (1-r) - a rmN r/2 r1/r r2/r ar - a (1-r)mn (1-r)/2 0 1 -ar = recomb fraction between markers(known)r1 (r2) = recomb fraction between marker 1 (2) and QTLr = r1 / r r1, a: unknownInterval mapping : BC regression approachHaley and Knott (1992)P(G=QQ|M) - P(G=Qq|M)The model is :y = b + ca a + eQTL effectphenotypesfixed effectsInterval mapping : BC regression approach The strategy is : 1) compute ca at predetermined positions 2) Compute the test statistics F full model / reduced model at each position 3) Choose estimates (r and a) that correspond to FmaxThe reduced model is : y = b + eHaley and Knott (1992)The model is :y = b + ca a + eExample : F2 cross between Iberian and Landrace pigs The IBMAP consortium (Spain)UdL-IRTA, INIA, UAB, CTC-IRTA, UMurciaThe Landrace lineThe F1 offspringSome F1s ...The variability in the F2313IBMAP experimental protocol F0xIbérico GuadyerbasLandrace Nova Genètica1 litterF1716x100 markersF2577Traits measuredCarcass:WeightBackfat thicknessCarcass lengthCutting weightsHistochemistry:* % muscle fibers* diameter fibersQuality:pH 45’ y 24hConductivityPigmentsMinolta color% intram. fat% Fatty acidsmicrosatellite density SSC4USDA map-log10 (p-value)Grasa34Grasa1P paletaLong canalP vivoP jamónS0097 SW839SW524 SW445 Sw2404 cMS0001Sw317 S0214 S0301 DECR SW58 S0073 Sw35 FABP4SSC4: Mercadé et al. 2005Crecimiento y caracteres de formaGrasa y caracteres de formaOutbred populations
  • Completely inbred lines are available only in a limited number of species, i.e., mice and some plants.
  • It is of interest to compare whether the QTL found in crosses are also segregating in outbred populations.
  • Selection is carried out within outbred lines so we need to be able to analyze these populations.
  • Within family analysis
  • Disequilibrium assumed only within families
  • Most typical designs
  • Daugther design
  • Granddaughter design
  • Within family analysis:schemeA-,B-a-,B-A-,B-a-,b-A-,b-a-,b-Aa,BbxDaugther designMMMmmQQQqqxSiremqOffspring
  • Only offspring phenotyped
  • Offspring from heterozygous sires is analyzed
  • The performance of offspring having received one or other allele is compared
  • Dam information is discarded
  • One test per family, a global test combining all families
  • Daugther designGenotype Freq. E(y|G)MQ (1-r)/2 aMq r/2 -amQ r/2 amq (1-r)/2 -aNote: phases may be different so that a test statistics is computed for each family. The global statistics is obtained combining the statistics for each family. SoftwareQTLexpress: http://qtl.cap.ed.ac.uk/QTl-cartographer: http://statgen.ncsu.edu/qtlcart/Qxpak: http://www.icrea.es/pag.asp?id=Miguel.PerezR/qtl: http://www.biostat.jhsph.edu/~kbroman/qtl/Solar: http://www.sfbr.org/solar/ Merlin http://www.sph.umich.edu/csg/abecasis/Merlin/Microarray technology
  • Technology: expression and genotyping
  • Experimental design
  • Analysis: cluster and differential expression
  • Microarray Technology
  • Offers the possibility of large scale data production
  • SNPs : Up to 650,000 / sample
  • mRNA expression: All expressed genes
  • Microarrays History1991 - Photolithographic printing (Affymetrix)1994 - First cDNA collections are developed at Stanford.1995 - Quantitative monitoring of gene expression patterns with a complementary DNA microarray 1996 - Commercialization of arrays (Affymetrix)1997 - Genome-wide expression monitoring in S. cerevisiae (yeast)2000 - Portraits/Signatures of cancer 2002 - The Pig Microarray from Quiagen is produced. Contains 10,665 70-mer probes representing (10.665) Sus Scrofa gene sequences2004 - Whole human genome on one microarray2004 - The Amplichip CYP450 from Roche FDA-aproved2004- First Affymetrix microarrays for domestic animals availableDifferent microarray systems All rely on:1. Making very high density arrays of DNA or RNA on silica or glass.2. Probing these arrays in situ eg with a fluorescently labeled sequence.3. Scanning the array and detecting binding/non binding.Array types
  • Probe with DNA to detect sequence difference
  • Probe with cDNA to detect gene expression
  • cDNA microarrays
  • oligo microarrays
  • There are presently two methods of making arrays:
  • cDNA: Robotically spot-out DNA on a solid (glass) support
  • Oligo: Make an oligo in situ using technology closely analogous to electronic silicon chip technology.
  • TTAGCTAGTCTGGACATTAGCCATGCGGATTTAGCTAGTCTGGACATTAGCCATGCGGATGACCTGTAATCGGACCTATAATCGGenotyping: SNP Microarray
  • Immobilized allele specific oligo probes
  • Hybridize with labeled PCR product
  • Assay multiple SNPs on a single array
  • TranscriptomeSet of transcribed mRNAs from a sampleTechniques to quantify 1 - Microarrays. 2 - Serial analysis of gene expression (SAGE). 3 - Massively parallel signature sequencing 4 - Differential display. 5 - Ribonuclease protection assay y Northern blot. 6 - Quantitative PCRNumber of samples1,000TaqMan100Expressionmicroarrays10Northernhyb.Number of genes queriedSAGE11101001,00010,000Transcriptome techniquescDNA Microarray
  • Probes are chosen from cDNA libraries
  • 'Printed' onto a slide
  • Visualization is by competitive hybridization between two samples dyed differently.
  • cDNA microarray principleNHGRISources of Probe Sequences1. Collections of Expressed Sequence Tags (ESTs); cDNA libraries. Both are derived from expressed genes (i.e. mRNAs)2. Open reading frames from genomic sequence3. Non-coding fragments of genomic sequenceNoteThe quantity (number) analyzed is the ratio of color intensities, which is proportional to the ratio of amount of mRNA in one sample vs. amount of RNA in the other sample.Oligo microarrays (Affymetrix)
  • In silico synthesis of 25nt oligos designed according to bioinformatic algorithms.
  • Every gene is represented by different probes to avoid cross hybridization
  • Allows for larger gene numbers (~ 25000 genes)
  • Absolute measurement (not competitive ratios)
  • Much more expensive than cDNA microarrays
  • Oligo synthesis principle (e.g. Affymetrix)
  • Probes synthetised in vitro
  • 'Absolute' measurement
  • SNP chip assaycDNA vs Oligo
  • Affymetrix
  • Only certain species
  • Very expensive (~ 400$)
  • Depend on external technology (software, hibridization...)
  • More reliable
  • High number of genes
  • cDNA
  • Can be applied to any organism
  • Cheap (relatively)
  • Less reliable
  • Not available commercially
  • Less number of genes
  • Some questions that can be addressed by microarrays
  • Is a gene expressed differentially in two or more treatments (tissues, time, disease status, etc)?
  • How much different are several treatments / genes in terms of their expression profile?
  • Phenotype prediction: disease status, disease subtype, survival time.
  • How does evolution affect gene expression?
  • What is the genetic basis in the variation of gene expression?
  • Analyze alternative splicing
  • A Typical DNA Microarray Experiment
  • Design experiment
  • Isolate RNA (or DNA) from multiple samples of cells (total or mRNA)
  • Convert mRNA to cDNA using reverse transcriptase
  • Label cDNA with fluorescent or radioactive nucleotides
  • Hybridize labelled cDNAs to array
  • Detect and quantify fluorescence (or radioactivity) using confocal laser scanner (or phosphoimager)
  • Analyze results
  • Experimental design for cDNA microarrays
  • Dye – swap designs:
  • Cross designs
  • Loop designs
  • Reference sample designs
  • Remember: competitive ratios (no absolute measurement)Dye – swap designs
  • Two fluorochromes (colors), Cy3 and Cy5
  • Same sample is hybridized twice with different colors
  • Corrects for differential signal intensity between both colors
  • 246135Dye – swap designs: cross designsDye – swap designs: loop designs321465Reference designs123R456What is best?Overall, dye-swap designs are more powerful than reference designs but more difficult to analyze and interpret.Microarray analysisA typical cDNA microarray data consists of the measurements of laser intensity, which are assumed to be proportional to the original amount of mRNA in the tissue, of the i-th individual / sample and the j-th gene, {Gij}The large p small n paradigm# samples x # genesTypical approachLook for similarities (or differences) in patterns e.g. Compare rows to find evidence for co-regulation of genes 1) Need ways to measure similarity (distance) among the objects being compared 2) Then, group together objects (genes or samples) with similar properties.Eisen et al. (1998) http://www.pnas.org/cgi/content/full/95/25/14863Clustering techniques
  • The idea is to group genes (or microarrays) that show a similar behavior, thus identifying patterns of gene expression (or samples).
  • There exist dozens of variants, that can be grouped in:
  • Hierarchical / Non hierarchical
  • Agglomerative / Divisive
  • Self organizing maps
  • ...
  • All  Definition of distance or ‘proximity’Euclidean distance:
  • WARNING!
  • Results depend on distance chosen
  • Difficult to justify any given distance measurement
  • Pearson’s correlationHierarchical ClusteringUnweighted Pair-Group Method Average (UPGMA)Applied to marray data by Eisen et al. (1998)Measure of distance = ri,j (correlationin expression between genes i and j, or tissue i and j)Iterate on: 1) Maximal r ==> Next node. 2) New observation computed as the average expression levels of joined genes. 3) Recompute r for remaining pairs.The UPGMA method was widely used in phylogeny ==> rooted tree.The nice appearance of the result (dendrogram) is one of the main reasons for its success Molecular portraits of human breast tumoursCHARLES M. PEROU, THERESE SORLIE, MICHAEL B. EISEN, MATT VAN DE RIJN, STEFANIE S. JEFFREY, CHRISTIAN A. REES, JONATHAN R. POLLACK, DOUGLAS T. ROSS, HILDE JOHNSEN, LARS A. AKSLEN, OYSTEIN FLUGE, ALEXANDER PERGAMENSCHIKOV, CHERYL WILLIAMS, SHIRLEY X. ZHU, PER E. LONNING, ANNE-LISE BORRESEN-DALE, PATRICK O. BROWN & DAVID BOTSTEIN* Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA‡ Department of Genetics, The Norwegian Radium Hospital, N-0310 Montebello Oslo, Norway§ Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, USA  Department of Surgery, Stanford University School of Medicine, Stanford, California 94305 , USA¶ Department of Biochemistry, Stanford University School of Medicine, Stanford, California 94305, USA# Department of Pathology, The Gade Institute, Haukeland University Hospital, N-5021 Bergen, Norway  Department of Molecular Biology, University of Bergen, N-5020 Bergen, Norway** Department of Oncology, Haukeland University Hospital, N-5021 Bergen, Norway †† Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, California 94305, USA† These authors contributed equally to this workNature406, 747-752 (17 August 2000)ExamplePerou et al. 2000Human breast tumours are diverse in their natural history and in their responsiveness to treatments. Variation in transcriptional programs accounts for much of the biological diversity of human cells and tumours. In each cell, signal transduction and regulatory systems transduce information from the cell's identity to its environmental status, thereby controlling the level of expression of every gene in the genome. Here we have characterized variation in gene expression patterns in a set of 65 surgical specimens of human breast tumours from 42 different individuals, using complementary DNA microarrays representing 8,102 human genes. These patterns provided a distinctive molecular portrait of each tumour. Twenty of the tumours were sampled twice, before and after a 16-week course of doxorubicin chemotherapy, and two tumours were paired with a lymph node metastasis from the same patient. Gene expression patterns in two tumour samples from the same individual were almost always more similar to each other than either was to any other sample. Sets of co-expressed genes were identified for which variation in messenger RNA levels could be related to specific features of physiological variation. The tumours could be classified into subtypes distinguished by pervasive differences in their gene expression patterns.Figure 1 Variation in expression of 1,753 genes in 84 experimental samples. Data are presented in a matrix format: each row represents a single gene, and each column an experimental sample. In each sample, the ratio of the abundance of transcripts of each gene to the median abundance of the gene's transcript among all the cell lines (left panel), or to its median abundance across all tissue samples (right panel), is represented by the colour of the corresponding cell in the matrix.. a, Dendrogram representing similarities in the expression patterns between experimental samples. All 'before and after' chemotherapy pairs that were clustered on terminal branches are highlighted in red; the two primary tumour/lymph node metastasis pairs in light blue; the three clustered normal breast samples in light green. Branches representing the four breast luminal epithelial cell lines are shown in dark blue; breast basa
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks