John Wiley & Sons - 2004 - Analysis of Genes and Genomes
.pdf308 |
GENOME SEQUENCING PROJECTS 9 |
|
|
are designed to identify ORFs within human DNA and other individual species, respectively. These programmes assign ORFs not only on the basis of initiator and terminator codons, but also using codon bias to identify likely coding regions, and the identification of both intron –exon boundaries and transcriptional control elements (e.g. the TATA box). Unfortunately, these latter sequences can be quite variable, and precise gene identification remains problematical. An alternative approach to gene identification is to use previously identified genes as a guide. Is the gene we are trying to assign similar (homologous) to any existing genes? If so, it is likely that assignment has been made correctly. One danger with this type of approach is that pseudogenes (generally non-transcribed genomic DNA with a high degree of sequence similarity to a real gene) may be assigned as real genes.
•cDNA comparison. The simplest way to identify a gene within a segment of genomic DNA is compare the sequence to a copy of the corresponding cDNA. Readers will remember that cDNA (Chapter 5) is produced from mRNA and contains just the exon sequences of the ORF joined together. This can be achieved either through the hybridization of genomic DNA fragments to mRNA separated on an agarose gel (northern blotting, see Chapter 2) or through the comparison with databases of sequenced cDNA fragments. Expressed sequence tags (ESTs) are small pieces of cDNA sequence (usually 200 to 500 bases long) that are generated by sequencing either one or both ends of an expressed gene. Random cDNA clones are sequenced to generate sections of sequence that represent genes expressed in certain cells, tissues or organs from different organisms. These tags can then be used to identify the gene encoding them from genomic DNA by sequence comparison. Because ESTs represent a copy of just the interesting part of a genome – that which is expressed – they are powerful tools in the hunt for genes. ESTs also have a number of practical advantages – the sequences can be generated rapidly and inexpensively; only one sequencing experiment is needed for each cDNA generated; they do not have to be checked for sequencing errors as mistakes do not prevent identification of the gene from which the EST was derived using similarity searches. Databases of EST sequences are publicly available, e.g. dbEST (http://www.ncbi.nlm.nih.gov/dbEST/), which contains over 12 million sequences from different organisms including 4.5 million human sequences. Many of these sequences are, of course, repetitious (Banfi, Guffanti and Borsani, 1998), with highly expressed genes being represented many times. Additionally, it should be noted that genes that are expressed at a low level, or those whose expression pattern is highly tissue or developmental stage specific, might not be present within an EST database.
9.8 GENE ASSIGNMENT |
309 |
|
|
9.8Gene Assignment
Genome sequencing projects have thrown up some interesting, and somewhat unexpected, results. For example, even though E. coli and S. cerevisiae had been studied extensively in the laboratory for many decades, when their genomic sequence became available, it was realized that only between 30 and 40 per cent of the genes they contained had been previously characterized. In less experimentally amenable organisms, especially in humans, comparatively few genes were known before large-scale sequencing projects were undertaken. There are, however, several methods there are currently used to assign the function of a gene based only on its sequence.
•Similarity searches. Just as computational methods play an important role in defining those portions of the genome that may encode genes,
the availability of large databases of known gene sequences can also be used to assign function to unknown ones. Similarity searches like this are usually performed using amino acid sequences because the comparisons of the four DNA bases will often yield similar sequences even through the encoded proteins are very different. Many genes that encode proteins with the same function in different organism will be similar. For example, almost all organisms have the ability to convert the sugar galactose into glucose-6-phosphate so that it can be fed into the glycolytic cycle. The first step of this pathway is the conversion of galactose into galactose-1-phosphate – a reaction that is catalysed by the enzyme galactokinase. All organisms possess their own galactokinase enzyme, and the galactokinases from different organisms each have their own unique sequence. However, most likely as a result of having to perform the same chemical reaction, galactokinase enzymes are related to each other. That is, the amino acid sequence of the galactokinase from one organism shares similarity to the galactokinase from another organism (Figure 9.12). Although only 30 per cent of the approximately 6000 yeast genes had previously ascribed function (see above), the function of an additional 30 per cent could be ascribed based on similarity searches. This still leaves 40 per cent of the identified yeast genes having no known function. Of course, some of these genes may not be real, perhaps being incorrectly assigned as genes, but many will need to have their function assigned by other mechanisms.
•Experimental gene assignment. In experimental organisms, such as E. coli or yeast, one of the most popular ways of ascribing a function into an unknown gene is to make a gene knockout. As we will see later chapters,
310 |
GENOME SEQUENCING PROJECTS |
9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(a) |
OH |
|
CH2OH |
ATP |
ADP |
OH |
|
CH2OH |
|||||||||||||||||||||||
|
|
|
O |
|
|
|
|
|
O |
|||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||||||
|
|
|
|
|
|
|
|
|
||||||||||||||||||||||||
|
|
|
OH |
|
|
|
|
|
|
|
|
|
|
|
OH |
|||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Galactokinase |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
OH |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
OPO32− |
|||||
|
|
|
|
|
|
OH |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
OH |
|||||||
|
(b) |
35 |
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
|
|
|
|||||||||||||
|
|
Hs-GAL1: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
LVL |
|
|
|
|||||||||||
|
|
Hs-GAL2: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C |
|
|
|
|
|
|
||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||
|
|
Sc-GAL1: .. |
|
|
|
|
|
|
|
|
|
|
|
|
C |
|
|
|
|
|
|
|
.. |
|||||||||
|
|
Ec-galK: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
|||||||||
|
|
Bs-galK: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
Ca-GAL1: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||
|
|
Hi-galK: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
||||||||||
|
|
St-galK: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
Kl-GAL1: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
At-GAL1: .. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.. |
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 9.12. Sequence comparison of galactokinases from different species. (a) The
galactokinase |
reaction. |
(b) Comparison |
of a |
region |
of human galactokinase |
(amino acids |
35–54 of |
the 392 amino |
acid |
protein). |
Key: Hs – Homo sapiens, |
Sc – Saccharomyces cerevisiae, Ec – Escherichia coli, Bs – Bacillus subtilis, Ca – Candida albicans, Hi – Haemophilus influenzae, St – Salmonella typhimurium, Kl – Kluyveromyces lactis, At – Arabidopsis thaliana. Amino acids have been coloured according to their properties. Blue indicates positively charged amino acids (H, K, R), red indicates negatively charged residues (D, E), green indicates polar neutral residues (S, T, N, Q), grey indicates non-polar aliphatics (A, V, L, I, M) and purple indicates non-polar aromatic residues (F, Y, W). Brown is used to indicate proline and glycine, while yellow indicates cysteine
homologous recombination is used in both yeast and in higher-eukaryotic cells to disrupt the functional copy of a gene within a genome. The phenotype of the disrupted mutant can then be assessed in order to attempt to identify the natural function of the wild-type gene. This approach works well for many genes. For example, the previously uncharacterized yeast gene SNU17 shows little similarity to other proteins when compared using database searches. A yeast strain knocked out for SNU17, however, shows a slow-growth phenotype and is defective in pre-mRNA splicing (Gottschalk et al., 2001), indicating that the protein is involved in the splicing process. The difficulty with this approach is that, often, the deleted strain is either non-viable or is indistinguishable from the wild-type. Neither of these outcomes makes functional assignment possible – the non-viable state suggests that the protein may be playing a vital role in the cell, but may not yield any further clues to that role. An alternative approach to
9.9 BIOINFORMATICS |
311 |
|
|
gene assignment is to overproduce a protein, by carrying the gene on a high-copy-number plasmid, to attempt to observe a phenotype.
Despite the availability of the techniques described above, much of the assignment of gene function must be performed on an individual gene basis. This remains a large task in an experimentally tractable organism for the 2000 or so unidentified yeast genes, but the complete identification of the 30 000 or so human genes seems daunting.
9.9Bioinformatics
The availability of huge amounts of sequence information from an increasingly large number of fully characterized genomes has led to problems in the way in which the data is stored and accessed. Bioinformatics is the study of this biological information. It brings together the avalanche of biological data (genome sequence and other experiments) with the analytical theory and practical tools of mathematics and computer science. Bioinformatics aims to
•develop new algorithms and statistics with which to assess the relationships among members of large data sets,
•analyse and interpret various types of data including DNA and amino acid sequences, protein domains and protein structures and
•develop and implement tools that enable efficient access and management of different types of information.
Table 9.1. Curated genome sequencing projects
Organism (type) |
Web site(s) |
|
|
Escherichia coli (bacterium) |
www.genome.wisc.edu |
Bacillus subtilis (bacterium) |
genolist.pasteur.fr/SubtiList |
Saccharomyces cerevisiae (yeast) |
genome-www.stanford.edu/Saccharomyces |
Caenorhabditis elegans (nematode |
www.wormbase.org |
worm) |
|
Drosophila melanogaster (fruit fly) |
flybase.bio.indiana.edu |
Arabidopsis thaliana (plant) |
www.arabidopsis.org |
Mus musculus (mouse) |
www.informatics.jax.org |
Homo sapiens (human) |
www.ncbi.nlm.nih.gov/genome/guide/human/ |
|
|
312 |
GENOME SEQUENCING PROJECTS 9 |
|
|
We have already touched upon the use of computers to align DNA sequences to form contigs and in the search for similar genes, but their role does not stop there. Raw sequence information, e.g. the entire sequence of a chromosome, deposited into a database is important for the analysis of gene and gene function. Perhaps more important, and certainly more useful to the majority of researchers, is to have an integrated collection of genes, proteins and experimental evidence relating to the function of both. Curated databases (Table 9.1) attempt to collate the available information and present it in a format that is more user friendly than a list of DNA sequences. These databases generally allow users to search for gene or protein names or sequences and will often also guide users to published literature relating to their search topic. As we will see in Chapter 10, the analysis of the relationship between gene products under a variety of experimental conditions provides another layer of complexity to understanding gene function. The ability to integrate and analyse this data is vital if we are to gain real benefits in a post-genome age.
314 POST-GENOME ANALYSIS 10
a low frequency to produce much less protein than the same amount of transcript containing commonly used codons.
• Proteome – the protein content of a cell. This could be thought of as the translated component of the genome, but in many cases the protein products produced by a cell may differ from those predicted from the transcriptome. Post-translational modifications can radically alter the function of many proteins.
•Metabolome – the small molecule metabolites present within the cell. The quantity and identity of primary and secondary metabolites in a cell will vary greatly depending upon its physiological state. Changes in the metabolome may, however, reflect the function of the proteins required for metabolism.
The ability to measure changes in each of the above can help to define the precise cellular processes that occur under particular circumstances. For example, what changes occur within a cell during its conversion from a normal to a cancerous state? What genes are turned on or off? What proteins are made, and how do these differ from the normal complement of proteins? A number of techniques have been devised to address these issues. Many of the experiments designed to address the global effects on gene function have been performed using the yeast Saccharomyces cerevisiae as a model eukaryotic organism. Yeast has the advantage of a relatively small genome ( 6300 genes) with compact intergenic regions and few introns. This, combined with the ability to perform rapid and powerful genetic analyses, makes it an ideal system to study the interactions between genes and gene products. Most of the experiments described below were performed first on yeast prior to moving to the larger genomes of more complex higher eukaryotes.
10.1Global Changes in Gene Expression
The expression levels of individual genes can be modulated in response to a variety of extra-cellular and intra-cellular signals. The complement of genes that are expressed within a cell at a particular time gives a ‘snap-shot’ of the proteins that it is currently producing. For example, the treatment of human cells with a particular drug may induce changes in expression of genes required for the response to that drug; e.g. those proteins required for drug metabolism may be produced at a higher level. We have already discussed a number of techniques that are aimed at monitoring changes in gene expression, e.g. Northern blotting (Chapter 2) and RT-PCR (Chapter 4). These methods, however, require that alterations in the expression levels of specific genes be observed. This approach
10.1 GLOBAL CHANGES IN GENE EXPRESSION |
315 |
|
|
both limits the number of genes that can be analysed and biases the results obtained just to the genes whose expression pattern is observed. The researcher performing the experiment will have to make a call as to the expression of which genes are likely to be altered by a particular treatment. A far more systematic approach is to test the expression levels of all genes within the genome and to see how these levels are altered under particular circumstances. This allows for ‘unexpected’ gene expression alterations to be observed. A number of approaches have been designed to monitor gene expression changes on a genome-wide level.
10.1.1Differential Display
Although not requiring prior knowledge of the sequence of the genome,
differential display is a method for monitoring global changes in gene expression levels based on the systematic amplification of the 3 -ends of mRNA molecules (Liang and Pardee, 1992). As we have seen previously, the 3 -ends of
most mRNA molecules contain a poly(A) tail. Anchored primers are designed to bind to the 5 boundary of the polyA tail and act as starting points for a reverse transcription reaction (Figure 10.1). The single cDNA strands produced are then PCR amplified using the anchored primer and an upstream primer of arbitrary, but known, sequence. Different arbitrary primers are used to amplify different sets of cDNA molecules derived from a population of mRNA molecules. The population of PCR products produced by this method are then separated using by denaturing polyacrylamide electrophoresis – like the DNA sequencing gels we saw in Chapter 9. The amount of PCR produced from each mRNA molecule in the original sample should be proportional to the amount of RNA from which it was derived. Consequently, the relative abundance of individual mRNA molecules can be compared directly in different RNA samples from related sources. Using multiple primer combinations, differential display is able to visualize all the expressed genes in a cell in a systematic and sequence-dependent manner. One of the first reported uses of differential display was to compare the mRNAs from normal and tumour derived human mammary epithelial cells, cultured under the same conditions. The identification of genes specifically expressed in tumour cells but not in normal cells (potential oncogenes), or those expressed in normal cells only (potential tumour suppressor genes), is important for understanding the molecular basis
of cancer (Liang et al., 1992).
Sequences that are identified by differential display as being either upor down-regulated under particular conditions may be excised from the gel in which they are separated, re-amplified by PCR, cloned and sequenced.
316 POST-GENOME ANALYSIS 10
(a) |
mRNA |
CAAAAAAAA(n)-3'
GAAAAAAAA(n)-3'
UAAAAAAAA(n)-3'
Anchored oligo dT primer dNTP
Reverse transcriptase
CAAAAAAAA(n)-3'
GTTTTTTTCGAA-5'
Random arbitary primer
Anchored primer dNTP (+a32P-dATP)
Taq DNA polymerase Random arbitary primer
|
|
|
|
|
|
|
|
|
GTTTTTTTCGAA-5' |
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GTTTTTTTCGAA-5' |
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
Analyse using polyacrylamide gel |
|
|
|
||||||||||
(b) |
|
G |
|
|
|
|
|
|
A |
|
|
|
|
|
C |
|
1 |
2 |
3 |
1 |
|
2 |
3 |
|
|
|
1 |
2 |
3 |
||||
Figure 10.1. Differential display to detect changes in gene expression. (a) Conversion to cDNA is achieved by first using an anchored oligo-dT primer to create a single cDNA strand corresponding to the 3 -end of the mRNA. Each of the three anchored primers (whose 3 -end is either G, A or C) will produce a population of single-stranded cDNA molecules based upon the presence and abundance of individual mRNA molecules within the sample. Second-strand synthesis is then performed using a set of arbitrary primers – of known sequence. Different combinations of arbitrary primers and anchored oligo-dT primers will amplify all possible permutations of the first cDNA strand. The PCR fragments produced are separated using a polyacrylamide gel and differences in expression of genes within the tissue samples can be detected through the analysis of the intensity and individual bands. (b) Differential display of four RNA samples (one normal and three cancerous) using three different anchored primers (G, A and C) in combination with three arbitrary primers (1, 2 and 3). The red box indicates some of the gene products that are highly expressed in the cancer cells and not in the normal cells, and the green box indicates genes expressed only in normal cells. Reproduced, with permission, from GenHunter Corporation, www.genhunter.com
10.1 GLOBAL CHANGES IN GENE EXPRESSION |
317 |
|
|
The fragments made in this way are biased toward the 3 -end of genes and are therefore unlikely to represent full-length cDNA clones. The differentially expressed sequences can, however, be used as probes to isolate full-length cDNA and genomic DNA – either through library screening or computer searches. Differential display is a powerful technique for analysing gene expression changes. It does, however, suffer from the problem that even seemingly modest changes in cellular conditions can be accompanied by alterations in the levels of massive numbers of genes. Additionally, multiple primer combinations (>300) are required to analyse effectively all potential mRNA molecules that may be produced within a cell (Crawford et al., 2002).
10.1.2Microarrays
We have already seen that the pattern of genes expressed within a cell is characteristic of its current state. The realization that the genomes may not contain as many genes as was once thought – for example 6000 in yeast and perhaps as few as 30 000 in humans – opened the possibility of individually analysing the expression of all genes within an organism. 30 000 individual experiments is still a huge number, but advances in automation and sample processing means that this is now achievable. Virtually all changes in cell state or type can be correlated with alterations in the mRNA levels of genes. In some cases, alterations in massive numbers of genes occur. For example, in yeast, the process of sporulation is associated with a change in the expression of at least 1000 different genes – representing almost 20 per cent of the total number of genes (Chu et al., 1998). In other cases, changes in cellular environment may only alter the expression of a small subset of genes – e.g. treatment of yeast cells with copper sulphate significantly alters the expression of only five genes (Gross et al., 2000). Knowledge of the expression patterns of many previously uncharacterized genes may also provide vital clues to their function. The analysis of changes in the expression of all of the thousands or tens of thousands of genes within a genome is essential if we are to understand the interplay between genes and gene products.
DNA microarrays have been developed as a method for rapidly analysing the expression of all genes within a genome (Shalon, Smith and Brown, 1996). They work by providing a fixed single strand of DNA to which labelled cDNA fragments can bind (Figure 10.2). The DNA fragments are physically attached to an inert support (called a chip). Several different technologies are currently used to perform microarray experiments. These differ in the way in which the DNA sequences are attached to the chip, and the length of the DNA sequence itself. The two most commonly used systems are that
