|Short Contents | Full Contents||Other books NCBI|
|Genomes 2 Studying Genomes Sequencing Genomes
6.2. Assembly of a Contiguous DNA Sequence
The next question to address is how the master sequence of a chromosome, possibly several tens of Mb in length, can be assembled from the multitude of short sequences generated by chain termination sequencing. We addressed this issue at the start of Chapter 5 and established that the relatively short genomes of prokaryotes can be assembled by shotgun sequencing, but that this approach might lead to errors if applied to larger eukaryotic genomes. The whole-genome shotgun method, which uses a map to aid assembly of the master sequence, has been used with the fruit-fly and human genomes, but it is generally accepted that a greater degree of accuracy is achieved with the clone contig approach, in which the genome is broken down into segments, each with a known position on the genome map, before sequencing is carried out (see Figure 5.3 ). We will start by examining how shotgun sequencing has been applied to prokaryotic genomes.6.2.1. Sequence assembly by the shotgun approach
The straightforward approach to sequence assembly is to build up the master sequence directly from the short sequences obtained from individual sequencing experiments, simply by examining the sequences for overlaps (see Figure 5.1 ). This is called the shotgun approach. It does not require any prior knowledge of the genome and so can be carried out in the absence of a genetic or physical map.The potential of the shotgun approach was proven by the Haemophilus influenzae sequence
During the early 1990s there was extensive debate about whether the shotgun approach would work in practice, many molecular biologists being of the opinion that the amount of data handling needed to compare all the mini-sequences and identify overlaps, even with the smallest genomes, would be beyond the capabilities of existing computer systems. These doubts were laid to rest in 1995 when the sequence of the 1830 kb genome of the bacterium Haemophilus influenzae was published (Fleischmann et al., 1995).
The H. influenzae genome was sequenced entirely by the shotgun approach and without recourse to any genetic or physical map information. The strategy used to obtain the sequence is shown in Figure 6.10 . The first step was to break the genomic DNA into fragments by sonication, a technique which uses high-frequency sound waves to make random cuts in DNA molecules. The fragments were then electrophoresed and those in the range 1.62.0 kb purified from the agarose gel and ligated into a plasmid vector. From the resulting library, 19 687 clones were taken at random and 28 643 sequencing experiments carried out, the number of sequencing experiments being greater than the number of plasmids because both ends of some inserts were sequenced. Of these sequencing experiments, 16% were considered to be failures because they resulted in less than 400 bp of sequence. The remaining 24 304 sequences gave a total of 11 631 485 bp, corresponding to six times the length of the H. influenzae genome, this amount of redundancy being deemed necessary to ensure complete coverage. Sequence assembly required 30 hours on a computer with 512 Mb of RAM, and resulted in 140 lengthy contiguous sequences, each of these sequence contigs representing a different, non-overlapping portion of the genome.
The next step was to join up pairs of contigs by obtaining sequences from the gaps between them ( Figure 6.11 ). First, the library was checked to see if there were any clones whose two end sequences were located in different contigs. If such a clone could be identified, then additional sequencing of its insert would close the 'sequence gap' between the two contigs ( Figure 6.11A ). In fact, there were 99 clones in this category, so 99 of the gaps could be closed without too much difficulty.
This left 42 gaps, which probably consisted of DNA sequences that were unstable in the cloning vector and therefore not present in the library. To close these 'physical gaps' a second clone library was prepared, this one with a different type of vector. Rather than using another plasmid, in which the uncloned sequences would probably still be unstable, the second library was prepared in a bacteriophage l vector (Section 4.2.1). This new library was probed with 84 oligonucleotides, one at a time, these 84 oligonucleotides having sequences identical to the sequences at the ends of the unlinked contigs ( Figure 6.11B ). The rationale was that if two oligonucleotides hybridized to the same l clone then the ends of the contigs from which they were derived must lie within that clone, and sequencing the DNA in the l clone would therefore close the gap. Twenty-three of the 42 physical gaps were dealt with in this way.
A second strategy for gap closure was to use pairs of oligonucleotides, from the set of 84 described above, as primers for PCRs of H.influenzae genomic DNA. Some oligonucleotide pairs were selected at random and those spanning a gap identified simply from whether or not they gave a PCR product (see Figure 6.11B ). Sequencing the resulting PCR products closed the relevant gaps. Other primer pairs were chosen on a more rational basis. For example, oligonucleotides were tested as probes with a Southern blot of H. influenzae DNA cut with a variety of restriction endonucleases, and pairs that hybridized to similar sets of restriction fragments identified. The two members of an oligonucleotide pair identified in this way must be contained within the same restriction fragments and so are likely to lie close together on the genome. This means that the pair of contigs that the oligonucleotides are derived from are adjacent, and the gap between them can be spanned by a PCR of genomic DNA using the two oligonucleotides as primers, which will provide the template DNA for gap closure.
The demonstration that a small genome can be sequenced relatively rapidly by the shotgun approach led to a sudden plethora of completed microbial genomes. These projects demonstrated that shotgun sequencing can be set up on a production-line basis, with each team member having his or her individual task in DNA preparation, carrying out the sequencing reactions, or analyzing the data. This strategy resulted in the 580 kb genome of Mycoplasma genitalium being sequenced by five people in just eight weeks (Fraser et al., 1995), and it is now accepted that a few months should be ample time to generate the complete sequence of any genome less than about 5 Mb, even if nothing is known about the genome before the project begins. The strengths of the shotgun approach are therefore its speed and its ability to work in the absence of a genetic or physical map.6.2.2. Sequence assembly by the clone contig approach
The clone contig approach is the conventional method for obtaining the sequence of a eukaryotic genome and has also been used with those microbial genomes that have previously been mapped by genetic and/or physical means. In the clone contig approach, the genome is broken into fragments of up to 1.5 Mb, usually by partial restriction (Section 5.3.1), and these cloned in a high-capacity vector such as a BAC or a YAC (Section 4.2.1). A clone contig is built up by identifying clones containing overlapping fragments, which are then individually sequenced by the shotgun method. Ideally the cloned fragments are anchored onto a genetic and/or physical map of the genome, so that the sequence data from the contig can be checked and interpreted by looking for features (e.g. STSs, SSLPs, genes) known to be present in a particular region.Clone contigs can be built up by chromosome walking, but the method is laborious
The simplest way to build up an overlapping series of cloned DNA fragments is to begin with one clone from a library, identify a second clone whose insert overlaps with the insert in the first clone, then identify a third clone whose insert overlaps with the second clone, and so on. This is the basis of chromosome walking, which was the first method devised for assembly of clone contigs.
Chromosome walking was originally used to move relatively short distances along DNA molecules, using clone libraries prepared with l or cosmid vectors. The most straightforward approach is to use the insert DNA from the starting clone as a hybridization probe to screen all the other clones in the library. Clones whose inserts overlap with the probe give positive hybridization signals, and their inserts can be used as new probes to continue the walk ( Figure 6.12 ).
The main problem that arises is that if the probe contains a genome-wide repeat sequence then it will hybridize not only to overlapping clones but also to non-overlapping clones whose inserts also contain copies of the repeat. The extent of this non-specific hybridization can be reduced by blocking the repeat sequences by prehybridization with unlabeled genomic DNA (see Figure 5.30 ). But this does not completely solve the problem, especially if the walk is being carried out with long inserts from high-capacity vectors such as BACs or YACs. For this reason, intact inserts are rarely used for chromosome walks with human DNA and similar DNAs which have a high frequency of genome-wide repeats. Instead, a fragment from the end of an insert is used as the probe, there being less chance of a genome-wide repeat occurring in a short end-fragment compared with the insert as a whole. If complete confidence is required then the end-fragment can be sequenced before use to ensure that no repetitive DNA is present.
If the end-fragment has been sequenced then the walk can be speeded up by using PCR rather than hybridization to identify clones with overlapping inserts. Primers are designed from the sequence of the end-fragment and used in attempted PCRs with all the other clones in the library. A clone that gives a PCR product of the correct size must contain an overlapping insert ( Figure 6.13 ). To speed the process up even more, rather than performing a PCR with each individual clone, groups of clones are mixed together in such a way that unambiguous identification of overlapping ones can still be made. The method is illustrated in Figure 6.14 , in which a library of 960 clones has been prepared in ten microtiter trays, each tray comprising 96 wells in an 8 × 12 array, with one clone per well. PCRs are carried out as follows:
1. Samples of each clone in row A of the first microtiter tray are mixed together and a single PCR carried out. This is repeated for every row of every tray - 80 PCRs in all.
2. Samples of each clone in column 1 of the first microtiter tray are mixed together and a single PCR carried out. This is repeated for every column of every tray - 120 PCRs in all.
3. Clones from well A1 of each of the ten microtiter trays are mixed together and a single PCR carried out. This is repeated for every well - 96 PCRs in all.
As explained in the legend to Figure 6.14 , these 296 PCRs provide enough information to identify which of the 960 clones give products and which do not. Ambiguities arise only if a substantial number of clones turn out to be positive.Newer more rapid methods for clone contig assembly
Even when the screening step is carried out by the combinatorial PCR approach shown in Figure 6.14 , chromosome walking is a slow process and it is rarely possible to assemble contigs of more than 1520 clones by this method. The procedure has been extremely valuable in positional cloning, where the objective is to walk from a mapped site to an interesting gene that is known to be no more than a few Mb distant. It has been less valuable for assembling clone contigs across entire genomes, especially with the complex genomes of higher eukaryotes. So what alternative methods are there?
The main alternative is to use a clone fingerprinting technique. Clone fingerprinting provides information on the physical structure of a cloned DNA fragment, this physical information or 'fingerprint' being compared with equivalent data from other clones, enabling those with similarities - possibly indicating overlaps - to be identified. One or a combination of the following techniques is used ( Figure 6.15 ):
As with chromosome walking, efficient application of these fingerprinting techniques requires combinatorial screening of gridded clones, ideally with computerized methodology for analyzing the resulting data.6.2.3. Whole-genome shotgun sequencing
The whole-genome shotgun approach was first proposed by Craig Venter and colleagues as a means of speeding up the acquisition of contiguous sequence data for large genomes such as the human genome and those of other eukaryotes (Venter et al., 1998; Marshall, 1999). Experience with conventional shotgun sequencing (Section 6.2.1) had shown that if the total length of sequence that is generated is between 6.5 and 8 times the length of the genome being studied, then the resulting sequence contigs will span over 99.8% of the genome sequence (Fraser, 1997), with a few gaps that can be closed by methods such as those developed during the H. influenzae project (see Figure 6.11 ). This implies that 70 million individual sequences, each 500 bp or so in length, corresponding to a total of 35 000 Mb, would be sufficient if the random approach were taken with the human genome. Seventy million sequences is not an impossibility: in fact, with 75 automatic sequencers, each performing 1000 sequences per day, the task could be achieved in 3 years.
The big question was whether the 70 million sequences could be assembled correctly. If the conventional shotgun approach is used with such a large number of fragments, and no reference is made to a genome map, then the answer is certainly no. The huge amount of computer time needed to identify overlaps between the sequences, and the errors, or at best uncertainties, caused by the extensive repetitive DNA content of most eukaryotic genomes (see Figure 5.2 ), would make the task impossible. But with reference to a map, Venter argued, it should be possible to assemble the mini-sequences in the correct way.Key features of whole-genome shotgun sequencing
The most time-consuming part of a shotgun sequencing project is the 'finishing' phase when individual sequence contigs are joined by closure of sequence gaps and physical gaps (see Figure 6.11 ). To minimize the amount of finishing that is needed, the whole-genome shotgun approach makes use of at least two clone libraries, prepared with different types of vector. Two libraries are used because with any cloning vector it is anticipated that some fragments will not be cloned because of incompatibility problems that prevent vectors containing these fragments from being propagated. Different types of vector suffer from different problems, so fragments that cannot be cloned in one vector can often be cloned if a second vector is used. Generating sequence from fragments cloned in two different vectors should therefore improve the overall coverage of the genome.
What about the problems that repeat elements pose for sequence assembly? We highlighted this issue in Chapter 5 as the main argument against the use of shotgun sequencing with eukaryotic genomes, because of the possibility that jumps between repeat units will lead to parts of a repetitive region being left out, or an incorrect connection being made between two separate pieces of the same or different chromosomes (see Figure 5.2 ). Several possible solutions to this problem have been proposed (Weber and Myers, 1997), but the most successful strategy is to ensure that one of the clone libraries contains fragments that are longer than the longest repeat sequences in the genome being studied. For example, one of the plasmid libraries used when the shotgun approach was applied to the Drosophila genome contained inserts with an average size of 10 kb, because most Drosophila repeat sequences are 8 kb or fewer. Sequence jumps, from one repeat sequence to another, are avoided by ensuring that the two end-sequences of each 10-kb insert are at their appropriate positions in the master sequence ( Figure 6.16 ).
The initial result of sequence assembly is a series of scaffolds, each one comprising a set of sequence contigs separated by sequence gaps - ones which lie between the mini-sequences from the two ends of a single cloned fragment and so can be closed by further sequencing of that fragment ( Figure 6.17 ). The scaffolds themselves are separated by physical gaps, which are more difficult to close because they represent sequences that are not in the clone libraries. The marker content of each scaffold is used to determine its position on the genome map. For example, if the locations of STSs in the genome map are known then a scaffold can be positioned by determining which STSs it contains. If a scaffold contains STSs from two non-contiguous parts of the genome then an error has occurred during sequence assembly. The accuracy of sequence assembly can be further checked by obtaining end-sequences from fragments of 100 kb or more that have been cloned in a high-capacity vector. If a pair of end-sequences do not fall within a single scaffold at their anticipated positions relative to each other, then again an error in assembly has occurred.
The feasibility of the whole-genome shotgun approach has been demonstrated by its application to the fruit-fly and human genomes (Adams et al., 2000; Venter et al., 2001). The question that remains, and which has been hotly debated (Patterson, 1998), is whether the sequences obtained by the whole-genome shotgun approach have the desired degree of accuracy. Part of the problem is that the random nature of sequence generation means that some parts of the genome are covered by several of the mini-sequences that are obtained, whereas other parts are represented just once or twice ( Figure 6.18 ). It is generally accepted that every part of a genome should be sequenced at least four times to ensure an acceptable level of accuracy, and that this coverage should be increased to 810 times before the sequence can be looked upon as being complete. A sequence obtained by the whole-genome shotgun approach is likely to exceed this requirement in many regions, but may fall short in other areas. If those areas include genes, then the lack of accuracy could cause major problems when attempts are made to locate the genes and understand their functions (see Chapter 7). These problems have been highlighted by studies of the Drosophila genome sequence, which have suggested that as many as 6500 of the 13 600 genes might contain significant sequence errors (Karlin et al., 2001).