Josephine Hoh, Jurg Ott
Rockefeller University, New York
jhoh@linkage.rockefeller.edu, ott@linkage.rockefeller.edu
2 May 2002
Instructions
for the use of scan statistics in gene mapping
(scanstat
program)
Legal notice: The methods and computer programs described here are copyrighted by The Rockefeller University. They are distributed freely and at no cost to academic researchers for use on a not for profit basis. However, any for profit use requires a license from The Rockefeller University. Please contact us for such a license. Currently only the executable program versions are made available.
Please note that the version currently distributed (dated 25 May 2002) works fine to the best of our knowledge. We'll keep improving it and will correct errors that may be detected. Please send email to request the program package.
Methodology
The general approach is described in Hoh and Ott (2000). Briefly, consider a number N of marker loci on the genome. At each marker, genotypes are available for two types of observations, for example:
For the j-th marker, a (single-locus) statistic, Xj, is defined that should discriminate between the two types of observations. For example, Xj might be the difference in lod scores between AA and AU sib pairs.
Let YL(t) be a moving sum of single-locus statistics over L consecutive markers, with the first item in the sum being at the t-th marker and the last element at the (t + L – 1)-st marker. For example, for L = 5 and t = 7, YL(t) is the sum of elements Xi for markers 7 through 11. The linear unconditional scan statistic of length L (Glaz and Balakrishnan 1999) is then defined as the maximum of all possible such moving sums over the genome, that is, as
SL = max{YL(1), YL(2), ..., Y L(N – L + 1)}.
What we want to do is determine scan statistics of various lengths for a given body of observations and find that scan statistic that is most significant.
The scan statistic of length L = 1 is just the largest single-locus statistic over the genome, that is, the statistic conventionally used in genome screens. Scan statistics of lengths L > 1 make use of the most significant marker and markers around it. In this sense, scan statistics are reminiscent of Bayesian analysis. However, we apply scan statistics in a strict likelihood framework.
Permutation tests
When the maximum lod score in a genome screen exceeds a critical value such as 3.3, such a result is considered significant at the a = 0.05 level even though the empirical significance level for the data at hand has not been determined. For our scan statistics, determining empirical significance levels, p, is mathematically intractable, which is why we resort to permutation (randomization) tests (Manly 1999). Under the hypothesis of no linkage or no association, any permutation of the labels for the two outcome types is expected to be equally likely. Because of the potentially very large number of possible permutations, we take a random sample of permutations and approximate empirical significance levels.
Two levels of permutations are performed. Assume that we evaluate scan statistics of lengths L = 1 through L = Lmax, where we might choose Lmax = 10. (1) For each scan statistic of a given length, the associated empirical significance level, pL, is determined. The smallest of these,
pmin = min{p1, p2, ..., pLmax },
is then our statistic of interest. (2) Because we deliberately look for the smallest p-value, we must determine the empirical significance level associated with pmin, which is done on the basis of the permutations samples obtained in level (1) (Manly 1999, page 109).
Implementation
The approach described above has been implemented in two computer programs, scanstat and statpval, where scanstat computes scan statistics of given lengths and their associated significance levels and statpval evaluates the significance level of pmin. When all markers on the genome are included in this analysis, the p-values so obtained refer to genome-wide significance levels. The user may restrict attention to a subset of markers, for example, the markers on a given chromosome.
Input and output for scanstat program
The scanstat program works with the following files, all in text (ASCII) format:
(1) Input file called scanstat.dat , consisting of a matrix with M rows and N + 1 columns, where M = number of observations (e.g., sib pairs) and N = number of markers. Each cell in the matrix contains an observation, for example, the lod score, uij, for the i-th sib pair and j-th marker. The (N + 1)st column must contain codes of 0 and 1 (or 1 and 2) indicating the two types of observations, where the higher number corresponds to the category of interest, for example, 0 = AU sib pair and 1 = AA sib pair (below, the two categories are numbered 1 and 2).
Based on the observations {uij}, the scanstat program will compute single-locus statistics, Xj = u(2)j – u(1)j, for the j-th marker, where u(2) and u(1) refer to the means in categories 2 and 1, respectively. Optionally, sums instead of means or other statistics as indicated below may be computed. The program will then compute scan statistics for the observed data and for each permutation sample. The empirical significance level will be estimated by the proportion of permutation samples exhibiting a scan statistic as large or larger than the corresponding scan statistic in the observed data.
(2) Input of parameter values: The scanstat program reads parameter values interactively from the standard input (i.e. the keyboard). In practice, it may be convenient to collect these values in a file named, for example, inputscan.txt , and call the program with the command,
scanstat <inputscan.txt >out,
which will redirect all standard input to the inputscan.txt file, with the out file collecting interactive output, which is not needed in this case.
An example for such a parameter file is as follows:
Scan
statistic, all chrom
n y for codes 1+2, n for
codes 0+1
n Subset of variables (y)?
(if so, enter y mbegin mend)
4 Statistic 1=mean2
2=mean2-1 3=sum2 4=sum2-1 5=t-test 2-1 6=mean2/se2
10 max. number of terms in sum
10000 number of random samples
The various lines of input have the following meaning:
|
Line |
Example |
Explanations |
|
1 |
(text) |
Any text, truncated to 80 characters |
|
2 |
n |
n for observation types coded as (0, 1), y for (1, 2) |
|
3 |
n |
n to use all N marker data, y for restriction to a subset of markers. If y, then this line must contain 3 items, y Nbegin Nend, where Nbegin and Nend indicate the first and last markers to be used. |
|
4 |
4 |
1 to test mean (against 0) in observation type 2; 2 to test for difference in means between types 2 and 1; 3 to test sum in observation type 2; 4 to test difference in sums between types 2 and 1; 5 t-test; 6 mean divided by standard error in type 2 observations (against 0) |
|
5 |
10 |
Lmax = maximum number of terms in scan statistic. L max = 0 specifies the default value of 10. A negative value for L max, e.g. –8, indicates that scan statistics of lengths L = 1, 2, ... should be considered until the corresponding p-values increase, but for lengths of at most (–Lmax) = 8. Currently Lmax is limited to 90 in the statpval program (no limit in scanstat program). |
|
6 |
10000 |
Number of permutation samples to be taken, currently limited to 100,000 |
(3) Input file called sumseed.txt holding a seed for the random number generator. This seed must be a negative integer number. At program termination, this file will be overwritten with the ending seeds so that seeds are always updated from one to the next program run. The random number generator implemented is ran1 and has a period of 108. It has been highly recommended by Press et al. (1992) "Numerical recipes," Cambridge University Press, New York.
(4) Output file, scanstat.out, containing results.
(5) Output file, statout.txt, containing values of permutation samples. This file can potentially be very large. It is required as input to the statpval program but will not generally be of interest to the user.
Input and output for statpval program
This program works with two text files:
(1) Input file, statout.txt, as created by the scanstat program,
(2) Output file, statpval.out, reporting the resulting significance level associated with pmin.
Computer platforms
The two programs, written in Pascal, are available for Windows and Unix. In Windows, they should be run in a command window. It is convenient to run the two programs with a batch (script) file as outlined below.
Windows
Assume
that program and data files reside in the folder (directory), c:\scanstat.
The user opens a command window and issues the commands, c:,
and cd \scanstat. To run the scanstat
program, one issues the command, for example,
scanstat
(for interactive input)
or
scanstat
<inputscan.txt >out (for parameter input
delivered in the inputscan.txt
file).
Subsequently, the statpval
program may be run. To run both program sequentially with one command, that
command is
goscan
xx,
where xx indicates that the parameter
file is called inputxx.txt.
For example, with the parameter input file called inputscan.txt, the appropriate command
is goscan scan. Or when markers on
chromosome 7 only should be considered and the corresponding input file is
called input07.txt,
the appropriate command is goscan 07.
Unix
The two programs are available for Sun Sparcstations running under Sun OS (Solaris), and for PCs running under Linux. To run both programs consecutively with the goscan script file, first copy the input data to the file, scanstat.dat . After completion of the statpval program, it is prudent to rename the resulting output file, statpval.out, to something else to avoid that the file will be overwritten the next time the programs run. When you receive the files and copy them to your directory, you may have to make the goscan file executable with the command, chmod +x goscan. To run the goscan script in the background, the command, (goscan >out)&, may be issued.
Practical considerations, Outlook
In permutation (randomization) tests, the choice of test statistic, X i, is important. The optimal choice may depend on the kind of data at hand (Manly 1999). Currently, only two simple test statistics are implemented, i.e., means and sums or their differences between the two observation types. Other possible test statistics might be the difference between means weighted by the inverse of their variances (t-test). When weights are used, only means can be computed, not sums.
Depending on the size of the data, the total number of possible permutations can be astronomical. It is then important that the number of random samples be large enough to ensure a somewhat good coverage of the space of all permutations. In our experience, 20,000 is a good number but 50,000 to 100,000 is better.
There is also a question of what maximum length, Lmax, should be chosen. A large value of Lmax tends to increase the overall p-value. On the other hand, too small a value of Lmax may not find the scan statistic with associated smallest p-value. One of our original recommendations was to choose Lmax at the point of L when the p-value starts increasing. However, this no longer seems to be a beneficial approach. For linkage analyses with a marker spacing of 10 cM, Lmax = 10 appears useful. For smaller marker spacings, Lmax may have to be increased.
Scan statistics should also be applied to each chromosome separately. There is currently a way to consider a subset of marker loci (see section on input, above) but the programs have to be run from scratch for each chromosome. We are planning on automating this process so that the programs run all chromosomes consecutively and combine the results in a suitable manner.
References
Glaz J, Balakrishnan N, eds (1999) Scan Statistics and Applications . Birkhäuser, Basel
Hoh J, Ott J (2000) Scan statistics to scan markers for susceptibility genes. Proc Natl Acad Sci USA 97, 9615-9617
Manly BJF (1998) Randomization, Bootstrap and Monte Carlo Methods in Biology, 2nd edition. Chapman & Hall, New York