8.4.3 Titan test

The Titan test was also developed and is scored by Dr. Hoeflin. It is also a 48-item take-at-home test modeled much after the Mega.

Certainly, we would like to have been able to provide item analysis and at least been able to review norming data for the Titan, but even without that analysis and data, some of us expressed comfort with continuing to use the Titan for admissions at the present time if it had not been for known compromises based on the following considerations.

There are the matched-pair data which provides the scores of 114 subjects on both the Mega and Titan test. See figure 15 below. The mean of the Titan raw scores in this set is 20.1 and the mean of the Mega raw scores is 22.3. The difference between means was highly significant (p>0.001) according to a t-test. So across the full range of scores, the Titan is, perhaps, two problems tougher. The correlation between tests was 0.82.

Examining the raw scores of the subjects with combined Mega and Titan raw scores of 48 (n=46) -- people near the Prometheus Society membership criteria interest range -- reveals that the means of the two tests for that group were Mega= 31.4, Titan =31.3. The difference between means being statistically insignificant, as one might expect.

Figure 15 shows a correlation between scores of individuals taking both the Mega and Titan. Using score pairing equipercentile equating methods for calibration, the fourteenth Titan test score was a 36 and the fourteenth Mega score was a 35. See figure 16. The 46th Titan score was a 24 and 46th Mega score was also a 24 -- a fairly close pairing.

A consensus opinion of those on the committee having done both tests, is that the Titan is 2 to 3 problems harder than the Mega. The statistical evidence, however, seems to indicate that the Mega may be a bit more difficult, but at the higher ranges we are trying to measure, they are almost identical. It is interesting that Ron Hoeflin also has characterized the Titan as more difficult at the lower range, and equivalent at the upper end.

Figure 15: Titan vs. Mega (48-item) Correlation of Score Pairs
Figure 16: Titan vs. Mega (48-item) Equipercentile Equating

 

The Titan appears to be less compromised at this point in time than the Mega -- our impression is that most people that examine both tests opt to use the Mega because the Titan appears more

difficult at first glance and, perhaps, "less fun". Answers to the Titan problems have on occasion appeared on the Internet over the last couple of years. A serious problem in this regard is that we cannot perform item response (IRT) or other analyses necessary to develop a sub-test. We do not even have enough data to effectively check its characteristics.

According to data supplied by the membership officer, very few people have been admitted to Prometheus by the Titan, so evidently people aren't "leaking in" due to this test being too easy or answer leakage being too severe as of yet.

We feel that it is most unfortunate to have to recommend suspension of this test from our qualification list at this time and hope that sufficient data will be provided in the near future so that the test can again be certified for use by the Society. Ron has assured us that he will provide the data so that we will be able to add an addendum to our recommendation if the data warrant the Titan's retention in some form. However, as of this time there is insufficient data to work around the known compromises to this test and we must stop the leak.

8.4.4 LAIT (scored before Dec. 31, 1993)

The norming data on the LAIT has not been made available to this committee by the test developer. However, since the LAIT is no longer being scored, having been retired some time ago when its answers were published, we are not concerned about continued Prometheus Society criteria erosion vulnerabilities due to this test. Many members have been accepted into the Society based on scores on this test in the past and members of record at two dates in the past have been assured entry to the Society so it seems reasonable to retain LAIT scores obtained prior to Dec. 31, 1993 as satisfying entry criteria.

There have been legal problems and some controversy with regard to the legitimacy of this test, but we do not believe that these are of much concern since the test is no longer being scored.

Figure 17: LAIT vs. Mega score pairs

 
Cursory review of Kevin Langdon's 2nd norming of the LAIT together with more recent data relating LAIT scores to Mega scores as shown in the following figure 17 has persuaded us that it is reasonable to retain a LAIT-IQ score of 164 as satisfying the 1-in-30,000 of the general population criterion, though it would have been nice to have had more data.

 

g loading of the LAIT:

The following excerpts are from Grady Towers's "Letters to Kevin Langdon" (Noesis 131 -- Special Issue on Psychometric Issues, 11, September 1998). Grady discussed LAIT/Mega analyses in the "3rd" leter dated 4/28/98, factor analysis in his "4th" and "5th" letters dated 7/27/98 and 8/24/98. He wrote:

"I worked them out many years ago but was reluctant to publish them because of the small sample size (N=46).

There are two kinds of factor analyses extant in psychometrics: Principal Components Analysis and Common Factor Analysis. Common factor analysis is the preferred method.

What I did was to factor analyze the correlations between the LAIT and 24 Verbal items on the Mega Test, with 12 Spatial items, and 12 Numerical items. I found two important factors: the first column represents g loadings, and the second is a verbal/non-verbal bifactor.

 
 
I
II
LAIT
.76
-.36
Verbal
.44
.47
Spatial
.84
-.09
Number
.74
.18
 
Rotating these factors to orthogonal simple structure, we get 'fluid intelligence' and 'crystallized intelligence.' "  
 
I
II
LAIT
.83
.11
Verbal
.12
.63
Spatial
.75
.38
Number
.52
.55
 

Kevin's reply is an article entitled "Reply to Grady Towers" (Noesis 131 -- Special Issue on Psychometric Issues, 16, September 1998).

 


8.5 Scholastic Aptitude Test (SAT) -- the data and its application to the norming of other tests

 

We have decided that the SAT deserves its own heading in this Membership Committee Report since the analysis of its data is central to our task. Correlation of paired scores with the SAT is the major basis of the norming of the Mega test that has (and we recommend to continue in the subset Mega27 test) satisfied the criteria for membership to the Society. In addition the SAT has been analyzed to determine the appropriateness of using a cutoff SAT score for qualification to the Society as described further on.

8.5.1 Background data

A couple of caveats are in order. First of all, the SAT has changed fairly substantially over the years. The analyses that we have performed and the use to which the SAT has been put in norming other tests in this report involves exclusively what we call the "old" SAT. To distinguish this version, it is essential to note that: The "new" SAT has been deployed since April 1, 1995. The "old" SAT was administered prior to that date.

The maximum score of 1600 on the new SAT V+M appears to map to the score range of 1510 to 1600 on the old SAT. Given the shape of the score frequency distribution in general, we believe that most 1600's on the new SAT would fall below 1560 on the old SAT. For example, 453 out of 1,127,021 students who actually took the test in 1996-7 (probably representing some 3.5 million total 17 year olds) scored 1600 on this new SAT. This is about 1 out of 7,726 that would correspond to about a 158 IQ. We have yet to see sufficient statistically reliable data on the numbers of participants receiving these high scores from one year to the next on the new SAT, but until and unless these reveal something other than we anticipate from what we have seen, the new SAT is definitely not suitable for our purposes.

8.5.2 The SAT data correlations with IQ

The SAT does correlate highly with g. This is discussed by Arthur Jensen in The g Factor. Jensen says on pages 559-560 that, "Data obtained from 339 college students support the notion that much of the variance in SAT scores can be attributed to g (it is unclear from the text whether pre or post recentered SAT scores were used). College students are a somewhat restricted sample, so it would be expected that if the sample was the entire population, the correlations could be even higher. The g-loading of the SAT-M is shown as .698, and the g-loading of the SAT-V is .804. The g-loading of most IQ tests is around .80. Another source, Nicholas Lemann, estimates in an article, "The Great Sorting" (Atlantic Monthly, Sept. 1995) that the correlation between the verbal score and IQ is .60 to .80.

8.5.3 Cautionary notes and considerations

There are cautionary notes to be added, though: g-loading is both a function of the test involved and the population being measured. Jensen's data was obtained from a small sample of college students (it is reasonable to view this as a controlled condition due to the population being entirely represented by college students -- this could provide a control for other significant factors that affect SAT scores. The size of the population used in the ETS data has not been specified. According to Thomas J. Bouchard (a widely recognized researcher in the U.S. at the University of Minnesota studying IQ correlations between monozygotic twins), research in correlating IQ with SAT scores has been inconsistent. The Standford Binet and SAT have been found to correlate anywhere between .445 and .8. The WAIS and SAT correlations fall in about the same range according to Bouchard. While the SAT and other college admissions tests may be adequate measures of g for small homogeneous populations, e.g., group of native-English-speaking US students that have had an almost identical academic background that would include learning vocabulary lists and four years of high school math (the test uses no higher than 9th grade math), and who also have had similar lifestyles and academic motivations. These limitations clearly preclude the SAT from ever becoming the sole test from which to select members world wide.

While most cognitive abilities tests are influenced by education and cultural factors, SAT tests, because of their more specific academic focus, are probably less effective in measuring "g" for people who fall into categories that one finds in more diverse populations (e.g., unsuitable education, lack of motivation to learn required subjects -- verbal/mathematical, or those suffering from math phobia, attention deficit disorder (ADD), depression, dyslexia, adverse effects of exam pressure, young children, foreign examinees, etc.). However, these conditions probably also significantly reduce the possibility of interest in membership in Prometheus.

Finally, it is possible that scores can be increased without a corresponding increase in g through long-term study undertaken with the specific goal of raising test scores (as of yet there is insufficient data on this). Individuals may be able to put in extra study and practice relative to the normal comparable population and considerably improve his/her mathematical and verbal aptitudes. In this regard, long-term coaching should be distinguished from short-term coaching; research on the latter by the College Board indicates that short term coaching produces scores that are within the standard error of the test. See http://www.collegeboard.org/press/html9899/html/981123a.html. It is also worth noting that some minimal study and coaching are fairly typical of SAT participation so that such may be the norm which is already taken into account in the general population distribution.

Discussion by Messick and Jungblut in "Time and method in coaching for the SAT" (Psychological Bulletin, Vol. 89, 1981) provide an argument against the efficacy of coaching to obtain uncharacteristic high scores. Discussion of the issue on pages 400-402 in The Bell Curve cites this paper; there is an excellent graph on p. 401 showing score increments for the SAT-V and SAT-M plotted in separate curves vs. hours of study.

Some facts from the text and the graph:

hours of study Verbal Math Total
30 +16 +25 +41
100 +24 +39 +63
 

300 hours of study might be expected to reap a 70 point increment on the combined score, 600 hours 85 points.

The cited article is a review of all studies done to that date on this issue. These documented improvements involve the average increments at all levels and are therefore weighted for differences occurring at the average level; increments at the high end of the scale must certainly be less. One would do well to remember that coaching for the SAT is a profitable mini-industry in the U.S. Extravagant claims are to be expected on a routine basis from this industry (as for any other).

Rebuttals to this study are available like The Princeton Review (The studies are intra-institutional like studies by ETS - information about these studies can be obtained by contacting The Princeton Review directly or found in books published by Princeton Review) which claims to provide unbiased studies that prove significant improvement is possible (well over 200 points). (Other material that explores this issue are available by Samuel J. Messick in "Effectiveness of Coaching for the SAT" and "Individuality in Learning". Similar criticisms to those of extravagant gains have been made about the claims put forward by Hernstein and Murray. See for example, Measured Lies: The Bell Curve Examined; Cracks in the Bell Curve; Intelligence, Genes, and Success: Scientists Respond to the Bell Curve 'Statistics for Social Science and Public Policy'; Inequality by Design : Cracking the Bell Curve Myth; The Bell Curve Debate; History, Documents, Opinions; The Bell Curve Wars.) Also, ETS have sometimes been accused of biased statistical approaches that may significantly influence conclusions obtained. See for example, Stephen Levy's "ETS and the Coaching Cover-up," in the March 1979 issue of New Jersey Monthly.

While all members of the Membership Committee acknowledge that there are valid criticisms of the SAT, we are in general agreement that these criticisms are insufficient to preclude its use for our purposes.

8.5.4 Intelligence filter operative in selection of SAT participants

It is well known that the SAT is administered selectively to high school age students in the US. On page 35 of The Bell Curve it is stated that, "By 1960, a student who was really smart -- at or near the 100th percentile in IQ -- had a chance of going to college of nearly 100%." There is a graph on the same page showing three curves for percentile IQ vs. percent of college attendance. The curves are for the 1920s, early 1960s and early 1980s. From the graph, it appears that in the 1980s and in the 1960s, a student at the 96 percentile IQ had about a 92% chance of attending college (and, by implication of taking the SAT).

From the notes in The Bell Curve on page 692, note 7: "...from top quartile [of PSAT scores], 79% went to college; of those in the top 5%, more than 95% went to college." The data in the first example used IQ scores, not SAT scores.

There is another graph on p. 37 showing two curves, one for students entering college, one for completing the B.A. as a percentage vs. percentile IQ. Quote from p. 36: "...Meanwhile about 70% of the top decile of ability were completing a B.A."

For the graph on p. 35 of The Bell Curve, the curve for the 1980s is drawn from data from the National Longitudinal Survey of Youth. This study, the backbone of much data in The Bell Curve, used IQ not SAT for its cognitive ability estimate.

As the curves in these graphs show no signs of "bending over" at the higher IQ ranges, this ought to allay fears about appreciable numbers of people at the top not taking the test. See for example, figures 19 & 20 below.

We have examined the effects of selective intelligence filtering to assess the extent to which participants differ from the general population. Only about one in three seventeen to eighteen year-olds in the US take this test although virtually all "college bound" students do take it. Filter assessment has been assisted by the availability of the National High School (NHS) survey that assessed the distribution of all students independent of whether they would have taken the SAT otherwise.

Figure 18 shows the frequency distribution of college bound students for a given year.

The distribution of scores are again quite obviously not distributed according to the normal distribution although the skewing is less than for the Mega. There are again many more nominally high scores than a normal distribution would predict. In figure 19, which is described in more detail in the selective filter methodology description of section X, the effective filter is shown on an enlarged scale as the roughly diagonal curve indicating progressively intense selection based on intelligence. The deviation at the bottom is obviously because students with excessively low IQs do not even attend high school and therefore were not even included in random samples. See Kjeld Hvatum's table presented in section 8.3.3 where the range of retadation is shown to extend well into the score levels on the SAT which are effectively missing.

The degree to which this composite filter fits the SAT data is shown particularly well in the plot on a log scale shown in figure 20. The similarity in form of this filter and that which is evident in the Mega data suggests that many of the same type of pressures must exist and again, that individuals are capable of very accurate assessments of their own cognitive abilities.

 
Figure 18: SAT (Verbal Plus Mathematical Parts) Frequency
 
Figure 19: General population distribution, actual and predicted
SAT scoring distributions and the effective selective
filter with raw scores going from 200 on left, 1600 on right.

It is interesting that Kjeld Hvatum in his "Letter to Ron Hoeflin" (In-Genius, Vol. 15 ,August 1990) says,

"Incidentally, the PSAT/NMSQT data provides a way to estimate the selectivity of SAT takers at various levels, because the PSAT is more of a 'forced' test in many schools, and the PSAT and SAT scales are equated (via a factor of 10). The ETS provides PSAT estimates 'that would be obtained if ALL students at these grade levels took the test.' A quick check indicates a factor of 3 is approximately the selectivity at the higher score levels for the SAT."

 

Figure 20: Actual and predicted SAT scoring distributions -- log scale

 

This is very essentially what we have found, but one cannot just assume that the top 1/3 of the overall US high school population takes the SAT as shown in the figure above -- it is more complicated and the filtering more effective than that.

 

8.5.5 The ability of the SAT to discriminate at the high end of its scale

 

The graphs in figures 21 and 22 below show that the SAT has the ability to discriminate throughout its complete range of raw scores. Figure 21 shows a slight non-linearity between raw vs. scaled scores starting near a total score of 1540. On other administrations of the test (see figure 20) the questions are evidently more difficult and the raw vs. scaled graph is linear all the way to the top, suggesting that the test is indeed discriminating through its complete range.

 

Figures 21 & 22: SAT discrimination capabilities
 

The difference between 1600 and 1560 is typically 2 to 4 problems on the "old" (pre-recentered) SAT. However, when figuring percentile equivalents for the SAT, it should be remembered that it is based upon a sample size of approximately 1 million actual test takers selectively sampled from a general population size in excess of 3 million. It isn't unreasonable to assume that the general population percentiles that we assign to the SAT at the top end (for which selection is the highest) are accurate for the test group as a whole. In fact, however, in a population of 3 million there should be over 100 individuals scoring at the 1-in-30,000 level. On any given year less than ten individuals obtained a perfect score on the old SAT with on the order of 100 or less scoring 1560 or more and, therefore, it is is safe to say that the 1-in-30,000 level is achieved by these individuals.

8.5.6 Establishing a credible 1-in-30,000 of the general population raw score cutoff

As indicated throughout this report, we have chosen not to accept theoretical positions on what the distributions of test scores will be at the high end of the psychometric range nor even if it is intelligence that is being discriminated at the extreme tails of distributions, preferring actual data to accepted notions and legitimate claims of rarity to unverified claims of "super intelligence." In keeping with this philosophy, we note that of three million people in the general population for which a single SAT applies, 100 would satisfy the rarity condition. Therefore, for a given year, looking down the top 100 scores, we find for example for 1984 combined V+M for College-Bound Seniors:

SAT high range data distribution in 1984
Score Number
1600 5
1590 0
1580 27
1570 19
1560 39
1550 75
1540 96
1530 108
1520 188
1510 217
1500 278
This data is typical of data available for various years on the "old" SAT. In this case 90 individuals scored 1560 or above. 1560 is also the score that Ron Hoeflin used in his sixth norming of the Mega so this value is highly compatible with analyses performed elsewhere in this report. In Paul Maxim's article "Renorming Ron Hoeflin's Mega Test" (Gift of Fire, Issue 79, 8 - 12, October 1996), Ron Hoeflin is said to have had breakdowns of 5,157,642 SAT scores from 1984 to 1989. The top scorers for those six years were said to be distributed as follows:
SAT high range data distribution in 1984 -1989
Score Range Number
1591-1600 35
1581-1590 8
1571-1580 149
1561-1570 71
This gives an average of less than 44 per year so that we are very confident that our assessment has been (if anything) a conservative estimate for a cutoff score. We are, therefore, quite comfortable with the cutoff of 1560 indicative of a rarity of no more than 1-in-30,000 and as a qualifying score for the Prometheus Society.

 


8.6 Consideration of Additional/Alternative Tests to Satisfy Prometheus Society Membership criteria

 

Wherever possible we have used Otfried Spreen's A Compendium of Neuropsychological Tests: administration norms, and commentary and the book of norms from 1991 (Comprehensive Norms for an Expanded Halstead-Reitan Battery, Heaton et al., commonly referred to as the "Heaton norms") which is widely used in neuropsychological testing. This information may conflict with other available data on occasion. This is expected with the nature of normative data at the current state of the art in this field -- particularly at the upper extremity. But these norms are widely used and accepted as authoritative, so we've used them for comparisons and other purposes.

8.6.1 Mensa testing approaches

Because of much greater membership, Mensa can afford quite extensive testing programs. Facilities and psychometric instruments are available throughout the world. In much the way that this committee is attempting to assist the Prometheus Society in establishing tests that it can warrant with credibility, Mensa accepts scores on various tests -- which change from time to time.

It is understood in this regard that Mensa's discrimination problems are much less demanding than ours because of their considerably lower qualifying standard. They do provide a paradigm, however, and if it were possible to tap into their resources and global support, it would have considerable merit. Greg Scott addressed this possibility in his article, "For Acceptance of Mensa Supervised Tests" (Gift of Fire, Issue 99, September 1998). We have, therefore, considered tests whereby individuals may be qualified for entry to Mensa. We have also considered counter arguments as put forth by Kevin Langdon in his article "Mensa Tests and Other Standard Tests" (Gift of Fire, Issue 81, January 1997) that was in response to Greg Scott's article as well as other issues that we have encountered.

You will see these various lines of reasoning pursued in the following sections.

8.6.2 Cattell Culture Fair III

Cattell Culture Fair III (A+B) has a history of use since the early 1920s, but the present edition is dated 1960 and was revised in 1963. Mensa used this test prior to its latest adoption of the Raven Advanced (both tests are still used by Mensa in the UK although now dropped in the US).

The features of this test are as follows:

  1. Scale III is for above average youth through adult.
  2. The norms tables include both 16 standard deviation and 24 standard deviation statistics.
  3. Age range norms exist for each of the following ages: 13, 13.5, 14, 15, 16 (adult)
  4. IQ's on Scale III range from 55 to 183 on a 16 standard deviation basis; from 20 to 219 on a 24 standard deviation basis.
Accepted conversion from raw to standard scores are as follows for the 16 standard deviation normed A+B form: 87 for IQ 163

88 for IQ 165

89 for IQ 167

90 for IQ 168

91 for IQ 169

93 for IQ 173

95 for IQ 176

97 for IQ 179
99 for IQ 183

100 for IQ 187 (extrapolated)

For the 24 standard deviation scale, a combined raw 85 = IQ 190, 88 = 197, 92 = 207, 97 = 219.

The following are features of the test:

  1. Each form is 50 questions and total test time is 12.5 minutes excluding time to give directions for each of the 4 parts.
  2. The test is entirely non-verbal. Editions of the test are available in 23 foreign countries and include a Spanish edition. The IPAT (publisher) can give details about all translations.
  3. The four parts of the test are: series, classification, matrices, and conditions.
  4. Validities for Scale III include: Concept validity (direct correlations with the pure intelligence factor) at .92 (702 males and females), concrete validity (GRE, WAIS, Otis, Raven APM, Stanford-Binet, etc.) at .69 (673 males and females, students and adults), consistency over items (split-half) at .85, consistency over parts (interform correlations corrected) at .82, consistency over time (test-retest, immediate to one week) at .82.
This test is accepted by respected psychometricians throughout the world who accept its score up into the Prometheus Society cuttoff. We certainly do not lose credibility in accepting scores obtained on this test. Whereas we are skeptical of scores that are listed without indicating that they are "extrapolations" up to IQ 183 (16 points per standard deviation), we believe allowing a raw score of 88 (corresponding to an IQ of 165) on the 16 standard deviation A+B form is reasonable. It would open the global window for the Prometheus Society. It also would support our goal of being a truly international Society.
 

8.6.3 Raven's Advanced Progressive Matrixes (RAPM)

Raven's Advanced Progressive Matrixes is one of a series of nonverbal tests of intelligence developed by J.C. Raven (1962). Following Spearman's theory of intelligence, it was designed to measure the ability to educe relations and correlates among abstract pictorial forms and it is widely regarded as one of the best available measures of Spearman's g, or of general intelligence (e.g., Jensen, 1980; Anastasi, 1982). As its name suggests, and of particular significance to the Prometheus Society, it was developed primarily for use with persons of advanced or above average intellectual ability.

Like the other Raven's matrices tests, the APM is composed of a series of perceptual analytic reasoning problems, each in the form of a matrix. The problems involve both horizontal and vertical transformations: Figures may increase or decrease in size, and elements may be added or subtracted, flipped, rotated, or show other progressive changes in the pattern. In each case, the lower right corner of the matrix is missing and the subject's task is to determine which of eight possible alternatives fits into the missing space such that row and column rules are satisfied. The APM battery consists of two separate groups of problems. Set I consists of 12 problems that cover the full range of difficulty sampled from the Standard Progressive Matrices test. Standard timing for Set I is 5 minutes. This set is generally used only as a practice test for those who will be completing Set II. Set II consists of 36 problems with a greater average difficulty than those in Set I. Set II can be administered in one of two ways: either with or without a time limit of 40 minutes. Administering Set II without a time limit is said specifically to assess a person's capacity for clear thinking, whereas imposing a time limit is said to produce an assessment of intellectual efficiency (Raven, Court, & Raven, 1988).

Phillip A. Vernon, in his review of the APM (Test Critiques, 1984) writes that "the quality of the APM as a test is offset by the totally inadequate manual which accompanies it. For interpretive purposes, the manual provides 'estimated norms' for the 1962 APM which allow raw scores to be converted into percentiles (but only 50, 75, 90, and 95) and another table for converting percentiles into IQ scores." John Johansen, a graduate student at the University of Minnesota and former regular poster to the Brain Board, came into possession of the 1962 version of the test for use in his research (this form is no longer used for testing) along with 27 pages of written text about the implementation, scoring and standardization of the test. In a post to the Brain Board at (http://www.brain.com/bboard/read/iq-archive3/1599), he provided the following information applicable to the untimed 1962 version of the test:

Untimed intraday (go until you give up) 1962 distribution for 20 year olds, 30 year olds and 40 year olds. Scores balanced for guessing.
 

general population
percentile ranking
number correct by age group
20 years
30 years
40 years
50
9
7
-
75
14
12
9
90
21
20
17
95
24
23
21
99
26
25
23
99.9
30
29
26
Norms are not accurate above this point for the untimed version due to limited population taking test in this condition. 

Ignoring the above caveat about inaccurate norms above the 99.9th percentile, the above data indicates that there is about a 4 point raw score difference between 2 and 3 sigma on this test. If this difference carries on to the next "sigma," this would give associated scores of:

 

general population
percentile ranking
number correct by age group
20 years
30 years
40 years
99.997
34
33
30
Although this data would seem to suggest sufficient ceiling for discriminating at the 1-in-30,000 level, there have been other normative studies which provide conflicting data. In an article in Educational and Psychological Measurement (Bors and Stokes, 1998), the authors mentioned two studies of interest besides Raven's 1962 group -- Paul's study and their own: S. M. Paul's 1985 study of 300 University of California, Berkeley students (190 women, 110 men): Tested under the untimed condition, the students' scores ranged from 7 to 36 with a mean of 27 and a standard deviation of 5.14. This was significantly higher than the mean of Raven's 1962 normative group (M=21.0, SD=4.0).

Bors and Stokes administered the timed version of the APM to 506 students (326 women, 180 men) from the Introduction to Psychology course at the University of Toronto at Scarborough. Subjects ranged in age from 17 to 30 years, with a mean of 19.96 (standard deviation=1.83). Enrollment in the Introduction to Psychology course was considered roughly representative of first-year students at this university. The scores on Set II for the 506 students ranged from 6 to 35 with a mean of 22.17 (standard deviation=5.60). This performance is somewhat higher than that of the Raven's 1962 normative group but considerably lower than Paul's 1985 University of California, Berkeley sample.

Additional data supporting the conclusion that the RAPM (either timed or untimed) does not discriminate at the 1/30,000 level is taken from Spreen & Strauss (Compendium of Neuropsychological Tests, 2nd Edition, 1998), and shown in the tables below.

A middle-of-the-road approach would be to use the recent University of Toronto at Scarborough data and to assume that the mean of the test group corresponds to about 1 SD above the mean of the general population, and to further assume that the SD of the general population would be about the same as the standard deviation of the test group. Finally assuming a normal distribution in the test group, the 1-in-30,000 level would correspond to 22.17 + 3 * (5.60) = 39, which is 3 raw points above the test's ceiling of 36.

Advanced Progressive Matrices Set II: Occupational Norms
 
 
 
 
 
%ile rank
Occupations of various groups
UK
US
UK TA
UK
UK
UK
UK
UK
UK
UK
General
populatn. 23 yr.olds
Navy
25-28
yr. olds
Officer
Aplcnt
Retail
Mngrs.
Police
Officer
Senior
Mngrs(Htls.)
Accnt.
Staff
Oxford
Local
Athrty.
Rsrch.
Scntsts.
untmd. 40 min 40 min 40 min 40 min 40 min 40 min 40 min 40 min 40 min
(n=71) (n=195 (n=104) (n=104) (n=157) (n=49) (n=52) (n=104) (n=61) (n=34)
95 33 29 34 30 34 28 32 34 30 33
90 31 27 32 28 32 26 31 32 28 31
75 27 23 29 25 30 22 28 30 25 28
50 22 18 25 22 27 19 25 27 21 24
25 17 13 21 19 25 15 23 25 17 21
10 12 10 18 16 22 12 20 22 13 18
5 9 8 16 14 21 10 19 21 11 16
UK general population data derived from the 1993 Standardization of the SPM and APM (TABLE APM XIII). US Navy data extracted from data supplied by Alderlon (see Knapp & Court, 1992) (TABLE APM XVII) UK Police Officers' data extracted from Feltham (1988) (TABLE APM XXVI). Other data collected by Oxford Psychologists Press. Source: J. Raven (1994).
 

The data above does bring up the issue of age variation of IQ data which is not typically addressed by other instruments that we've used for Prometheus Society entry requirements and that is perhaps something that should be considered. (In the case of the SAT and GRE tests, there is not typically much variation in the ages of those taking the test and no such data was used in norming any of the take-at-home tests we've used. Spreen and Strausse have provided the information for the table below:

Advanced Progressive Matrices Set II (Untimed) Smoothed Summary Norms for the USA
 
 
%ile rank
Age of test taker in years
18-22
23-27
28-32
33-37
38-42
43-47
48-52
53-57
58-62
63-67
68+
(n=28) (n=53) (n=72) (n=77) (n=121) (n=69) (n=33) (n=36) (n=27) (n=33) (n=54)
95 32 32 32 32 32 32 31 30 29 27 25
90 30 30 30 30 30 30 29 28 27 25 23
75 27 27 27 26 26 26 26 25 24 22 19
50 20 20 20 19 19 19 19 18 16 14 12
25 15 15 15 15 15 14 14 13 12 10 8
10 10 10 10 10 10 10 9 8 7 6 4
5 7 7 7 7 7 7 6 5 4 3 2
Based on the 1993 standardization of the APM in Des Moines, Iowa.

Tests completed at leisure. Source: J. Raven (1994)

Curiously, American Mensa does not list the RAPM among its currently accepted tests, although UK Mensa does. Perhaps this is a more "international" test than others we have reviewed and considering its quality, we should probably continue to consider its possible use, especially as an "auxiliary" test to be submitted in conjunction with other tests that are deemed capable of discriminating at the 1-in-30,000 level.

 

8.6.4 California Test of Mental Maturity (CTMM)

The reliability coefficients are said by Bert Goldman, Dean of Academic Advising at the University of North Carolina, in reviewing the "California Short-Form Test of Mental Maturity, 1963 Revision" in The Seventh Mental Measurements Yearbook, to indicate adequate reliability. He says further that:

"Levels 0 and 1 present the weakest coefficients and when coefficients for the five factors are compared across all levels, it is noted that Spatial Relationships has the poorest reliability. The K-R 21 reliabilities reported for each type of score follow: the five factor scores, .48 to .94, median .77; language total, .71 to .95, median .80; nonlanguage total, .79 to .93, median .86; and total, .86 to .96, median .93." Considerable validity data for the Short Form of the CTMM are presented, but no data are provided for the Long Form. As an earlier reviewer pointed out, there is need for evidence of the Long Form's use for "educational selection, prediction, and guidance at each of the several age and grade levels" (Freeman, 5:314). Also lacking are validity and reliability data indicating use with the intellectual extremes (i.e., mentally deficient and superior).

No rationale is given for using eight school levels with the Short Form and only six school levels with the Long Form. Further, five factors are included in the Long Form and only four in the Short Form. No reason is given for eliminating the Spatial Relationships factor from the Short Form. However, earlier in this review it was pointed out that among the five factors this one provided the poorest reliability coefficients.

In sum, as far as group tests of intelligence are concerned, the CTMM appears to rate among the best. Its format is clear and easy to follow, its material appears durable, the norms appear representative, and its reliability while being weaker at the lower levels generally seems satisfactory. Data on validity are lacking, but if its shorter version is comparable, then considerable evidence suggests that the Long Form is valid. This leads to a question that has long stood in this reviewer's mind. Why both tests? Why not just the CTMM-SF? The Short Form takes less time to administer than the Long Form, research is available concerning its validity, and in terms of reliability it does not contain the Long Form's weakest factor (Spatial Relationships)."

There are several interesting pieces of data that would seem to suggest the CTMM may be an appropriate test for inclusion on our list. For example, the following score pair data is available on Darryl Miyaguchi's web site for the "OMNI Sample":

LAIT vs. CTMM: 5 cases -- CTMM substantially lower score in every case. Average difference = 12.8 IQ points.

Cattell vs. CTMM: 24 cases -- CTMM substantially lower score in every case. Average difference = 12.6 IQ points.

In neither of the situations described above did the difference seem to be IQ (Mega raw score) dependent! In fact in the data included for that norming, roughly the same number of individuals reported LAIT, Cattell and CTMM scores as follows:

CTMM high scores: 179, 162, 154, 154, 150...total of 30 scores

Cattell high scores: 191, 178, 172, 169, 164...total of 35 scores

LAIT high scores: 171, 170, 169, 167, 166...total of 35 scores

It is noted that in "Mensa Tests and Other Standard Tests" (Gift of Fire, Issue 81, January 1997), Langdon has suggested that the CTMM is inappropriate for admission to our Society because it has "a ceiling of 3.5 sigma," which is in accord with Grove's mention of a ceiling of 158. In no case was a 4-sigma LAIT or Mega score confirmed in the OMNI Sample by a CTMM score. The CTMM scores tend in general to be much lower than the other two as can be seen in figure 4 above. This impression is further confirmed by inspection of figure 6 above where, if CTMM scores were used for norming the Mega, standard scores on the Mega would have to be dropped (as against raised!) by as much as ten points since the CTMM score of 155 corresponds to the Mega cutoff score of 36! Clearly, if anything, the CTMM seems to underestimate IQ at these high scores. However, we have to reject the CTMM because its ceiling of 158 is too low for our entry criterion.