Saturday, April 30 - 0 comments

LTE _ COMPUTERS IN LANGUAGE TESTING: PRESENT RESEARCH AND SOME FUTURE DIRECTIONS

COMPUTERS IN LANGUAGE TESTING:
PRESENT RESEARCH AND SOME FUTURE DIRECTIONS

James Dean Brown
University of Hawai'i at Manoa


ABSTRACT

This article begins by exploring recent developments in the use of computers in language testing in four areas: (a) item banking, (b) computer-assisted language testing, (c) computerized-adaptive language testing, and (d) research on the effectiveness of computers in language testing.

The article then examines the educational measurement literature in an attempt to forecast the directions future research on computers in language testing might take and suggests addressing the following issues: (a) piloting practices in computer adaptive language tests (CALTs), (b) standardizing or varying CALT lengths, (c) sampling CALT items, (d) changing the difficulty of CALT items, (e) dealing with CALT item sets, (f) scoring CALTS, (g) dealing with CALT item omissions, (h) making decisions about CALT cut-points, (i) avoiding CALT item exposure, (j) providing CALT item review opportunities, and (k) complying with legal disclosure laws when using CALTs.

The literature on computer-assisted language learning indicates that language learners have generally positive attitudes toward using computers in the classroom (Reid, 1986; Neu & Scarcella, 1991; Phinney, 1991), and a fairly large literature has developed examining the effectiveness of computer-assisted language learning (for a review, see Dunkel, 1991). But less is known about the more specific area of computers in language testing. The purpose of this article is to examine recent developments in language testing that directly involve computer use including what we have learned in the process. The article will also examine the dominant issue of computer-adaptive testing in the educational measurement literature in an attempt to forecast some of the directions future research on computers in language testing might take.

CURRENT STATE OF KNOWLEDGE ON COMPUTERS IN LANGUAGE TESTING

In reviewing the literature on computers in language testing, I have found four recurring sets of issues: (a) item banking, (b) computer-assisted language testing, (c) computer-adaptive language testing, and (d) the effectiveness of computers in language testing. The discussion in this section will be organized under those four headings.

Item Banking

Item banking covers any procedures that are used to create, pilot, analyze, store, manage, and select test items so that multiple test forms can be created from subsets of the total "bank" of items. With a large item bank available, new forms of tests can be created whenever they are needed. Henning (1986) provides a description of how item banking was set up for the ESL Placement Examination at UCLA. (For further explanation and examples of item banking in educational testing, see Baker, 1989, pp. 412-414, or Flaugher, 1990.)

While the underlying aims of item banking can be accomplished by using traditional item analysis procedures (usually item facility and item discrimination indexes; for a detailed description of these traditional item analysis procedures, see Brown, 1996), a problem often occurs because of differences in abilities among the groups of people who are used in piloting the items, especially when they are compared to the population of students with whom the test is ultimately to be used. However, a relatively new branch of test analysis theory, called item response theory (IRT), eliminates the need to have exactly equivalent groups of students when piloting items because IRT analysis yields estimates of item difficulty and item discrimination that are "sample-free." IRT can also provide "item-free" estimates of students' abilities. Naturally, a full discussion of IRT is beyond the scope of this article. However, Henning (1987) discusses the topic in terms of the steps involved in item banking for language tests and provides recipe-style descriptions of how to calculate the appropriate IRT statistics.
-44-
Several other references may prove helpful for readers interested in more information on IRT. In language testing, Madsen and Larson (1986) use computers and IRT to study item bias, while de Jong (1986) demonstrates the use of IRT for item selection purposes. (For readers who want more technical information on applications of IRT to practical testing problems in general education, see Lord, 1980; Hambleton & Swaminathan, 1985; Andrich, 1988; Suen, 1990; Wainer & Mislevy, 1990; and Hambleton, Swaminathan, & Rogers, 1991.)


I am not saying that item banking is without potential problems. Green (1988) outlines some of the problems that might be encountered in using IRT in general, and Henning (1991) discusses specific problems that may be encountered with the validity of item banking techniques in language testing settings. Another serious limitation of IRT is the large number of students that must be tested before it can responsibly be applied. Typically, IRT is only applicable for full item analysis (that is, for analysis of two or three parameters) when the numbers of students being tested are very large by the standards of most language programs, that is to say, in excess of one thousand. Smaller samples in the hundreds can be used only if the item difficulty parameter is studied.

Minimal item banking can be done without computers by using file cards, and, of course, the traditional item analysis statistics can be done (using the sizes of groups typically found in language programs) with no more sophisticated equipment than a hand-held calculator. Naturally, a personal computer can make both item banking and item analysis procedures much easier and much faster. For example, standard database software can be used to do the item banking, (e.g., Microsoft Access, 1996; or Corel Paradox, 1996). For IRT analyses, more specialized software will be needed. The following are examples of computer programs that can be used for IRT analysis: TESTAT (Stenson, 1988), BIGSTEPS (Wright, Linacre, & Schulz, 1990), and PC-BILOG (Mislevy & Bock, 1986). Alternatively, the MicroCAT Testing System (1984) program can help with both item banking and IRT analyses. BestTest (1990) is another less sophisticated program, which can be used in both item banking and test creation. An example of a software program specifically designed for item banking is the PARTest (1990) program. If PARTest is used in conjunction with PARScore (1990) and PARGrade (1990), a completely integrated item banking, test analysis, and record-keeping system can be set up and integrated with a machine scoring system. Indications have also surfaced that computers may effectively be used to assemble pre-equated language tests from a bank of items (see Henning, Johnson, Boutin, & Rice, 1994).

Computer-Assisted Language Testing

Tests that are administered at computer terminals, or on personal computers, are called computer-assisted tests. Receptive-response items-including multiple-choice, true-false, and matching items-are fairly easy to adapt to the computer-assisted testing medium. Relatively cheap authoring software like Testmaster (1988) can be used to create such tests. Even productive-response item types-including fill-in and cloze-can be created using authoring software like Testmaster. Unfortunately, the more interesting types of language tasks (e.g., role plays, interviews, compositions, oral presentations) prove much more difficult to develop for computer-assisted testing.

However, advancing technologies have many potential ramifications for computer-assisted language testing. Brown (1992a) outlined some of the technological advances that may have an impact on language teaching and testing:

Consider the multi-media combinations that will be available in the very near future: CD-ROM players working with video-image projectors, and computers controlling the whole interactive process between students and machines for situations and language tailored to each student's specific needs....Consider the uses to which computer communications networks could be put. What about scanners and hand-writing recognition devices? Won't voice sensitive computers and talking computers be valuable tools in the language media services of the future? (p. 2)
-45-
The new technologies such as the CD-ROM and interactive video discussed in Brown (1992a) do make it possible for students to interact with a computer. Hence, no technical reason remains why interactive testing like role plays, interviews, compositions, and presentations cannot be done in a computer-assisted mode. Naturally, the expense involved may impose some limits, and the scoring will probably continue to involve rater judgments (thus, further increasing the expense involved). But at least, the logistics of gathering the language samples can now be simplified by the use of computer-assisted testing procedures.


Two consequences may evolve from the current advances in technology: (a) the sophistication of existing computer hardware and software tools will continue to grow, and (b) the cost of the technology will continue to drop (eventually to within reach of all language programs). Hence, the possibilities for developing productive-response computer-assisted language tests will definitely increase.

But, why should we bother to create computer-assisted language tests at all? Aren't they really just a sophisticated version of the paper-and-pencil tests that they will probably be modeled on? Two primary benefits can be gained from computer-assisted language testing:
Computer-assisted language tests can be individually administered, even on a walk-in basis. Thus group-administered tests and all of the organizational constraints that they impose will no longer be necessary.
Traditional time limits are not necessary. Students can be given as much time as they need to finish a given test because no human proctor needs to wait around for them to finish the test.

No doubt, cheating will arise, but such problems can be surmounted if a little thought and planning are used.

Given the advantages of individual, time-independent language testing, computer-assisted testing will no doubt prove to be a positive development. Consider the benefits of a writing test administered in a computer laboratory as the final examination for an ESL writing course. Such a computer-assisted test would be especially suitable for students who had been required to do all of their writing assignments in the course on a PC. In such a course, it would make imminent sense to allow the students to do their final examination writing samples on a computer and turn in the diskette at the end of the testing period (or send the file by modem or network to the teacher). Under such circumstances, the testing period could be quite long to allow time for multiple revisions. Of course, logistical problems will crop up, but they can no doubt be overcome with careful planning. In fact, the literature indicates that computers can be an effective tool for teaching writing (Neu & Scarcella, 1991; Phinney, 1991). Why not also use it as an effective tool for testing writing (see for example Reid, 1986)?

Computer-Adaptive Language Testing

Computer-adaptive language tests are a subtype of computer-assisted language tests because they are administered at computer terminals or on personal computers. The computer-adaptive subtype of computer-assisted tests has three additional characteristics: (a) the test items are selected and fitted to the individual students involved, (b) the test is ended when the student's ability level is located, and, as a consequence, (c) computer-adaptive tests are usually relatively short in terms of the number of items involved and the time needed. As Madsen (1991) put it, "The computer-adaptive language test (CALT) is uniquely tailored to each individual. In addition, CALT is automatically terminated when the examinee's ability level has been determined....The result is a test that is more precise yet generally much shorter than conventional paper-and-pencil tests" (p. 237).

A clear description of how to develop computer-adaptive language tests (CALTs) is provided in Tung (1986). (For descriptions of more general computer-adaptive test [CAT] development in educational measurement, see Kaya-Carton, Carton, & Dandonoli, 1991, as well as Laurier, 1991; 1996.) CALT development relies very much on item response theory. While the computer-adaptive language test is taking place, the computer typically uses a combination of item response theory and the concept of flexilevel tests (Lord, 1980) to create a test specifically designed for the individual student taking it.
-46-
The flexilevel procedures roughly determine the general ability level of the student within the first few test questions. Then, based on item response statistics, the computer selects items which are suitable for the student's particular level and administers those items in order to get a more finely tuned estimate of the student's ability level. This flexilevel strategy eliminates the need (usually present in traditional fixed-length paper-and-pencil tests) for students to answer numerous questions that are too difficult or too easy for them. In fact, in a CALT, all students take tests that are suitable to their own particular ability levels-tests that may be very different for each student. (Readers interested in further information on computer-adaptive language testing should see Larson & Madsen, 1985, and for a discussion of both positive and negative aspects, see Canale, 1986.)


One example of the actual development of a CALT is the Montgomery County Public Schools project which is described in Stevenson and Gross (1991). Madsen (1991) describes another example of a CALT which was applied to students at Brigham Young University in Utah for testing reading and listening abilities. The Madsen (1991) study indicates that many fewer items are necessary in administering computer-adaptive language tests than are necessary in pencil-and-paper tests and that the testing time is correspondingly shorter. For example, the CALT in Madsen (1991) used an average of 22.8 items to adequately test the students in an average of 27.2 minutes. The comparable conventional reading test used in the study required 60 items and 40 minutes.

Educational Testing Service (ETS) is providing considerable leadership in the area of what they are calling computer-based tests. That organization is already offering the GRE and PRAXIS as computer-based tests in 180 countries. In 1998, a computer-based version of the TOEFL examination will be released in North America and selected countries abroad, though paper-and-pencil versions will continue to be used until computer delivery is available.

Because of recent efforts to develop computer-based versions of the TOEFL, some of the research at ETS has focused on the effects of computer familiarity on TOEFL test performance. For instance, Kirsch, Jamieson, Taylor, and Eignor (1997) (based on a computer-familiarity scale discussed in Eignor, Taylor, Kirsch, & Jamieson, 1997) indicates that, in general, computer familiarity is related to individuals' TOEFL scores. At the same time, Taylor, Jamieson, Eignor, and Kirsch (1997) indicates that, after students participate in a computer-based testing tutorial, there is no meaningful relationship between computer familiarity and individuals' TOEFL scores. (Readers interested in more information on the new TOEFL developments can contact TOEFL Programs and Services, P.O. Box 6155, Princeton, NJ 08541-6155, use e-mail: toefl@ets.org, or visit their web site: http://www.toefl.org. Readers interested in more details about computer-adaptive testing in general would benefit from reading the educational measurement "primer" on the topic: Wainer, Dorans, Flaugher, Green, Mislevy, Steinberg, & Thissen, 1990. For a fairly technical book on the subject, see Weiss, 1983.)

Effectiveness of Computers in Language Testing

Educational Testing Service (1996) claims the following advantages for their new computer-based TOEFL:
Further enhancements to test design before 2000
Greater flexibility in scheduling test administrations
Greater standardization of test administration conditions
Portions of test individualized to examinee ability level
Inclusion of writing with every test administration
Examinee choice of handwriting or typing essay
Ability to record multiple aspects of examinee test-taking behavior
Platform for future innovations in test design and services (p. 5).

Judging by what they are claiming, at least portions of this new computer-based test will eventually be computer-adaptive.
-47-
Brown (1992b) looked in more detail at both the advantages and disadvantages of using computers in language testing. That discussion will be expanded next.


Advantages. The advantages of using computers in language testing can be further subdivided into two categories: testing considerations and human considerations.

Among the testing considerations, the following are some of the advantages of using computers in language testing:
Computers are much more accurate at scoring selected-response tests than human beings are.
Computers are more accurate at reporting scores.
Computers can give immediate feedback in the form of a report of test scores, complete with a printout of basic testing statistics.
IRT and computer-adaptive testing allow testers to target the specific ability levels of individual students and can therefore provide more precise estimates of those abilities (see Bock & Mislevy, 1982).
The use of different tests for each student should minimize any practice effects, studying for the test, and cheating (for discussion of an IRT strategy to help spot such cheating, see Drasgow, Levine, & McLaughlin, 1987).
Diagnostic feedback can be provided very quickly to each student on those items answered incorrectly if that is the purpose of the test. Such feedback can even be fairly descriptive if artificial intelligence is used (for more on such uses of artificial intelligence, see Baker, 1989, pp. 423-425, or Bunderson, Inouye, & Olsen, 1989, pp. 398-402).

Among the human considerations, the following are some advantages of using computers in language testing:
The use of computers allows students to work at their own pace.
CALTs generally take less time to finish than traditional paper-and-pencil tests and are therefore more efficient (as found for CALTs in Madsen, 1991, and for CATs in Kaya-Carton, Carton, & Dandonoli, 1991, and Laurier, 1996).
In CALTs, students should experience less frustration than on paper-and-pencil tests because they will be working on test items that are appropriate for their own ability levels.
Students may find that CALTs are less overwhelming (as compared to equivalent paper-and-pencil tests) because the questions are presented one at a time on the screen rather than in an intimidating test booklet with hundreds of test items.
Many students like computers and even enjoy the testing process (Stevenson & Gross, 1991).

Disadvantages. The disadvantages of using computers in language testing can also be further subdivided into two categories: physical considerations and performance considerations.

Among the physical considerations, the following are some of the disadvantages of using computers in language testing:
Computer equipment may not always be available, or in working order. Reliable sources of electricity are not universally available.
Screen capacity is another physical consideration. While most computers today have overcome the 80 characters by 25 lines restrictions of a few years ago, the amount of material that can be presented on a computer screen is still limited. Such screen size limitations could be a problem, for example, for a group of teachers who wanted to develop a reading test based on relatively long passages.
In addition, the graphics capabilities of many computers (especially older ones) may be limited, and even those machines that do have graphics may be slow (especially the cheaper machines). Thus, tests involving even basic graphs or animation may not be feasible at the moment in many language teaching situations.

Among the performance considerations, the following are some of the disadvantages of using computers in language testing:
-48-
The presentation of a test on a computer may lead to different results from those that would be obtained if the same test were administered in a paper-and-pencil format (Henning, 1991). Some limited research indicates that there is little difference for math or verbal items presented on computer as compared with pencil-and-paper version (Green, 1988) or on a medical technology examination (Lunz & Bergstrom, 1994), but much more research needs to be done on various types of language tests and items.
Differences in the degree to which students are familiar with using computers or typewriter keyboards may lead to discrepancies in their performances on computer-assisted or computer-adaptive tests (Hicks, 1989; Henning, 1991; Kirsch, Jamieson, Taylor, & Eignor, 1997)
Computer anxiety (i.e., the potential debilitating effects of computer anxiety on test performance) is another potential disadvantage (Henning, 1991).

What Issues Have Already Received Attention?

Judging by the first half of this paper, language testers know a great deal about computers in language testing. For instance, we know:
How to use an item bank to create, pilot, analyze (with item response theory), store, manage, and select test items for purposes of making multiple test forms.
How to create more imaginative computer-assisted language tests that take advantage of the new technologies.
How to build computer-adaptive tests that will (a) help us select and fit items to individual students' abilities, (b) help us know when to end the test when the student's ability level is found, and hence, (c) help us build computer-adaptive tests that are relatively short in terms of number of items and time.
How effective computers are in language testing, including the advantages (in terms of testing and human considerations) and disadvantages (in terms of physical and performance considerations).

At the same time, language testers have a great deal to learn about computers in language testing, as I will explain next.

FUTURE DIRECTIONS: COMPUTER-ADAPTIVE LANGUAGE TESTING

Most of the research cited so far has been practical in nature. In my view, research on computers in language testing will inevitably become more technical, complex, and detailed. Drawing on the wider literature of the educational testing field, I found that the educational measurement literature on using computers in testing developed very much like the language testing literature has (for an overview of these developments, see Bunderson, Inouye, & Olsen, 1989). However at this point in time, researchers in educational measurement have developed well beyond the practical questions of how to item bank, how to definitions and distinguish between computer-assisted and computer-adaptive testing, and how to measure the effectiveness of computers in testing. In short, they are now addressing considerably more technical, complex, and detailed issues.

Examining the types of concerns that have arisen recently for educational testers may therefore point the way for future research on computers in language testing and provide a basis for researchers in our field who would like to begin studying these issues. Because of length constraints, I will narrow my exploration to recent concerns about computer-adaptive testing in order to illustrate how at least one strand of research points to the future for language testers. The discussion will be organized into three categories--CALT design issues, CALT scoring issues, and CALT logistical issues--under subheadings that pose the central questions that seem to be the focus of research into computer use in educational testing. All of these questions will be directed at computer-adaptive tests, as are the studies in the educational testing literature.

CALT Design Issues

A number of CALT design issues need to be addressed by language testers before we can be fairly sure we are using such tests responsibly, including at least the following questions: How should we pilot CALTS? Should a CALT be standard length or vary across students? How should we sample CALT items? What are the effects of changing the difficulty of CALT items? How can we deal with item sets on a CALT?
-49-
How should we pilot CALTs? The problem that must be addressed here is caused by the fact that CALTs, because of their adaptive nature, may be different for every student. In classical theory tests, the examinations were the same for every student. Hence, a single examination could be piloted, item analyzed, revised, and then validated during the operational administration. In contrast, a new CALT is created each time a student takes the test (for a quick overview of this process, see Baker, 1989, pp. 425-426).


In educational testing circles, the strategy that is used to overcome the piloting problem is called simulation. Using item response theory, testers can simulate the answers of typical students based on the pilot responses of real students. By simulating such responses in various combinations and for different test lengths, researchers can study the probable characteristics of the test under many different conditions and lengths (as shown in Eignor, Way, Stocking, & Steffen, 1993). To my knowledge, no such simulation studies have been done in the language testing context.

Should a CALT be standard length or vary across students? One of the advantages of computer-adaptive testing is that the test can be tailored to each student's ability levels. Such tailoring results in a test that may be more efficient at some levels of ability, requiring fewer items to reliably estimate the students' abilities, but less efficient at other levels, requiring more items to do a reliable job. The issue that arises then is whether the test should be the same length for each student or of different lengths tailored to each student's ability.

Stocking (1987) indicates that the tailored strategy resulting in different lengths may produce biased scores for students who end up taking short versions. If the test is made to be relatively long for all students, even for those students who will be taking the "short" versions, no such biases may surface. However, if the test is short for all students, it may be preferable to use the same length test for all students. These important length issues have yet to be investigated for CALTs. Perhaps they should be.

How should we sample CALT items? Traditional tests, in order to be theoretically valid, are often designed on the basis of clearly developed test specifications to sample different knowledges or skills that make up a particular domain. Because CALTs will typically be shorter than traditional tests, testing all of the specifications may be impossible. Items could be randomly selected from all the possible specifications. However, if CALTs are short for only some students, a better strategy might be to develop a sampling algorithm that takes into account the issue of specification sampling by keeping track of which specifications have been sampled for a given student and selecting the remaining items in a way that best fulfills the rest of the test specifications. Such a scheme might even take into account the relative importance of the various specifications. Naturally, any such scheme should be based on research-research that has yet to be conducted in relation to CALTs (for related research in CAT, see Stocking & Swanson, 1993; Swanson & Stocking, 1993).

What are the effects of changing the difficulty of CALT items? Bergstrom, Lunz, and Gershon (1992) found that altering the difficulty of CAT items slightly raises the number of items that are necessary to test students with adequate precision, but otherwise has little affect on the estimation of their abilities. They did so by studying the responses of 225 students on a medical technology examination. No such study has been conducted on language tests to my knowledge. Is there a relationship between item difficulty and test length on a CALT, but little effect on ability estimation in language testing? And if both sets of results are the same in language testing as in the education literature, what difficulty level would be ideal in terms of language test length and precision of measurement in relation to the level of language proficiency of the students?

How can we deal with item sets on a CALT? A problem that occurs whenever item banking is done, even for paper-and-pencil tests, is how to deal with item sets (sets of items determined by reading or listening passages, or sets of items based on item formats, topics, item contents, etc.) Good reasons may exist for keeping certain items together and even for ordering them in a certain way. Wainer and Kiely (1987) call these sets of items "testlets." (Wainer et al., 1990, also discuss testlets, as did Sheehan & Lewis, 1992, and Bunderson,
-50-
Inouye, & Olsen, 1989, p. 393-394, 398, refer to them as "reference tasks," while Boekkooi-Timminga, 1990, explores a related concept she calls item clustering.)


Unfortunately, using item sets, or testlets, may result in a selection of test items that are not maximally efficient from an IRT perspective. Since such item sets often occur in language tests, especially in academic reading and lecture listening tests, the issue of item sets is an important one in language testing. Hence, investigations should be conducted into the best strategies to use in dealing with CALT item sets, as well as on the relative effectiveness of item sets as compared to independent items. Research might also be profitable on the ordering of items within sets, and indeed, the ordering of items throughout a CALT.

CALT Scoring Issues

A number of CALT scoring issues also need to be addressed by language testers before we can be fairly sure we are using such tests responsibly, including at least the following questions: How should we score CALTS? How should we deal with CALT item omissions? How should we make decisions about cut-points on CALTs?

How should we score CALTs? A number of alternative methods have been explored in the educational literature including at least the following:
Raw scores (a simple sum of the number of correct items)
Weighted raw scores (in which some items count more than others)
Scaled scores (e.g., the TOEFL, which is equated across forms; for an explanation of scaling and equating CATS, see Dorans, 1990; for studies using the techniques, see Mazzeo, Druesne, Raffeld, Checketts, & Muhlstein, 1991; O'Neill, Folk, & Li, 1993; Schaeffer, Steffen, & Golub-Smith, 1993)
Any of the above (1-3) corrected for guessing (as discussed for CATs in Stocking, 1994)
Any of the above (1-3) based on polytomous (as opposed to dichotomous) scoring (for a study of this issue on CATs, see Dodd, De Ayala, & Koch, 1995)
Any of the above (1-4) scores referenced to a conventional test (as described for CATs in Ward, 1988; Wainer et al., 1990), especially when a CALT is first introduced to replace an existing paper-and-pencil test
Any of the above (1-4) scores referenced to a set of anchor items (anchor items are typically a set of items that all students take along with the rest of the test).

Research on which method works best for which purposes when interpreting CALT scores would be very beneficial. Also, when using IRT for the development of CALTs, language testers must grapple with explaining the scoring methods in non-technical terms so that students and teachers can understand the results. Hence, CALT research might profitably focus on how to best convey such IRT scoring information to lay people.

How should we deal with CALT item omissions? On traditional tests, especially those scored with a correction for guessing, students can and do omit items if they are not able to answer them. However, on a CALT, problems arise if the students omit one or more items. How should testers deal with such omissions? Wainer et al. (1990, pp. 236-237) chose to not allow for omissions on their CATs; the students simply could not go on to the next item until they had answered the one in front of them.Another strategy would be to allow omissions but assume that omitted items are wrong. If that strategy was followed, what would be the effect of such a wrong item on the estimation of the items that should follow?

Other problems may arise if students are allowed to omit items. For instance, if omitted items are not scored, students can manipulate the test by skipping items until they find items they can answer correctly. According to Lunz and Bergstrom (1994), such students would receive undeservedly high scores. Another problem is that students can simply omit all items and get a good look at all of the items in the item bank. This would have important implications for item exposure (as discussed below). Another possibility is that students at different language proficiency levels might omit items in different ways. Hence, omission patterns would be linked to language proficiency as measured on such a test and would become a source of measurement error. All of these issues related to item omissions on CALTs need to be researched and resolved in one way or another.
-51-
How should we make decisions about cut-points on CALTs? The literature on deciding cut-points is often referred to as standards setting (for a review, see Jaeger, 1989, pp. 491-500). Reckase (1983) has suggested the use of the sequential probability ratio test (SPRT) for making pass-fail decisions in adaptive and other related tests. The SPRT was originally developed by Wald (1947) as a quality control device, but both Reckase (1983) and Kingsbury and Weiss (1983) show how the SPRT can be applied to adaptive tests as well. Lewis and Sheehan (1990) suggested using Bayesian decision theory for making mastery/non-mastery decisions on a computerized sequential mastery test. Du, Lewis, and Pashley (1993) demonstrated the use of a fuzzy set approach for the same purpose. The SPRT and other methods just described should all be considered in terms of how well they help with test constuction and decision making on CALTs.


CALT Logistical Issues

In addition to the important logistical issues raised by Green (1990), which included system, hardware, human, and software issues, at least three CALT logistical issues need to be addressed by language testers: How can we avoid CALT item exposure? Should we provide item review opportunities on a CALT? How can we comply with legal disclosure laws when administering CALTs?

How can we avoid CALT item exposure? Item exposure occurs when students see any given item. Since exposure means that other students who might take the test in the future may know about the exact content of those items which have been exposed, such items should not be used again. In traditional tests, large numbers of students are tested on one day with a relatively small number of items, all in one exposure. Thus, even though the items could not be used again, they were used rather efficiently to test a large number of students. However, on a CALT, even though each student gets a different test form, unless there is an infinite number of items, whatever items are in the item bank will be exposed rather quickly. If the test is administered on a daily walk-in basis at a computer lab, for instance, all items in the bank could be exposed within a matter of days without having tested very many students. In addition, as discussed above, particularly in a situation were item omissions are permitted on a CALT, students can simply omit all items and get a good look at all of the items in the item bank. Naturally, such wholesale item exposure would have important implications for test security. The steps that can be taken to slow down the process of item exposure are to:
Have a very large item bank with a wide variety of difficulty levels to meet all item specifications desired in the test
Have the computer select a number of items which might come next (rather than a single item) and then randomly select from among those possibilities (as in McBride & Martin, 1983).
Have the computer select the next item based on complex probabilistic models like those discussed in Stocking (1992), Stocking and Lewis (1995a; 1995b).
Use simulation studies to estimate the efficiency of different sized item banks in minimizing exposure, then stay within whatever size limits those studies suggest.
Circulate item banks, or sub-banks, through different testing sites.
Constantly change the items in an item bank by adding new items and eliminating old ones (especially those most likely to have been exposed).
Monitor the functioning of items within the item pool by keeping track of students' item performances, and identify items that appear to have been exposed

Clearly, studies should be conducted into the relative effectiveness of strategies for dealing with CALT item exposure.

Should we provide item review opportunities on a CALT? In traditional paper-and-pencil testing situations, students who have time remaining at the end of the test can go back and review earlier items. Given the fact that CALT algorithms select the next item based on previous item performances, allowing students to go back and review items would undermine the theoretical nature of the test. Lunz, Bergstrom, and Wright (1992) indicate that this is not a big problem for CATs, but what about CALTs?
-52-
Wainer (1992) suggests that very testwise students could use reviewing as a way of manipulating the test and gaining an unfair edge. The degree to which these item reviewing issues are important to CALTs is yet to be determined. (For a fuller explanation of testing algorithms in CATs, see Thissen & Mislevy, 1990.)


How can we comply with legal disclosure laws when administering CALTs? New York state has so-called "truth in testing" disclosure laws dating back to 1979 that require that standardized tests administered in the state be made available to examinees for inspection within 30 days of the administration. Naturally, with today's communications technology, if a test must be disclosed in one state, it might as well be disclosed in all states and around the world. That requirement does not cause undue problems for paper-and-pencil tests which can be administered to tens or even hundreds of thousands of students in a one month period. However, a problem arises from the fact that a relatively large item bank is necessary for CALTs. To disclose the entire item bank of a CALT every 30 days, or even to disclose to students only those items they took, would be very costly in terms of development time and manpower to produce, pilot, analyze, and distribute new items. Currently, the New York legislature is considering laws (SO3292-C or A5648-C) that will continue the basic requirements of the 1979 law, but be updated to more reasonably fit with the new CATs and the logistics involved (see Lissitz, 1997). However, in the interim, research should be conducted into the best strategies for developing new item banks and for phasing out old ones and disclosing them publicly.

CONCLUSION

The purpose of this paper is to examine recent developments in language testing that directly involve computers. In the process, I have looked at what language testers have learned in the specific area of CALT and found substantial information on:
How to use an item bank
How to use new technologies
How to build computer-adaptive tests
How effective computers are in language testing.

Next, I examined the educational testing literature for ideas on what directions future research on computers in language testing might take, focusing on the dominant issue of computer-adaptive testing and found that future research might benefit from answering the following questions:
How should we pilot CALTS?
Should a CALT be standard length or vary across students?
How should we sample CALT items?
What are the effects of changing the difficulty of CALT items?
How can we deal with item sets on a CALT?
How should we score CALTS?
How should we deal with CALT item omissions?
How should we make decisions about cut-points on CALTs?
How can we avoid CALT item exposure?
Should we provide item review opportunities on a CALT?
How can we comply with legal disclosure laws when administering CALTs?

However, I would like to stress that the computer-adaptive testing involved in the above 11 questions is only one stream of computer-related research in education, psychology, and related fields. Important development and research are also going on in areas like: (a) testing in intelligent teaching systems, (b) testing using the Internet, (c) handwriting and speech recognition, (d) analysis and scoring of open-ended responses (like compositions and speech samples), and (e) alternative psychometric models for analyzing the results of the more complex information that can be gathered using computer-assisted response analysis. (Readers interested in exploring these issues further might start by reading Alderson, 1991; Bejar & Braun, 1994; Burstein, Frase, Ginther, & Grant, 1996; Corbel, 1993; Jamieson, Campbell, Norfleet, & Berbisada, 1993; Mislevy, 1993, 1994; Powers, Fowles, Farnum, & Ramsey, 1994.)

-53-
Naturally, through all of our computer-related research efforts, professional quality language testing should continue to be our goal, and such testing should continue to adhere to the Standards for educational and psychological testing (American Psychological Association, 1985) agreed to by the American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education for all educational and psychological tests. Special guidelines have also been published (American Psychological Association, 1986), which interpret the above Standards in terms of how they should be applied to computer-based testing and score interpretations (see also Green, Bock, Humphreys, Linn, & Reckase, 1984). In keeping with those two sets of guidelines, ongoing research must also be conducted on the reliability and validity of each and every CALT that is developed in the future.1

Notes 1 The reliability issues for CATs are addressed in Thissen, 1990; the validity issues are addressed in Steinberg, Thissen, & Wainer, 1990; and reliability and validity were studied together in McBride & Martin, 1983, and Bennett & Rock, 1995.

Sumber
Friday, April 29 - 0 comments

Misteri 3 Jam Terjebak di Masa Silam



Misteri 3 Jam Terjebak di Masa Silam Yang Belum Terungkap Hingga Kini
Puys, sebuah desa tepi pantai dekat pelabuhan Dieppe di Prancis
Tanggal 4 Agustus 1951, fajar belum lagi menyingsing. Laut bergemuruh oleh derai ombak yang menghantam karang di kawasan pesisir Puys, Prancis. Subuh yang tenang dan damai. Namun hari itu berubah menjadi pengalaman menakutkan bagi dua turis perempuan asal Inggris yang sedang berlibur di Puys.

Puys, sebuah desa tepi pantai dekat pelabuhan Dieppe di Normandy, Prancis menjadi lokasi wisata alternatif dengan pemandangan pantai, beting, dan tebing karang. Romantis untuk sebagian orang yang suka laut. Hal ini yang mendorong dua turis perempuan itu memilih Puys sebagai tempat liburan musim gugur. Namun pengalaman liburan itu menjadi kenangan tak terlupakan bagi mereka.

Subuh hari itu, kedua turis perempuan itu terbangun oleh gaduhnya suara tembakan gencar. Suara itu semakin menguat dengan rentetan tembakan yang semakin gencar disusul jeritan dan tangisan yang sangat kacau, lalu terdengar dengung sejumlah pesawat pembom, ledakan bom, tembakan mortir dan tembakan, teriakan… Keduanya kaget bukan kepalang. Mereka kini seolah berada di tengah kancah pertempuran hebat.

Suara demi suara pertempuran itu tetap menggema dan terdengar jelas oleh mereka. Namun mereka tak berani bergeming keluar dari kamarnya. Hanya tiarap dan bersembunyi ketakutan di sudut kamar. Tubuh menggigil akibat suara tembakan dan ledakan yang kadang terdengar sangat dekat, atau suara-suara perintah khas militer dalam bahasa Inggris dan Jerman, jeritan kesakitan, dan isak tangis.

Selama kurang lebih tiga jam mereka mendengar jelas semua suara pertempuran di luar sana. Sampai akhirnya suara-suara mengerikan itu semakin samar… samar… dan hilang! Debur gelombang menghantam karang sayup kembali terdengar. Fajar sudah menyingsing.

Setelah menenangkandiri, keduanya kemudian memberanikan diri keluar kamar. Dengan takut-takut mereka mengintip keluar jendela. Pemandangan di luar sana normal. Tak ada bekas pertempuran baru sama sekali. Hanya rumah, karang, pantai, pepohonan… suasana hariandi Puys.


Operation Jubilee (1942)

Keduanya kemudian bertanya-tanya kepada beberapa orang yang berada di dkat sana, apakah mereka mendengar suara pertempuran barusan? Semua hanya menggeleng dengan wajah bingung. Tak ada kegaduhan apapun apalagi suara tembakan dan ledakan bom. Seorang penduduk lokal yang agak tua mengatakan tak ada pertempuran baru di Normandia setelah D-Day "Operation Overlord" (1945) dan Operation Jubilee (1942). Sang kakek menjelaskan bahwa Pelabuhan Dieppe, Puys and Pourville merupakan titik pendaratan pasukan gabungan Sekutu (Inggris, Kanada, AS dan Polandia) dalam Operation Jubille 19 Agustus 1942.


Operation Jubilee (1942)

Lantas, apakah yang sebenarnya terjadi? Kedua turis Inggris itu tak mengerti. Mereka sangat yakin bahwa apa yang mereka dengar adalah sebuah pertempuran yang bahkan seolah bisa mereka lihat. Dalam kebingungan, mereka kemudian membuat laporan ke otoritas setempat mengenai fenomena tersebut. Mulanya laporan itu diabaikan, namun akhirnya sebuah lembaga khusus di Inggris tertarik akan hal tersebut.

Detail yang Mencengangkan

British Society of Psychical Research lah yang kemudian melakukan riset dan penelitian terhadap fenomena tersebut. Mereka sangat yakin bahwa apa yang dialami dua turis perempuan Inggris itu adalah bagian dari misteri alam yang tidak terpecahkan. Namun mereka punya asumsi, kemungkinan keduanya telah terjebak dalam "kedutan waktu". Suatu fenomena terbukanya semacam portal energi di suatu tempat yang memungkinkan orang bisa merasakan apa yang telah terjadi di masa lalu. Benarkah?

Mungkin saja benar. Karena penelitian terhadap laporan perempuan itu memang menunjukkan kesamaan peristiwa dengan kejadian nyata di Puys dalam gelar Operation Jubilee, yaitu operasi tempur pendaratan Sekutu di Normandia untuk memukul Jerman yang bercokol di Prancis pada 19 Agustus 1942. Operasi itu gagal dan kemudian menjadi bahan pertimbangan penting untuk gelar operasi tempur berikutnya "Operation Overlord" D-Day 6 Juni 1945 yang sukses mengalahkan dominasi Jerman di Prancis.

Bukti-bukti kebenaran akurasi cerita kedua turis itu dibuktikan dengan kros cek terhadap arsip data rahasia militer yang tidak pernah dipublikasikan. Hasilnya ada sejumlah besar persamaan persitiwa yang mencengangkan semua pihak.

Walau pun kedua perempuan itu mengetahui kisah tentang Operasi Jubilee di Dieppe dari banyak literatur saat itu, mereka tak akan mendapat detail penting seperti yang tercantum dalam arsip rahasia militer itu. Namun kenyataannya mereka memapar data detail yang hampir persis sama dengan arsip militer tersebut.

Sumber
- 0 comments

Mustika Ular




Suku Dayak Benuag dan Tunjung meyakini betapa mereka berasal dari leluhur yang dikenal dengan sebutan Tamerikukng — karena keturunannya melakukan suatu kesalahan, akhirnya, mereka pun berubah ujud dan tersebar di beberapa tempat di seantero Pulau Borneo. Dan mereka inilah yang sering disebut sebagai “Roh” atau makhluk halus yang memiliki tugas serta fungsi masing-masing dan mukim di seluruh alam, seperti di langit, bumi, air dan sebagainya.
Walau hidup di alam yang tak kasat mata, namun, mereka memiliki kebutuhan yang sebagian besar sama dengan yang dibutuhkan manusia pada umumnya. Dalam kepercayaan lama inilah, sejatinya, hubungan dua alam yang bersanding dan hanya terpisahkan oleh kabut misteri terjalin dengan erat — dan keadaan itu hanya terasakan oleh manusia yang masih alami, atau manusia yang masih memanusiakan manusia dan masih menghargai alam semesta. Dan tak dapat dipungkiri, pengejawantahan dari sikap menghargai itulah yang dapat membuka tabir dimensi misteri tersebut yang oleh sebagian besar masyarakat Dayak diyakini sebagai Dunia Ilmu Magis.
Masyarakat Dayak meyakini, wujud ketaatan dan kesetiaan mereka terhadap “roh” akan mendapatkan berkah dan imbalan dalam berbagai bentuk. Sebaliknya, ketidaktaatan akan membawa mereka ke jurang kehancuran. Oleh karena itu, mereka selalu berusaha untuk bisa berkomunikasi dengan “roh-roh” tersebut lewat cara-cara yang seringkali tidak bisa diterima dengan akal sehat.
Menurut pakar kebudayaan Tanah Borneo, Dalmasius Madrah T, pada dasarnya, ilmu magis dibagi menjadi dua bagian; Yakni; Ilmu Magis Panas; ilmu yang dipakai atau dapat mencelakakan orang yang disukai. Contoh dari ilmu ini adalah rasutn dan bongkaaq eqaau yang sangat mematikan. Sedang yang tidak membahayakan namun digolongkan dalam ilmu magis panas adalah ilmu kebal. Sementara, Ilmu Magis Dingin; ilmu yang berfungsi untuk mengantisipasi, menangkal, dan mengobati ilmu magis yang dipasang atau dikirim oleh pihak lawan. Bahkan, bisa juga digunakan untuk pengobatan penyakit madis.
Seperti biasa, bagi seseorang yang berniat mendapatkan ilmu tersebut di atas, maka, ia harus mencari sumber (guru-pen) yang tepat atau yang sesuai dengan keinginannya. Yang paling menarik adalah, walau berbagai kajian ilmiah telah dilakukan dan banyak bukti nyata di dalam hidup dan kehidupan sehari-hari, tetapi, konsep magis yang memang sulit untuk diterima dengan akal sehat itu tetap saja tak bisa terungkap dengan sejelas-jelasnya.
Selain dari mencari sumber (guru-pen), ada pula yang ingin mendapatkan ilmu magis dengan cara “betapa” (bertapa-pen) sebagaimana yang dilakukan oleh leluhur Bung Dani-i-Dani yang mendapatkan warisan berupa batu berbentuk mirip telur yang terlilit oleh seekor ular. Dan sampai sekarang mereka meyakini, inilah yang disebut sebagai mustika ular.
Bermula, ketika itu, daerah Tumbang Samba terserang oleh wabah penyakit yang mematikan. Tak ada yang mereka bisa lakukan di desa yang demikian terpencil itu kecuali hanya berharap dan berdoa — keadaan inilah yang membuat kakek Bung Dani bertekad untuk betapa (bertapa-pen) di Sungai Kahayan untuk mendapatkan pencerahan guna mengatasi penyakit yang kian hari kian merajalela itu.
Pada saatnya, sang kakek pun berendam di Sungai Kahayan. Waktu terus berlalu hingga suatu hari, ia ia ditemui oleh penguasa Sungai Kahayan yang mengaku bernama Datu Amin Kelaru. Dan dari pertemuan dua makhluk yang berbeda alam itulah, ia pun mendapatkan sebuah batu mirip telur yang dililit oleh seekor ular. Singkat kata, dengan daya magis batu tersebut, akhirnya, sang kakek pun berhasil menyembuhkan masyarakat di desanya yang terkena penyakit aneh tersebut.
Meski mustika ular itu didapat dengan jalan betapa (bertapa-pen), tetapi, benda yang oleh suku dayak diyakini memiliki kekuatan atau kesaktian itu pada waktu-waktu tertentu biasa meminta imbalan berupa makanan dan minuman sebagaimana yang kita kenal dengan sebutan sesaji.
Sudah barang tentu, silang pendapat akan hal tersebut di atas selalu terjadi di tengah-tengah masyarakat. Namun, masyarakat suku Dayak melakukan hal tersebut sebagai (meminjam istilah Khanjeng Joko-pen) “tali asih” antara sesama makhluk ciptaan Tuhan. Sayangnya, dalam kehidupan sehari-hari, hal tersebut kadang berkebalikan. Seharusnya manusia yang diciptakan lebih sempurna ketimbang makhluk lain diptaanNya itu memberi “sesaji” sebagai sedekah bagi mahkluk yang lebi rendah — bukan sebaliknya.
Setelah sang kakek meninggalkan dunia nan fana ini, akhirnya, mustika ular tersebut diwariskan kepada cucunya, Bung Dani-i-Dani. Pemuda inilah yang akhirnya menjadi penerus sang kakek dalam memberikan pelayanan pengobatan baik medis maupun non medis di daerahnya. Tumbang Samba.
Sampai sekarang, tiap malam Jumat, Bung Dani-i-Dani selalu memberikan sesaji berupa bunga 3 atau 7 macam — dan salah satu di antaranya harus bunga melati, serta kopi manis dan kopi pahit masing-masing segelas, sementara, mustika ular itu diletakan di sebuah piring yang sebelumnya telah ditaburi dengan segenggam beras.
Kini, ditangan Dani-i-Dani, mustika ular yang berdaya gaib tinggi itu berhasil dioptimalkan untuk berbagai hal. Selain pengobatan, mustika ular ini berhasil juga mendongkrak nilai guna dalam hal ekonomi. Di antaranya, penglarisan dagang, memperlancar usaha dan keperluan pagar gaib yang dikenal dengan sebutan kamaat (penjaga gaib yang setia). Yang terakhir ini memang dapat diperoleh dengan cara nemaai (diperoleh dengan pembayaran dan tata cara tertentu). Singkat kata, untuk membeli kamaat bukanlah suatu pekerjaan yang mudah — karena diperlukan kesungguhan, selain harus berhasil meyakinkan si pemilik kamaat agar mau berbagi. Pada dasarnya, kamaat bukan barang dagangan, hanya saja, bagi yang serius ingin mendapatkannya harus mau berbagi.
Demikian sekelumit legenda, tetapi nyata, dan sampai tulisan ini diturunkan masih bisa ditemui di Desa Tumbang Samba.

Sumber
Sunday, April 24 - 0 comments

Anak Cerdas Dengan Senam Otak

Ada banyak cara yang bisa dilakukan untuk meningkatkan kecerdasan otak, salah satunya melalui senam otak (brain gym). Senam otak bisa membantu meningkatkan kecerdasan anak-anak sekolah termasuk bisa dilakukan untuk bayi.

Otak merupakan bagian tubuh yang befungsi sebagai pusat pengendali organ-organ tubuh dan otak selalu berhubungan dengan intelejensia atau kecerdasan seseorang.

Pada awalnya senam otak sudah dikenal sejak tahun 80-an, tapi saat itu masih terbatas untuk orang dewasa saja. Tapi sejak tahun 2000-an telah dikembangkan senam otak untuk membantu meningkatkan kecerdasan anak-anak sekolah atau bisa juga untuk bayi.

Gerakan-gerakan senam ringan yang dilakukan melalui olah tangan dan kaki, dapat memberikan rangsangan atau stimulus ke otak. Stimulus itulah yang dapat meningkatkan kemampuan kognitif seperti kewaspadaan, konsentrasi, kecepatan, dalam proses belajar, memori, pemecahan masalah dan kreativitas.

"Kecerdasan seorang anak dipengaruhi oleh faktor genetik atau turunan dan faktor stimulasi. Nah senam otak ini membantu meningkatkan kecerdasan anak melalui berbagai stimulasi gerakan," ujar Tri Gunadi, S.Psi dalam acara talkshow Berbagai Gerakan Senam Otak Untuk Mencerdaskan Anak di STIKOM LSPR, Jakarta, Sabtu (3/10/2009).

Tri Gunadi menambahkan bahwa sebenarnya senam otak bisa digunakan untuk semua golongan usia, mulai dari bayi hingga para manula. Namun fungsinya berbeda. Bagi para manula senam otak bisa membantu menunda penuaan dini dalam arti menunda pikun atau perasaan kesepian yang biasanya menghantui para manula.

Sedangkan bagi anak-anak senam otak ini bisa membantu meningkatkan kecerdasan anak, meningkatkan kepercayaan diri, menangani anak yang mengalami masalah dalam proses belajar mengajar. Selain itu senam otak juga sering digunakan untuk terapi beberapa gangguan pada anak-anak seperti hiperaktif, gangguan pemusatan perhatian, gangguan emosional, sindrom pada bayi dan gangguan kemampuan belajar.

Konsultan yang akrab disapa Pak Gun ini menambahkan bahwa sebelum melakukan senam otak anak harus melakukan beberapa hal yang dikenal dengan istilah PACE (Positive, Active, Clear dan Energetic), yaitu:

  1. Energetic, untuk bersikap energik diperlukan pendukung berupa air putih minimal 125 cc. Berguna untuk menyalurkan oksigen ke otak dan melarutkan garam sehingga mengoptimalkan fungsi energi listrik di dalam tubuh.
  2. Clear, untuk menjernihkan diperlukan pemijatan pada daerah saklar otak (brain button). Daerah yang dipijat adalah titik dua jari di bawah tulang selangka (clavikula) dengan satu tangan dan tangan lainnya menggosok daerah pusar.
  3. Active, dilakukan dengan cara gerakan silang (cross crawl). Caranya dengan menggerakkan tangan kanan bersamaan dengan kaki kiri dan sebaliknya.
  4. Positive, yaitu dengan melakukan gerakan kait relaks (hook ups), tangan disilangkan dengan jempol dibagian bawah, lalu diputar sambil kaki disilangkan.

"Gerakan PACE ini membantu mengurangi kecemasan anak dan membuat anak berada dalam kondisi yang santai," ujar dosen rehab medik FK-UI ini.

Selanjutnya dilakukan pre-activity lalu learning menu yang disesuaikan dengan masalah atau hal yang ingin dioptimalkan dari si anak. Setelah itu dilakukan post-activity untuk melihat seberapa besar peningkatan yang bisa dilakukan anak setelah melakukan senam otak dan diakhiri dengan celebrate goal misalnya anak mengucapkan 'Hore saya bisa menyelesaikan soal tribonometri'

"Pada anak yang memiliki kelainan seperti autis, hal penting yang harus diingat sebelum anak melakukan senam otak adalah anak tersebut sudah bisa meniru apa yang dilakukan oleh orang lain," ujar Tri Gunadi yang juga Direktur Pusat Terapi Tumbuh Kembang Anak YAMET.

Senam otak selain berfungsi untuk membantu segala hal yang berhubungan dengan kecerdasan juga bisa membatu mengatasi keterlambatan bayi dalam berjalan atau berlari, atau membantu anak yang tidak bisa lepas dari orangtuanya serta meningkatkan motivasi dan semangat diri anak.

Sumber
- 0 comments

Siswa Pria Diusulkan Jalan-jalan di Kelas Biar Otaknya Aktif

Pakar pendidikan Inggris menyarankan agar siswa laki-laki dibiarkan berjalan-jalan selama di kelas. Laki-laki punya kemampuan belajar yang berbeda dari perempuan. Dengan membiarkannya berjalan, otak laki-laki akan lebih aktif dan menyala.

Hasil studi Abigail Norfleet James dari University of Virginia di sebuah sekolah khusus laki-laki ini menunjukkan bahwa siswa laki-laki akan belajar lebih baik saat melakukan sebuah aktivitas yang berdasarkan gerakan dan visual.

"Laki-laki memiliki kemampuan verbal atau berbahasa yang lebih rendah dari perempuan. Kemampuan audio atau pendengaran mereka juga payah. Mereka juga kurang bisa menahan emosi dan mengontrol rangsangan," ujar Dr James seperti dilansir Dailymail, Rabu (20/1/2010).

Tapi laki-laki punya kemampuan spatial atau kemampuan mengatur yang baik. Laki-laki juga punya visual atau penglihatan yang tajam, mengenal sentuhan lebih baik dan secara fisik lebih aktif. Untuk itu, para pakar menyarankan agar pendidikan yang diterapkan pada laki-laki harus disesuaikan dengan kelebihan-kelebihan tersebut.

Studi yang dipresentasikan pada acara the International Boys' Schools Coalition Conference di London itu menyebutkan bahwa laki-laki sebaiknya lebih banyak melakukan aktivitas fisik saat di kelas.

Setiap mata pelajaran juga sebaiknya dirancang untuk membuat siswa laki-laki lebih banyak bergerak dan bekerja dalam grup.

"Guru juga harus banyak bergerak untuk menarik mata para siswa. Kebanyakan siswa laki-laki akan memilih untuk melihat sesuatu yang bergerak. Jika guru hanya duduk dan tidak bergerak selama beberapa menit, perhatian siswa laki-laki akan teralihkan pada hal lain," jelas Dr James.

Jadi jika seorang laki-laki tidak bisa diam dan terus berjalan-jalan di dalam kelas, itu normal. "Tidak ada yang salah dengan itu, karena dengan begitu otaknya akan lebih aktif dan kemampuan belajarnya akan lebih baik," kata Dr James.

- 0 comments

Perempuan Makin Cerdas Setelah Melahirkan

Kelahiran seorang anak adalah berkah tersendiri bagi seorang perempuan. Salah satunya adalah menjadi semakin cerdas, sebab volume otak perempuan ternyata bertambah besar setelah melahirkan.

Besarnya peningkatan volume otak dipengaruhi oleh reaksinya terhadap kelahiran bayi. Semakin perempuan itu mensyukuri kehadiran sang buah hati, semakin besar peningkatan yang terjadi.

Sebuah riset yang dipublikasikan oleh American Psychological Association membuktikan hal itu lewat pengamatan terhadap 19 ibu hamil. Dalam periode tertentu setelah melahirkan, para partisipan menjalani pemindaian otak.

Pemindaian pertama dilakukan 2-4 pekan setelah melahirkan sementara yang kedua dilakukan setelah 3-4 bulan. Meski sebenarnya tidak terlalu besar, peningkatan volume yang terjadi dinilai cukup signifikan.

Dikutip dari Telegraph, Kamis (21/10/2010), peningkatan volume otak terjadi pada beberapa area berikut ini.
  1. Hipothalamus yang memproses emosi
  2. Amygdala yang memproses emosi dan penghargaan
  3. Parietal lobe yang memproses indera
  4. Prefrontal cortex yang memproses logika dan penilaian.
Menurut peneliti, volume otak pada orang dewasa biasanya sangat jarang berubah kecuali akibat kondisi-kondisi tertentu. Misalnya proses belajar yang intensif, cedera atau penyakit di otak serta pengaruh lingkungan.

Peneliti menduga, hal itu dipicu oleh perubahan hormonal yang terjadi di masa-masa awal mengasuh anak. Perubahan positif terjadi ketika seorang perempuan menganggap kehadiran si buah hati sebagai sesuatu yang spesial, indah dan sempurna.

- 0 comments

Ajari Anak Main Musik Agar Kelak Tidak Cepat Pikun

Mengenalkan musik sejak usia dini tak hanya membentuk jiwa seni pada anak, tetapi juga bermanfaat dalam memelihara fungsi otak. Menurut penelitian, anak yang bermain musik sejak kecil tidak cepat pikun saat mulai memasuki usia lanjut.

Bahkan meski setelah dewasa tidak lagi memainkan alat musik, pengalaman belajar musik di waktu kecil tetap memberikan manfaat hingga puluhan tahun kemudian. Fungsi kognitif lebih terpelihara dibandingkan lansia yang tidak punya pengalaman bermusik.

Manfaat ini terungkap dalam penelitian yang dilakukan oleh Dr Brenda Hanna-Pladdy, pakar kecerdasan dari University of Kansas. Dalam penelitian tersebut ia melibatkan 70 lansia sehat berusia antara 60-83 tahun, yang dibagi menjadi 3 kelompok berdasarkan pengalaman bermusiknya.

Kelompok pertama adalah lansia yang sama sekali belum pernah belajar musik. Kelompok kedua belajar musik antara 1-9 tahun sejak umur 10 tahun, sedangkan kelompok terakhir belajar musik selama lebih dari 10 tahun sejak umur yang sama dengan kelompok kedua.

Lebih dari 50 persen partisipan melewatkan masa kecil dengan belajar musik menggunakan piano, sedangkan sebagian lainnya memainkan seruling dan klarinet. Hanya sebagian kecil saja yang menggunakan alat musik lain termasuk drum, biola dan gitar.

Hasil uji kognitif atau fungsi otak terhadap para partisipan menunjukkan adanya hubungan erat antara pengalaman bermusik dengan kapasitas memori serta daya ingat. Partisipan yang belajar musik lebih dari 10 tahun di masa kecil mampu mengingat dengan baik dibandingkan kelompok lain.

"Masa kanak-kanak adalah masa krusial bagi sel otak, yang membuatnya mudah mempelajari musik. Proses yang rumit selama bertahun-tahun membentuk struktur tertentu yang mengimbangi kemunduran fungsi otak ketika sudah mulai pikun," ungkap Dr Hanna-Pladdy seperti dikutip dari Telegraph, Minggu (24/4/2011).

Penelitian tersebut dipublikasikan dalam jurnal Neuropsychology edisi terbaru yang diterbitkan baru-baru ini oleh American Psychological Association.
Saturday, April 16 - 0 comments

Testing and Evaluation_1

Teacher Test Accountability:
From Alabama to Massachusetts

Larry H. Ludlow
Boston College

Abstract
Given the high stakes of teacher testing, there is no doubt that every teacher test should meet the industry guidelines set forth in the Standards for Educational and Psychological Testing. Unfortunately, however, there is no public or private business or governmental agency that serves to certify or in any other formal way declare that any teacher test does, in fact, meet the psychometric recommendations stipulated in the Standards. Consequently, there are no legislated penalties for faulty products (tests) nor are there opportunities for test takers simply to raise questions about a test and to have their questions taken seriously by an impartial panel. The purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems (NES) in their 1999 Massachusetts Educator Certification Test (MECT) Technical Report, and more specifically, to identify those technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to call for the establishment of a standing test auditing organization with investigation and sanctioning power. The significance of the present analysis is twofold: a) psychometric results for the MECT are similar in nature to psychometric results presented as evidence of test development flaws in an Alabama class-action lawsuit dealing with teacher certification (an NES-designed testing system); and b) there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. I begin by reviewing NES's role in Allen v. Alabama State Board of Education, 81-697-N. Next I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally, I present results taken directly from the 1999 MECT Technical Report and compare them to procedures, results, and consequences of procedures followed by NES in Alabama.

Teacher Test Accountability: From Alabama to Massachusetts

         From its inception and continuing through present administrations, the Massachusetts Educator Certification Test (MECT) has attracted considerable public attention both regional and around the world (Cochran-Smith & Dudley- Marling, in press). This attention is due in part to two disturbing facts: 1) educators seeking certification in Massachusetts have generally performed poorly on the test, and 2) in many instances politicians have used these test results to assert, among other things, that candidates who failed are “idiots” (Pressley, 1998).
         The purpose of the MECT is “to ensure that each certified educator has the knowledge and some of the skills essential to teach in Massachusetts public schools” (National Evaluation Systems, 1999, p. 22). The Massachusetts Board of Education has raised the stakes on the MECT by enacting plans to sanction institutions of higher education (IHEs) with less than an 80% pass rate for their teacher candidates (Massachusetts Department of Education, 2000). One consequence of this proposal is that most IHEs are considering requirements that the MECT be passed before students are admitted to their teacher education programs. In addition, Title II (Section 207) of the Higher Education Act of 1998 requires the compilation of state “report cards” for teacher education programs, which must include performance on certification examinations (U.S. Department of Education, 2000).
         What all of this means is that poor performance on the MECT could prevent federal funding for professional development programs, limit federal financial aid to students, allow some IHEs be labeled publicly “low performing”, and prove damaging at the state-level when states are inevitably compared to one another upon release of the Title II report cards in October 2001. Given the personal, institutional, and national ramifications of the test results, there is no question that the MECT should be expected to meet the industry benchmarks for good test development practice as set forth in the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999). At this time, however, there is no public or private business or governmental agency either within the Commonwealth of Massachusetts or nationally that can certify or in any other formal way declare that the MECT does (or does not), in fact, meet the psychometric recommendations stipulated in the Standards. The National Board on Educational Testing and Public Policy (NBETPP) serves as an “independent organization that monitors testing in the US” but even it does not function as a regulatory agency (NBETPP, 2000).
         In addition to the absence of a national regulatory agency, many state departments of education do not have the professionally trained staff to answer directly technical psychometric questions. Nor do they usually have the expertise on staff to confront a testing company, which they have contracted, and demand a sufficient response to a technical question raised by outside psychometricians. Furthermore, even when a database with the candidates' item- level responses is available for internal analysis, a state department of education does not typically conduct rigorous disconfirming analyses, e.g. evidence of adverse impact. Thus, most state departments are largely dependent on whatever information testing companies decide to release. The public is then left with an inadequate accountability process.
         One purpose of this article is to highlight some of the psychometric results reported by National Evaluation Systems in their 1999 MECT Technical Report (NES, 1999). Specifically, this article identifies technical characteristics of the MECT that are inconsistent with the Standards. A second purpose of this article is to voice one more call for the establishment of a standing test auditing organization with powers to investigate and sanction (National Commission on Testing and Public Policy, 1990; Haney, Madaus & Lyons, 1993).
         The significance of the present analysis is twofold. First, psychometric results reported by NES for the MECT are similar in nature to psychometric results entered as evidence of test development flaws in an Alabama class- action lawsuit dealing with teacher certification (Allen v. Alabama State Board of Education, 81-697-N). That suit was brought by several African-American teachers who charged, among other things, that “the State of Alabama's teacher certification tests impermissibly discriminate[d] against black persons seeking teacher certification;” the tests “[were] culturally biased;” and the tests “[had] no relationship to job performance” (Allen, 1985, p. 1048). Second, there was no impartial enforcement agency to whom complaints about the Alabama tests could be brought, other than the court, nor is there any such agency to whom complaints about the Massachusetts tests can be brought. These two points are linked in an interesting and troubling way--NES, the Massachusetts Educator Certification Tests contractor, was also the contractor for the Alabama Initial Teacher Certification Testing Program (AITCTP).
         Some of the criticism of debates about teacher testing, teacher standards, teacher quality, and accountability suggests that arguments are, in part, ideologically, rather than empirically based (Cochran-Smith, in press). This may or may not be the case. This article, however, takes the stance that regardless of one's political ideology or philosophy about testing, the MECT is technically flawed. Furthermore, because of the lack of an enforceable accountability process, the public is powerless in its efforts to question the quality or challenge the use of this state-administered set of teacher certification examinations. In this article I argue that the consequences of high-stakes teacher certification examinations are too great to leave questions about technical quality solely in the hands of state agency personnel, who are often ill- prepared and under-resourced, or in the hands of test contractors, who may face obvious conflicts-of-interest in any aggressive analyses of their own tests.
         In the sections that follow, I begin by reviewing NES's role in Allen v Alabama. Then I explain the purpose and interpretation of standard item analysis procedures and statistics. Finally I compare results taken directly from the 1999 MECT Technical Report with statistical results entered as evidence of test development flaws in Allen v Alabama.

NES and the AITCTP

Allen, et al. v. Alabama State Board of Education, et al.

         In January 1980, National Evaluation Systems was awarded a contract on a non-competitive basis for the development of the Alabama Initial Teacher Certification testing Program (AITCTP). Item writing for these tests began in the Spring of 1981, and the first administration of the tests took place on June 6, 1981. Allen v Alabama was brought just six months later on December 15th, 1981. The Allen complaint challenged the Alabama State Board of Education's requirement that applicants for state teacher certification pass certain standardized tests administered under the AITCTP. On October 14, 1983, class certification (Note 1) was granted, and the first trial was set for April 22, 1985. Subsequent to a pre-trial hearing on December 19, 1984 and “after substantial discovery was done,”(Note 2) an out-of-court settlement was reached on April 4, 1985. A Consent Decree was presented to the U.S. District Court April 8, 1985(Note 3). The Attorney General for the State of Alabama immediately “publicly attacked the settlement” (Allen, 1985, p. 1050), claiming that it was illegal. Nonetheless, the consent decree was accepted by the court October 25, 1985 (Allen, Oct. 25. 1985). A succession of challenges and appeals on the legality and enforceable status of the settlement resulted (Note 4). For example, on February 5, 1986, the district court vacated its October 25th order approving the consent decree (Allen, February 5, 1985, p. 76). While the plaintiffs appeal of the February 5th decision was pending at the 11th Circuit Court of Appeals, trial began in district court on May 5, 1986.
         The AITCTP consisted of an English language proficiency examination, a basic professional studies examination, and 45 content-area examinations. The purpose of the examinations was to measure “specific competencies which are considered necessary to successfully teach in the Alabama schools” (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 21). A pool of 120 items for each exam was generated--100 of which were scorable and mostly remained unchanged across the first eight administrations. Extensive revisions were incorporated into most of the tests at the ninth administration. By the start of the May 1986 trial the tests had been administered 15 times in all.
         A team of technical experts (Note 5) for the plaintiffs was hired in November 1983 (prior to the ninth administration of the exams) to examine test development, administration, and implementation procedures. The team was initially unsure about the form of the sophisticated statistical analyses they assumed would have to be conducted to test for the presence of “bias” and “discrimination”, the bases of the case. That is, the methodology for investigating what was then called “bias” and is now called “differential item functioning” was far from well established at that time (Baldus & Cole, 1980). Nevertheless, when the plaintiffs' team received the student-level item response data from the defendants, their first steps were to perform an “item analysis.” Such an analysis produces various item statistics and test reliability estimates. These initial analyses produced negative point-biserial correlations. Although point-biserial correlations are explained in detail below, suffice it to say at this point that it was a surprise to find negative point-biserial correlations between the responses that examinees provided on individual items and their total test scores. Such correlations are not an intended outcome from a well-designed testing program.
         These statistical results prompted a detailed inspection of the content, format, and answers for all the individual items on the AITCTP tests. Content analyses yielded discrepancies in the keyed correct responses in the NES test documents and the keyed correct responses in the NES- supplied machine scorable answer keys (i.e., miskeyed items were on the answer keys). This finding led to an inspection of the original NES in-house analyses which revealed that negative point-biserials for scorable items existed in their own records from the beginning of the testing program and continuing throughout the eighth administration without correction.
         What this meant for the plaintiffs was that NES had item analysis results in their own possession which indicated that there were mis-keyed items. Nonetheless they implemented no significant changes in the exams until they were faced with a lawsuit and plaintiffs' hiring of the testing experts to do their own analyses. The defendants argued that it was normal for some problems to go undetected or uncorrected in a large-scale testing program because the overall effect is trivial for the final outcome. The problem with that argument was that many candidates were denied credit for test items on which they should have received credit, and some of those candidates failed the exam by only one point. In fact, as the plaintiffs argued, as many as 355 candidates over eight administrations of the basic professional skills exam alone should have passed but were denied that opportunity simply because of faulty items that remained on the tests (Milman, 1986, p. 285). It should be noted here that these were items that even one of the state's expert witnesses for the defense admitted were faulty (Millman, 1986, p. 280).
         Establishing that there were flawed items with negative point-biserial correlations was critical to the plaintiffs' case. The plaintiffs presented as evidence page after page of so-called “failure tables” (Note 6) with the names of candidates for each test whose answers were mis-scored on these faulty items. Based upon these failure tables, any argument from defendants that the mis-keyed items did not change the career expectations for some candidates would most likely have failed.
         In the face of this evidence, the defendants argued at trial that
...the real disagreement is between two different testing philosophies. One of these philosophies would require virtual perfection under its proponents' rigid definition of that word. The other looks at testing as a constantly- developing art in which professional judgment ultimately determines what is appropriate in a particular case”
(Allen, Defendant's Pre-trial Memorandum, 1986, p. 121-2).
Plaintiffs counter-argued
“This case…is not a philosophical case at all. This case is a case on professional competence….this was an incompetent job, unprofessional, and as I said before, sloppy and shoddy, and in the case of the miskeyed items, unethical.” (Madaus, 1986, p. 185).
         Judge Thompson, in the subsequent Richardson decision which also involved the AITCTP, specifically agreed with plaintiffs on this point (Richardson, 1989, p. 821, 823, 825). Excellent reviews of the diametrically opposed plaintiff and defendant positions may be found in Walden & Deaton (1988) and Madaus (1990).
         At the same time that this case was proceeding, the plaintiffs' appeal to reverse the vacating of the original settlement was granted prior to a decision in this trial (Allen, Feb. 5, 1986, p. 75). The U.S. Court of Appeals decided the district court should have enforced the consent decree (Allen, April 22, 1987)—which the district court so ordered on May 14, 1987 (Allen, May 14, 1987). Although the decision to uphold the original settlement was a positive ruling for the plaintiffs, it also was somewhat counter-productive for them because it was unexpectedly beneficial to NES at this stage in the proceedings. That is because the evidence presented above in Allen v Alabama was critical of the state and NES (NES was explicitly referred to in the court documents). Thus, NES's best hope for avoiding a written opinion critical of their test development procedures was if plaintiffs' appeal were to be upheld and the original settlement enforced, as it was. Then there would be no evidentiary record, no court ruling, and no legal opinion that would reflect badly upon the NES procedures. Richardson v Lamar County Board of Education (87-T-568-N) commenced, however, and the actions of NES and the Alabama State Board of Education were openly discussed and critiqued in the court's opinion of November 30, 1989 (though NES was not mentioned by name in the Richardson, 1989 decision).

Richardson v Lamar County Board of Education, et al.

         Like Allen v Alabama, Richardson v Lamar County also addressed issues of the “racially disparate impact” of the AITCTP (Richardson, 1989, p. 808). The Honorable Myron H. Thompson again presided, and testimony from Allen v Alabama was admitted as evidence (Richardson, 1989). Although the defendants denied in the Allen v Alabama consent decree that the AITCTP tests were psychometrically invalid, and even though no decision was reached in the abbreviated Allen v Alabama trial, the State Board of Education did not attempt to defend the validity of the tests in Richardson v Lamar and, “in fact, it conceded at trial that plaintiff need not relitigate the issue of test validity” (Richardson v Alabama State Board of Education, 1991, p. 1240, 1246).
         Judge Thompson's position on the test development process of NES was clearly stated: “In order to fully appreciate the invalidity of the two challenged examinations, one must understand just how bankrupt the overall methodology used by the State Board and the test developer was” (Richardson, 1989, p. 825, n. 37). While sensitive to the fact that “close scrutiny of any testing program of this magnitude will inevitably reveal numerous errors,” the court concluded that these errors were not “of equal footing” and “the error rate per examination was simply too high” (Richardson, 1989, pp. 822- 24) Thus, none of the examinations that comprised the certification test possessed content validity because of five major errors by the test developer and the test developer had made six major errors in establishing cut scores (Richardson, 1989, pp. 821-25).

Case Outcomes in Alabama

         The Allen v Alabama consent decree required Alabama to pay $500,000 in liquidated damages and issue permanent teaching certificates to a large portion of the plaintiff class (Allen, Consent Decree, Oct. 25, 1985, pp. 9-11). The decree also provided for a new teacher certification process. However, no new test was developed or implemented and the Alabama State Board of Education suspended the teacher certification testing program on July 12, 1988. In 1995 the Alabama State Legislature enacted a law requiring that teacher candidates pass an examination as a condition for graduation. Subsequently, another trial was held February 23, 1996 to decide the state's motions to modify or vacate the 1985 consent decree (Allen, 1997, p. 1414). Those motions were denied on September 8, 1997 (Allen, Sept. 8, 1997). Given the rigorous test development and monitoring conditions of the Amended Consent Decree, it was estimated by the court that the State of Alabama would not gain complete control of its teacher testing program “until the year 2015” (Allen, Jan. 5, 2000, p. 23). Only recently has a testing company stepped forward with a proposal for a new Alabama teacher certification test (Rawls, 2000).
         Plaintiff Richardson was awarded re-employment, backpay, and various other employment benefits (Richardson, 1989, pp. 825-26). Defendants (the State of Alabama and its agencies) in both cases were ordered to pay court costs and attorney fees (Richardson, 1989, pp. 825-26). However, even though NES was responsible for the development of the tests, NES was not named as one of the defendants in these cases and was not held liable for any damages (Note 7).

Psychometric and Statistical Background

         At this point it is appropriate to discuss some of the psychometric concepts and statistics that are fundamental to any question about test quality. The purpose of this discussion is to illustrate that excruciatingly complex analyses are not necessarily required in order to reveal flaws in a test or individual test items. The first steps in test development simply involve common sense practice combined with sound statistical interpretations. If those first steps are flawed, then no complex psychometric analysis will provide a remedy for the mistakes.
         One of the simplest statistics reported in the reliability analysis of a test like the MECT is the “item-test point-biserial correlation.” This statistic goes by other names such as the “item-total correlation” and the “item discrimination index.” It is called the point-biserial correlation specifically because it represents the relationship between a truly dichotomous variable (i.e., an item scored as either right or wrong) and a continuous variable (i.e., the total test score for a person). A total test score, here, is the simple sum of the number of correctly answered items on a test.
         The biserial correlation has a long history of statistical use (Pearson, 1909). One of its earliest measurement uses was as an item-level index of validity (Thorndike, et al., 1929, p. 129). The “point”-biserial correlation appeared specifically for individual dichotomous items in an item analysis because of concerns over the assumptions implicit in the more general biserial-correlation (Richardson & Stalnaker, 1933). It was again used as a validity index. It subsequently came to acquire diagnostic value and was re-labeled as a discrimination index (Guilford, 1936, p. 426).
         The purpose of this statistic is to determine the extent to which an individual item contributes useful information to a total test score. Useful information may be defined as the extent to which variation in the total test scores has spread examinees across a continuum of low scoring persons to high scoring persons. In the present situation, this refers to the extent to which well qualified candidates can be distinguished from less capable candidates.
         Generally, the greater the variation in the test scores, the greater the magnitude of a reliability estimate. Reliability may be defined many ways through the body of definitions and assumptions known as Classical Test Theory or CTT (Lord & Novick, 1968). According to CTT, an examinee's observed score (X) is assumed to consist of two independent components, a true score component (T) and an error component (E). One relevant definition of reliability may be expressed as the ratio of true-score variance to observed- score variance. Thus, the closer the ratio is to 1.0, the greater the proportion of observed-score variance that is attributed to true-score variance.
         The KR-20 reliability estimate is often reported for achievement tests (Kuder & Richardson, 1937, Eq. 20, p. 158). Although reliability as defined above is necessarily positive, the KR-20 can be negative under certain extraordinary conditions (Dressel, 1940) but typically ranges from 0 to +1. Nevertheless, the higher the value, the more “internally consistent” the items on a test. The magnitude of the KR-20, however, is affected by the direction and magnitude of the point-biserial correlations. Specifically, total test score reliability is decreased by the inclusion of items with near-zero point-biserial correlations and is worsened further by the inclusion of items with negative point-biserial correlations. This is because each additional faulty item increases the error variance in the scores at a faster rate than the increase in true-score variance.
        Technically, the point-biserial correlation represents the magnitude and direction of the relationship between the set of incorrect (scored as “0”) and correct (scored as “1”) responses to an individual item and the set of total test scores for a given group of examinees. In other words, it is a variation of the common Pearson product-moment correlation (Lord & Novick, 1968, p. 341). It can range in magnitude from zero to . An estimate near zero is a poorly discriminating item that contributes no useful information. An estimate of +1 would indicate a perfectly discriminating item in the sense that no other items are necessary on the test for differentiating between high scoring and low scoring persons. A value of 1.0 is never attained in practice nor is it sought (Loevinger, 1954). Negative estimates are addressed below.
         Ideally the test item point-biserial correlation should be moderately positive. Although various authors differ on what precisely constitutes “moderately positive”, a long-standing general rule of thumb among experts is that a correlation of .20 is the minimum to be considered satisfactory (Nunnally, 1967, p. 242; Donlon, 1984, p. 48) (Note 8). There is, however, no disagreement among psychometricians on the direction of the relationship—it has to be positive.
         The direction of the correlation is critical. A positive correlation means that examinees who got an item right also tended to score above the mean total test score and those who got the item wrong tended to score below the mean total test score. This is intuitively reasonable and is an intended psychometric outcome. Such an item is accepted as a good “discriminator” because it differentiates between high and low scoring examinees. This is one of the fundamental objectives of classical test theory, the theory underlying the development and use of the MECT.
         A negative point-biserial correlation, however, occurs when examinees who got an item correct tended to score below the mean total test score while those who got the item wrong tended to score above the mean total test score. This situation is contrary to all standard test practice and is not an intended psychometric outcome (Angoff, 1971, p. 27). A negative point-biserial correlation for an item can occur because of a variety of problems (Crocker & Algina , 1986). These include:
  1. chance response patterns due to a very small sample of people having been tested,
  2. no correct answers to an item,
  3. multiple correct answers to an item,
  4. the item was written in such a way that “high ability” persons read more into the item than was intended and thus chose an unintended distracter while the “low ability” people were not distracted by a subtlety in the item and answered it as intended,
  5. the item had nothing to do with the topic being tested, or
  6. the item was mis-keyed, that is, a wrong answer was mistakenly keyed as the correct one on the scoring key.
         When an item yields a negative point-biserial correlation, the test developer is obligated to remove the item from the test so that it does not enter into the total test score calculations. In fact, the typical commercial testing situation is one where the test contractor administers the test in at least one field trial, discovers problematic items, either fixes the problems or discards the items entirely, and then readministers the test prior to making the test fully operational. The presence of a flawed item on a high-stakes examination can never be defended psychometrically.
         One additional point must be made. The point-biserial correlation can be computed two ways. The first way is to correlate the set of 0/1 (incorrect/correct) responses with the total scores as described above. In this way of computing the statistic, the item for which the correlation is being computed contributes variance to the total score, hence, the correlation is necessarily magnified. That is, the statistical estimate of the extent to which an item is internally consistent with the other items “tends to be inflated” (Guilford, 1954, p.439).
         The second way in which the correlation may be computed is to compute it between the 0/1 responses on an item and the total scores for everyone but with the responses to that particular item removed from the total score (Henrysson, 1963). This is called the “corrected point-biserial correlation.” It is a more accurate estimate of the extent to which an individual item is correlated to all the other items. It is easily calculated and reported by most statistical software packages used to perform reliability analyses (e.g., SPSS's Reliability procedure).
         Various concerns have been raised over the interpretation of the point-biserial correlation because the magnitude of the coefficient is affected by the difficulty of the item. The fact is, however, that all the various discrimination indices are highly positively correlated (Nunnally, 1936; Crocker & Algina, 1986). Furthermore, even though the magnitude of the point-biserial correlation tends to be less than the biserial-correlation, all writers agree on the interpretation of negative discriminations. “No test item, regardless of its intended purpose, is useful if it yields a negative discrimination index”(Ebel & Frisbie, 1991, p. 237). Such an item “lowers test reliability and, no doubt, validity as well” (Hopkins, 1998, p. 261). Furthermore, “on subsequent versions of the test, these items [with negative point-biserial correlations] should be revised or eliminated (Hopkins, 1998, p. 259).

NES AND THE MECT

The 1999 MECT Technical Report

         In July 1999 NES released their five volume Technical Report on the Massachusetts Educator Certification Tests. Volume I describes the test design, item development description, and psychometric results. Volume II describes the subject matter knowledge and test objectives. Volume III consists of “correlation matrices by test field.” Volume IV consists of various content validation materials and reports. Volume V consists of pilot material, bias review material, and qualifying score material. The report was immediately hailed by Massachusetts Commissioner of Education David P. Driscoll: "I have said all along that I stand by the reliability and validity of the tests, and this report supports it.” (Massachusetts Department of Education, 1999).

Field Trial

         Technical Report Volume I contains the psychometric results for the first four administrations of the MECT (April, July, and October 1998, and January 1999). It does not, however, contain any results from a full-scale field trial, nor are any “pilot” test results reported (Note 9). There is no information on how may different items were tested, where the items came from, how many items were revised or rejected, what the revisions were to any revised items, or what the psychometric item-level results were. In fact, there is no field trial evidence in support of the initial inclusion of any of the individual items on the operational exams because there was no field trial.
         Interestingly, the Department of Education released a brochure in January 1998 stating that the first two test administrations would not count for certification—implying that the tests would serve as a field trial. Chairman of the Board of Education John Silber, however, declared in March 1998 that the public had been misinformed and that the first two tests would indeed count for certification. This policy reversal was unfortunate because of the confusion and anxiety it created among the first group of examinees and because it prevented the gathering of statistical results that could have improved the quality of the test.
         NES had considered a field trial of their teacher test in Alabama but did not conduct one and assumedly came to regret that decision. In Allen v Alabam they argued, “As the evidence will show, there was no need to conduct a separate large-scale field tryout in this case, since the first test administration served that purpose” (Allen, Defendants' Pre-Trial Memorandum, 1986, p. 113). That decision was unwise because it directly affected the implementation and validity of their procedures. For example, “The court has no doubt that, after the results from the first administration of those 35 examinations were tallied, the test developer knew that its cut-score procedures had failed” (Richardson, 1989, p. 823). In fact, the original settlement in Allen v Alabama stipulated that in any new operational examination, the items “shall be field tested using a large scale field test” (Allen, Consent Decree, Oct. 25, 1985, p. 3).
         The first two administrations of the MECT would have served an important purpose as a full-scale field trial for the new tests, thus avoiding the mistake made in Alabama. However, that opportunity to detect and correct problems in administration, scoring, and interpretation was lost. The impact of the lack of a field trial is further magnified when it is noted that the time period between when NES was awarded the Massachusetts contract (October 1997) and when the first tests were administered (April 1998) was even smaller than the time period NES had to develop the tests in Alabama—a time frame that the court referred to as “quite short” (Richardson, 1989, p. 817). Furthermore, even though NES may have drawn many of the MECT items from existing test item banks, items written and used elsewhere still must be field tested on each new population of teacher candidates.

Point-biserial correlations

         In the NES Technical Report Volume I, Chapter 8, p. 140, there is a description of when an item is flagged for further scrutiny. One of the conditions is when an item displays an “item-to-test point-biserial correlation less than 0.10 (if the percent of examinees who selected the correct response is less than 50)”. After such an item is found, “The accuracy of each flagged item is reverified before examinees are scored.” The Technical Report, however, does not report or provide the percent of persons who selected the correct response on each item. Nor is there an explanation of what the reverification process consisted of, nor of how many items were flagged, nor what was subsequently modified on flagged items. Thus, there is no way to determine the extent to which NES actually followed its own stated guidelines and procedures in the development of the MECT. The relevance of what NES states as their review procedures and what they actually performed is that in Alabama, under the topic of content validity, it was argued by the defense that items rated as “content invalid” were revised by NES and that these “revisions were approved by Alabama panelists before they appeared on a test.” The court, however, found that “no such process occurred” (Richardson, 1989, p. 822).
         The following table summarizes the point-biserial estimates reported for the MECT. Note that these are not the results prior to NES conducting the item review process. These are the results for the “scorable items” after the NES review.

Table 1
Problematic Point Biserial Correlations
from the 1999 MECT Technical Report

Date
Number
tested
N of
M/C Items
Items with point biserials <=0.20
% of total
items
<.00.00-.05.06-.10 .11-.15 .16-.20
Apr-98 4891 315 1 7 15 24 46 29.5%
Jul-98 5716 443 0214 17 39 16.3%
Oct-98 5286 379 2510 1532 16.9%
Jan-99 9471 507 1414 3549 20.3%
  25,364 1,644 41853 91166 332/1644 = 20.2%


Test
Number
tested
N of
M/C Items
Items with point biserials <=0.20
% of total
items
<.00.00-.05.06-.10 .11-.15 .16-.20
Writing 9750 92 0 0 0 1 1 2.2%
Reading 9455 144 0 0 1 1 6 5.6%
Early Childhood 936 256 0 3 18 30 46 37.9%
Elementary 3125 256 0 2 0 3 27 12.5%
Social Studies 259 128 1 0 1 6 14 17.2%
History 108 64 0 0 2 6 5 20.3%
English 695 256 0 3 11 12 29 21.5%
Mathematics 345 192 1 0 4 4 7 8.3%
Special Needs 691 256 2 10 16 28 31 34.0%
    1,644 4 18 53 91166
Source: Massachusetts Educator Certification Tests: Technical Report, 1999
         A number of observations may be made from the information in this table. First, of the 1644 total number of items administered over the first four dates, 332 items (20.19%) had point-biserial correlations that are lower than the industry minimum standard criterion of .20. That is a huge percent of poorly performing items for a high-stakes examination. Second, while there are relatively few suspect items on the Reading and Writing tests, there are large numbers of items with poor statistics on many of the subject matter tests. The Early Childhood, English, and Special Needs tests, in particular, consisted of extraordinarily large percentages of poorly performing items (37.9%, 21.5%, and 34%, respectively). Overall, of the 332 items with low point-biserials, 322 (97%) occurred on the subject matter tests. On the face of it, the results for the subject matter tests are terrible. There is, unfortunately, no authoritative source in the literature (including the Standards) that tells us unequivocally whether or not this overall 20.19% of poorly performing items on a licensure examination with high-stakes consequences is acceptable, not acceptable, or even terrible. Given the steps that NES claims were followed in selecting items from existing item banks and in writing new items, there simply should not be this many technically poor items on these tests.

Reliability

         In Volume I, Chapter 9, p. 188 of the Technical Report, the following statement appears. “It is further generally agreed that reliability estimates lower than .70 may call for the exercise of considerable caution.” The practical significance of this statement lies in the fact that when reliability is less than .70, it means that at least 30% of the variance in an examinee's test score is attributable to something other than the subject matter that is being tested. In other words, an examinee's test score consists of less than 70% true-score variance and more than 30% error variance. This ratio of true-score variance to error-variance is not desirable in high-stakes examinations (Haney, et al., 1999). Nearly 40 years ago, Nunnally went so far as to describe as “frightening” the extent to which measurement error is present in high-stakes examinations even with reliability estimates of .90 (1967, p. 226).
         NES, however, suggests that their reported item statistics and reliability estimates should not greatly influence one's judgment about the overall quality of the tests because the multiple-choice items make up only part of the exam format (NES, 1999, p. 189). The problem with that argument, as noted by Judge Thompson in Richardson (1989, pp. 824-25), is that small errors do accumulate and can invalidate the use for which the test was developed. This issue of simply dismissing troubling statistics as inconsequential is particularly ironic when the MECT has been described by the non-profit Education Trust as “the best [teacher test] in the country” (Daley, Vigue & Zernike, 1999).
         The Special Needs test deserves closer attention because it had problems at each reported administration.
  1. The sample sizes for the tests were 131, 206, 154, and 200, respectively. Based on NES's own criteria (NES, 1999, p. 187), these sample sizes are sufficient for the generation of statistical estimates that would be relatively unaffected by sampling error.
  2. The KR-20 reliability coefficients for the four administrations were .67, .76, .76, and .74, respectively. These are minimally tolerable for the last three administrations. The reliability is not acceptable, however, for the first administration. This means that people were denied certification in Special Needs based on their performance on a test that was deficient even by NES's own guidelines.
  3. For the April 1998 administration eleven Special Needs items had point-biserials of .10 or less (again, one of NES's stated criterion for “flagging” an item). For the July 1998 administration it was five items, for October 1998 it was four items, and for January 1999 it was eight items. In fact, in two of the administrations there was an item with a negative point-biserial. (Given the previous discussion about the way the point-biserials were likely to have been calculated (uncorrected), the frequency of negative point-biserials would likely increase if the corrected coefficients had been reported.) Given that there is no specific information about flagging, deleting or replacing items, it is possible that these same faulty items were, and continue to be, carried over from one administration to the next.

The Linkage between Alabama and Massachusetts: A modus operandi

         At this point the reasonable reader might ask why I am expending so much effort upon what appears to be a relatively minor problem—some items had negative point- biserial correlations. NES, for example, would likely call this analysis “item-bashing”, as this type of analysis was referred to in Alabama. The significance of these findings lies in the apparent connection between NES's work in Alabama and their present work on the MECT in Massachusetts.
         In Alabama, defendants claimed that
Before any item was allowed to contribute to a candidate's score, and before the final 100 scorable items were selected, the item statistics for all the items of the test were reviewed and any items identified as questionable were checked for content and a decision was made about each such item (Allen, Defendants' Pre-Trial Memorandum, 1986, pp. 113-14).
         In fact, in Alabama there were negative point-biserial correlations in the original reliability reports generated by NES (their own documents reported negative point-biserial correlations as large as -0.70) and those negative point- biserial correlations for the same scorable items remained after multiple administrations of the examinations. Simply taking out the worst 20 items in each test did not remove all the faulty items since each exam had to have 100 scorable items. As seen above in Table 1, the MECT has statistically flawed items on many tests, these items have been there since the first administration, and they may be the same items still being used in current administrations.
         In Alabama, the negative point-biserial correlations led to the discovery of items for which there was no correct answer. Also discovered were items for which there were multiple correct answers and there were items for objectives that had been rated “not as job related.” Additionally, items were found to have been mis-keyed on the item analysis scoring forms. Furthermore, those flawed items existed unchanged for the first eight administrations of the tests. They were not revised, deleted, or changed to “experimental” non-scorable status until the ninth administration--one month after the plaintiffs' team agreed to take the case. Defendants argued that “problems with the testing instrument—such as mis-keyed answers” were simply one component of many that is taken into account by the “error of measurement” (Allen, Defendants' Pre-Trial Memorandum, 1986, pp. 108- 113). (Note 10)
         As noted earlier, poor item statistics may result for many reasons. Of those reasons the only acceptable one is that they may be due to sampling error (chance). That explanation is unlikely with respect to the MECT, however, because the sample sizes are sufficiently large, and the pattern of faulty item statistics persists over time. The extent to which flawed items may exist in the Massachusetts tests can only be determined by release of the student-level item response data and the content of the actual items, something that has not been done to date. Furthermore, such a release of additional technical information, or item response data, or item content is highly unlikely. (Note 11) In Alabama, the statistical results and in-house documents were not produced by NES until the plaintiffs seriously discussed contempt of court actions against NES personnel. Consequently, there is little reason to expect that NES will voluntarily release MECT data or results not explicitly covered in their original confidential contract.
         In Alabama there were no independent testing experts appointed or contracted to monitor the test developer's work. This fact led the court to conclude that “The developer's work product was accepted by the state largely on the basis of faith” (Richardson, 1989, p. 817). In Massachusetts the original MECT contract called for the contractor to recommend a technical review committee of nationally recognized experts who were external to their organization (MDOE, 1997, Task 2.14.i, p. 11). The committee was to review the test items, test administration, and scoring procedures for validity and reliability and was to report its findings to the Department of Education. NES did not form such an independent technical advisory committee for the MECT nor has a formal independent review of the MECT been undertaken by anyone else.
         It is not in the short-term business interests of a testing company to conduct disconfirming studies on the technical quality of their commercial product. The MECT is, of course, a product that NES markets as an example of what they can build for other states who might be interested in certification examinations. It is, however, in the best interests of a state for such studies to be conducted. For example, the Commonwealth of Massachusetts has a statutory responsibility to “protect the health, safety and welfare of citizens” who seek services from licensed professionals (NES, 1999, p. 16). In the present situation “citizens” are defined by the Board of Education as “the children in our schools” (MDOE, Special Meeting Minutes, 1998). What has apparently been lost in all of this is the fact that prospective educators are “citizens” and deserve protection too--protection from a faulty product that can damage the profession of teaching and can alter drastically the career paths of individuals. Educators and the public at large deserve the highest quality certification examinations that the industry is capable of providing. There is ample evidence that the MECT may not be such an examination.

Conclusion

         A technical review of the psychometric characteristics of the MECT has been called for in this journal (Haney et al. 1999; Wainer, 1999). The year 2000 and 2001 budgets passed by the Legislature of the Commonwealth also called for such an independent audit of the MECT. Those budget provisions, however, were vetoed by Governor Cellucci, and the legislature failed to override the vetoes. Until an independent review committee with full investigative authority is convened by the Commonwealth, the only technical material publicly available for independent analysis is the 1999 MECT Technical Report generated by NES (NES, 1999). (Note 12) One of the important points made by Haney et al, (1999) was that the Massachusetts Department of Education is not the appropriate agency for conducting such a review. Part of my point here is that the only review of the MECT the Commonwealth may ever see is the one prepared by NES of its own test. Such a review clearly raises a concern over conflict-of-interest (Madaus, 1990; Downing & Haladyna, 1996).
         Given the national interest in “higher standards” for achievement and assessment, it must be recognized that there are no “gold” standards by which a testing program such as the MECT can be evaluated (Haney & Madaus, 1990; Haney, 1996). This is ironic given how technically sophisticated the testing profession has become. Consequently, without “gold” standards to define test development practice, there are no legislated penalties for faulty products (tests) and there is no enforced protection for the public. Testing companies may lose business if the details of shoddy practice are made known and the public may appeal to the judicial system for damages. But the opportunity for a test taker simply to raise a question about a test that can shape his or her career and to have that question taken seriously by an impartial panel should be the right of every test-taking citizen. (Note 13)
         Contrary to former Chairman John Silber's statement to the Massachusetts Board of Education, “there is nothing wrong with this test” (Minutes of the Board, Nov. 11, 1998) and the statement by the chief of staff for the MDOE, Alan Safran, “[the test]does not show who will become a great teacher, but it does reliably and validly rule out those who would not” (Associated Press, 1998), there is ample evidence that there may be significant psychometric problems with the MECT. These problems, in turn, have significant practical ramifications for certification candidates and the institutions responsible for their training.
         Is the MECT sound enough to support assertions that the candidates are “idiots”? No. Is there evidence that poor performance may, in part, reflect a flawed test containing defective items? Yes. Should the Massachusetts Commissioner of Education independently follow through on the twice-rejected Senate bill to "select a panel of three experts from out-of-state from a list of nationally qualified experts in educational and employment testing, provided by the National Research Council of the National Academy of Sciences, to perform a study of the validity and reliability of the Massachusetts educator certification test as used in the certification of new teachers and as used in the elimination of certification approval of teacher preparation programs and institutions to endorse candidates for teacher certification?" (Massachusetts, 1999, Section 326. (S191K)). Absolutely. Should such a panel serve as a blueprint for the formation of a standing national organization for test review and consumer protection? Yes.
         As we enter the 21st century, high stakes tests are becoming increasingly powerful determinants of students' and teachers' lives and life chances. Title II of the 1998 Higher Education Act, in particular, has encouraged a kind of de facto national program of teacher testing. Given the extraordinarily high stakes of these tests, the personal and institutional consequences of poorly designed teacher tests have become too great simply to allow test developers to serve as their own (and lone) quality control and their own (and often non-existent) dispute resolution boards.
         Now is the time for the community of professional educators and psychometricians to take a stand and demand that test developers be held accountable for their products in the test marketplace. What this would require at the very least are (1) a mechanism for an independent external audit of the technical characteristics of any test used for high stakes decisions, and (2) a mechanism for the resolution of disputed scores, results, and cases.
         Only then will taxpayers, educators, and test candidates have confidence that teacher tests are actually providing the information intended by legislative actions to raise educational standards and enhance teacher quality. Title II legislation certainly did not cause the high stakes test Juggernaut that is rolling through all aspects of educational reform in the U.S. and elsewhere. With mandatory teacher test reporting now tied to federal funding, however, Title II legislation certainly has added to the size, weight, and power of the test Juggernaut and strengthened its hold on reform. For this reason, federal policy makers are now responsible for providing legislative assurances that the public will be protected from the shoddy craftsmanship of some tests and some testing companies and that there will be remedies in place to right the mistakes that result from negligence. This article ends with a call to action. Policy makers must now incorporate into the federal legislation that requires state teacher test reporting new concomitant requirements for the establishment of independent audits and dispute resolution boards.

Notes

I wish to thank Marilyn Cochran-Smith, Walt Haney, Joseph Herlihy, Craig Kowalski, George Madaus, and Diana Pullin for their advice and editorial comments.
  1. The class consisted of “all black persons who have been or will be denied any level teaching certificate because of their failure to pass the tests by the Alabama Initial Teacher Certification Testing Program.” (Order On Pretrial Hearing, 1984).
  2. This specific wording does not appear until the Amended Consent Decree of Jan. 5, 2000.
  3. Among other things, conditions were set on the development of new tests, an independent monitoring and oversight panel was established, grade point averages were ordered to be considered in the certification process, and defendants would pay compensatory damages to the plaintiffs and plaintiffs' attorneys' fees and costs (Consent Decree, 1985).
  4. That decision has been upheld numerous times since. The latest Amended Consent Decree was approved on January 5, 2000 (Allen, Jan.5 , 2000).
  5. George Madaus, Joseph Pedulla, John Poggio, Lloyd Bond, Ayres D'Costa, Larry Ludlow.
  6. “Failure tables” consisted of an applicant's name, their raw scores on the exams, the exam cut-scores, their actual responses to suspect items, and their recomputed raw scores if they should have been credited with a correct response to a suspect item. Examinees were identified in court who had failed an examination by one point (i.e., missed the cut- score by one item) but had actually responded correctly to a miskeyed item. For example, on the fifth administration of the Elementary Education exam there were six people who should have been scored correct on scorable item #43 (the so-called “carrot” item) but were not. Their total scores were 72. The cut-score was 73. These individuals should have passed the examination. There was even a candidate who took an exam multiple times and failed but who should have passed on each occasion.
  7. The standard contract for test development will include some specification of indemnification. In the case of a state agency like the MDOE, the Request For Responses will typically specify protection for the state, holding the contractor responsible for damages (MDOE, 1997, V. (G), 1, p.17). Contractors, understandably, are reluctant to enter into such an agreement and have been successful in striking this language from the contract.
  8. The rationale is that .20 is the minimum correlation required to achieve statistical significance at alpha=.05 for a sample size of 100. This is because .20 is twice the standard error (based on a sample of 100) needed to differ significantly from a correlation of zero.
  9. The difference between piloting test items, as NES did, and conducting a field-trial is that the field-trial simulates the actual operational test-taking conditions. Its value is that problems can be detected that are otherwise difficult to uncover. For example, non-standardized testing conditions created numerous sources of measurement error on the first administration of the MECT (Haney et al, 1999).
  10. This interpretation of measurement error goes considerably beyond conventional practice where “Errors of measurement are generally viewed as random and unpredictable.” (Standards, 1999, p. 26). A miskeyed answer key is not a random error. It is a mistake and its effect is felt greatest by those near the cut-score. Although false-positive passes may benefit from the mistake, it is the false-negative fails who suffer and, as a consequence, seek a legal remedy.
  11. To date the MDOE has routinely ignored questions requesting technical information, e.g. how many items originally came from item banks, who developed the item banks, how many items have been replaced, what are the reliabilities of new items, what are the technical characteristics of the present tests, will the Technical Report be updated, what “disparate impact” analyses have been conducted?
  12. From the start of testing to the present time individual IHE's have not been able to initiate any systematic analysis of their own student summary scores, let alone any statewide reliability and validity analyses. The primary reason for this paucity of within- and across- institution analysis is because NES only provides IHEs with student summary scores printed on paper—no electronic medium is provided for accessing and using one's own institutional data. Thus, each IHE faces the formidable task of hand-entering each set of scores for each student for each test date. This results in a unique and incompatible database for each of the Commonwealth's IHEs.
  13. I assert that the right to question any aspect of a high-stakes examination should take precedence over the waiver required when one takes the MECT: “I waive rights to all further claims, specifically including, but not limited to, claims for negligence arising out of any acts or omissions of the Massachusetts Department of Education and the Contractor for the Massachusetts Educator Certification Tests (including their respective employees, agents, and contractors)” (MDOE, 2001, p. 28).

References

Angoff, W. (ed.). (1971). The College Board Admissions Testing Program: A Technical Report on Research and Development Activities Relating to the Scholastic Aptitude Test and Achievement Tests. NY: College Entrance Examination Board. Allen v. Alabama State Board of Education, 612 F. Supp. 1046 (M.D. Ala. 1985).
Allen v. Alabama State Board of Education, 636 F. Supp. 64 (M.D. Ala. Feb. 5, 1986).
Allen v. Alabama State Board of Education, 816 F. 2d 575 (11th Cir. April 22, 1987).
Allen v. Alabama State Board of Education, 976 F. Supp. 1410 (M.D. Ala. Sept. 8, 1997).
Allen v. Alabama State Board of Education, 190 F.R.D. 602 (M.D. Ala. Jan. 5, 2000).
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for Educational and Psychological Testing. Washington, D.C.: American Educational Research Association.
Associated Press Archives, (October 4, 1998). State Administers Teacher Certification Test Amid Ongoing Complaints.
Baldus, D.C. & Cole, J.W.L. (1980). Statistical Proof of Discrimination. NY: McGraw-Hill.
Cochran-Smith, M. (in press). The outcomes question in teacher education. Teaching and Teacher Education.
Cochran-Smith, M. & Dudley-Marling, C. (in press). The flunk heard round the world. Teaching Education.
Consent Decree, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. Oct. 25, 1985).
Crocker, L. & Algina, J. (1986). Introduction to Classical and Modern Test Theory. NY: Holt, Rinehart and Winston.
Daley, B. (1999). “Teacher exam authors put to the test”. Boston Globe, 10/7/98, B3.
Daley, B.; Vigue, D.I. & Zernike, K. (1999) “Survey says Massachusetts Teacher Test is best in US”. Boston Globe, 6/22/99, B02.
Defendant's Pre-trial Memorandum, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. May 1, 1986).
Donlon, T. (ed.) (1984). The College Board Technical Handbook for the Scholastic Aptitude Test and Achievement Tests. NY: College Entrance Examination Board.
Downing, S. & Haladyna, A. (1996). A model for evaluating high stakes testing programs: Why the fox should not guard the chicken coop. Educational Measurement: Issues and Practice, 15:1, pp.5-12.
Dressel, P.L. (1940). Some remarks on the Kuder-Richardson reliability coefficient. Psychometrika, 5, 305-310.
Ebel, R.L. & Frisbie, D.A. (1991) (5th ed.). Essentials of Educational Measurement. NJ: Prentice Hall.
Guilford, J.P. (1936) (1st ed.). Psychometric Methods. NY: McGraw-Hill.
Guilford, J.P. (1954) (2nd ed.). Psychometric Methods. NY: McGraw-Hill.
Haney, W., & Madaus, G. F. (1990). Evolution of Ethical and Technical standards. In R.K. Hamilton, & J. N. Zaal (Eds.), Advances in Educational and Psychological Testing (pp.395-425).
Haney, W.M., Madaus, G.F. & Lyons, R. (1993). The Fractured Marketplace for Standardized Testing. Boston: Kluwer.
Haney, W. (1996). Standards, Schmandards: The need for bringing test standards to bear on assessment practice. Paper presented at the annual meeting of the American Educational Research association annual meeting. NY: NY.
Haney, W., Fowler, C., Wheelock, A, Bebell, D. & Malec, N. (1999). Less truth than error?: An independent study of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(4). Available online at http://epaa.asu.edu/epaa/v7n4/.
Henrysson, S. (1963). Correction for item-total correlations in item analysis. Psychometrika, 28, 211-218.
Hopkins, K.D. (1998) (8th ed.). Educational and Psychological Measurement and Evaluation. Boston: Allyn and Bacon.
Kuder, G.F. & Richardson, M.W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151-160.
Loevinger, J. (1954). The attenuation paradox in test theory. Psychological Bulletin, 51, 493-504.
Lord, F.M. & Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley.
Madaus, G. (May 19-20, 1986). Testimony in Allen v Alabama (81-697-N).
Madaus, G. (1990). Legal and professional issues in teacher certification testing: A psychometric snark hunt. In J.V. Mitchell, S. Wise, & B. Plake (Ed.), Assessment of teaching: Purposes, practices, and implications for the profession. (pp. 209-260). Hillside, NJ: Lawrence Erlbaum Associates..
Massachusetts. (1999). FY 2000-2001 Budget.
Massachusetts Department of Education (February 24, 1997). Massachusetts Teacher Certification Tests of Communication and Literacy Skills and Subject Matter Knowledge: Request for Responses (RFR).
Massachusetts Department of Education (July 1, 1998). Board of Education Special Meeting Minutes. http://www.doe.mass.edu/boe/minutes/98/min07 0198.html.
Massachusetts Department of Education (July 27, 1999). Department of Education Press Release. http://www.doe.mass.edu/news/archive99/pr072 799.html.
Massachusetts Department of Education (November 28, 2000). Board of Education Regular Meeting Minutes. http://www.doe.mass.edu/boe/minutes/00/1128r eg.pdf.
Massachusetts Department of Education (February 16, 2001). Massachusetts Educator Certification Tests: Registration Bulletin. http://www.doe.mass.edu/teachertest/bulletin 00/00bulletin.pdf
Melnick, S. & Pullin, D. (1999, April). Teacher education & testing in Massachusetts: The issues, the facts, and conclusions for institutions of higher education. Boston: Association of Independent Colleges and Universities of Massachusetts.
Millman, J. (June 17, 1986). Testimony in Allen v Alabama (81-697-N).
National Board on Educational Testing & Public Policy. (2000). Policy statement. Chestnut Hill, MA: Lynch School of Education, Boston College.
National Commission on Testing and Public Policy. (1990). From Gatekeeper to Gateway: Transforming Testing in America. Chestnut Hill, MA: Lynch School of Education, Boston College.
National Evaluation Systems. (1999). Massachusetts Educator Certification Tests Technical Report. Amherst, MA: National Evaluation Systems.
Nunnally, J. (1967). Psychometric Theory. NY: McGraw- Hill.
Order On Pretrial Hearing, Allen v. Alabama State Board of Education, No. 81-697-N (M.D. Ala. Dec. 19, 1984).
Pearson, K. (1909). On a new method of determining correlation between a measured character A and a character B, of which only the percentage of cases wherein B exceeds or falls short of a given intensity is recorded for each grade of A. Biometrika, Vol. VII.
Pressley, D.S. (1998). “Dumb struck: Finneran slams 'idiots' who failed teacher tests.” Boston Herald, 6/26/98 pp. 1,28.
Rawls, P. (2000). “ACT may design test for Alabama's future teachers.” The Associated Press, 7/11/00
Richardson v. Lamar County Board of Education, 729 F. Supp. 806. (M.D. Ala 1989) aff'd, 935 F. 2d 1240 (11th Cir. 1991).
Richardson, M.W. & Stalnaker, J.M. (1933). A note on the use of bi-serial r in test research. Journal of General Psychology, 8, 463-465.
Thorndike, E.L., Bregman, M.V., Cobb, Woodyard, E. et al., (1929) The Measurement of Intelligence. NY: Teachers College, Columbia University.
U.S. Department of Education, National Center for Education Statistics. Reference and Reporting Guide for Preparing State and Institutional Reports on the Quality of Teacher Preparation: Title II, Higher Education Act, NCES 2000- 089. Washington, DC: 2000.
Wainer, H. (1999). Some comments on the Ad Hoc Committee's critique of the Massachusetts Teacher Tests. Education Policy Analysis Archives, 7(5). Available online at http://epaa.asu.edu/epaa/v7n5.html.
Walden, J.C. & Deaton, W.L. (1988). Alabama's teacher certification test fails. 42 Ed. Law Rep.1

About the Author

Larry H. Ludow
Associate Professor
Boston College
Lynch School of Education
Educational Research, Measurement, and Evaluation Department
Email: Ludlow@bc.edu
Larry Ludlow is an Associate Professor in the Lynch School of Education at Boston College. He teaches courses in research methods, statistics, and psychometrics. His research interests include teacher testing, faculty evaluations, applied psychometrics, and the history of statistics. 
Sumber 
http://epaa.asu.edu/epaa/v9n6.html