This topic introduces a few particularly popular themes related to the application of Lertap 5.

 

Visual item analysis with quintile plots

 

The matter of assessing the quality of cognitive test items has traditionally been based on tables of numeric data, embroidered with core global measures, such as estimates of test reliability.

Lertap's quintile plots provide an alternative method: pictures.

Have a look at some examples, and see if you too might not be found wearing smiles for quintiles. Click here to have an initial look. See "packed plots" in action here. See fancy new capabilities introduced in January 2015.

Then, pour yerself another cuppa something, sit back, and take in a short practical example from some quality achievement tests developed and used in Central Java ('Jateng').

Make a leap for Lelp (Lertap help) to see even more about quintiles and associated Lertap 5 options.

i12packedquintiles

i3packedquintiles

 

Mastery tests, certification, licensing

 

The Standards for Educational and Psychological Testing (1999), published by the American Educational Research Association, recommend the use of 'special' statistics when the measurement process involves the use of cut scores. Mastery, licensing, and certification tests are examples of applications which typically use cut scores, often on a pass-fail basis.

Lertap 5 produces all of the 'special' statistics recommended in the Standards, and then some. We've got a whiz-bang, top-flight paper which you'll not want to miss if you hope to make the cut.

NCCA, the National Commission for Certifying Agencies, has a special report form used to summarize results from mastery (or pass/fail) exams. A paper which indicates how Lertap's output links in with the information requested by NCCA is here.

Lelp (Lertap help) has these topics covered too.

The screen snapshot below displays part of Lertap 5's report for a certification test.
 

variancecomponents

 

DIF: differential item functioning

 

DIF refers to procedures used to determine whether or not test items may have unexpectedly favored one test-taking group over another.

The top plot on the right indicates the proportion of students with test scores ranging from 9 to 36 who got "I1" right. The plot shows that for all those with a test score of 9, 31% of the students in the "Ing." group got the item right, compared to 9% in the "Nat." group. However, over the range of scores plotted there is not a consistent pattern: one group will be on top at a particular score level, but then the other group will be on top at another score level. This plot (and its associated statistics) is an example where no DIF was detected.

The bottom plot on the right differs: it shows some evidence of DIF as the Nat. group rather consistently outperformed the Ing. group, especially at score levels from 14 to 23.

Lertap 5 uses the Mantel-Haenszel method for assessing DIF. A technical paper on this is available, as are related topics in "Lelp", Lertap help.

 

There are many times when test developers and users are not necessarily concerned with DIF, per se, but want to have plots which trace the performance of item responses in order to discern how groups may have differed in their "endorsements" to each of an item's options. For example, the table below, and its corresponding graph, indicate that many of the 844 students in the "Ing." group were distracted by option B: 24% endorsed B, compared to just 9% in the "Nat." group. This almost certainly accounts for the considerable group disparity seen at option C, which was the correct answer to I35.

 

i35_response_plot

 

Tables and plots such as the one above are discussed in another Lelp topic.

 

 

 

i1_dif_plot_zoom89

 

 

 

i35_dif_plot_zoom87

 

 

Assessing test score differences among groups

 

The table and plot for I35 seen in the box above exemplifies Lertap 5's capability for displaying group differences at the item level. A similar capability is available for expressing the extent to which two or more groups differ with regard to a score, such as a total test score.

For example, the boxplot to the right indicates how students in seven age levels differed on a rigorously-developed test of science achievement. It shows that the greatest spread of test scores was found in group A10 (ten-year olds), while older students had the lowest median test scores. It appears that the test may have been on the difficult side as the median test score for all groups never exceeded 10 (the maximum possible test score was 25).

Lertap 5's "Breakout scores by groups" options are in charge of making boxplots, other graphs, and a corresponding detailed analysis of variance.

 

 

boxplotb_scienceages

 

Response similarity analysis, RSA

 

Lertap 5 has a cheat-checker referred to as "RSA". It's used to determine if the answers a pair of students gave on a multiple-choice test were overly similar.

When an RSA analysis is run, Lertap gets Excel to create a variety of tables with descriptive and inferential statistics.

hmfile_hash_34dd830b

rsa_negocios

In this example, a real-life case, 384 students sat a 59-item test. RSA indicated that, out of a total of 73,538 pairs of students, 3 student pairs had response patterns which could be regarded as "suspicious". Seating charts from the exam venue indicated that these pairs were seated in close proximity, and thus would have had the chance to share answers. The first pair of students in the table above had only a single response difference over all items (the D column), and gave the same incorrect answer on 25 items (EEIC). RSA's "Sigma" statistic found this to be a very unlikely result by chance alone -- the two students were interviewed, confessed to cheating, and were suspended for one semester.

 

To read up on RSA, see this paper, this Lelp topic, and the dataset used in this example..

 

Coefficient alpha, eigenvalues, scree tests, omega reliability

 

This plot displays the eigenvalues from a ten-item rating scale (eigenvalues are also known as "latent roots").

The value of the first eigenvalue was almost 3.83, after which there is a sharp drop to the "scree" -- the second eigenvalue has a magnitude just a bit over 1.00 and from there on the eigenvalues decline in a regular, nearly linear manner. (Some researchers would accept this as an indicator of a "single-factor" scale.

There is an interesting, useful relationship between the magnitude of a test's or scale's first eigenvalue and coefficient alpha. This is discussed further in a technical paper.

A practical example based on using the first eigenvalue and its related eigenvector in the development of a rating scale is presented in this topic. Turn to Lelp (Lertap help) for more comments on eigenvalues, principal components, and how to get Lertap 5 to produce them.

Lertap 5 will form a complete inter-item correlation matrix, putting SMCs (squared-multiple correlation values) on the diagonal if wanted. Tetrachoric correlations may also be output. Comments on these options may be reviewed at this topic, and assistance with getting SPSS to pick up Lertap's correlation matrix is covered here.

An empirical comparison of alpha and coefficient omega, with more about eigenvalues and test reliability, is to be found in this research report.

 

screetest1_zoom91

 

IRT, item response theory

 

Lertap 5 is based on "CTT", classical test theory. But it has support for IRT users, to be sure: see this 2015 document.

An option in the System worksheet to turn on "experimental features" will get Lertap 5 to add three columns to one of its standard reports, "Stats1b". These are the "bis.", "b(i)", and "a(i)" columns seen in the screen snapshot below:

experimentalfeatures1

a(i) and b(i) are estimates of the 2PL IRT parameters, while bis. is the biserial correlation coefficient. The derivation of these statistics is the focus of this paper.

Lertap 5 has extensive, well-developed support for users of ASC's Xcalibre IRT program, and it will also output data ready for input to Bilog-MG. (The data file made for use with Bilog-MG has a format also useful to quite a number of other programs, such as ConQuest and RUMM.)

A webpage which demonstrates the strong relationship between CTT and IRT statistics is available at this link, and a technical paper, somewhat critical of the Rasch model, may be found by following this hyperlink.

 

Scoring open-ended and constructed-response items

 

Lertap 5 is particularly adept at processing multiple-choice quizzes and tests, and rating scales. However, it may also be used in situations where the number of responses to an item or question is not fixed in advance. We have two examples to recommend.

One example comes from a graduate student at the University of Minnesota who developed and administered a test having a mixture of multiple-choice and short-answer questions. Her work is described in this document.

Another example derives from a major project to develop a new high-school student aptitude inventory for use on a national scale. A mixture of multiple-choice and constructed-response items were used. An interesting aspect of this project had to do with the development of two equivalent forms of the instrument, making it possible to examine parallel-forms reliability. This project is discussed in another technical paper; the paper as well as authentic data from this project are available as one of our sample datasets.

 

Check out the Features topic for links to other technical references covering Lertap 5's capabilities.