The main item statistic generated in classical item analysis is a P value, not to be confused with the p-value generated in inferential statistical analysis. In this context, P refers to difficulty, and it is abbreviated as P because it is the proportion or percentage of examinees who get the item correct. If almost no one gets the item correct, it is a difficult item. If almost everyone gets the item correct, it is an easy item.
The problem here is that P value is entirely sample-dependent. If you have an exceptionally capable sample take your test, your items will all look easy, even if relatively speaking they are not. Item response theory and Rasch, on the other hand, provide item difficulty that is not sample dependent. If you look at the math behind IRT and Rasch, you can see exactly where sample is being controlled for and therefore partialled out. (Sadly, you'll have to take all of that at face value, because going into how IRT and Rasch is not sample dependent goes beyond the scope of this post/series. But maybe I need to do an A to Z of Rasch next year!)
Basically, when doing classical item analysis, where the capability of your sample can completely change your item statistics, it becomes even more important to validate content and have experts on hand to help determine what items are appropriate for different ability groups. It also highlights the importance of sampling when piloting the test or measure.
In my previous job, I began working on a cognitive ability test developed with classical test theory. This was a fixed form test, or rather, a set of fixed form tests that were written for specific ages. So an 8-year old would take the form created for 8-year olds, which would contain items appropriate for that age level as well as the age levels on either side (7 and 9). Later ages were often combined - for instance, 13/14, 15/16, and 17/18. This was not an adaptive test, but was one way to see if children and adolescents are performing at their age-level with a paper and pencil, multiple-choice test that could be given to a group of pretty much any size. As I've said before, adaptive testing is sometimes the best way, but not always. There are other considerations, such as ease of administration, cost, and what you plan to do with the information.
The other psychometrician on the project was using a piece of software called ITEMAN. But because quality control was one of our top priorities, I was asked to perform my own psychometric analysis with a different piece of software that would give us the same analysis. Fortunately, I found the ITEMAN R package. This allowed us to validate each other, and showed that the ITEMAN package provided nearly exact results to the ITEMAN software.
So my plan was to do a tutorial today on ITEMAN, a rather simple package that comes with two sample datasets and two functions, ITEMAN1 and ITEMAN2. ITEMAN1 is for use with multiple choice tests that have single correct answers; the sample dataset, dichotomous, is for use with ITEMAN1. ITEMAN2 is for polytomous items - items that use rating scales, such as attitude measures; the sample dataset, timms2011_usa works with this function. As I was beginning to write this post, I discovered ITEMAN was no longer available, and I couldn't even find old package files on GitHub. Hence the subtitle for today, I Must Be Flexible. After some searching, I found another package, CTT, that gives similar results to ITEMAN.
So let's demonstrate scoring and classical item analysis with CTT. I pulled an exam key for an old exam I gave in a course I taught about 9 years ago. This was a 30 item test that had 25 multiple choice items, plus 5 short answer and essay items. I generated some random data as responses to the 25 multiple choice items so we can apply the answer key and run item statistics. The dataset contains responses (A, B, C, or D) for all 25 items by 20 students. I'll read in that data, then create an object containing the exam key.
exam_data<-read.delim("madeup_testdata.txt",header=TRUE) items_only<-exam_data[,2:26] exam_key<-c("B","C","C","D","A","C","C","B","C","C","A","A","C","C","A","B", "B","B","D","A","C","A","A","C","C")
Now I need to have CTT score my test using the key I provided. I requested output.scored, so it gives me a matrix of scored results I can then reference with the name of the score object (exam_score) + $scored. The resulting scored data is then used for item analysis. Be sure to load the CTT package (install first if you haven't yet).
install.packages("CTT")
library(CTT) exam_score<-score(items_only, key=exam_key, output.scored=TRUE, rel=TRUE)
## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero
report<-itemAnalysis(exam_score$scored, itemReport=TRUE)
## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero ## Warning in cor(items[, i], Xd): the standard deviation is zero
You notice I'm getting some errors, likely because of some very easy items that everyone got correct. We'll just ignore those for now and move on to examining our results. As part of the itemAnalysis function, CTT creates a data frame called itemReport, which we can access like this:
report[["itemReport"]]
## itemName itemMean pBis bis alphaIfDeleted ## 1 X1 0.6666667 0.39053161 0.50632146 0.7420752 ## 2 X2 0.8571429 0.49165409 0.76245000 0.7361525 ## 3 X3 0.4761905 0.02570432 0.03223647 0.7747674 ## 4 X4 1.0000000 NA NA 0.7583576 ## 5 X5 0.9047619 0.56092808 0.97239993 0.7346370 ## 6 X6 1.0000000 NA NA 0.7583576 ## 7 X7 0.6666667 0.28566255 0.37035946 0.7512439 ## 8 X8 1.0000000 NA NA 0.7583576 ## 9 X9 0.4285714 0.63662447 0.80260597 0.7184512 ## 10 X10 0.9523810 -0.12122355 -0.26026053 0.7661433 ## 11 X11 1.0000000 NA NA 0.7583576 ## 12 X12 0.6666667 0.28566255 0.37035946 0.7512439 ## 13 X13 0.9047619 0.56092808 0.97239993 0.7346370 ## 14 X14 1.0000000 NA NA 0.7583576 ## 15 X15 0.9523810 0.44556792 0.95661070 0.7445828 ## 16 X16 0.4761905 0.73184243 0.91782302 0.7084914 ## 17 X17 0.6190476 -0.10360238 -0.13203544 0.7841816 ## 18 X18 1.0000000 NA NA 0.7583576 ## 19 X19 1.0000000 NA NA 0.7583576 ## 20 X20 0.4761905 0.73184243 0.91782302 0.7084914 ## 21 X21 0.5714286 0.30228059 0.38109155 0.7503664 ## 22 X22 0.8571429 0.39819222 0.61751070 0.7423402 ## 23 X23 1.0000000 NA NA 0.7583576 ## 24 X24 1.0000000 NA NA 0.7583576 ## 25 X25 0.7142857 0.52730093 0.70081320 0.7301665
itemMean gives us our P value, pBis is our point-biserial (the correlation between score on that item and total score on the other items), biserial (almost the same as point biserial - correlation between item score and total score without that item - but it treats the item differently: as ordinal with an underlying continuity, instead of as a discrete, 0 or 1, value), and alpha if that item were deleted.
This report tells us that there many easy items - p-values of 0.9 and greater - but there are also some moderately difficult to difficult items, which not a bad thing. What is problematic is that some of the items don't seem to relate to overall performance on the test - which can be seen in the low and sometimes negative point-biserial correlations. This gives me some guidance on what items I might want to potentially drop. But deleting any items isn't going to have too much of an impact on reliability of the measure, which sits in the 0.7 range.
CTT has more to offer that I won't go into now, but you can read more about them in the manual available here. Could you do this kind of analysis without a package like CTT? Absolutely. One of the benefits of classical test theory and item analysis approaches is that many of these analyses can be done by hand or simple software, while IRT and Rasch approaches are nearly impossible without computer assistance. If your data is set up as 0=incorrect and 1=correct, you could calculate P values by simply taking the mean. Point-biserial correlations would take a bit more work but are still completely doable on your own. But the nice thing about this package is that it automates the process, so you don't need to write your own functions, and produces a report of all item statistics.
Hi, there is a read-only mirror of the ITEMAN package https://github.com/cran/ITEMAN and this same page has the homepage URL as http://sites.education.miami.edu/zopluoglu/.
ReplyDelete