Friday, May 24, 2019

I'm More Sad About This Show Ending than Game of Thrones

Like many, I eagerly waited to see how the game of thrones would end. I tore through the books available at the time shortly before the first season of Game of Thrones aired, and look forward to reading how George R.R. Martin himself would write the ending of the story.

And like many, I was disappointed in the turns taken by Game of Thrones that felt inauthentic to the characters. Especially, this was a show that failed many of its female characters. They took Brienne, who we watched grow into a strong, independent, and honorable knight, and reduced her to Carrie F***ing Bradshaw. They justified the horrible things that had happened to Sansa as character-building. (No one can make you be someone you're not. Sansa, the strength was inside you all the time. Littlefinger and Ramsay don't get credit for that. If anyone does, it's the strong women in your life, like Brienne and Arya.)

But while I'm disappointed in how the show ended, and a little sad that it's gone, I'm honestly more sad that this show is over:


Who would have guessed that a musical comedy TV show would take on some very important issues with such authenticity? Here's just a few of them (some spoilers ahead, so read on only if you've watched the show or don't care about being spoiled):

Women's Issues
Just as a short list, this show tackled periods, abortion, women's sexuality, motherhood, and body image in a way that never felt cheap, judgmental, or cliché. It was the first network show to use the word "clitoris." The relationships between the women on the show felt real and the conversations were about more than simply the men in their lives. It didn't glamorize women's bodies - in fact, it pulled back the curtain on many issues related to women's appearance and projection of themselves to the world.



Men's Issues
The show didn't just represent women authentically - the men were fully realized characters too, and never props or plot devices. Crazy Ex-Girlfriend explored men's relationships, fatherhood, and toxic masculinity and how it affects men.


Mental Health
I could probably write an entire blog post just on how this show represents mental health issues. The main character, Rebecca Bunch, is diagnosed with borderline personality disorder in season 3. And in fact, the show was building up to and establishing that diagnosis from the very beginning. The show constantly made us rethink the word "crazy" and helped to normalize many mental health issues - and when I say normalize, I mean show us that these issues are common and experienced by many people, while still encouraging those struggling with mental health issues to seek help.


The show also tackled issues like low self-esteem, self-hatred, suicide, and alcoholism, without ever glamorizing them. Instead, it encouraged us to take better care of ourselves, and recognize when we have a problem we can't handle ourselves.



Bisexuality
When bisexuals show up in other movies or TV shows, they're often portrayed as promiscuous - people who are bi because they want to have sex with everyone. Either that, or they portray it, especially among men, as someone who is actually gay but not comfortable with coming fully out of the closet. Not Crazy Ex-Girlfriend.


Race and Ethnicity
This show has a diverse cast. And unlike many shows with "diversity," none of the characters are tokens. In fact, race and ethnicity aren't referenced so much as heritage. Further, the show pokes fun at the token concept. One great episode deals with Heather's ethnicity. Her boss, Kevin, encourages her to join a management training program because she is "diverse." Later, he gives her a gift to apologize for his insensitivity: a sari, because he assumes she is Indian. She corrects him; her father is African-American and her mother is White. The extra layer here is that the actress who plays Heather, Vella Lovell, has been mistakenly called Indian in the media, when she, like her character, is African-American and White. So this episode not only makes fun of the concept of the token, it also makes fun of the media trying so hard to ascertain and define an actor by her race.

Crazy Ex-Girlfriend, I'm really going to miss you.

Wednesday, May 22, 2019

New Color Palette for R

As I was preparing some graphics for a presentation recently, I started digging into some of the different color palette options. My motivation was entirely about creating graphics that weren't too visually overwhelming, which I found the default "rainbow" palette to be.

But as the creators of the viridis R package point out, we also need to think about how people with colorblindness might struggle with understanding graphics. If you create figures in R, I highly recommend checking it out at the link above!

Monday, April 15, 2019

J is for Journal of Applied Measurement (and Other Resources)

As with many fields and subfields, Rasch has its own journal - the Journal of Applied Measurement, which publishes a variety of articles either using or describing how to use Rasch measurement. You can read the table of contents for JAM going back to its inaugural issue here.

But JAM isn't the only resource available for Rasch users. First off, JAM Press publishes multiple books on different aspects of Rasch measurement.

But the most useful resource by far is Rasch Measurement Transactions, which goes back to 1987 and is freely available. These are shorter articles dealing with hands on topics. If I don't know how to do something regarding Rasch, this is a great place to check. And you can always find those articles on a certain topic via Google, by setting the site to search as "rasch.org".

Finally, there is a Rasch message board, which is still active, where you can post questions (and answer them if you feel so inclined!).

As you can see, I'm a bit behind on A to Z posts. I'll be playing catch up this week!


Wednesday, April 10, 2019

I is for Item Fit

Rasch gives you lots of item-level data. Not only difficulties, but Rasch analysis will also produce fit indices, for both items and persons. Just like the log-likelihood chi-square statistic that tells you how well your data fit the Rasch model, you also receive item fit indices, which compare observed to expected (based on the Rasch model) responses. These indices are also based on chi-square statistics.

There are two types of fit indices: INFIT and OUTFIT.

OUTFIT is sensitive to Outliers. They are responses that fall outside of the targeted ability level, such as a high ability respondent missing an item targeted to their ability level, or a low ability respondent getting a difficult item correct. This could reflect a problem with the item - perhaps it's poorly worded and is throwing off people who actually know the information. Or perhaps there's a cue that is leading people to the correct answer who wouldn't otherwise get it right. These statistics can cue you in to problems with the item.

INFIT (Information weighted) is sensitive to responses that are too predictable. These items don't tell you anything you don't already know from other items. Every item should contribute to the estimate. More items is not necessarily better - this is one way Rasch differs from Classical Test Theory, where adding more items increases reliability. The more items you give a candidate, the greater your risk of fatigue, which will lead reliability (and validity) to go down. Every item should contribute meaningful, and unique, data. These statistics cue you in on items that might not be necessary.

The expected value for both of these statistics is 1.0. Any items that deviate from that value might be problematic. Linacre recommends a cut-off of 2.0, where any items that have an INFIT or OUTFIT of 2.0 or greater should be dropped from the measure. Test developers will sometimes adopt their own cut-off values, such as 1.5 or 1.7. If you have a large bank, you can probably afford to be more conservative and drop items above 1.5. If you're developing a brand new test or measure, you might want to be more lenient and use the 2.0 cut-off. Whatever you do, just be consistent and cite the literature whenever you can to support your selected cut-off.

Though this post is about item fit, these same statistics also exist for each person in your dataset. A misfitting person means the measure is not functioning the same for them as it does for others. This could mean the candidate got lazy and just responded at random. Or it could mean the measure isn't valid for them for some reason. (Or it could just be chance.) Many Rasch purists see no issue with dropping people who don't fit the model, but as I've discovered when writing up the results of Rasch analysis for publication, reviewers don't take kindly to dropping people unless you have other evidence to support it. (And since Rasch is still not a well-known approach, they mean evidence outside of Rasch analysis, like failing a manipulation check.)

The best approach I've seen is once again recommended by Linacre: persons with very high OUTFIT statistics are removed and ability estimates from the smaller sample are cross-plotted against the estimates from the full sample. If removal of these persons has little effect on the final estimates, these persons can be retained, because they don't appear to have any impact on the results. That is, they're not driving the results. 

If there is a difference, Linacre recommends next examining persons with smaller (but still greater than 2.0) OUTFIT statistics and cross-plotting again. Though there is little guidance on how to define very high and high, in my research, I frequently use an OUTFIT of 3.0 for ‘very high’ and 2.0 for ‘high.’ In my experience, the results of such sensitivity analysis never shows any problem, and I'm able to justify keeping everyone in the sample. This seems to make both reviewers and Rasch purists happy.


Tuesday, April 9, 2019

H is for How to Set Up Your Data File

The exact way you set up your data of course depends on the exact software you use. But my focus today is to give things to think about if/when setting up your data for Rasch analysis.

First, know how your software needs you to format missing values. Many programs will let you simply leave a blank space or cell. Winsteps is fine with a blank space to notate a missing value or skipped question. Facets, on the other hand, will flip out at a blank space and needs a missing value set up (usually I use 9).

Second, ordering of the file is very important, especially if you're working with data from a computer adaptive test, meaning missing values is also important. When someone takes a computer adaptive test, their first item is drawn at random from a set of moderately difficult items. The difficulty of the next item depends on how they did on the first item, but even so, the item is randomly drawn from a set or range of items. So when you set up your data file, you need to be certain that all people who responded to a specific item have that response in the same column (not necessarily where the item was administered numerically in the exam).

This why you need to be meticulously organized with your item bank and give each item an identifier. When you assemble responses for computer adaptive tests, you'll need to reorder people's responses. That is, you'll set up an order for every item in the bank by identifier. When data are compiled, their responses are put in that order, and if a particular item in the bank wasn't administered, there would be a space or missing value there.

Third, be sure you differentiate between item variables and other variables, like person identifiers, demographics, and so on. Once again, know your software. You may find that a piece of software just runs an entire dataset as though all variables are items, meaning you'll get weird results if you have a demographic variable mixed in. Others might let you select certain variables for the analysis and/or categorize variables as items and non-items.

I tend to keep a version of my item set in Excel, with a single variable at the beginning with participant ID number. Excel is really easy to import into most software, and I can simply delete the first column if a particular program doesn't allow non-item variables. If I drop any items (which I'll talk more about tomorrow), I do it from this dataset. A larger dataset, with all items, demographic variables, and so on is kept usually in SPSS, since that's the preferred software at my company (I'm primarily an R user, but I'm the only one and R can read SPSS files directly) in case I ever need to pull in any additional variables for group comparisons. This dataset is essentially the master and any smaller files I need are built from it.


Monday, April 8, 2019

G is for Global Fit Statistics

One of the challenges of Blogging A to Z is that the posts have to go in alphabetical order, even if it would make sense to start with a topic from the middle of the alphabet. It's more like creating the pieces of a puzzle, that could (and maybe should) be put together in a different order than they were created. But like I said, it's a challenge!

So now that we're on letter G, it's time to talk about a topic that I probably would have started with otherwise: what exactly is the Rasch measurement model? Yes, it is a model that allows you to order people according to their abilities (see letter A) and items according to their difficulties. But more that that, it's a prescriptive model of measurement - it contains within it the mathematical properties of a good measure (or at least, one definition of a good measure). This is how Rasch differs from Item Response Theory (IRT) models, though Rasch is often grouped into IRT despite its differences. You see, mathematically, Rasch is not very different from the IRT 1-parameter model, which focuses on item difficulty (and by extension, person ability). But philosophically, it is very different, because while Rasch is prescriptive, IRT is descriptive. If an IRT 1-parameter model doesn't adequately describe the data, you could just select a different IRT model. But Rasch says that the data must fit its model, and it gives you statistics to tell you how well it does. If your data don't fit the model, the deficiency is with the data (and your measure), not the model itself.

Note: This is outside of the scope of this blog series, but in IRT, the second parameter is item discrimination (how well the item differentiates between high and low ability candidates) and the third is the pseudo-guessing parameter (the likelihood you'd get an answer correct based on chance alone). The Rasch model assumes that the item discrimination for all items is 1.0 and does not make any corrections for potential guessing. You know how the SAT penalizes you for wrong answers? It's to discourage guessing. They don't want to you answering a question if you don't know the answer; a lucky guess is not a valid measure. What can I say, man? We psychometricians are a**holes.

When you use Rasch, you're saying that you have a good measure when it gives you data that fits the Rasch model. Poor fit to the Rasch model means you need to rework your measure - perhaps dropping items, collapsing response scales, or noting inconsistencies in person scores that mean their data might not be valid (and could be dropped from the analysis).

For Blogging A to Z 2017, I went through the alphabet of statistics, and for the letter G, I talked about goodness of fit. In Rasch, we look at our global fit statistics to see how well our data fit the prescribed Rasch model. If our data don't fit, we start looking at why and retooling our measure so it does.

The primary global fit statistic we should look at is the log-likelihood chi square statistic, which, as the name implies, is based on the chi square distribution. A significant chi-square statistic in this case means the data significantly differs from the model. Just like in structural equation model, it is a measure of absolute fit.

There are other fit statistics you can look at, such as the Akaike Information Criterion (AIC) and Schwarz Bayesian Information Criterion (BIC). These statistics are used for model comparison (relative fit), where you might test out different Rasch approaches to see what best describes the data (such as a Rating Scale Model versus a Partial Credit Model) or see if changes to the measure (like dropping items) results in better fit. These values are derived from the log-likehood statistic and either degrees of freedom for the AIC or number of non-extreme cases (in Rasch, extreme cases would be those that got every item right or every item wrong) for the BIC. (You can find details and the formulas for AIC and BIC here.)

BIC seems to be the preferred metric here, since it accounts for extreme cases; a measure with lots of extreme cases is not as informative as a measure with few extreme cases, so this metric can help you determine if dropping too easy or too difficult items improves your measure (it probably will, but this lets you quantify that).

Tomorrow, I'll talk about setting up your data for Rasch analysis.

Sunday, April 7, 2019

F is for Facets

So far this month, I've talked about the different things that affect the outcome of measurement - in that it determines how someone will respond. Those things so far would be item difficulty and person ability. How much ability a person has or how much of the trait they possess will affect how they respond to items of varying ability. Each of things that interacts to affect the outcome of Rasch measurement is called a "facet."

But these don't have to be the only facets in a Rasch analysis. You can have additional facets, making for a more complex model. In our content validation studies, we administer a job analysis survey asking people to rate different job-related tasks. As is becoming standard in this industry, we use two different scales for each item, one rating how frequently this task is performed (so we can place more weight on the more frequently performed items) and one rating how critical it is to perform this task competently to protect the public (so we can place more weight on the highly critical items). In this model, we can differentiate between the two scales and see how the scale used changes how people respond. This means that rating scale also becomes a facet, one with two levels: frequency scale and criticality scale.

When conducting a more complex model like this, we need software that can handle these complexities. The people who brought us Winsteps, the software I use primarily for Rasch analysis, also have a program called Facets, which can handle these more complex models.

In a previous blog post, I talked about a facets model I was working with, one with four facets: person ability, item difficulty, rating scale, and timing (whether they received frequency first or second). But one could use a facets model for other types of data, like judge rating data. The great thing about using facets to examine judge data is that one can also partial out concepts like judge leniency; that is, some judges go "easier" on people than others, and a facets models lets you model that leniency. You would just need to have your judges rate more than one person in the set and have some overlap with other judges, similar to the overlap I introduced in the equating post.

This is the thing I love about Rasch measurement, in that it is a unique approach to measurement that can expand in complexity to whatever measurement situation you're presented with. It's all based on the Rasch measurement model, a mathematical model that represents the characteristics a "good measure" should possess - that's what we'll talk about tomorrow when we examine global fit statistics!


Friday, April 5, 2019

E is for Equating

In the course of any exam, new items and even new forms have to be written. The knowledge base changes and new topics become essential. How do we make sure these new items contribute to the overall test score? Through a process called equating.

Typically in Rasch, we equate test items by sprinkling new items in with old items. When we run our Rasch analysis on items and people, we "anchor" the old items to their already established item difficulties. When item difficulties are calculated for the new items, they are now on the exact same metric as the previous items, and new difficulties are established relative to the ones that have already been set through pretesting.

It's not even necessary for everyone to receive all pretest items - or even all of the old items. You just need enough overlap to create links between old and new items. In fact, when you run data from a computer adaptive test, there are a lot "holes" in the data, creating a sparse matrix.


In the example above, few examinees received the exact same combination of items, but with an entire dataset that looks like this (and more examinees, of course), we could estimate item difficulties for new items in the set.

But you may also ask about the "old" items - do we always stick with the same difficulties or do we change those from time to time? Whenever you anchor items to existing item difficulties, the program will still estimate fresh item difficulties and let you examine something called "displacement": how much the newly estimated difficulty differs from the anchored one. You want to look at these and make sure you're not experiencing what's called "item drift," which happens when an item becomes easier or harder over time.

This definitely happens. Knowledge that might have previously been considered difficult can become easier over time. Math is a great example. Many of the math classes my mom took in high school and college (and that may have been electives) were offered to me in middle school or junior high (and were mandatory). Advances in teaching, as well as better understanding of how these concepts work, can make certain concepts easier.

On the flipside, some items could get more difficult over time. My mom was required to take Latin in her college studies. I know a little "church Latin" but little else. Items about Latin would have been easier for my mom's generation, when the language was still taught, than mine. And if we could hop in a time machine and go back to when scientific writing was in Latin (and therefore, to be a scientist, you had to speak/write Latin fluently), these items would be even easier for people.

Essentially, even though Rasch offers an objective method of measurement and a way to quantify item difficulty, difficulty is still a relative concept and can change over time. Equating is just one part of the type of testing you must regularly do with your items to maintain a test or measure.

Thursday, April 4, 2019

D is for Dimensionality

We're now 4 posts into Blogging A to Z, and I haven't really talked much about the assumptions of Rasch. That is, like any statistical test, there are certain assumptions you must meet in order for results to be valid. The same is true of Rasch. One of the key assumptions of Rasch is that the items all measure the same thing - the same latent variable. Put another way, your measure should be unidimensional: only assessing one dimension.

This is because, in order for the resulting item difficulties and person abilities to be valid - and comparable to each other - they have to be assessing the same thing. It wouldn't make sense to compare one person's math test score to a reading test score. And it wouldn't make sense to combine math and reading items into the same test; what exactly does the exam measure?

But the assumption of unidimensionality goes even further than that. You also want to be careful that each individual item is only measuring one thing. This is harder than it sounds. While a certain reading ability is needed for just about any test question, and some items will need to use jargon, a math item written at an unnecessarily high reading level is actually measuring two dimensions: math ability and reading ability. The same is true for poorly written items that give clues as to the correct answer or trick questions that trip people up even when they have the knowledge. Your test is not only of ability on the test topic, but a second ability: test savviness. The only thing that impacts whether a person gets an item correct should be their ability level in that domain. That is an assumption we have to make for a measurement to be valid. That is, if a person with high math ability gets an item incorrect because of low reading ability, we've violated that assumption. And if a person with low ability gets an item correct because there was "all of the above" option, we've violated that assumption.

How do we assess dimensionality? One way is with principal components analysis (PCA), which I've blogged about before. As a demonstration, I combined two separate measures (the Satisfaction with Life Scale and the Center for Epidemiologic Studies Depression measure) and ran them through a Rasch analysis as though they were a single measure. The results couldn't have been more perfect if I tried. The PCA results showed 2 clear factors - one made up of the 5 SWLS items and one made up of the 16 CESD items. Here's some of the output Winsteps gave me for the PCA. The first looks at explained variance and Eigenvalues:


There are two things to look at when running a PCA for a Rasch measure. 1. The variance explained by the measures should be at least 40%. 2. The Eigenvalue for the first contrast should be less than 2. As you can see, the Eigenvalue right next to the text "Unexplned variance in 1st contrast" is 7.76, while the values for the 2nd contrast (and on) are less than 2. That means we have two dimensions in the data - and those dimensions are separated out when we scroll down to look at the results of the first contrast. Here are those results in visual form - the letters in the plot refer to the items.


You're looking for two clusters of items, which we have in the upper left-hand corner and the lower right-hand corner. The next part identifies which items the letters refer to and gives the factor loadings:


The 5 items on the left side of the table are from the SWLS. The 16 on the right are all from the CESD. Like I said, perfect results.

But remember that PCA is entirely data-driven, so it could identify additional factors that don't have any meaning beyond that specific sample. So the PCA shouldn't be the only piece of evidence for unidimensionality, and there might be cases where you forgo that analysis altogether. If you've conducted a content validation study, that can be considered evidence that these items all belong on the same measure and assess the same thing. That's because content validation studies are often based on expert feedback, surveys of people who work in the topic area being assessed, job descriptions, and textbooks. Combining and cross-validating these different sources can be much stronger evidence than an analysis that is entirely data dependent.

But the other question is, how do we ensure all of our items are unidimensional? Have clear rules for how items should be written. Avoid "all of the above" or "none of the above" as answers - they're freebie points, because if they're an option, they're usually the right one. And if you just want to give away freebie points, why even have a test at all? Also make clear rules about what terms and jargon can be used in the test, and run readability analysis on your items - ignore those required terms and jargon and try to make the reading level of everything else as consistent across the exam as possible (and as low as you can get away with). When I was working on creating trait measures in past research, the guidelines from our IRB were usually that measures should be written at a 6th-8th grade reading level. And unless a measure is of reading level, that's probably a good guideline to try to follow.

Tomorrow, we'll talk about equating test items!

Wednesday, April 3, 2019

C is for Category Function

Up to now, I’ve been talking mostly about Rasch with correct/incorrect or yes/no data. But Rasch can also be used with measures using rating scales or where multiple points can be awarded for an answer. If all of your items have the same scale – that is, they all use a Likert scale of Strongly Disagree to Strongly Agree or they’re all worth 5 points – you can use the Rasch Rating Scale Model.

Note: If your items have differing scales, you could use a Partial Credit Model, which fits each item separately, or if you have sets of items worth the same number of points, you could use a Grouped Rating Scale model, which is sort of a hybrid of the RSM and PCM. I’ll try to touch on these topics later.

Again, in Rasch, every item has a difficulty and every person an ability. But for items worth multiple points or with rating scales, there’s a third thing, which is on the same scale as item difficulty and person ability – the difficulty level for each point on the rating scale. How much ability does a person need to earn all 5 points on a math item? (Or 4 points? Or 3 points? …) How much of the trait is needed to select “Strongly Agree” on a satisfaction with life item? (Or Agree? Neutral? ...) Each point on the scale is given a difficulty. When you examine these values, you’re looking at Category Function.

When you look at these category difficulties, you want to examine two things. First, you want to make certain that higher points on the scale require more ability or more of the trait. Your category difficulties should stairstep up. When your scale points do this, we say the scale is “monotonic” (or “proceeds monotonically”).

Let’s start by looking at a scale that does not proceed monotonically, where the category difficulties are disordered. There are two types of category function data you’ll look at. The first is the “observed measure,” which is the average ability of the people who selected that category. The second are category thresholds – how much more of the trait is needed to select that particular category. When I did my Facebook study, I used the Facebook Questionnaire (Ross et al., 2009), which is a 4-item measure assessing intensity of use and attitudes toward Facebook. All 4 items use a 7-point scale from Strongly Disagree to Strongly Agree. Just for fun, I decided to run this measure through a Rasch analysis in Winsteps, and see how the categories function. Specifically, I looked at the thresholds. (I also looked the observed measures, but they were monotonic, which is good. But the thresholds were not, which can happen, where one looks good and the other looks bad.) Because these are thresholds between categories, there isn’t one for the first category, Strongly Disagree. But there is one for each category after that, which reflects how much ability or the trait they need to be more likely to select that category than the one below it. Here’s what those look like for the Facebook Questionnaire.


The threshold for the neutral category is lower than for slightly disagree. People are not using that category as I intended them to – perhaps they’re using it when they generally have no opinion, for instance, rather than when they’re caught directly between agreement and disagreement. If I were developing this measure, I might question whether to drop this category, or perhaps find a better descriptor for it. Regardless, I would probably collapse this category into another one (which I usually determine based on frequencies), or possibly drop it, and rerun my analysis with a new 6-point scale to see if category function improves.

The second thing you want to look for is a good spread on those thresholds; you want them to be at least a certain number of logits apart. When you have more options on a rating scale, this adds additional cognitive effort to answer the question. So you want to make sure that each additional point on the rating scale actually gives you useful information – information that allows you to differentiate between people at one point on the ability scale and others. If two categories have basically the same threshold, it means people are having trouble differentiating the two; maybe they’re having trouble parsing the difference between “much of the time” and “most of the time,” leading people of approximately the same ability level to select these two categories about equally.

I’ve heard different guidelines on how big a “spread” is needed. Linacre, who created Winsteps, recommends 1.4 logits, and recommends collapsing categories until you’re able to attain this spread. That’s not always possible. I’ve also heard smaller, such as 0.5 logits. But either way, you definitely don’t want two categories to have the exact same observed measure or category threshold.

Also as part of the Facebook study, I administered the 5-item Satisfaction with Life Scale (Diener et al., 1985). Like the Facebook Questionnaire, this measure uses a 7-point scale (Strongly Disagree to Strongly Agree).


The middle categories are all closer together, and certainly don’t meet Linacre’s 1.4 logits guideline. I’m not as concerned about that, but I am concerned that Neither Agree nor Disagree and Slightly Agree are so close together. Just like above, where the category thresholds didn’t advance, there might be some confusion about what this “neutral” category really means. Perhaps this measure doesn’t need a 7-point scale. Perhaps it doesn’t need a neutral option. These are some issues to explore with the measure.

As a quick note, I don’t want it to appear I’m criticizing either measure. They were not developed with Rasch and this idea of category function is a Rasch-specific one. It might not be as important for these measures. But if you’re using the Rasch approach to measurement, these are ideas you need to consider. And clearly, these category function statistics can tell you a lot about whether there seems to be confusion about how a point on a rating scale is used or what it means. If you’re developing a scale, it can help you figure out what categories to combine or even drop.

Tomorrow’s post – dimensionality!

References

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction with Life Scale. Journal of Personality Assessment, 49, 71-75.

Ross, C., Orr, E. S., Sisic, M., Arseneault, J. M., Simmering, M. G., & Orr, R. R. (2009). Personality and motivations associated with Facebook use. Computers in Human Behavior, 25, 578-586.

Tuesday, April 2, 2019

B is for Bank

As I alluded to yesterday, in Rasch, every item gets a difficulty and every person taking (some set of) those items gets an ability. They're both on the same scale, so you can compare the two, and determine which item is right for the person based on their ability and which ability is right for the person based on how they perform on the items. Rasch uses a form of maximum likelihood estimation to create these values, so it goes back and forth with different values until it gets a set of item difficulties and person abilities that fit the data.

Once you have a set of items that have been thoroughly tested and have item difficulties, you can begin building an item bank. A bank could be all the items on a given test (so the measure itself is the bank), but usually, when we refer to a bank, we're talking about a large pool of items from which to draw for a particular administration. This is how computer adaptive tests work. No person is going to see every since item in the bank.

Maintaining an item bank is a little like being a project manager. You want to be very organized and make sure you include all of the important information about the items. Item difficulty is one of the statistics you'll want to save about the items in your bank. When you administer a computer adaptive test, the difficulty of the item the person receives next is based on whether they got the previous item correct or not. If they got the item right, they get a harder item. If they got it wrong, they get an easier item. They keep going back and forth like this until we've administered enough items to be able to estimate that person's ability.

You want to maintain the text of the item: the stem, the response options, and which option is correct (the key). The bank should also include what topics each item covers. You'll have different items that cover different topics. On ability tests, those are the different areas people should know on that topic. An algebra test might have topics like linear equations, quadratic equations, word problems, and so on.

With adaptive tests, you'll want to note which items are enemies of each other. These are items that are so similar (they cover the same topic and may even have the same response options) that you'd never want to administer both in the same test. Not only do enemy items give you no new information on that person's ability to answer that type of question, they may even provide clues about each other that leads someone to the correct answer. This is bad, because the ability estimate based on this performance won't be valid - the ability you get becomes a measure of how savvy the test taker is, rather than their actual ability on the topic of the test.

On the flip side, there might be items that go together, such as a set of items that all deal with the same reading passage. So if an examinee randomly receives a certain passage of text, you'd want to make sure they then receive all items associated with that passage.

Your bank might also include items you're testing out. While some test developers will pretest items through a special event, where examinees are only receiving new items with no item difficulties, once a test has been developed, new items are usually mixed in with old items to gather item difficulty data, and to calibrate the difficulties of the new items based on difficulties of known items. (I'll talk more about this when I talk about equating.) The new items are pretest and considered unscored - you don't want performance on them to affect the person's score. The old items are operational and scored. Some tests put all pretest items in a single section; this is how the GRE does it. Others will mix them in throughout the test.

Over time, items are seen more and more. You might reach a point where an item has been seen so much that you have to assume it's been compromised - shared between examinees. You'll want to track how long an item has been in the bank and how many times it's been seen by examinees. In this case, you'll retire an item. You'll often keep retired items in the bank, just in case you want to reuse or revisit them later. Of course, if the item tests on something that is no longer relevant to practice, it may be removed from the bank completely.

Basically, your bank holds all items that could or should appear on a particular test. If you've ever been a teacher or professor and received supplementary materials with a test book, you may have gotten items that go along with the text. These are item banks. As you look through them, you might notice similarly worded items (enemy items), topic in the textbook covered, and so on. Sadly, these types of banks don't tend to have item difficulties attached to them. (Wouldn't that be nice, though? Even if you're not going to Rasch grade your students, you could make sure you have a mix of easy, medium, and hard items. But these banks don't tend to have been pretested, a fact I learned when I applied for a job to write many of these supplementary materials for textbooks.)

If you're in the licensing or credentialing world, you probably also need to track references for each item. And this will usually need to be a very specific reference, down to the page of a textbook or industry document. You may be called upon - possibly in a deposition or court - to prove that a particular item is part of practice, meaning you have to demonstrate where this item comes from and how it is required knowledge for the field. Believe me, people do challenge particular items. In the comments section at the end of a computer adaptive test, people will often reference specific items. It's rare to have to defend an item to someone, but when you do, you want to make certain you have all the information you need in one place.

Most of the examples I've used in this post have been for ability tests, and that's traditionally where we've seen item banks. But there have been some movements in testing to use banking for trait and similar tests. If I'm administering a measure of mobility to a person with a spinal cord injury, I may want to select certain items based on where that person falls on mobility. Spinal cord injury can range in severity, so while one person with a spinal cord injury might be wheelchair-bound, another might be able to walk short distances with the help of a walker (especially if they have a partial injury, meaning some signals are still getting through). You could have an item bank so that these two patients get different items; the person who is entirely wheelchair-bound wouldn't get any items about their ability to walk, while the person with the partial injury would. The computer adaptive test would just need a starting item to figure out approximately where the person falls on the continuum of mobility, then direct them to the questions most relevant to their area of the continuum.

Tomorrow, we'll talk more about the rating scale model of Rasch, and how we look at category function!


Monday, April 1, 2019

A is for Ability

Welcome to April A to Z, where I'll go through the A to Z of Rasch! This is the measurement model I use most frequently at work, and is a great way to develop a variety of measures. It started off in educational measurement, and many Rasch departments are housed in educational psychology and statistics departments. But Rasch is slowly making its way into other disciplines as well, and I hope to see more people using this measurement model. It provides some very powerful and useful statistics on measures and isn't really that difficult to learn.

A few disclaimers:

  1. There are entire courses on Rasch measurement, and in fact, some programs divide Rasch up across several courses because there's much that can be learned on the topic. This blog series is really an introduction to the concepts, to help you get started and decide if Rasch is right for you. I won't get into the math behind it as much, but will try to use some data examples to demonstrate the concepts. 
  2. I try to be very careful about what data I present on the blog. My data examples are usually: data I personally own (that is, collected as part of my masters or doctoral study), publicly available data, or (most often) simulated data. None of the data I present will come from any of the exams I work on at my current position or past positions as a psychometrician. I don't own those data and can't share them. 
  3. Finally, these posts are part of a series in which I build on past posts to introduce new concepts. I won't always be able to say everything I want on a topic for a given post, because it ties into another post better. So keep reading - I'll try to make it clear when I'll get back to something later.

For my first post in the series, ability!

When we use measurement with people, we're trying to learn more about some underlying quality about them. We call this underlying quality we want to measure a "latent variable." It's something that can't be captured directly, but rather through proxies. In psychometrics, those proxies are the items. We go through multiple steps to design and test those items to ensure they get as close as we can to tapping into that latent variable. In science/research terms, the items are how we operationalize the latent variable: define it in a way that it can be measured.

In the Rasch approach to psychometrics, we tend to use the term "ability" rather than "latent variable." First, Rasch was originally designed for educational assessment - tests that had clear right and wrong answers - so it makes sense to frame performance on these tests as ability. Second, Rasch deals with probabilities that a person is able to answer a certain question correctly or endorse a certain answer on a test. So even for measures of traits, like attitudes and beliefs or personality inventories, people with more of a trait have a different ability to respond to different questions. Certain answers are easier for them to give because of that underlying trait.

In Rasch, we calibrate items in terms of difficulty (either how hard the question is or, for trait measures, how much of the trait is needed to respond in a certain way) and people in terms of ability. These two concepts are calibrated on the same scale, so once we have a person's ability, we can immediately determine how they're likely to respond to a given item. That scale is in a unit called a "logit," or a log odds ratio. This conversion gives us a distribution that is nearly linear. (Check out the graphs in the linked log odds ratio post to see what I mean.) Typically, when you look at your distribution of person ability measures, you'll see numbers ranging from positive to negative. And while, theoretically, your logits can range from negative infinity to positive infinity, more likely, you'll see values from -2 to +2 or something like that.

That ability metric is calculated based on the items the person responded to (and especially how they responded) - for exams with right and wrong answers, their ability is the difficulty of item at which they have a 50% chance of responding correctly. The actual analysis involved in creating these ability estimates uses some form of maximum likelihood estimation (there are different types, like Joint Maximum Likelihood Estimation, JMLE, or Pairwise Maximum Likelihood Estimation, PMLE, to name a couple), so it goes back and forth with different values until it gets estimates that best fit the data. This is both a strength and weakness of Rasch: it's an incredibly powerful analysis technique that makes full use of the data available, and handles missing data beautifully, but it couldn't possibly be done without a computer and capable software. In fact, Rasch has been around almost as long as classical test theory approaches - it just couldn't really be adopted until technology caught up.

I'll come back to this concept later this month, but before I close on this post, I want to talk about one more thing: scores. When you administer a Rasch exam and compute person ability scores, you have logits that are very useful for psychometric purposes. But rarely would you ever show those logit scores to someone else, especially an examinee. A score of -2 on a licensing exam or the SAT isn't going to make a lot of sense to an examinee. So we convert those values to a scaled score, using some form of linear equation. That precise equation differs by exam. The SAT, for instance, ranges from 400 to 1600, the ACT from 1 to 36, and the CPA exam from 0 to 99. Each of these organizations has an equation that takes their ability scores and converts them to their scaled score metric. (And in case you're wondering if there's a way to back convert your scaled score on one of these exams to the logit, you'd need to know the slope and constant of the equation they use - not something they're willing to share, nor something you could determine from a single data point. They don't really want you looking under the hood on these types of things.)

Some questions you may have:

  1. The linear equation mentioned above takes into account the possible range of scaled scores (such as 400 to 1600 on the SAT), as well as the actual abilities of the pilot or pretest sample you used. What if you administer the test (with the set difficulties of the items from pretesting), and you get someone outside of the range of abilities of the sample you used to create the equation? Could you get someone with a score below 400 or above 1600? Yes, you could. This is part of why you want a broad range of abilities in your pilot sample, to make sure you're getting all possible outcomes and get the most accurate equation you can. This is also where simulated data could be used, which we do use in psychometrics. Once you have those item difficulties set with pilot testing, you could then create cases that get, for instance, all questions wrong or all questions right. This would give you the lowest possible logit and the highest possible logit. You can then include those values when setting the equation. As long as the difficulties are based on real performance, it's okay to use simulated cases. Finally, that range of possible scores sets the minimum and maximum for reporting. Even if an ability score means a person could have an SAT below 400, their score will be automatically set at the minimum.
  2. What about passing and failing for licensing and credentialing exams? How do ability scores figure into that? Shouldn't you just use questions people should know and forget about all this item difficulty and person ability nonsense? This is one way that Rasch and similar measurement models differ from other test approaches. The purpose of a Rasch test is to measure the person's ability, so you need a good range of item difficulties to determine what a person's true ability is. These values are not intended to tell you whether a person should pass or fail - that's a separate part of the psychometric process called standard setting, where a committee of subject matter experts determines how much ability is necessary to say a person has the required knowledge to be licensed or credentialed. This might be why Rasch makes more sense to people when you talk about it in the educational assessment sense: you care about what the person's ability is. But Rasch is just as useful in pass/fail exams, because of the wealth of information it gives you about your exam and items. You just need that extra step of setting a standard to give pass/fail information. And even in those educational assessments, there are generally standards, such as what a person's ability should be at their age or grade level. Rasch measurement gives you a range of abilities in your sample, which you can then use to make determinations about what those abilities should be.


Tomorrow, we'll dig more into item difficulties when we discuss item banks!

Sunday, March 24, 2019

Statistics Sunday: Blogging A to Z Theme Reveal

I'm excited to announce this year's theme for the Blogging A to Z challenge:


I'll be writing through the alphabet of psychometrics with the Rasch Measurement Model approach. I've written a bit about Rasch previously. You can find those posts here:

Looking forward to sharing these posts! First up is A for Ability on Monday, April 1!

Sunday, March 17, 2019

Statistics Sunday: Standardized Tests in Light of Public Scandal

No doubt, by now, you've heard about the large-scale investigation into college admissions scandals among the wealthy - a scandal that suggests SAT scores, among other things, can in essence be bought. Eliza Shapiro and Dana Goldstein of the NY Times ask if this scandal is "the last straw" for tests like the SAT.

To clarify in advance, I do not nor have I ever worked for the Educational Testing Service or for any organization involved in admissions testing. But as a psychometrician, I have a vested interest in this industry. And I became a psychometrician because of my philosophy: that many things, including ability, achievement, and college preparedness, can be objectively measured if certain procedures and methods are followed. If the methods and procedures are not followed properly in a particular case, the measurement in that case is invalid. That is what happens when a student (or more likely, their parent) pays someone else to take the SAT for them, or bribes a proctor, or finds an "expert" willing to sign off on a disability the student does not have to get extra accommodations.

But because that particular instance of measurement is invalid doesn't damn the entire field to invalidity. It just means we have to work harder. Better vetting of proctors, advances in testing like computerized adaptive testing and new item types... all of this is to help counteract outside variables that threaten the validity of measurement. And expansions in the field of data forensics now include examining anomalous patterns in testing, to identify if some form of dishonesty has taken place - allowing scores to be rescinded or otherwise declared invalid after the fact.

This is a field I feel strongly about, and as I said, really sums up my philosophy in life for the value of measurement. Today, I'm on my way to the Association of Test Publishers Innovations in Testing 2019 meeting in Orlando. I'm certain this recent scandal will be a frequent topic at the conference, and a rallying cry for better protection of exam material and better methods for identifying suspicious testing behavior. Public trust in our field is on the line. It is our job to regain that trust.


Tuesday, March 12, 2019

Are Likert Scales Superior to Yes/No? Maybe

I stumbled upon this great post from the Personality Interest Group and Espresso (PIG-E) blog about which is better - Likert scales (such as those 5-point Agree to Disagree scales you often see) or Yes/No (see also True/False)? First, they polled people on Twitter. 66% of respondents thought that going from a 7-point to 2-point scale would decrease reliability on a Big Five personality measure; 71% thought that move would decrease validity. But then things got interesting:
Before I could dig into my data vault, M. Brent Donnellan (MBD) popped up on the twitter thread and forwarded amazingly ideal data for putting the scale option question to the test. He’d collected a lot of data varying the number of scale options from 7 points all the way down to 2 points using the BFI2. He also asked a few questions that could be used as interesting criterion-related validity tests including gender, self-esteem, life satisfaction and age. The sample consisted of folks from a Qualtrics panel with approximately 215 people per group.

Here are the average internal consistencies (i.e., coefficient alphas) for 2-point (Agree/Disagree), 3-point, 5-point, and 7-point scales:

And here's what they found in terms of validity evidence - the correlation between the BFI2 and another Big Five measure, the Mini-IPIP:


FYI, when I'm examining item independence in scales I'm creating or supporting, I often use 0.7 as a cut-off - that is, items that correlate at 0.7 or higher (meaning 49% shared variance) are essentially measuring the same thing and violate the assumption of independence. The fact that all but Agreeableness correlates at or above 0.7 is pretty strong evidence that the scales, regardless of number of response options, are measuring the same thing.

The post includes a discussion of these issues by personality researchers, and includes some interesting information not just on number of response options, but also on the Big Five personality traits.

Monday, March 11, 2019

Statistics Sunday: Scatterplots and Correlations with ggpairs

As I conduct some analysis for a content validation study, I wanted to quickly blog about a fun plot I discovered today: ggpairs, which displays scatterplots and correlations in a grid for a set of variables.

To demonstrate, I'll return to my Facebook dataset, which I used for some of last year's R analysis demonstrations. You can find the dataset, a minicodebook, and code on importing into R here. Then use the code from this post to compute the following variables: RRS, CESD, Extraversion, Agree, Consc, EmoSt, Openness. These correspond to measures of rumination, depression, and the Big Five personality traits. We could easily request correlations for these 7 variables. But if I wanted scatterplots plus correlations for all 7, I can easily request it with ggpairs then listing out the columns from my dataset I want included on the plot:

library(ggplot2)
ggpairs(Facebook[,c(112,116,122:126)]

(Note: I also computed the 3 RRS subscales, which is why the column numbers above skip from 112 (RRS) to 116 (CESD). You might need to adjust the column numbers when you run the analysis yourself.)

The results look like this:


Since the grid is the number of variables squared, I wouldn't recommend this type of plot for a large number of variables.

Thursday, March 7, 2019

Time to Blog More

My blogging has been pretty much non-existent this year. Without getting too personal, I've been going through some pretty major life changes, and it's been difficult to focus on a variety of things, especially writing. As I work through this big transition, I'm thinking about what things I want to make time for and what things I should step away from.

Writing - especially about science, statistics, and psychometrics - remains very important to me. So I'm going to keep working to get back into some good blogging habits. Statistics Sunday posts may remain sporadic for a bit longer, but look for more statistics-themed posts very soon because...


That's right, it's time to sign up for the April A to Z blogging challenge! I'll officially announce my theme later this month, but for now I promise it will be stats-related.

Thursday, February 28, 2019

A New Trauma Population for the Social Media Age

Even if you aren't a Facebook use, you're probably aware that there are rules about what you can and cannot post. Images or videos that depict violence or illegal behavior would of course be taken down. But who decides that? You as a user can always report an image or video (or person or group) if you think it violates community standards. But obviously, Facebook doesn't want to traumatize its users if it can be avoided.

That's where the employees of companies like Cognizant come in. It's their job to watch some of the most disturbing content on the internet - and it's even worse than it sounds. In this fascinating article for The Verge, Casey Newton describes just how traumatic doing such a job can be. (Content warning - this post has lots of references to violence, suicide, and mental illness.)

The problem with the way these companies do business is that, not only do employees see violent and disturbing content; they also don't have the opportunity to talk about what they see with their support networks:
Over the past three months, I interviewed a dozen current and former employees of Cognizant in Phoenix. All had signed non-disclosure agreements with Cognizant in which they pledged not to discuss their work for Facebook — or even acknowledge that Facebook is Cognizant’s client. The shroud of secrecy is meant to protect employees from users who may be angry about a content moderation decision and seek to resolve it with a known Facebook contractor. The NDAs are also meant to prevent contractors from sharing Facebook users’ personal information with the outside world, at a time of intense scrutiny over data privacy issues.

But the secrecy also insulates Cognizant and Facebook from criticism about their working conditions, moderators told me. They are pressured not to discuss the emotional toll that their job takes on them, even with loved ones, leading to increased feelings of isolation and anxiety.

The moderators told me it’s a place where the conspiracy videos and memes that they see each day gradually lead them to embrace fringe views. One auditor walks the floor promoting the idea that the Earth is flat. A former employee told me he has begun to question certain aspects of the Holocaust. Another former employee, who told me he has mapped every escape route out of his house and sleeps with a gun at his side, said: “I no longer believe 9/11 was a terrorist attack.”
It's a fascinating read on an industry I really wasn't aware existed, and a population that could be diagnosed with PTSD and other responses to trauma.

Thursday, February 21, 2019

Replicating Research and "Peeking" at Data

Today on one of my new favorite blogs, EJ Wagenmakers dissects a recent interview with Elizabeth Loftus on when it is okay to peek at data being collected:
Claim 4: I should not feel guilty when I peek at data as it is being collected

This is the most interesting claim, and one with the largest practical repercussions. I agree with Loftus here. It is perfectly sound methodological practice to peek at data as it is being collected. Specifically, guilt-free peeking is possible if the research is exploratory (and this is made unambiguously clear in the published report). If the research is confirmatory, then peeking is still perfectly acceptable, just as long as the peeking does not influence the sampling plan. But even that is allowed as long as one employs either a frequentist sequential analysis or a Bayesian analysis (e.g., Rouder, 2014; we have a manuscript in preparation that provides five intuitions for this general rule). The only kind of peeking that should cause sleepless nights is when the experiment is designed as a confirmatory test, the peeking affects the sampling plan, the analysis is frequentist, and the sampling plan is disregarded in the analysis and misrepresented in the published report. This unfortunate combination invokes what is known as “sampling to a foregone conclusion”, and it invalidates the reported statistical inference.
Loftus also has many opinions on replicating research, which may in part be driven by the fact that recent replications have not been able to recreate some of the major findings in social psychology. Wagenmakers shares his thoughts on that as well:
I believe that we have a duty towards our students to confirm that the work presented in our textbooks is in fact reliable (see also Bakker et al., 2013). Sometimes, even when hundreds of studies have been conducted on a particular phenomenon, the effect turns out to be surprisingly elusive — but only after the methodological screws have been turned. That said, it can be more productive to replicate a later study instead of the original, particularly when that later study removes a confound, is better designed, and is generally accepted as prototypical.
The whole post is worth a read and also has a response from Loftus at the end.

Wednesday, February 13, 2019

Sunday, January 27, 2019

Statistics Sunday: Creating a Stacked Bar Chart for Rank Data

Stacked Bar Chart for Rank Data At work on Friday, I was trying to figure out the best way to display some rank data. What I had were rankings from 1-5 for 10 factors considered most important in a job (such as Salary, Insurance Benefits, and the Opportunity to Learn), meaning each respondent chose and ranked the top 5 from those 10, and the remaining 5 were unranked by that respondent. Without even thinking about the missing data issue, I computed a mean rank and called it a day. (Yes, I know that ranks are ordinal and means are for continuous data, but my goal was simply to differentiate importance of the factors and a mean seemed the best way to do it.) Of course, then we noticed one of the factors had a pretty high average rank, even though few people ranked it in the top 5. Oops.

So how could I present these results? One idea I had was a stacked bar chart, and it took a bit of data wrangling to do it. That is, the rankings were all in separate variables, but I want them all on the same chart. Basically, I needed to create a dataset with:
    1 variable to represent the factor being ranked
  • 1 variable to represent the ranking given (1-5, or 6 that I called "Not Ranked")
  • 1 variable to represent the number of people giving that particular rank that particular factor

What I ultimately did was run frequencies for the factor variables, turn those frequency tables into data frames, and merged them together with rbind. I then created chart with ggplot. Here's some code for a simplified example, which only uses 6 factors and asks people to rank the top 3.

First, let's read in our sample dataset - note that these data were generated only for this example and are not real data:

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ranks <- read_csv("C:/Users/slocatelli/Desktop/sample_ranks.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
##   RespID = col_integer(),
##   Salary = col_integer(),
##   Recognition = col_integer(),
##   PTO = col_integer(),
##   Insurance = col_integer(),
##   FlexibleHours = col_integer(),
##   OptoLearn = col_integer()
## )

This dataset contains 7 variables - 1 respondent ID and 6 variables with ranks on factors considered important in a job: salary, recognition from employer, paid time off, insurance benefits, flexible scheduling, and opportunity to learn. I want to run frequencies for these variables, and turn those frequency tables into a data frame I can use in ggplot2. I'm sure there are much cleaner ways to do this (and please share in the comments!), but here's one not so pretty way:

salary <- as.data.frame(table(ranks$Salary))
salary$Name <- "Salary"
recognition <- as.data.frame(table(ranks$Recognition))
recognition$Name <- "Recognition by \nEmployer"
PTO <- as.data.frame(table(ranks$PTO))
PTO$Name <- "Paid Time Off"
insurance <- as.data.frame(table(ranks$Insurance))
insurance$Name <- "Insurance"
flexible <- as.data.frame(table(ranks$FlexibleHours))
flexible$Name <- "Flexible Schedule"
learn <- as.data.frame(table(ranks$OptoLearn))
learn$Name <- "Opportunity to \nLearn"

rank_chart <- rbind(salary, recognition, PTO, insurance, flexible, learn)
rank_chart$Var1 <- as.numeric(rank_chart$Var1)

With my not-so-pretty data wrangling, the chart itself is actually pretty easy:

ggplot(rank_chart, aes(fill = Var1, y = Freq, x = Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Ranking of Factors Most Important in a Job") +
  ylab("Frequency") +
  xlab("Job Factors") +
  scale_fill_continuous(name = "Ranking",
                      breaks = c(1:4),
                      labels = c("1","2","3","Not Ranked")) +
  theme_bw() +
  theme(plot.title=element_text(hjust=0.5))

Based on this chart, we can see the top factor is Salary. Insurance is slightly more important than paid time off, but these are definitely the top 2 and 3 factors. Recognition wasn't ranked by most people, but those who did considered it their #2 factor; ditto for flexible scheduling at #3. Opportunity to learn didn't make the top 3 for most respondents.

Friday, January 25, 2019

Natural Graph

Via Not Awful and Boring, this reddit post discusses a really cool natural graph, measuring the amount of sunlight per day, created with a tree and a magnifying glass:


Apparently, this device is a Campbell-Stokes recorder.

Thursday, January 24, 2019

Long Time, No Write

Wow, it's been way too long since I've posted anything! Lots of life changes recently, including moving to a new place and fighting with Comcast to get internet there. I still need to set up my office, and I plan on doing lots of writing in my new dedicated space. I'm planning on more statistics posts and a few more surprises this year.

Work has also been busy. At the moment:

  • I'm working on three content validation studies, including analyzing data for two job analysis surveys and gearing up for a third
  • I've wrapped up the first phase of analysis on our salary and satisfaction survey, and have some cool analysis planned for phase 2
  • I finished a time study on our national and state exams
  • I'm awaiting feedback on the first draft of a chapter about standard setting I coauthored with some coworkers
  • I'm learning how to be a supervisor, now that I have someone working for me! That's right, I'm no longer a department of one
Once I get through some of the most pressing project work, I'm going to take some of my work time to teach myself data forensics as it applies to the testing. In fact, this book has been on my to-read shelf since my annual employee evaluation back in November. Look for blog posts on that!