Choosing Assessment Item Types
Writing good assessments is of course a major part of learning design, but before anyone writes any questions, it’s worth putting some thought into deciding which question types you want to include. You may want multiple choice questions, but you might want other item types as well. Or maybe you want something completely different. To make sense of this decision, we’ve constructed this overview of the process of choosing item types, and in subsequent posts we’ll provide an analysis of the strengths and weaknesses of the most common (and even some of the less common) item types.
First, though, let’s define terms. In this context, “item types” are defined by the action they ask the learner to perform. Are they choose one choice out of four? Entering a number? Sorting from least to greatest? Deciding what the learners will do to demonstrate their achievement of the learning objectives is a key question that is often neglected, but always worth asking. After all, mismatches between the learning objectives and the assessments make test results meaningless. At Second Avenue, we always start with the objectives, and then think about what would be evidence that those objectives have been accomplished.
Here are more of the important considerations that go into selecting item types:
Discriminatory power refers to the degree to which the item tells you something about the learner. Highly discriminatory questions allow for highly accurate and efficient assessments. Questions with extremely low discriminatory power may be a waste of time.
In general, the discriminatory power of an item type is positively related to the number of potential responses it allows. For example, a true/false question has two potential responses, true and false. Guessing randomly yields the correct response half the time, and so performance on a single true/false question tells you very little about the learner. A battery of such questions would need to be administered before any solid conclusions could be drawn. In contrast, a matching question with 60 potential responses but only one correct answer provides much more helpful information. The chances of randomly getting it correct are less than two percent, so fewer of these items would need to be administered before a solid conclusion could be drawn.
The flexibility of an item type measures the variety of uses to which it can be put. Consider sequencing questions. When they are appropriate, they can provide a welcome break from multiple choice questions. They can also be wonderful discriminators, since they have so many potential responses (e.g. a 7-part sequencing question has thousands of potential responses). Still, how often do you have 7 things you need to put in order? Sometimes, but not often. There simply are not many instances in which an order exists and is worthy of testing.
In contrast, multiple choice questions can be asked in a wide variety of contexts. Individually, they have less discriminatory power than other question types, but they are almost always applicable, and that’s part of the reason why they get used as often as they do.
Cost refers to the expenditure (usually time and money) required to produce the item. In general, there is a tradeoff between the sophistication of an item and its cost. Simple items that ask for the definition of a key term can be churned out rapidly, but questions that call for subtle inferences or careful analysis of data can take an hour or more to write, even when the writer is experienced.
Sophistication is a factor, but it is not the only (or even the most important) driver of cost. Graphics, charts, tables, and similar elements can help make items more realistic and are often necessary for some learning objectives, but they can be expensive to produce, making the strategic use of art an important skill in assessment design. Similarly, providing complex scenarios to analyze tends to cost more than testing simple definitions, but sophistication is often worth the cost. Simulating the complexity of the real world can be essential to measuring higher-order skills and can very helpful in terms of predictive validity.
Most of the standard question types (such as multiple choice, numeric, and multiple answer) can be graded automatically. That’s often really important, because requiring human intervention to evaluate responses is time-consuming, expensive, and potentially subjective. At Second Avenue, we’ve also developed item types that can automatically grade complex tasks, such as drawing chemical structures and creating force diagrams. When a task can be auto-graded, feedback can be immediate, and practice becomes much easier.
Still, sometimes there are nuances to the learner’s responses that the computer can’t be expected to catch. We may be comfortable with using automatic grading to measure quantitative skills, but that doesn’t mean that we trust the computer to evaluate creative writing. If automatic grading isn’t required, though, lots of innovative approaches are possible.
We hope you find this framework useful when you think about your assessment projects. Check back soon for more analysis of particular item types, starting with the traditional, but controversial, multiple choice.