Join our weekly blog

Assessments 101 – differentiating good assessments from bad

When it comes to screening candidates against requirements, psychometric assessments can be a powerful tool. An assessment can tell you who has what skills, knowledge, or traits, and to what level. They can pick up details that could drive your team and organization to the height of success, or send it crashing down in a chaotic mess. They can give you a window into how people think, how they thrive and can expose potential challenges or areas of weakness. Or they can do nothing but lead you astray and erode the validity, equity, and compliance of your process – if you use the wrong ones.

The key is to select the correct assessments. To do this, first we must answer the fundamental question – what exactly are psychometric assessments?

Assessments are tests that measure the level of the test taker’s knowledge, skills, abilities, and other characteristics.

For example, think back to school when you took a test – that test was designed to measure your knowledge of the topic (math or science or your understanding of the last book you were assigned to read). However, unlike those tests in school, psychometric assessments can measure much more than one’s memory of what happened in a book.

So how do you know if an assessment is a good one?

Separate the good assessments from the unicorn poop

There are thousands of assessments and hundreds of assessment providers all excited to provide you with their version of the “best assessment ever created”. There are long assessments full of colorful results, short, fast assessments that will tell you “everything you need to know” about a candidate in ten minutes, digital assessments that look more like games, and assessments that look more like the SATs from when we were teenagers – complete with number two pencil and oval bubbles. But… let me let you in on a little secret:

Not all of those assessments do as they claim.

I know, crazy, right!

In fact, some of them are downright bunk. Yup; they are nothing but colorful, dazzling, unicorn poop! (can I use ‘unicorn poop’ in a blog? I’m going with yes.)

For example: if the result tells you which Harry Potter character you are… it’s probably unicorn poop.

Telling the difference between good and bad starts with these three questions.

Question 1: Does the assessment measure what it’s supposed to measure?

Not all assessments measure what they claim to measure in their flashy marketing material. When evaluating what the assessment is measuring, identify the following:

  • Definition of what is being measured: whether traits, skills, knowledge, or behaviors, whatever is being measured, it must include a strong, clear definition.

  • The defined scale: what is a high vs. a low score, and what does it mean to be high or low?

  • The variability: How much differentiation does the scale provide? If everyone scores pretty much the same, then there is nothing to distinguish one candidate from another.

  • The method of validation: This is the evidence that the data is valid. Common methods include comparing assessments results against a known and validated assessment measuring the same thing, using control groups with known outcomes (for example a test measuring ‘do you know Java’ administered to a group of Java developers and non-Java developers), and having recognized experts verify the outcomes.

When evaluating what is measured, if you don’t understand the responses - ASK! If it still doesn’t make sense after the assessment provider explains it to you, read it as a big yellow flag. Either get a third-party expert to help you evaluate the assessment or walk away. When you use assessments, you take on the liability for actions resulting from the assessments, so if the provider can’t explain what the assessment is doing clearly, they are not the provider for you.

Question 2: Is the assessment accurate?

Once you know what the assessment is measuring, evaluate the accuracy of that measurement.

This includes:

  • How reliable are the measurements? Reliability is another term for “consistency”, the more reliable an assessment, the more it will consistently produce the same results (someone taking the assessments today will get the same results if they take the test tomorrow).

  • How big was the data set? Was the assessment validated with a hundred data points? A thousand? Ten thousand? The larger the data set, the better. However, there is a minimum required before the results become meaningful (i.e. showing a clear pattern and not just lucky guesses).

- For normative data (i.e. general data): several hundred if not thousands.

- For job-specific data: depends on job – could be dozens – could be hundreds.

  • Who is represented in the data? Just as important as how many people took the assessment, is who they were. For example, an assessment that measures cognitive ability (intelligence) given only to Harvard graduate students will not have an accurate distribution because it is only measuring relatively smart people. Look for a representative sample – i.e. a distribution that reflects who you want to measure in terms of location (same country you plan to use it in), demographics, types of roles, education, etc.

  • How are the results protected? Assessments are no good if there is too much cheating affecting the results. Understand how the assessments are being protected from cheating. Common techniques include monitoring for leaked questions, monitoring results for anomalies, and randomizing question order. More sophisticated techniques also include randomly selecting questions from a larger library for each test taker or remote proctoring.

Just like with what is measured, don’t be shy to ask a lot of questions when determining assessment accuracy, especially if you don’t have an expert handy.

Question 3: Does the assessment have minimal inappropriate adverse impact?

Many assessments have differences in performance between two defined groups. This is not within itself a bad thing.

For example, women are, on average shorter than men – it is not bad, it just is. A tape measurer is not inherently biased against women just because it measures women as generally shorter.

When the results are used to determine who is selected for a job, this difference is called adverse impact.

Adverse impact is when members of a protected class (such as women) are selected at a lower rate than another class (such as men).

Inappropriate adverse impact is when members of a protected class are selected at a lower rate than another class for reasons that do not pertain to the job.

Inappropriate Adverse impact happens when either:

A. The assessment is not accurately measuring what it claims to measure; i.e. biasing the results to give an unfair advantage to one group over another.

Example: an assessment that uses references to popular 80’s movies to measure ethics (which has nothing to do with 80’s movie knowledge) will put younger applicants at a disadvantage because they are less likely to have seen them.

B. What is being measured by the assessments has performance differences between two groups and the results are unrelated to what is needed for the job, biasing the results to give one group an unfair advantage over another.

Example: higher height values (where men are generally taller) are used to rank engineering candidates.

Bringing it all together

Finding the right assessment starts by distinguishing a good assessment from unicorn poop. Choose assessments that clearly measure what you want, accurately, and with minimal inappropriate adverse impact.

Want to know more about assessment selection? Check out the Assessment Selection whitepaper for a complete walkthrough of the process of evaluating and selecting assessments that are good, good for your job, and by providers that will be the right partners.

#NoBias #diversityandinclusion #assessments

Want to republish on your site? Click here.


 Join our weekly blog

 Join our weekly blog