Understanding the Rasch Model with Python

Alex Kholodniak

• 10 Apr 2024 • 7 min read

Ever wonder how standardized tests figure out which questions are "hard" and which students are "good"? The Rasch model does exactly that.

I've used it to analyze survey data and build adaptive tests. It's simpler than you'd think, and way more useful than traditional scoring methods.

What's Item Response Theory?

Classical test scoring is crude. Add up the points, divide by total. Done.

But that approach has problems:

A hard test makes everyone look bad
An easy test makes everyone look good
You can't compare results across different tests
You can't tell which questions actually measure ability

Item Response Theory (IRT) fixes this. Instead of just counting points, it models:

How good each student is
How hard each question is
The relationship between ability and getting questions right

The Rasch Model: One Parameter to Rule Them All

Georg Rasch (Danish mathematician, 1960s) created the simplest useful IRT model. Just two things matter:

Person ability (theta, θ)
Item difficulty (b)

The probability someone answers correctly:

P(correct) = 1 / (1 + e^(-(θ - b)))

That's it. If your ability is way higher than the question difficulty, you'll probably get it right. If the question is way harder than your ability, you'll probably get it wrong.

Why This Matters

Traditional scoring:

Student A gets 7/10 on test X
Student B gets 7/10 on test Y
Are they equally skilled? Who knows.

Rasch model:

Student A has ability θ = 1.5
Student B has ability θ = 1.5
They're equally skilled, even if they took different tests

The measurements are on the same scale. Like using meters instead of "about 5 shoe lengths."

Building It in Python

Let's implement this. Not with some black-box library, but from scratch so you see how it works.

The Basic Model

import numpy as np

def rasch_probability(ability, difficulty):
    """
    Calculate probability of correct response.

    Args:
        ability: Person's ability level (θ)
        difficulty: Item's difficulty level (b)

    Returns:
        Probability of correct response (0 to 1)
    """
    return 1.0 / (1.0 + np.exp(-(ability - difficulty)))

# Example
person_ability = 1.5
question_difficulty = 1.0

prob = rasch_probability(person_ability, question_difficulty)
print(f"Probability of correct answer: {prob:.2%}")
# => Probability of correct answer: 62.25%

Higher ability than difficulty = good chance of getting it right.

Multiple Questions

Real tests have multiple questions:

# Test with 5 questions
difficulties = np.array([0.5, 1.0, 1.5, 2.0, 2.5])

# Student with ability 1.5
student_ability = 1.5

# Probability for each question
probabilities = rasch_probability(student_ability, difficulties)

for i, (diff, prob) in enumerate(zip(difficulties, probabilities), 1):
    print(f"Question {i} (difficulty {diff}): {prob:.2%} chance")

# Output:
# Question 1 (difficulty 0.5): 73.11% chance
# Question 2 (difficulty 1.0): 62.25% chance
# Question 3 (difficulty 1.5): 50.00% chance
# Question 4 (difficulty 2.0): 37.75% chance
# Question 5 (difficulty 2.5): 26.89% chance

Notice question 3? When ability equals difficulty, you've got a coin flip (50%).

Estimating Parameters from Real Data

That's great, but how do you get these ability and difficulty numbers in the first place?

You have test responses. You need to estimate parameters. This requires optimization.

Joint Maximum Likelihood Estimation

from scipy.optimize import minimize
import numpy as np

def rasch_likelihood(params, responses):
    """
    Calculate negative log-likelihood for Rasch model.

    Args:
        params: Concatenated abilities and difficulties
        responses: Matrix of responses (people × items)

    Returns:
        Negative log-likelihood
    """
    n_people, n_items = responses.shape

    # Split params into abilities and difficulties
    abilities = params[:n_people]
    difficulties = params[n_people:]

    # Calculate probabilities for all person-item pairs
    log_likelihood = 0

    for i in range(n_people):
        for j in range(n_items):
            if not np.isnan(responses[i, j]):
                prob = rasch_probability(abilities[i], difficulties[j])

                # Add to log-likelihood
                if responses[i, j] == 1:
                    log_likelihood += np.log(prob + 1e-10)
                else:
                    log_likelihood += np.log(1 - prob + 1e-10)

    return -log_likelihood  # Negative because we minimize

def estimate_rasch_parameters(responses):
    """
    Estimate abilities and difficulties from response data.

    Args:
        responses: Matrix of 1s (correct) and 0s (incorrect)

    Returns:
        Tuple of (abilities, difficulties)
    """
    n_people, n_items = responses.shape

    # Initial guess: everyone average ability, items average difficulty
    initial_params = np.zeros(n_people + n_items)

    # Optimize
    result = minimize(
        rasch_likelihood,
        initial_params,
        args=(responses,),
        method='BFGS'
    )

    # Extract results
    abilities = result.x[:n_people]
    difficulties = result.x[n_people:]

    return abilities, difficulties

# Example: 4 people, 5 questions
responses = np.array([
    [1, 1, 1, 0, 0],  # Person 1: got first 3 right
    [1, 1, 0, 0, 0],  # Person 2: got first 2 right
    [1, 1, 1, 1, 0],  # Person 3: got first 4 right
    [0, 1, 0, 0, 0],  # Person 4: only got #2 right
])

abilities, difficulties = estimate_rasch_parameters(responses)

print("Person abilities:")
for i, ability in enumerate(abilities, 1):
    print(f"  Person {i}: {ability:.2f}")

print("\nItem difficulties:")
for i, difficulty in enumerate(difficulties, 1):
    print(f"  Question {i}: {difficulty:.2f}")

This estimates both abilities and difficulties from the response pattern. Magic!

Real-World Use Case: Survey Analysis

I used this to analyze Likert scale surveys (Strongly Disagree to Strongly Agree).

# Convert 5-point Likert to binary
# 4-5 = agree (1), 1-3 = disagree (0)
def likert_to_binary(responses, threshold=4):
    return (responses >= threshold).astype(int)

# Survey responses (5 people, 10 questions, 1-5 scale)
survey_data = np.array([
    [5, 5, 4, 3, 3, 4, 5, 4, 2, 1],
    [4, 4, 4, 3, 2, 3, 4, 3, 2, 2],
    [5, 5, 5, 4, 4, 5, 5, 5, 3, 2],
    [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
    [2, 2, 2, 2, 2, 2, 3, 2, 4, 4],
])

# Convert to binary
binary_responses = likert_to_binary(survey_data)

# Estimate parameters
abilities, difficulties = estimate_rasch_parameters(binary_responses)

# Interpret results
print("Most agreeable person:", np.argmax(abilities) + 1)
print("Least agreeable person:", np.argmin(abilities) + 1)
print("\nEasiest question to agree with:", np.argmin(difficulties) + 1)
print("Hardest question to agree with:", np.argmax(difficulties) + 1)

This tells you which survey items discriminate between high/low agreement and which respondents are most/least agreeable.

Computerized Adaptive Testing

The cool part: you can use this for adaptive tests. Pick the next question based on current ability estimate.

def adaptive_test(item_bank, n_questions=10):
    """
    Simulate an adaptive test.

    Args:
        item_bank: Array of item difficulties
        n_questions: Number of questions to ask

    Returns:
        Final ability estimate
    """
    ability_estimate = 0.0  # Start at average
    asked_items = []

    for i in range(n_questions):
        # Find item closest to current ability estimate
        # (most informative item)
        available_items = [
            (idx, diff) for idx, diff in enumerate(item_bank)
            if idx not in asked_items
        ]

        if not available_items:
            break

        # Pick item closest to current ability
        best_item = min(
            available_items,
            key=lambda x: abs(x[1] - ability_estimate)
        )
        item_idx, item_diff = best_item

        # Simulate response based on true ability (for demo)
        true_ability = 1.2
        prob = rasch_probability(true_ability, item_diff)
        response = np.random.random() < prob

        # Update ability estimate (simple method)
        if response:
            ability_estimate += 0.3
        else:
            ability_estimate -= 0.3

        asked_items.append(item_idx)

        print(f"Q{i+1}: Difficulty {item_diff:.2f}, "
              f"Response: {'Correct' if response else 'Wrong'}, "
              f"Ability est: {ability_estimate:.2f}")

    return ability_estimate

# Item bank with various difficulties
item_bank = np.linspace(-2, 2, 20)

final_ability = adaptive_test(item_bank, n_questions=10)
print(f"\nFinal ability estimate: {final_ability:.2f}")

Each question adapts to the student's level. More efficient than giving everyone the same test.

When to Use the Rasch Model

Good for:

Educational testing
Survey analysis
Psychometric assessments
Adaptive testing systems
When you need comparable scores across different test forms

Not good for:

Very short tests (need 10+ items)
When items don't measure the same construct
Multiple choice with random guessing (use 3PL model instead)
When discrimination varies wildly between items (use 2PL model)

Libraries That Do This Better

For production, use established libraries:

# py-irt - Pure Python IRT
from py_irt import rasch

# Fit model
model = rasch(data=response_matrix)
abilities = model.abilities
difficulties = model.difficulties

# Or use R's mirt package (more features)
# Or use TAM package (very comprehensive)

But building it yourself teaches you what's actually happening.

The Math, Plain English

The logistic function:

Output is always between 0 and 1 (perfect for probability)
When ability = difficulty, output = 0.5
Higher ability OR lower difficulty = higher probability
The curve is S-shaped (gradual at extremes, steep in middle)

That's why it works. Simple, elegant, effective.

Bottom Line

The Rasch model turns messy test data into clean measurements. It separates person ability from item difficulty.

Traditional scoring: "You got 15 out of 20."

Rasch model: "Your ability is 1.8 logits above average."

The second one you can actually compare across different tests, different populations, different times.

Not magic. Just good math.