Deducing correct answers from multiple choice exams

Question

I am looking for an algorithmic way to solve the following problem.

Problem

Say we are given a multiple choice test with 100 questions, 4 answers per question (exactly one of those four being correct), each correctly given answer is worth one point, wrong answers are worth zero points. If now we got a database D of lots of answer sheets and their corresponding points, e.g.

D:= { ('ABAA...', 80), ('ABAB...', 80), ('ABAC...', 80), ('ABAD...', 81), ... }

How can we find out which answers are correct? I am not looking for something probabilistic, but for answers which are definitely correct.

Some ideas

There are some obvious strategies like:

look for someone who reached 100 points; you got all your answers
look for tests, whose answers differ only by one (in the example database given, we can deduce the answer to question 4 is "D")
look for someone who answered everything wrong, you can rule out those answers

But what information can we get from other combinations of answers?

Viewing the answer sheets as a metric space we get

100 - S(test) = d(test, correct)

for the hamming distance d(.,.), the score S(.) and the correct sheet correct.

Maybe someone could give me a reformulation of the problem, which yields a more obvious implementation. Any contribution is appreciated.

Edit:

Not considering computational complexity, couldn't I achieve something by intersecting the balls $$ \bigcap_i B_{d(t_i,\textrm{correct})}(t_i), $$ with tests $t_i$ and balls $B_d(x) := \{y: d(x,y)\leq d\}$?

I think you mean a multiple choice exam; each question has multiple choices (4). — Andrew Kelley, May 26 '14 at 17:08
Hm. From what I'm used to, 'multiple choice' means more than one answer can be correct, which gives 2^4(-1) possible ways of answering a question. And single choice means: pick one answer per question. [Coming from a german speaking country.] — knedlsepp, May 26 '14 at 17:11
I think this is similar to the board game Mastermind. There is a lot of research on algorithms for this game. — Nate Eldredge, May 28 '14 at 00:27
@Nate, I totally agree. The problem Number Mind on the well-known projecteuler site was based on Mastermind and is essentially equivalent. — Andrew Kelley, Aug 28 '14 at 00:00

score 3 · Accepted Answer · edited Jun 12 '20 at 10:38

Update:

Today, as I was looking through some interesting math/programming problems on projecteuler, I noticed Number Mind (problem number 185), and it immediately reminded me of this math.SE question; they are practically equivalent. Searching for a solution to the projectuler problem, I found a solution (written in python) from a sister site: codereview.SE. (I haven't actually read it though.)

What follows is what I originally posted.

Some Observations:

Depending on the database D, we may not be able to determine the correct answers. For example, if question number 100 is very tricky and as a result everyone chooses C or D, even though the correct answer is A, then we cannot determine the correct answer.
Let's fix some notation: Let $\mathcal{A}$ be the correct answers and $\mathcal{\hat{A}}$ be some guess of $\mathcal{A}$. (So both are strings, 100 letters long.) Let $t_i$ be a test whose actual score is 85; say $S_{\mathcal{A}}(t_i) = 85$. Further, let's assume that if we rescore $t_i$ according to $\mathcal{\hat{A}}$, we get a new score $S_{\mathcal{\hat{A}}}(t_i) = 90$. Then we have lower and upper bounds for $d(\mathcal{A},\mathcal{\hat{A}})$ = the number of letters for which they differ: $$5=|90-85| \leq d(\mathcal{A},\mathcal{\hat{A}}) \leq (100-85) + (100-90) = 25.$$ The first inequality holds because the test $t_i$ is graded incorrectly by $\mathcal{\hat{A}}$ for at least 5 questions. The second is the triangle inequality; $d(\mathcal{A},\mathcal{\hat{A}}) \leq d(\mathcal{A},t_i) + d(t_i,\mathcal{\hat{A}})$. (Notice that $d(\mathcal{A},t_i) = 100 - S_{\mathcal{A}}(t_i)$ etc.) The second inequality is optimal because we can imagine all 15 wrong answers of $t_i$ being counted as correct (by $\mathcal{\hat{A}}$) and 10 of its correct answers being counted as incorrect (by $\mathcal{\hat{A}}$).

I guess one could also view the first inequality as the reverse triangle inequality |d(x,z)-d(y,z)|<=d(x,y) using a completely wrong z with S(z)=0. — knedlsepp, May 26 '14 at 19:31

score 1 · Answer 2 · answered May 27 '14 at 23:17

Here's what i got so far. I gave it some thought and it seems to me, that the problem is very similar to a trilateration problem .

So what my first approach here is, is to intersect all the spheres $S_d(x) := \{y: d(x,y)= d\}$ with $x$ being a test and $d$ being the missing points to a perfect score.

For obvious reasons it doesn't perform well for the amount of questions I was originally going for, but for a small number of questions it seems to work.

Code

Here is the python code I did program for the computation of $$ \bigcap_i S_{d(t_i,\textrm{correct})}(t_i): $$

import itertools
import random

def hammingDistance(s1, s2):
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

def hammingSphere(word, distance, alphabet):
    positions = itertools.combinations(range(len(word)), distance)
    for pos in positions:
        letterCombinations = itertools.product(alphabet, repeat=distance)
        for letters in letterCombinations:
            tmp = list(word[:])
            for i,l in zip(pos, letters):
                if tmp[i] == l:
                    break
                else:
                    tmp[i] = l
            else:
                yield tuple(tmp)


def findPossiblyCorrectAnswers(tests, choices):
    bestTest = max(tests, key=lambda x: x[1])
    maxPoints = len(bestTest[0])
    possibleAnswers = set(hammingSphere(bestTest[0], maxPoints-bestTest[1], choices))
    for test in tests:
        possibleAnswers = possibleAnswers.intersection(set(hammingSphere(test[0], maxPoints-test[1], choices)))
    return possibleAnswers

def remainingOptions(tests, choices):
    return [set(x) for x in zip(*findPossiblyCorrectAnswers(tests, choices))]

def buildTests(correct, alphabet, numTest):
    tests = []
    maxPoints = len(correct)
    for i in range(numTest):
        tests.append([random.choice(alphabet) for _ in range(len(correct))])
    return [(t, maxPoints-hammingDistance(t, correct)) for t in tests]

### TESTING:

correct = "AAAAAAAA" # The correct answer
choices = "ABCD"
numTests = 8
tests = buildTests(correct, choices, numTests) # Build some tests with known correct answer sheet
print(tests) # Display the test database
print(remainingOptions(tests, choices)) # Display remaining choices

Results

It yields the following output

[(['B', 'A', 'C', 'C', 'C', 'B', 'D', 'A'], 2),
(['C', 'D', 'C', 'B', 'C', 'B', 'A', 'C'], 1),
(['B', 'C', 'D', 'D', 'C', 'D', 'C', 'C'], 0),
(['C', 'D', 'C', 'A', 'D', 'D', 'D', 'C'], 1),
(['A', 'C', 'D', 'C', 'D', 'B', 'A', 'B'], 2),
(['C', 'C', 'D', 'B', 'A', 'B', 'A', 'C'], 2),
(['A', 'A', 'D', 'D', 'B', 'C', 'D', 'B'], 2),
(['C', 'B', 'D', 'B', 'A', 'A', 'D', 'D'], 2)]

[set(['A']), set(['A']), set(['A', 'B']), set(['A']), set(['A']), set(['A', 'B']), set(['A', 'B']), set(['A', 'D'])]

Which are the filled out tests and their scores, as well as the final knowledge about the solutions.

So for eight people randomly answering a multiple choice test with eight questions, quite a lot of information can be squeezed from it.

score 0 · Answer 3 · answered May 27 '14 at 02:11

0

The grade for each is a linear function of the selected alternatives (coefficient is 0 for a wrong alternative, 1 for the right one). Given the right mix of graded papers, you have all coefficients, and thus all right answers. Viewed this way, it is a question of linear independence of the set of papers.

Perhaps attacking this as an ANOVA (multivariable linear regression) proves useful.

answered May 27 '14 at 02:11

vonbrand

28,394

I'm having trouble filling in the details. How do we have the coefficients for even a single test? Also, what exactly is the vector space in question? (The coefficients are multiplying vectors in what space?) – Andrew Kelley May 27 '14 at 02:58
In addition to the problems that @AndrewKelley listed, I'm quite sure that linear regression with respect to the score would only yield a probabilistic best approach, from which one can't deduce correct answers for certain. (Imagine the case of his "observation 1.") – knedlsepp May 27 '14 at 08:01
@AndrewKelley, the vector space is just the 1/0 (selected/not selected) set of answers. Sure, this allows for multiple answers. – vonbrand May 27 '14 at 10:54