Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mean of average precisions at k, where average precision at k is computed as:
Where: P(i) is the precision at cut-off i in the list; rel(i) is an indicator function equaling 1 if the item at rank i is a relevant document, zero otherwise.
The divider min(k, number of relevant documents) has the meaning of maximum possible number of relevant entries in the answer.
Is this understanding correct?
Is MAP@k always less than MAP computed for all ranked list?
My concern is that, this is not how MAP@k is computed in many works.
It is typical, that the divider is not min(k, number of relevant documents), but the number of relative documents in the top-k. This approach will give higher value of MAP@k.
HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
Code: https://github.com/thuml/HashNet/blob/master/pytorch/src/test.py#L42-L51
for i in range(query_num):
label = validation_labels[i, :]
label[label == 0] = -1
idx = ids[:, i]
imatch = np.sum(database_labels[idx[0:R], :] == label, axis=1) > 0
relevant_num = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, R+1, 1)
if relevant_num != 0:
APx.append(np.sum(Px * imatch) / relevant_num)
Where relevant_num is not the min(k, number of relevant documents), but number of relevant documents in the result, which is not the same as total number of relative documents or k.
Am I reading wrong the code?
Deep Visual-Semantic Quantization of Efficient Image Retrieval CVPR 2017
Code: https://github.com/caoyue10/cvpr17-dvsq/blob/master/util.py#L155-L178
def get_mAPs_by_feature(self, database, query):
ips = np.dot(query.output, database.output.T)
#norms = np.sqrt(np.dot(np.reshape(np.sum(query.output ** 2, 1), [query.n_samples, 1]), np.reshape(np.sum(database.output ** 2, 1), [1, database.n_samples])))
#self.all_rel = ips / norms
self.all_rel = ips
ids = np.argsort(-self.all_rel, 1)
APx = []
query_labels = query.label
database_labels = database.label
print "#calc mAPs# calculating mAPs"
bar = ProgressBar(total=self.all_rel.shape[0])
for i in xrange(self.all_rel.shape[0]):
label = query_labels[i, :]
label[label == 0] = -1
idx = ids[i, :]
imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
rel = np.sum(imatch)
Lx = np.cumsum(imatch)
Px = Lx.astype(float) / np.arange(1, self.R+1, 1)
if rel != 0:
APx.append(np.sum(Px * imatch) / rel)
bar.move()
print "mAPs: ", np.mean(np.array(APx))
return np.mean(np.array(APx))
Where divider is rel, which is computed as np.sum(imatch), where imatch is a binary vector that indicates if the entry is relevant or not. The problem is that it takes only first R: imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0. So np.sum(imatch) will give number of relevant entries in the returned list of size R, but not min(R, number of relevant entries). And note that values of R used in the paper are less than number of entries in DB.
Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPR 2015)
Code: https://github.com/kevinlin311tw/caffe-cvprw15/blob/master/analysis/precision.m#L30-L55
buffer_yes = zeros(K,1);
buffer_total = zeros(K,1);
total_relevant = 0;
for j = 1:K
retrieval_label = trn_label(y2(j));
if (query_label==retrieval_label)
buffer_yes(j,1) = 1;
total_relevant = total_relevant + 1;
end
buffer_total(j,1) = 1;
end
% compute precision
P = cumsum(buffer_yes) ./ Ns';
if (sum(buffer_yes) == 0)
AP(i) = 0;
else
AP(i) = sum(P.*buffer_yes) / sum(buffer_yes);
end
Here the divider is sum(buffer_yes) which is number of the relative documents in the returned list of size k, not min(k, number of relevant documents).
"Supervised Learning of Semantics-Preserving Deep Hashing" (TPAMI 2017)
Code: https://github.com/kevinlin311tw/Caffe-DeepBinaryCode/blob/master/analysis/precision.m
Code is the same as in the previouse paper.
Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks (CVPR 2016)
Same code: https://github.com/kevinlin311tw/cvpr16-deepbit/blob/master/analysis/precision.m#L32-L55
Am I missing something? Is the code in the papers above correct? Why it does not coincide with https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py#L25-L39 ?
Update
I found this closed issue, referring the same problem: https://github.com/thuml/HashNet/issues/2
Is claim the following claim correct?
AP is a ranking metric. If the top 2 retrievals in the ranked list are relevant (and only the top 2), AP is 100%. You're talking about Recall, which in this case is indeed 0.2%.
From my understanding, if we treat AP as area under PR curve, the claim above is not correct.
P.S. I was in doubt if this should go to Cross Validated or to StackOverflow. If you think that it is better to place it to Cross Validated I don't mind. My reasoning was that it is not a theoretical question, but implementation one with reference to actual code.
