by Wu, Xiong, Yu and Lin (CVPR 2018)
Can a good feature representation be learnt that captures similarity among instances rather than classes, by asking features to be discriminative of individual instances?
Question: what is an instance? A single image.
Then minimize the negative log likelihood over the training set.
Instead of computing all pairwise inner products at every gradient step, they use a memory bank and update the computed features for particular instances every so often. Values in the bank are randomly initialized unit vectors.
Replace the full softmax with noise contrasive estimation. The idea is to treat each instance as binary classification between data samples and noise samples. Since there’s only one datum per instance, training is unstable.
Add proximal optimization i.e. penalize features from moving from their previous values
On CIFAR-10 (50K instances from 10 classes), can compute nonparametric softmax exactly. Then train linear classifier and nearest-neighbor classifier to predict test data.
Note that NCE is poor unless the number of noise samples is massive.
Compare on ImageNet ILSVRC.