Combining text and image features with different scales

Question

I have computed text features using [SBERT][1] and image features using VGG-16. The text features range from -1.58 to 1.58, whereas the image features range between 0 and 521. I would want to concatenate the text and image features and use them to compute cosine similarities. However, as you've probably noticed, the difference in scale would mean that the image features would completely dominate the text ones.

My idea was to use something like sklearn's MinMaxScaler and scale down the image features to the same scale as the SBERT computed features; however, I'm not sure if this is the best solution for my case since other [answers][2] here suggest normalizing both features. In my case, I would say that the text features are more important than the image ones.

[1]: https://github.com/UKPLab/sentence-transformers [2]: Creating a feature by combining 2 features with different units?

dark horse · Answer 1 · 2023-02-24T19:36:51.707

In my view, you found out appropriate answer because this article consists of regular normalization and weighting.

I think this answer normalized both features, but this is somewhat useless according to your project as normalization takes automatically place when computing cosine similarities.

So you can convert text feature ranges to image feature ranges and I suggest this example.

text_feature_v2 = [ele / 1.58 * 260.5 + 260.5 for ele in text_feature]
concated_feature = [*text_feature_v2, *text_feature_v2, *image_feature]

Here I concatenated two identical text features for enhancing its importance.

I will provide my python code.

from numpy import dot
from numpy.linalg import norm
from random import randint
def rand_text_feature(dimension=4):
    """Returns dimension-sized array between [0, 521]."""
    res = [randint(0, 521) for _ in range(dimension)]
    return res
def rand_image_feature(dimension=4):
    """Returns dimension-sized array between [0, 521]."""
    res = [randint(0, 521) for _ in range(dimension)]
    return res
def cos_sim(arr1, arr2):
    """Returns Cosine similarity of two arrays."""
    return dot(arr1, arr2)/(norm(arr1)*norm(arr2))
prepare two pairs of features
text_feature1 = rand_text_feature()
image_feature1 = rand_image_feature()
text_feature2 = rand_text_feature()
image_feature2 = rand_image_feature()
Prints similarity of texts and images.
print('similarity of two texts')
print(cos_sim(text_feature1, text_feature2))
print('similarity of two images')
print(cos_sim(image_feature1, image_feature2))
compute cosine similarity traditionally
feature1 = [text_feature1, image_feature1]
feature2 = [text_feature2, image_feature2]
print('similarity of concatenated feature')
print(cos_sim(feature1, feature2))
compute cosine similarity regarding my proposal
enhanced_feature1 = [text_feature1, text_feature1, *image_feature1]
enhanced_feature2 = [text_feature2, text_feature2, *image_feature2]
print('similarity of concatenated feature enhancing text')
print(cos_sim(enhanced_feature1, enhanced_feature2))

And this was the result.

similarity of two texts
0.8618949874358144
similarity of two images
0.598022653964154
similarity of concatenated feature
0.7335241784245647
similarity of concatenated feature enhancing text
0.7767832080432862

Combining text and image features with different scales

1 Answers1

prepare two pairs of features

Prints similarity of texts and images.

compute cosine similarity traditionally

compute cosine similarity regarding my proposal

If text is more similar than image, my algorithm prints higher similarity,

otherwise, prints lower similarity.

Linked