How to Select Effective Features from Color-Based Image Statistics for Glucose Prediction?

Question

I am working on a regression task to estimate glucose concentration from image data. The images are of reagent test strips, where a chemical reagent reacts with a blood sample and changes colour (ideally brown, but contaminated by red due to blood spillover). I have about only $108$ (possibly a bit more in the future) images of the strips with their corresponding glucose value of the patient; I am extracting the ROI (where the reaction takes place) from these strips, where I extract statistical features from different colour spaces (RGB, HSV, LAB, LUV, YCbCr).

Each sample image results in $32$ tabular features such as $r_{mean}, g_{std}, value_{mean}, saturation_{std}, u_{mean}, cr_{std}$, etc. These are fed into a hybrid deep learning model that takes both the raw image and these extracted features.

These images were taken in same lighting and background conditions.

However, I’ve encountered a few issues:

Only around $14$ of the features show even moderate correlation $(\ge 0.3)$ with the glucose value.

After removing highly collinear features $(\rho > 0.9)$, only $2$ features remain.

The ROI is largely circular and color-based, so spatial features (from CNN) may not be capturing much.

Despite using RGB and other color spaces, red-channel dominance due to blood contamination biases the input ($r_{mean}$ is the greatest among all other features).

My Questions:

Should I keep moderately correlated or repeating features in this context, especially since I'm using a CNN + tabular hybrid model?
Is it better to aggressively drop highly collinear features $(> 0.9)$, or allow redundancy in this case?
Would PCA or another dimensionality reduction method be preferable to simple correlation-based filtering?
Should I rely more on specific colour spaces (e.g., LAB or HSV) due to the nature of colour change from reagent?
Should I drop the idea of a hybrid model and use a only the tabular features(in this case what features should I use and would be helpful) ?
How can I reduce the error due to blood contamination the most ?
Should I use some pretrained CNN model like MobileV2Net or something similar ?

Any advice on selecting the right features, changing the model, using something else entirely or improving interpret-ability/predictive power in such a constrained setup would be greatly appreciated.

Edit: Here is a sample picture

So the blood from a person would be applied in the top semi-circle portion of the strip from where due to capillary action only the plasma part of the blood (containing the glucose molecules) would come to circle in the middle where the reagent has been applied. Only the glucose molecule in the plasma would react with this reagent causing a brownish colour to appear (intensity of which corresponds to the glucose level of blood). Now ideally only the plasma of the blood should fall on the circular region but that's not the case since some amount of blood comes into the reaction area.

The ROI is the reaction area (circle) in the centre.

I am extracting only the image in the centre (excluding the blood contamination as much as possible) and computing $32$ features from it (present in the correlation image I attached).

I tried taking all the $32$ features and here is the correlation between them.

Digitallis · Accepted Answer · 2025-06-25T16:48:46.533

As stated in the comments, I would suggest that you do not use a CNN model for this task. There are several reasons for this in my opinion

Always start with a simple baseline model first before moving on to more advanced model.
If you are still overfitting (as mentioned in the comments) then you should still try to pick an even simpler model. (for example reducing the number of maximum splits if you are using a decision tree)

Simple models which could do the trick are a KNN model or just a linear regression model.

From my understanding of the the scientific experiment, the only thing that matters is the intensity of the color of the reactant so as a baseline I would pick a model with color based features only. (again don't use CNN for this).

To accurately extract the color of the reactant without having to deal with the blood contamination I suggest the following ideas

you consider only pixels in the bottom half of the picture.
You could also sample a random subset of the pixels uniformly and take the mean of their colors instead of taking the mean of the whole picture. The underlying assumption is that the contaminant covers a small portion of the image so the odds of the contaminant being in the sample is small hope that the contaminant will not be in the random sample.
cluster the pixels by color and consider that the largest cluster is the reactant.

For creating features I cannot help you beyond the obvious RBG, HSV etc. You can on top of those features create some custom ones that you think are relevant to the problem based on your understanding of the task and the exploration you did of the dataset.

Once you have a subset of features, you may implement some feature selection techniques of your choice. There is no absolute right answer here other than "You should pick the features which maximise the ability of the model to generalise which you can estimate using cross validation."

For example select the top k most correlated features. Using cross validation estimate the generalisation performance for each k and pick the optimal value.

Or you could try another approach. Whatever you do, your priority should be the maximising of the models ability to generalise to unseen data.

How to Select Effective Features from Color-Based Image Statistics for Glucose Prediction?

1 Answers1