I want to find a 'nice' drawing of the lipids and genes in my database. Lipids belong to one one of several classes, while genes belong to one of several regions. Each gene/lipid pair has an associated strength value, which indicates how closely related they are.
I have tried to define my dataset formally as follows. A gene-lipid embedding problem is a $5$-tuple $(\texttt{lipids}, \texttt{genes},\texttt{strength},\texttt{region},\texttt{class})$ where
- $\texttt{lipids}$ and $\texttt{genes}$ are finite, non-empty, non-overlapping sets,
- $\texttt{strength}$ is a map $\texttt{lipids} \times \texttt{genes} \to [0, 1]$,
- $\texttt{region}$ maps genes to $\mathbb N$ and
- $\texttt{class}$ maps lipids to $\mathbb N$
Essentially, my data is a sort of graph where the vertices are the lipids and genes, the edges and edge weights are given by the $\texttt{strength}$ function, and each vertex is of one of two types (lipid or gene) which belongs to class or region.
A candidate solution for such a problem is a map $\texttt{genes} \cup \texttt{lipids} \to \mathbb R^2$.
I want to find a 'best' map. By best I mean that
- Lipids of the same class should be close together, and lipids of different classes should be far apart.
- Similarly, genes of the same region should be drawn close together, and far apart otherwise.
- Lipid/gene pairs with a high strength value should be drawn close to one another, and far apart otherwise.
My first question is: How can I mathematically define what it means for a candidate solution to be good? I.e. how can I come up with fitness function that assigns each candidate solution a score?
My second question is: Once I have defined such a function, how can I use it to find the 'best' solution?