3

Disclaimer: I'm not expert in math; the solution to this problem may be trivial.

I have a process based on a deterministic computer simulation where two continuous input variables $x_1$ and $x_2$ (under my control) produce a continuous output variable $y$.
My objective is to determine which one of the two variables affects the variability of $y$ the most and I would like the answer to be generalizable to any number of independent variables.

Mathematically speaking, my model is $y=f(x_1,x_2)$, with $f:X \rightarrow R $ being a continous unknown function in $X$ which is a simply connected domain.

Lack of random variation in $y$ suggest me that a statistical approach is not the right way to address the problem .

If I knew $f$ ,$ \frac{\partial f }{\partial x_1}$ and $\frac{\partial f}{\partial x_2}$ answer my question but only if $ \frac{\partial f }{\partial x_1}$ depends solely on $x_1$ (and same for $ \frac{\partial f}{\partial x_2}$). If it's not the case I do not know how to properly answer the question even by knowing $f$.


My naive approach:

I was thinking about "meshing" the $X$ domain in regions where $f$ is approximated by a linear o by a quadratic model.
Let us define:

$L(x_1,x_2)=\beta_0+\beta_1x_1 + \beta_2x_2$

$Q(x_1,x_2)=\beta_0+\beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 +\beta_4x_1^2 + \beta_5x_2^2 $

In domain regions where linear approximation holds, $\frac{\partial L}{\partial x_1}=\beta_1$ and $\frac{\partial L}{\partial x_2}=\beta_2$ and this should be nice because I can directly compare the coefficient in order to assess which of the two variable influences $y$ the most.

In those region where linear approximation does not hold, the presence of the interaction term disrupts my interpretation because $\frac{\partial Q }{\partial x_1} = \beta_1 +\beta_3x_2 + 2\beta_4x_1$ depends on $x_2$ (and similar for $\frac{\partial Q }{\partial x_2}$)

Some thougts: linear approximation may be incorrect where sub-regions contain minima, maxima or saddle points. Does avoiding them by partitioning the domain with a finite number of "holes" could solve the issue? (I may use an optimization algorithm such as gradient descent in order to find them)

omega
  • 129
  • 1
    I think the main issue here might not be in finding a solution but in a sharper framing of the problem. Specifically, in what ways does $\frac{\partial f }{\partial x_i}$ (which you can probably approximate numerically and would be the standard way of understanding "how does $f$ vary with $x_i$?") not work for you? You mention that these partial deriviatives might depend on other variables but that's going to be the case in general (unless $f$ is of a particular form). – mhum Dec 02 '20 at 20:04

1 Answers1

1

I believe your question is both utterly meaningless and highly relevant! The relevant part is clear to you because you wouldn't ask otherwise. For all others: One of the basic challenges to understand complicated computer models is to understand what is driving the results. A popular buzzword in this context is model "explainability".

So I agree, the problem itself and your thoughts on it are perfectly reasonable. That said, the question as stated is meaningless as a mathematical statement and hopeless in this generality from an applied perspective. You will need to sharpen your thinking around (at least) three issues:

  1. The operational definition of your question
  2. Assumptions on your function
  3. Clarity about the inputs respectively your coordinates.

The operational definition of your question:

As the commentator already stated "which one of the two variables affects the variability of y the most" is too fuzzy to be workable. For the sake of argument I will sharpen this (somewhat) to: "Is $x$ or $y$ more important in causing $f$ to be large?" Mathematically the answer is now clear: Take a "large" threshold $\tau$, define a measure of "importance" and apply it to the set $f^{-1}([\tau,\infty[)\subset\mathbb{R}^2.$ The problem is, even if $f$ is really "nice", e.g. infinitely differentiable, $f^{-1}([\tau,\infty[)$ can have any crazy shape. This makes it impossible to come up with a natural notion of importance. So, unless you have a clear externally defined and operational notion of importance your problem will be infeasible.

Assumptions on your function:

As you noticed yourself, 1) above makes restricting your class of functions as much as possible a key concern. So what do you know about your function? Available to you may be properties like "smooth", "convex", "monotonic in x or y", "where are the critical points" and so on. Your idea of approximating $f$ with simple functions is a very good and valid one. Prior knowledge about $f$ then helps in finding good approximations.

Clarity about the inputs respectively your coordinates:

You seem to treat your $x$ and $y$ coordinates as fixed. This is most likely because they are externally given by the process you simulate. But is this really true? Take for example a physical process where stuff ($x$) is ground at a certain temperature ($y$) to produce output $L(x,y)=\beta_0+\beta_1 x+\beta_2 y$. You would say $x$ is more important than $y$ iff $\beta_1 > \beta_2$. But the beta-factors depend on units of measurement. If you stop measuring the stuff you grind in tons but state $x$ in milligrams the new factor $\tilde{\beta}_1$ will be larger than $\beta_1$ by a factor of $10^{12}$. Often in applications another natural invariant is monotonic reparametrizations.

This can be driven further and actually might help a lot depending on the requirements of your problem. Take your linear $L$ example. Arguably it is not enough for either $x_1$ or $x_2$ to be large alone to achieve a large $L$. Both must be large at the same time. So the "really important" variable is $z=\beta_1 x+\beta_2 y.$ Now you can answer "What drives $L$?" easily because the result is driven clearly and only by "z". This idea works more generally. As long as the gradient of your function does not vanish, you can reparametrize it by its level surfaces. The coordinate orthogonal to those level surfaces is then your single variable $z$.

Further remarks:

Even though your problem is deterministic, probabilistic approaches may be very useful to you. This is because of two reasons. First, putting some probability distributions on your inputs is often a very intuitive model for variability. Furthermore, by averaging/integrating over inputs you will get rid of some of the complexity involved. In your case you could use normal distributions $X$ and $Y$ with variance $\sigma^2_i$ as inputs and calculate the variance of the random variable $f(X,Y)$ as a function of the variance of the inputs.

The second reason is that the people thinking about Uncertainty Quantification have been analysing these kind of problems for a long time.

g g
  • 2,799