Approximating Distribution of a Data Set

Question

If I have a set of data point and I want to approximate the distribution of that data set. What methods can be employed to fit the data set with the best most distribution. Whether it be gamma, normal, log normal, exponential, etc. I am trying to find the best distribution and the parameters that optimizes the best fit. What methods are out there to do so?

Here is the data, I am trying to approximate. With a distribution. I generated the data by running a 3,466 binary simulations (1,0) and summing the number of 1's in each simulation. According to probability theory, the sum of the outcomes of a Bernoulli distribution is a binomial. But for the sake of being ignorant, if I didn't know this was binomial, how could I build a function that approximates the data. My end goal is to build an excel function that draws on the inverse of the density function and spits out a random number from the distribution.

x   #occurance  P(x)
1636    1   0%
1646    2   0%
1656    2   0%
1666    6   1%
1676    13  2%
1686    20  2%
1696    44  5%
1706    61  7%
1716    79  10%
1726    115 14%
1736    120 14%
1746    97  12%
1756    88  11%
1766    81  10%
1776    48  6%
1786    31  4%
1796    13  2%
1806    7   1%
1816    3   0%
1826    0   0%

What does your data look like? Do you want to approximate the frequency distribution, like a histogram? Estimate a density function? Something else entirely? — Nameless, Nov 16 '13 at 15:30
Find the density function that best fits the distribution of the data. I am asking a general question, I don't have a particular data set to model. But the idea occured to me, if you have a distribution how do you which best find the density function that minimizes the errors. How do you find out what the parameters are. So if I have a data distribution that looks Gamma, how do you generate the parameters? — jessica, Nov 18 '13 at 02:44

score 4 · Answer 1 · edited Dec 22 '13 at 18:39

Discover an underly distribution of data is a typical example of application of neural networks. If is no clear what kind of function you are searching for, a neural net can built a, in general, multivariate non-linear function through the weights and the activation function chosen on units of the net. After training, the net can give you the value of probability for unknown values of x. Training is given by minimization of some type of error (typically quadratic error).

If you have an idea of what kind of function to search, a genetic algorithm can minimize the error of data using the parameters of the function (for example a gamma function). Population have several chromosomes that contain one solution (set of parameters) for each chromosome. Measuring the quadratic error between probability in the data set and probability (output value) of chromosome of all x values in the dataset using the parameters given by the chromosome for the chosen distribution, you can improve the global score of the population (using a selection pressure, with a fixed population length) and reach an approximation (good if dataset is big enough in relation to number of parameters) for the parameters of the distribution you are searching for.

Neural nets are more flexible, because can approximate any function, and then avoid to choose a distribution before apply the method, as need in general for other methods as Maximum Likelihood Estimation. Anyway, a genetic algorithm, I think, can be projected to use several density functions without problems.

Are you consider the approximation to a polynomial, using for example the Levenberg-Marquardt algorithm? Cannot generalize as a neural net, but give good results inside the domain of data.

Hi Francisco. Thank you for your response. I didn't really consider neural networks, to be honest I don't know what they are, so I all have research them. I actually populated a normal distribution in excel using the norm.dist(x,mean,stdev) and I approximated/fitted the normal distribution with the regression. I ran a polynomial regression y=c+ax+bx^2+....+bx^n and I was surprised by how close the function equation approximated the data. The errors were mostly on the tails of the distribution. Do you know of any conventional numerical techniques fit data? eg. chebyshev polynomials? — jessica, Dec 22 '13 at 18:13
I can't help you on Chebyshev approximation. Generaly, in my applications, it's enough for me a quadratic minimization. In the case of non-linear approximation, the Levenberg-Marquardt method with polynomials is good, but results, in partial zones of the domain, depends on the number of examples in the dataset that cover that sub-domain. Try to change the degree of the polynomial. It can happen that a polynomial with degree too high may approximate the domain worst that another one with lower degree if the numberof parameters is too high in relation to the length of dataset. — Francisco Yepes Barrera, Dec 22 '13 at 22:57

score 1 · Answer 2 · answered Dec 17 '13 at 04:42

1

Maximum Likelihood Estimation should give you the parameter values, for the given data set. (of course you need to decide what distribution you're going to use before hand)

http://research.microsoft.com/en-us/um/people/minka/papers/minka-gamma.pdf

answered Dec 17 '13 at 04:42

Marc Fletcher

83

Approximating Distribution of a Data Set

2 Answers2

Linked