0

I'm currently working on applying data science to High Performance Computing cluster, by analyzing the log files generated and trying to see if there is a pattern that leads to a system failure(specifically STALE FILE HANDLEs for now in GPFS file system). I am categorizing the log files and clustering based on their instances per time interval. Since some messages are more predominant over the others in any given time frame than the others, i don’t want the clustering to bias towards the one with maximum variance.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
Kraamed
  • 13
  • 2

1 Answers1

2

Its unclear what the OP is asking (so this response is somewhat general), but the table below illustrates common contexts and the transformations that are typical:

sales, revenue, income, price --> log(x)

distance --> 1/x, 1/x^2, log(x)

market share, preference share --> (e^x)/(1+e^x)

right-tailed dist --> sqrt(x), log(x) caution log(x<=0)

left-tailed dist --> x^2

You can also use John Tukey's three-point method as discussed in this post. When specific transformations don't work, use Box-Cox transformation. Use package car to lambda <- coef(powerTransform()) to compute lambda and then call bcPower() to transform. Consider Box-Cox transformations on all variables with skewed distributions before computing correlations or creating scatterplots.

Brandon Loudermilk
  • 1,216
  • 8
  • 19