I am completely redrafting this question following the advice of @MrFlick.
Assume I have a data.frame like the following
set.seed(1)
group<-(rep(1:10, sample(50:200, 10, replace=T)))
gender<-factor((sample(0:1, 1328, replace=T, prob=c(0.55, 0.45))))
country<-factor((sample(6030:6098, 1328, replace=T)))
ethnicity<-factor((sample(7040:7101, 1328, replace=T)))
yearbirth<-(sample(1950:1986, 1328, replace=T))
df<-data.frame(group, gender, country, ethnicity, yearbirth)
For each group, I would like to calculate the Silhouette Width (SW) corresponding to the 'optimal' number of clusters. To do so, I prepared the following function which I would like to repeat on any group
library(cluster)
library(fpc)
ASW<-function(x){
x<-as.data.frame(x)
id<-as.integer(x[1,1])
people<-length(as.vector(x[,1]))
if (people==1){
p=0
} else {
x<-x[,-1]
diss<-daisy(x, metric="gower")
if (people/3<2) {
maxclus=2
} else {
maxclus<-round(people/3)
}
asw <- numeric(maxclus)
for (k in 2:maxclus) asw[[k]] <- pam(diss, k, diss=T) $ silinfo $ avg.width
k.best <- which.max(asw)
p<-asw[k.best]
}
swg<-numeric(2)
swg[1]<-id
swg[2]<-p
swg
}
As a final output, I would like ASW to produce a data.frame having the group number (id) in the first column and the Silhouette Width value corresponding to the optimal number of clusters in the second. If the group contains only one individual, I would like Silhouette Width to be 0 - SW is not defined for less than 2 clusters.
Using all variables except for group I would like to compute a dissimilarity matrix using daisy from the cluster package. To my knowledge, daisy is the only function capable to compute a dissimilarity matrix from mixed variables. Next, I would pass the dissimilarity matrix just produced to pam and calculate the Silhouette Width for various cluster configurations. To shorten the computing time, especially with large groups, I am imposing a maximum number of clusters equal to one-third the number of individuals in the group.
At this point, I would like the function to take the SW value corresponding to the optimal number of clusters (determined by looking at the maximum Silhouette Width value) and paste it, together with the corresponding group id, in a data.frame - here called aswout.
Unfortunately, the function seems not to work properly (I tried it on the first group only) and it's not so clear to me how to get it 'cycle' over all the groups.
I hope the question is clear. Write if there is something you don´t understand and I will add more information. I am really thankful for any help on this!
All the best, Riccardo
EDIT:
The ASW function now works. I am trying to make it cycle over all groups in a data frame. I learned from another post that it's a bad habit to include data.frames within functions that are grown as the function executes. This however was the aim of my aswout data.frame. I am now looking for a way to achieve the same result, having the function loop over the groups and giving me an output data.frame, without including the data.frame within the function.