pam                 package:cluster                 R Documentation

_P_a_r_t_i_t_i_o_n_i_n_g _A_r_o_u_n_d _M_e_d_o_i_d_s (_P_A_M)

_D_e_s_c_r_i_p_t_i_o_n:

     Return a partitioning (clustering) of the data into `k' clusters.

_U_s_a_g_e:

     pam(x, k, diss = FALSE, metric = "euclidean", stand = FALSE)

_A_r_g_u_m_e_n_t_s:

       x: data matrix or dataframe, or dissimilarity matrix, depending
          on the value of the `diss' argument.

          In case of a matrix or dataframe, each row corresponds to an
          observation, and each column corresponds to a variable.  All
          variables must be numeric. Missing values (`NA's) are
          allowed.

          In case of a dissimilarity matrix, `x' is typically the
          output of `daisy' or `dist'.  Also a vector of length
          n*(n-1)/2 is allowed (where n is the number of observations),
          and will be interpreted in the same way as the output of the
          above-mentioned functions. Missing values (NAs) are not
          allowed.  

       k: positive integer specifying the number of clusters, less than
          the number of observations.

    diss: logical flag: if TRUE, then `x' will be considered as a
          dissimilarity matrix.  If FALSE, then `x' will be considered
          as a matrix of observations by variables. 

  metric: character string specifying the metric to be used for
          calculating dissimilarities between observations.
          The currently available options are "euclidean" and
          "manhattan".  Euclidean distances are root sum-of-squares of
          differences, and manhattan distances are the sum of absolute
          differences.  If `x' is already a dissimilarity matrix, then
          this argument will be ignored. 

   stand: logical; if true, the measurements in `x' are standardized
          before calculating the dissimilarities.  Measurements are
          standardized for each variable (column), by subtracting the
          variable's mean value and dividing by the variable's mean
          absolute deviation.  If `x' is already a dissimilarity
          matrix, then this argument will be ignored.

_D_e_t_a_i_l_s:

     `pam' is fully described in chapter 2 of Kaufman and Rousseeuw
     (1990). Compared to the k-means approach in `kmeans', the function
     `pam' has the following features: (a) it also accepts a
     dissimilarity matrix; (b) it is more robust because it minimizes a
     sum of dissimilarities instead of a sum of squared euclidean
     distances; (c) it provides a novel graphical display, the
     silhouette plot (see `plot.partition') which also allows to select
     the number of clusters.

     The `pam'-algorithm is based on the search for `k' representative
     objects or medoids among the observations of the dataset. These
     observations should  represent the structure of the data. After
     finding a set of `k' medoids,  `k' clusters are constructed by
     assigning each observation to the nearest  medoid. The goal is to
     find `k' representative objects which minimize the  sum of the
     dissimilarities of the observations to their closest
     representative  object. The algorithm first looks for a good
     initial set of medoids (this is called the BUILD phase). Then it
     finds a local minimum for the objective function, that is, a
     solution such that there is no single switch of an observation
     with a medoid that will decrease the objective (this is called the
     SWAP phase).

_V_a_l_u_e:

     an object of class `"pam"' representing the clustering.  See
     `pam.object' for details.

_B_A_C_K_G_R_O_U_N_D:

     Cluster analysis divides a dataset into groups (clusters) of
     observations that are similar to each other. Partitioning methods
     like `pam', `clara', and `fanny' require that the number of
     clusters be given by the user. Hierarchical methods like `agnes',
     `diana', and `mona' construct a hierarchy of clusterings, with the
     number of clusters ranging from one to the number of observations.

_N_o_t_e:

     For datasets larger than (say) 200 observations, `pam' will take a
     lot of computation time.  Then the function `clara' is preferable.

_R_e_f_e_r_e_n_c_e_s:

     Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An
     Introduction to Cluster Analysis. Wiley, New York.

     Anja Struyf, Mia Hubert & Peter J. Rousseeuw (1996) Clustering in
     an Object-Oriented Environment. Journal of Statistical Software,
     1. <URL: http://www.stat.ucla.edu/journals/jss/>

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997) Integrating
     Robust Clustering Techniques in S-PLUS, Computational Statistics
     and Data Analysis, 26, 17-37.

_S_e_e _A_l_s_o:

     `pam.object', `clara', `daisy', `partition.object',
     `plot.partition', `dist'.

_E_x_a_m_p_l_e_s:

     ## generate 25 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
                cbind(rnorm(15,5,0.5), rnorm(15,5,0.5)))
     pamx <- pam(x, 2)
     pamx
     summary(pamx)
     plot(pamx)

     pam(daisy(x, metric = "manhattan"), 2, diss = TRUE)

     data(ruspini)
     ## Plot similar to Figure 4 in Stryuf et al (1996)
     plot(pam(ruspini, 4), ask = TRUE)

