clara                package:cluster                R Documentation

_C_l_u_s_t_e_r_i_n_g _L_a_r_g_e _A_p_p_l_i_c_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     Returns a list representing a clustering of the data into `k'
     clusters.

_U_s_a_g_e:

     clara(x, k, metric = "euclidean", stand = FALSE, samples = 5, 
           sampsize = 40 + 2 * k)

_A_r_g_u_m_e_n_t_s:

       x: data matrix or dataframe, each row corresponds to an
          observation, and each column corresponds to a variable. All
          variables must be numeric. Missing values (NAs) are allowed.

       k: integer, the number of clusters. It is required that 0 < k <
          n where n is the number of observations. 

  metric: character string specifying the metric to be used for
          calculating dissimilarities between observations. The
          currently available options are "euclidean" and "manhattan".
          Euclidean distances are root sum-of-squares of differences,
          and manhattan distances are the sum of absolute differences.

   stand: logical flag: if TRUE, then the measurements in `x' are
          standardized before calculating the dissimilarities.
          Measurements are standardized for each variable (column), by
          subtracting the variable's mean value and dividing by  the
          variable's mean absolute deviation.

 samples: integer, number of samples to be drawn from the dataset.

sampsize: integer, number of observations in each sample. `sampsize'
          should be higher  than the number of clusters (`k') and at
          most the number of observations  (nrow(`x')).

_D_e_t_a_i_l_s:

     `clara' is fully described in chapter 3 of Kaufman and Rousseeuw
     (1990). Compared to other partitioning methods such as `pam', it
     can deal with much larger datasets. Internally, this is achieved
     by considering sub-datasets of fixed size, so that the time and
     storage requirements become linear in nrow(`x') rather than
     quadratic.

     Each sub-dataset is partitioned into `k' clusters using the same 
     algorithm as in the `pam' function. Once `k' representative
     objects have been selected from the sub-dataset, each observation
     of the entire dataset is assigned to the nearest medoid. The sum
     of the dissimilarities of the observations to their closest
     medoid, is used as a measure of the quality of the clustering. The
     sub-dataset for which the sum is minimal, is retained. A further
     analysis is carried out on the final partition. Each sub-dataset
     is forced to contain the medoids obtained from the best 
     sub-dataset until then. Randomly drawn observations are added to
     this set until `sampsize' has been reached.

_V_a_l_u_e:

     an object of class `"clara"' representing the clustering. See
     clara.object for details.

_B_A_C_K_G_R_O_U_N_D:

     Cluster analysis divides a dataset into groups (clusters) of
     observations that are similar to each other. Partitioning methods
     like `pam', `clara', and `fanny' require that the number of
     clusters be given by the user. Hierarchical methods like `agnes',
     `diana', and `mona' construct a hierarchy of clusterings, with the
     number of clusters ranging from one to the number of observations.

_N_o_t_e:

     For small datasets (say with fewer than 200 observations), the
     function `pam' can be used directly.

_R_e_f_e_r_e_n_c_e_s:

     Kaufman, L. and Rousseeuw, P.J. (1990).  Finding Groups in Data:
     An Introduction to Cluster Analysis.  Wiley, New York.

     Struyf, A., Hubert, M. and Rousseeuw, P.J. (1997). Integrating
     Robust  Clustering Techniques in S-PLUS, Computational Statistics
     and Data Analysis, 26, 17-37.

_S_e_e _A_l_s_o:

     `clara.object', `pam', `partition.object', `plot.partition'.

_E_x_a_m_p_l_e_s:

     ## generate 500 objects, divided into 2 clusters.
     x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
                cbind(rnorm(300,50,8), rnorm(300,50,8)))
     clarax <- clara(x, 2)
     clarax
     clarax$clusinfo
     plot(clarax)

     ## `xclara' is an artificial data set with 3 clusters of 1000 bivariate
     ## objects each.
     data(xclara)
     ## Plot similar to Figure 5 in Struyf et al (1996)
     plot(clara(xclara, 3), ask = TRUE)

