hierclust               package:multiv               R Documentation

_H_i_e_r_a_r_c_h_i_c_a_l _C_l_u_s_t_e_r_i_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     Agglomerates the closest pair of observations, and replaces them
     with a cluster.  Agglomerates the next closest pair of
     observations or clusters. Continues through `n-1' agglomerations,
     assuming there are `n' observations.

_U_s_a_g_e:

     hierclust(a, method = 1, bign = FALSE, diagnostics = FALSE, sim = " ",
                  movie = FALSE, option = "thru", alpha = 1.0, repres = "chull", 
                  show = "change")

     hierclust(dst, method="compact", sim=)

_A_r_g_u_m_e_n_t_s:

       a: data matrix to be analyzed, the rows representing
          observations and the columns variables; or, alternatively,
          lower-diagonal dissimilarity matrix in vector or list format.
           In latter case, the number of values is `n(n-1)/2'  where
          `n' is the number of observations.

          (See below for second variant of usage of this function.)

          First variant:

  method: First possibility: integer taking the value between 1 and 7. 
          These methods  correspond to: Ward's minimum variance or
          error sum of squares method (1); single linkage or nearest
          neighbor method (2); complete linkage or diameter (3);
          average linkage, group average, or UPGMA method (4);
          McQuitty's or WPGMA method (5); median, Gower's or WPGMC
          method (6); and centroid or UPGMC method (7). If the `movie'
          option is used, only `method's 1 to 4 are supported. Second
          possibility: string `compact' (default; complete linkage), 
          `average' (average linkage), or `connected' (single linkage). 

     sim: structure giving similarities rather than dissimilarities. 
          This can either be a symmetric matrix or a vector with a
          `Size' attribute.  Only one of  dissimilarities or
          similarities may be specified (or, of course, as an
          alternative, a data matrix. 

    bign: Is `n' big?  If storage is problemsome, a different
          implementation of the  Ward criterion may be tried.  This
          determines dissimilarities on the fly, and hence requires
          `O(n)' storage. 

diagnostics: show, or suppress, some information directed to the
          command window.  Suppression is useful if functional
          programming capabilities are availed of, leading to lazy
          evaluation; this often results in multiple evaluations,  with
          consequent repetition of diagnostic information. 

   movie: whether interactive display of construction of hierarchy is
          wanted.  This necessitates an open plot window.  It does not
          produce an output object.   All remaining arguments apply
          only in the case of the `movie=TRUE'. 

  option: (Only for `movie'.) "thru" (default) or "prompt": Go right
          through all n-1  agglomerations, or ask the user whether to
          continue at each agglomeration?  Here is what we recommend:
          On the first occasion, go right through.  Check out the
          cluster criterion values reported on in the command window. 
          Then using the 'prompt' option, go as far as the desired
          partition, and print it out. 

   alpha: (Only for `movie'.) 1.0 (default) or 0.0 < alpha <= 1.0. Only
          used by the minimum variance agglomerative criterion. Alpha
          downweights the variance component on the principal axis of
          the cluster, and thereby redefines the cluster criterion such
          that linear clusters are preferred.  A value of alpha = 1.0
          gives exactly the traditional Ward's minimum variance method.
           See Murtagh and Raftery (1984) for more on this altered
          criterion. 

  repres: (Only for `movie'.) repres = "chull" (default), "lines",
          "gauss": Representation of clusters: For input data which has
          dimensionality greater than 2, only "chull" is allowed (since
          the representations were found to be too problematic, given
          that the clustering takes place in m-dimensions, and the the
          representation is necessarily a 2-dimensional projection of
          this). Representation "chull" = convex hull, or straight line
          in case of collinear points.  (We test for collinearity using
          principal components analysis.) Representation "lines" =
          lines of best fit (least squares fit is used, and the
          extremities are defined by the intersection of the min. and
          max. x values on this LS fit line). "gauss" = 1-sigma
          circles.  Side-effect of latter: warning messages indicate
          whenever circles overlap plot boundaries. 

    show: (Only for `movie'.) "change" or "all": In this method's
          present implementation, the clusters at each stage are
          `highlighted' by means of a different symbol.  Option show =
          "change" has this highlighting changed at each agglomeration,
          so that only the most recent agglomerands are shown.  Option
          show = "all" does not turn off the highlighting in this way.

          Second variant (which re-wraps S function `hclust': cf. help
          file on `hclust'):

          Exactly one of dst or sim must be specified.

     dst: a distance structure or distance matrix.  Normally this will
          be the result of the function `dist', but it can be any data
          of the form returned by `dist', or  a  full,  symmetric
          matrix.  Missing values are not allowed. 

  method: character string giving  the  clustering  method.   The three
           methods  currently implemented are "average", "con- nected"
          (single linkage) and "compact" (complete linkage). (The first
          three characters of the method are sufficient.) 

    sim=: structure giving similarities  rather  than  distances. This 
          can  either be a symmetric matrix or a vector with a "Size"
          attribute.  Missing values are not allowed. 

_V_a_l_u_e:

     When `movie' is not specified in the first variant, this is a 
     list describing the hierarchical clustering of the observations.
     The `movie' option is considerably slower than alternative
     options, due to the processing involved.  Hence its use is
     recommended only on small data sets (less than 100 observations).

   merge: an `n-1' by 2 matrix. Row `i' of `merge' describes the
          merging of clusters at step `i' of the clustering.  If an
          element `j' in the row is negative, then observation `-j' was
          merged at this stage.  If `j' is positive  then the merge was
          with the cluster formed at the (earlier) stage `j' of the
          algorithm. Thus negative entries in `merge' indicate
          agglomerations of singletons, and  positive entries indicate
          agglomerations of non-singletons. 

  height: a set of `n-1' non-decreasing real values.  The clustering
          "height": that is, the value of the criterion associated with
          the clustering `method' for the particular agglomeration. 

   order: a vector giving the permutation of the original observations
          suitable for  plotting, in the sense that a cluster plot
          using this ordering and matrix `merge' will not have
          crossings of the branches.

          In hierarchical cluster displays, a decision is needed at
          each merge to specify which subtree should go on the left and
          which on the right.  Since,  for  `n' observations there are
          `n-1' merges, there are `2^(n-1)' possible orderings for the
          leaves in a cluster tree, or dendrogram.  The default
          algorithm in `hc' is to order the subtree so that the tighter
          cluster is on the left (the last, i.e. most recent, merge of
          the left subtree is at a lower  value than the last merge of
          the right subtree).  Observations are the tightest clusters
          possible, and merges involving two observations place them in
          order by their  observation sequence number. 

_N_o_t_e:

     The squared Euclidean distance is used in the `movie' case, and in
     the case when a data matrix is input. Note that the `movie' option
     returns  invisibly.

_M_e_t_h_o_d:

     In the case of all methods when `bign = FALSE' (default),  the
     Lance-Williams dissimilarity update formula is used to allow the 7
     methods to be implemented.   

     In the direct data matrix case,  a list of nearest neighbors is
     maintained, which  allows the closest pair of clusters or
     observations to be determined by scanning this list to find the
     pair of agglomerands on each iteration. Hence `O(n)' operations
     are required for each of `n-1' iterations, i.e. `O(n^2)'
     computation time in total.  The  initial setting up of the `n'
     nearest neighbors also requires `O(n^2)' time. The processing to
     be carried out following each agglomeration is `O(n)'. In total,
     all methods implemented in `hc' are of time complexity `O(n^2)'.
     Unless `bign = TRUE' (see below), storage  complexity is `O(n^2)'.
      Typical times on a Sun SPARCstation 1 for 400x4 and 800x4 arrays:
     12 and 48 secs.

     Ward's minimum variance method aims at finding compact, spherical
     clusters. The complete link method finds similar clusters. The
     single link method, which is closely related to the minimal
     spanning tree, adopts a  `friends of friends' clustering strategy.
       The other methods can be regarded as aiming for clusters with
     characteristics somewhere between the single and complete link
     methods. 

     In the case of the Ward method, and `bign = TRUE', the
     nearest-neighbor  chain algorithm is used, which gives an `O(n^2)'
     computation time algorithm. The input matrix is stored, but not
     the dissimilarities produced.  Computation time is marginally
     inferior to the implementation of Ward's method when `bign =
     FALSE' (default).  Sample  times for a 400x4 and an 800x4 array on
     a Sun SPARCstation 1: 12.5 and 50 secs.

_R_e_f_e_r_e_n_c_e_s:

     A small sample: 

     B. Everitt,  Cluster Analysis Heinemann Educ. Books, London 1974; 

     P.H.A. Sneath and R.R. Sokal,  Numerical Taxonomy Freeman, San
     Francisco, 1973; 

     M.R. Anderberg,  Cluster Analysis for Applications Academic Press,
     New York, 1973;

     A.D. Gordon,  Classification Chapman and Hall, London, 1981; and

     F. Murtagh,  Multidimensional Clustering Algorithms COMPSTAT
     Lectures 4, Physica-Verlag, Wuerzburg, 1985 (for algorithmic
     details of algorithms used).

_S_e_e _A_l_s_o:

     For finding assignments of objects to clusters, use `members'.
     'Dist' is used for creating (dis)similarities.  Functions
     `cutree', and `dist' are already available in R.  Function
     `members' is available in this function suite.

_E_x_a_m_p_l_e_s:

     data(iris)
     iris <- as.matrix(iris[,1:4])
     # Default Ward criterion
     hierclust(iris, movie = TRUE)
     h <- hierclust(iris)
     k <- members(h)

