Multivariate analysis is used to describe analyses of data sets for which more than two observations or variables are obtained for each individual or unit studied. For genetic diversity studies, gene frequencies can be determined for several loci in several breeds or populations. Multiple regression and multiple correlation procedures are multivariate techniques which have had the greatest application in animal breeding research. However, these techniques are not suitable when the number of observations or variables is large. Cluster analysis and principal component analysis are two multivariate methods that have been used to analyse data generated by molecular genetics studies [CS 1.10 by Okomo]; [CS 1.11 by Gwakisa].
Cluster analysis
Clustering is a technique for grouping individuals into unknown groups to assess the relationship between the groups (e.g. livestock populations). With cluster analysis the number and characteristics of the groups are to be derived from the data and are not usually known before the analysis. In animal diversity studies, cluster analysis has been used to classify breeds or strains into groups on the basis of their genetic characteristics. Some initial analysis is usually recommended before clustering. Common initial analyses include scatter diagrams, profile analysis and distance measures. Scatter diagrams and profile analysis fail when the number of observations is large. For a large data set, distance measures are more appropriate. They define some measure of closeness or similarity of two observations. In animal breeding, distance measures are called genetic distance.
- . Genetic distance estimates
Genetic distances give the extent of gene differences between populations (and hence genetic relationships among them) measured by some numerical quantity and usually refer to the gene differences as measured by a function of gene frequencies. There are several measures of genetic distances. In most situations, different distance measures yield different distance matrices, in turn leading to different clusters. Examples include the standard genetic distance developed by Nei (1972), and a genetic distance measure developed by Goldstein et al. (1995). The efficiencies of the various measures of genetic distances are compared in Takezaki and Nei (1996). Several computer programs are now available for estimating genetic differences, for example, DISPAN (Ota, 1993) (see Section 12, this module).
- . Phylogenetic analysis
The commonly used methods of clustering fall into two general categories: hierarchical and non-hierarchical. Hierarchical procedures are the most commonly used in animal diversity studies. When the number of variables is more than two and the data set is large, dendrograms have been used. In a dendrogram, the horizontal axis lists the observations in a particular order. The vertical axis shows the successive steps or cluster numbers.
In animal diversity studies, hierarchical procedures are called phylogenetic analysis. The genetic distance measures are used to construct the dendrograms, also called phylogenetic trees. The two most commonly used methods for constructing the trees are unweighted pair group method (UPGMA) and the neighbour-joining method (NJ) (Saitou and Nei, 1987). The operational taxonomic units (OTUs) in breeding are livestock populations or breeds. Therefore, the phylogenetic trees summarize evolutionary relationships among breeds or populations and categorize cattle populations into distinct genetic groups. The trees consist of nodes and branches. The nodes are the breeds and the branch lengths between breeds are graphical estimates of the genetic distance between the breeds and give an indication of genetic relationships between breeds. UPGMA trees give an indication of the time of separation (divergence) of breeds. The higher the branch length the longer is the separation period between breeds [CS 1.10 by Okomo]; [CS 1.11 by Gwakisa]. Bootstrapping is usually done to provide confidence statements about the groupings of the breeds as revealed by the dendrograms and hence test the validity of the clusters obtained. The bootstrap values are given in percentages and the higher the value, the higher is the confidence in the grouping. Programs such as SAS (Statistical Analysis System) and SPSS can produce dendrograms.
There are some problems with hierarchical procedures. An undesirable early combination can persist throughout the analysis and may lead to artificial results. It may then become necessary to perform the analysis several times after deleting certain suspect observations. For large sample sizes, the printed dendrograms become too large and unwieldy to read. Another important problem is how to select the number of clusters. No standard objective procedure exists for making the selection. The distance between clusters at successive steps may serve as a guide. In addition, the underlying situation may suggest a natural number of clusters.
Principal components analysis
Principal components analysis (PCA) provides a method of explaining the covariance structure among a large system of measurements by generating a smaller number of artificial variates. In this manner, principal components can be used objectively to evaluate variation in measurements and to increase understanding of structural relationships as an entity rather than as a series of individual and independent relationships. In PCA, the variables are treated equally as opposed to being divided into dependent and independent variables, as is done in regression analysis. The original variables are transformed into new uncorrelated variables that are called principal components (PC). Each PC is a linear combination of the original variables. The initial variates are replaced with a smaller number of latent variates (the PC) allowing the data to be summarized more concisely with minimal loss of information. Thus, instead of analysing a large number of original variables with complex interrelationships, the investigator can analyse a smaller number of uncorrelated PCs (Morrison, 1976).
One of the measures used to determine the amount of information conveyed by each PC is its variance (usually known as eigenvalue). For this reason, the PCs are arranged in order of decreasing variance. Thus, the most informative PC is the first and the least informative is the last while a variable with zero variance does not distinguish between the members of the population. To reduce the dimensionality of a problem, only the first few PCs are analysed. The PCs not analysed convey only a small amount of information since their variances are small. The number of components selected may be determined by examining the proportion of total variance explained by each component. The cumulative proportion of total variance indicates, to the investigator, just how much information is retained by selecting a specified number of components. Ideally, we wish to obtain a small number of PCs which explain a large percentage of the total variance. Once the number of PCs is selected, the investigator should examine the coefficients defining each of them to assign an interpretation to the components. A high coefficient of a PC on a given variable is an indication of high correlation between that variable and the PC. PC scatter graphs are drawn by plotting the PC coefficients. Two- and three-dimensional scatter graphs have been used. Related breeds are clustered together.
The PCA procedures in genetic studies were described by Cavalli-Sforza et al. (1994). In animal genetic diversity studies, PCs have been used to determine relationships among populations, supplementing relationships determined using phylogenetic analyses (e.g. Okomo, 1997). PCs can be more convenient than phylogenetic trees if clusters of populations are more visible. They are also more flexible than trees since they can use a greater number of parameters. It is usually easier to compare PC maps than it is to compare trees.