Often, rather than estimating specific differences between treatments, one may be interested in estimating variances (phenotypic, genetic and environmental) due to the effects. For example, in milk production the interest may not be to estimate differences between cows, but rather in estimating the variation among the cows as an estimate of the variation from a ‘larger’ population from which they were sampled. The cows can be considered as random effects and the data can be analysed according to a mixed effects model [Biometrics example 3].
Variance component estimation
The data described in Section 2.1 are used to estimate genetic and environmental variances needed to calculate genetic parameters (e.g. heritability) and the tests of significance for both genetic and non-genetic parameters estimated from the data.
When estimating variance components, the total variation for a trait under study is split into constituent components: genetic (additive and non-additive) and environmental. Depending on the data, different types of random effects models can be fitted. For example, dairy production data from collateral relatives (e.g. full-sibs and half-sibs) could be analysed fitting a sire model. Covariances generated by these relationships provide the information required for estimation of additive genetic variance or linear models containing both genetic and environmental effects for each animal (animal model) could be used.
The most widely used methods in variance component estimation are maximum likelihood (ML) procedures. These procedures estimate the fixed effects and variance components simultaneously. Animal breeders are increasingly confronted with data sets that have arisen from either selection experiments or from farm testing in which selection has been practised. If there is a lack of records because of selection based on some criterion that is correlated to trait(s) under analysis, the resultant estimates are likely to be biased by selection. In addition, following selection, variances of breeding values are reduced, breeding values of unrelated animals could become correlated, errors become correlated and breeding values become correlated with errors. ML statistical procedures can accommodate any structure of genetic relationship in the data, suitably weighted, do not require balanced designs and can account for selection of parents (Harville, 1977; Meyer, 1989; Falconer and Mackay, 1996).
A modified ML procedure, i.e. restricted maximum likelihood (REML) (Patterson and Thompson, 1971), has become the preferred method of analysis in animal breeding, not least for its ability to reduce selection bias. It accounts for the loss in degrees of freedom due to fixed effects in the model of analysis. In other words, it accounts for the fact that, for a given data size, more information is lost and cannot be used for estimation of variance components when one wants to estimate more levels of fixed effects.
There are several numerical procedures to find the variance components that maximize the (restricted) likelihood function, depending on whether one wants to compute only likelihood functions (‘derivative-free algorithm’), first derivative also (‘EM or Quasi-Newton algorithms’) or first and second derivatives of these with respect to the variance components (‘Newton, Fischer scoring or Average Information matrix’). Generally, impressive progress has been made in developing efficient computing algorithms for REML estimates. This, together with increasing computing power, has enabled the analysis of quite complex statistical models in large data sets [Biometrics example 3]. There are several suites of programmes for estimation of variance components available to the scientific community free of charge, e.g. VCE (developed by Eildert Groeneveld), DMU (the Danish team in Foulum), REMLF90 (Mizstal) and WOMBAT (developed by Karen Meyer) [Web pages, Section 12, this module].
Prediction of genetic merit
There are various methods available to estimate breeding values. The quality of data will determine what method is chosen. Complete data sets will have information on performance and identity of animals. When identity and relationships are known, pedigrees can be compiled. Availability of pedigree data allows modern methods of prediction of breeding values to be used. However, to collect complete records requires that infrastructure such as identity and performance recording schemes be in place and that these schemes be well managed [CS 1.15 by Dzama]. Such schemes do not exist in most developing countries yet, and where present, financial and management constraints result in data that has a lot of missing information.
Realized values of the random variables that have been sampled from a population can be estimated if the variance–covariance structure of the population is known. The estimation of realized values of a random variable is called prediction. There are various types of predictors—best predictor (BP), best linear predictor (BLP, e.g. selection index) and BLUP (Henderson, 1984). The differences between BP, BLP and BLUP are subtle yet statistically important [van der Werf in ICAR Tech. Series No. 3].
BLUP is the most commonly used predictor to evaluate the genetic merit of livestock and in selection decisions. Several programmes that can be used for prediction of BLUP breeding values are available to the scientific community free of charge, e.g. [PEST] and [WOMBAT] and BLUPF90 (Mizstal) (see Web pages, Section 11, this module). BLUP can accommodate non-random mating and reduce bias to selection provided that the data on which selection was practised is included in the analysis. In BLUP analysis, one equation for each level of each fixed or random factor is required so that effects can be estimated simultaneously (Henderson, 1975). If there are sufficient connections between herds, as is usually the case with the use of artificial insemination, selection on BLUP values can be done on a breed (rather than herd) basis [Computer exercises: BLUP].
The various sources of information that can be used to calculate BLUP breeding values are parent and progeny, both of which are based on the pedigree and the performance of the animal.
Models for calculating BLUP breeding values
The animal model is now the standard method for calculating breeding values. In an animal model, the performance of an individual animal and all known pedigree relationships are used to estimate its breeding value. The model is characterized by the fitting of a random component for the breeding value of each animal (Mrode, 2005). Use of an animal model results in a set of simultaneous equations with an order equal to the number of animals included in the analysis (with performance of their descendants), plus an additional equation for each fixed effect (Hill and Meyer, 1988). The animal model accounts for all the genetic relationships among the individuals whose breeding values are to be estimated and can account for repeated records, multiple traits, non-additive genetic effects, litter effects and a number of environmental effects, both fixed and random (Henderson, 1988). The implementation of animal models improves the correlation between proofs and true genetic values because all information is considered (Jansen, 1990; Banos et al., 1991) [Computer exercises: BLUP].
Due to computing constraints and data limitations or peculiarities, approximations or other models simpler than the animal model have been used. These include:
Sire models, where records are grouped according to the sire’s identity. When using a sire model, the dams are not represented, that is they are implicitly assumed to be non-related, non-inbred and unselected. Sons of sires are accounted for in the relationship matrix between sires. Use of sire models thus leads to a downward bias in parameter estimates as only half-sib relationships are acknowledged (Henderson, 1986; Meyer, 1987).
Sire maternal grandsire models, where in addition to effects in a sire model, the effect of the dam of an animal is considered through its maternal grand sire. Here the maternal grand dams are assumed unrelated, non-inbred and unselected.
Longitudinal data analysis
Some measured traits, such as weights or milk production, are repeated over the life of the animal. It is often not adequate to consider that two such observations obtained at different ages or stages of lactation are phenotypic expressions of the same (genetic) trait. In many cases, one wants to take into account the fact that two consecutive observations are more similar than two observations far apart in time. Furthermore, the interval between measurements on the same animal may greatly vary. Therefore ‘traditional discrete’ multivariate models are not efficient. Such traits are called longitudinal data.
Random regression models
Random regression models (RRM) can be used to analyse longitudinal data. These models provide a means to estimate genetic parameters for all ages without correcting the observations to certain landmark ages (Lewis and Brotherstone, 2002; Nobre et al., 2003). The models use fixed regression coefficients to account for overall and within fixed class trends while fitting the random regression coefficients for each individual to allow for individual variations in the trajectory. For example, the genetic component of the model will be described as a polynomial function (linear, quadratic or higher order) of time. The usual assumptions (multivariate normality using a relationship matrix) are extended to all (random) coefficients of this function. This modelling defines a particular genetic covariance between any two points in time. This continuous function that represents the variance and covariance of traits measured at different times is called covariance function (CF) (Meyer, 1998; van der Werf et al., 1998; Schaeffer, 2004). CFs are an infinite dimensional equivalent of a covariance matrix for a given number of records taken at different ages (Meyer and Hill, 1997; Huisman et al., 2002). For RRMs, the covariance function coefficients can be estimated directly by restricted maximum likelihood (REML) (Meyer and Hill, 1997; Albuquerque and Meyer, 2001).
Test day models in dairy production
Genetic evaluations for dairy cattle in many countries are obtained by analysing 305-day yields (or equivalent cumulative yield records) predicted from a few test-day yields (i.e. from longitudinal measurements). The 305-day yields predicted from monthly test-day records assumes such records within a single lactation measure the same trait for the whole duration of lactation. The error of genetic evaluation may further increase if 305-day yields are obtained by projecting partial lactations with factors that assume a constant shape of the lactation curve for all cows contrary to reality. Test-day records, however, are repeated observations measured along a trajectory (days in milk) and the mean and covariance between measures change gradually along the trajectory. Genetic evaluations based directly on test-day records can overcome the need to predict 305-day yields or project incomplete lactations.
Test-day models can facilitate a cheaper and more flexible recording scheme. The advantages of using these models as outlined by various authors (Stanton et al., 1992; Ptak and Schaeffer, 1993; Wiggans and Goddard, 1996; van Raden, 1997; Swalve, 1998) are:
- They can account for variable amounts of information from different lactations. By having four or more test-day yields per cow per lactation, the accuracy of a cow’s genetic evaluation may be better.
- They permit estimates of fixed effects to vary across herds and stages of lactation.
- The models can describe biology and define management groups more precisely and can account for differences in the shape of the lactation curve.
- They adjust for differing effects of sampling date. The models can account for short-term seasonal effects associated with actual time of production.
- No assumptions about the ‘normal’ length of a single lactation have to be made.
Test-day models therefore offer an opportunity to improve the genetic evaluation of dairy cattle in tropical production situations where infrastructure to support sophisticated or detailed recording systems is limited, often resulting in data sizes too small to allow for accurate genetic evaluation of bulls since production conditions are constrained by environment and resources (Swalve, 1998). Random regression analytical techniques are now the norm for evaluating test day yields.
Estimation of genotype by environment interactions
Tropical countries seeking to improve production levels have often imported exotic germplasm and then carried out selection in the imported population and their progeny under local conditions. This strategy is effective if production and marketing environments and selection objectives are similar for both the original and the recipient countries or production systems. However, unfavourable G × E interaction would reduce potential benefits from a strategy based entirely on continuous importation of superior germplasm from elsewhere [CS 1.16 by Mpofu]. G × E interactions are of two forms: firstly, correlations for the same trait in two environments may be significantly less than one, implying that the genetic basis for the trait differs between environments (Falconer and Mackay, 1996). The ranking of additive genetic values and hence optimal choices of selected animals may not be the same in alternative environments (Stanton et al., 1992; Calus, 2006). The second form of G × E interaction occurs when the scale of differences among breeding values for a specific trait is unequal between environments, termed ‘pseudo’ G × E interaction (Dickerson, 1962). In this case, the correlation between environments for true genetic value is one and the animal’s ranking is the same in all environments. However, additive genetic values are lower in the more restrictive environment resulting in less response to selection [CS 1.39 Okeyo and Baker].
Cattle genotypes in diverse environments
Methods of estimating G × E are presented by Mathur and Horst (1994), Chagunda (2000), Calus (2006) and Strandberg (2006). The methods include:
- Orthogonal comparison of subclasses
This method is normally used in factorial experiments. An example is when there are two genotypes raised in two environments. The interaction effect may be estimated as the difference between the sums of diagonal subclasses. The interaction is tested for significance using an F-test. - Factorial analysis of variance
For this method a linear model, with an environmental factor, a genetic factor and interaction effect between the two factors, is fitted with genetic and interaction effects as random effects. - Intraclass genetic correlations
This procedure is based on the estimation of genetic correlations between traits measured in two environments. The requirement is that the animals in the two environments should be genetically related (Ojango and Pollott, 2002). - Estimation through selection in two environments
G × E can also be determined indirectly from direct and correlated response to selection (Falconer and Mackay, 1996). This procedure considers the problem of carry-over of improvement from one environment to the other. Selection in environment Y is based on selection in environment X. The correlated response is compared to the direct response possible through selection in environment Y. The ratio of correlated response and direct response is computed and used to calculate G × E. This method, although likely to give a reliable measure of G × E, can only be applied after selection has been practised. - Using reaction norm models
Estimating G × E in breeding value estimation can be done with a reaction norm model when the production environment can be described as a continuous variable. A norm of reaction describes the pattern of phenotypic expression of a single genotype across a range of environments. For every genotype, phenotypic trait and environmental variable a different norm of reaction can exist. Studies of heritability carried out in a single environment cannot accurately estimate the Norm of Reaction, and often may not predict phenotypic response in a different environment. The reaction norm model, analysed using random regressions, has the advantage that no arbitrary grouping of environments is required and it can be extended to handle multiple environmental scales and multiple traits (Calus, 2006; Strandberg, 2006).
Estimating heterosis effects
Cross breeding is a popular method of genetic improvement of livestock, especially in developing countries where previously such practices have been mostly inappropriately designed or executed [CS 1.34 Panandam and Raymond]. The basis of the effects and benefits derived from systematic cross breeding can broadly be classified into additive and non-additive. The additive component is that which is due to the averaging of the additive merit in the parental breeds with simple weighting according to level of gene representation of each parental breed in the crossbred genotype (Swan and Kinghorn, 1992). Heterosis is the non-additive effect of cross breeding. It is the amount by which merit in crossbreds deviates from the additive component. Heterosis is usually attributed to genetic interactions within loci (dominance) and between loci (epistasis). Individual heterosis is the deviation in performance in an individual relative to the average of the parental breeds, whereas maternal heterosis refers to heterosis attributed to using crossbred instead of purebred dams and occurs due to the dam itself possessing heterosis [CS KDPG].
The performance of crosses can be predicted using estimates of genetic parameters from cross breeding experiments. Models for estimating cross breeding parameters based on a two-locus factorial model of gene effects were developed first by Dickerson (1973) and later by Küttner and Nitter (1997). A case study by Kahi [CS 1.5 by Kahi] illustrates an example of data analysis for estimating cross breeding parameters for milk production traits under the humid coastal regions of East Africa, while another by Aboagye [CS 1.9 by Aboagye] gives such parameters for milk production, reproductive, growth and carcass traits in cattle under the humid West African tropical conditions. Software such as CBE (cross breeding effects) are also available that be used to estimate cross breeding effects from a larger variety of data structures or experimental designs.
Analysis of ordered categorical traits
Traits such as calving ease or litter size are expressed and recorded in categories. For example, in the case of calving ease, births may be assigned to one of several distinct classes such as difficult, assisted and easy calving. Usually, these categories are ordered along a gradient. In the case of calving ease, for example, the responses are ordered along a continuum measuring the ease with which birth occurred. These traits are therefore termed ordered categorical traits. Such traits are not normally distributed and animal breeders have usually attributed the phenotypic expression of categorical traits to an underlying continuous unobservable trait which is normally distributed, referred to as the liability (Falconer and McKay, 1996). The observed categorical responses are therefore due to animals exceeding particular threshold levels of the underlying trait.
Linear and non-linear models have been applied for the genetic analysis of categorical traits with the assumption of the underlying normally distributed liability. Usually, the non-linear (threshold) models are more complex and have higher computing requirements. The advantage of the linear model is the ease of implementation as programs used for analysis of usual quantitative traits could be utilized. However Fernando et al. (1983) indicated that some of the properties of BLUP do not hold with categorical traits. In a simulation study, Meijering and Gianola (1985) demonstrate that with no fixed effects and constant or variable number of offspring per sire, an analysis of a binary trait with either a linear or non-linear model gives similar sire rankings. This was independent of the heritability of the liability or incidence of the binary trait. However, with the inclusion of fixed effects and variable number of progeny per sire, the non-linear model gave breeding values more similar to the true breeding values compared with those estimated using the linear model. The advantage of the threshold model increased as the incidence of the binary trait and its heritability decreased. Thus for traits with low heritability and low incidence, a threshold model might be the method of choice. Further information on these can be found in Mrode (2005).