117.136.11.193是那家公司法117条的ID

拒绝访问 |
| 百度云加速
请打开cookies.
此网站 () 的管理员禁止了您的访问。原因是您的访问包含了非浏览器特征(39b41e-ua98).
重新安装浏览器,或使用别的浏览器11-共享资料网
Lecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiMultivariate analysis? Multiple regression ? Multiple/partial correlation ? Cluster analysis ? Discriminant analysis ? Principal component analysis ? Factor analysis ? Correspondence analysis1 OrdinationLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiHistory of ordination methods? In 1930, Ramensky began to use informal ordination techniques for vegetation. Such informal and largely subjective methods became widespread in the early 1950’s (Whittaker 1967). ? In 1951, Curtis and McIntosh (1951) developed the ‘continuum index’, which later lead to conceptual links between species responses to gradients and multivariate methods. Shortly thereafter, Goodall (1954) introduced the term ‘ordination’ in an ecological context for Principal Components Analysis. ? Bray and Curtis (1957) developed polar ordination, which became the first widely-used ordination technique in ecology. ? Austin (1968) used canonical correlation to assess plant-environment relationships in what may have been the first example of multivariate direct gradient analysis in ecology. ? In 1973, Hill introduced correspondence analysis, a technique originating in the 1930’s, to ecologists. Correspondence analysis gradually supplanted polar ordination, which today has few practitioners. ? Fasham (1977) and Prentice (1977) independently discovered and demonstrated the utility of Kruskal’s (1964) nonmetric multidimensional scaling, originally intended as a psychometric technique, for community ecology. ? Hill (1979) corrected some of the flaws of Correspondence Analysis and thereby created Detrended Correspondence Analysis, which is the most widely used indirect gradient analysis technique today. The software to implement Detrended Correspondence Analysis, DECORANA, became the backbone of many later software packages. ? Gauch’s (1982) book &Multivariate Analysis in Community Ecology& described ordination in non-technical terms to the average practitioner, and allowed ordination techniques to enter the mainstream. ? Fuzzy set theory, introduced to ecologists by Roberts (1986), is a promising approach with ties to polar ordination, but has yet to gain many adherents. ? Ter Braak (1986) ushered in the biggest modern revolution in ordination methods with Canonical Correspondence Analysis. This technique coupled Correspondence Analysis with regression methodologies, and provides for hypothesis testing. ? Ter Braak and Prentice (1988) developed a theoretical unification of ordination techniques, hence placing gradient analysis on a firm theoretical foundation.From Wikipedia2 Lecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCluster analysis3 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiWhat is Cluster Analysis?? Cluster: a collection of data objectsC Similar to one another within the same cluster C Dissimilar to the objects in other clusters? Cluster analysisC Grouping a set of data objects into clusters? Clustering is unsupervised classification: no predefined classes ? Typical applicationsC As a stand-alone tool to get insight into data distribution C As a preprocessing step for other algorithms4 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiWhat Is Good Clustering?? A good clustering method will produce high quality clusters withC high intra-class similarity C low inter-class similarity? The quality of a clustering result depends on the similarity measure. ? The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.5 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiMeasure the Quality of Clustering? Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) ? The definitions of distance functions are usually very different for boolean, categorical, ordinal, interval-scaled, and ratio variables. ? Weights should be associated with different variables based on applications and data semantics. ? It is hard to define “similar enough” or “good enough” C the answer is typically highly subjective.6 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiClustering ApproachesClusteringHierarchicalPartitionalCategoricalLarge DBAgglomerativeDivisiveSamplingCompression7 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiData Structures? Data matrix? x 11 ? ? ... ?x ? i1 ? ... ?x ? ? n1 ... ... ... ... ... x 1f ... x if ... x nf ... ... ... ... ... x 1p ? ? ... ? x ip ? ? ... ? x np ? ? ?? Dissimilarity matrix? 0 ? d(2,1) ? ? d(3,1 ) ? ? : ? ? d ( n ,1)0 d ( 3,2 ) : d ( n ,2 )0 : ...? ? ? ? ? ? ... 0 ? ?8From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPartitioning algorithms: basic concept? Partitioning method: construct a partition of a database D of n objects into a set of k clusters ? Given a k, find a partition of k clusters that optimizes the chosen partitioning criterionC Global optimal: exhaustively enumerate all partitions C Heuristic methods: k-means and k-medoids algorithms? k-means (MacQueen 1967): Each cluster is represented by the center of the cluster ? k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 1987): Each cluster is represented by one of the objects in the cluster9 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiK-Means Clustering? Basic ideas : using cluster centre (means) to represent cluster ? Assigning data elements to the closet cluster (centre). ? Goal: Minimise square error (intra-class dissimilarity):r r ∑ d ( xi , C ( x ))i10 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe K-Means Clustering Method? Given k, the k-means algorithm is implemented in four steps:C Partition objects into k nonempty subsets C Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of n r r r v ( ) / , ,..., the cluster) C S = ∑ X i n X 1 X n ∈ Si =1C Assign each object to the cluster with the nearest seed point C Go back to Step 2, stop when no more new assignment11 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe K-Means Clustering Method? Example10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 1010 9 8 7 6 5 410 9 8 7 6 5Assign each objects to most similar center3 2 1 0 0 1 2 3 4 5 6 7 8 9 10Update the cluster means4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10reassign1010 9 8 7 6reassignK=2 Arbitrarily choose K object as initial cluster center9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10Update the cluster means5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10From Han and Kamber’s slides on cluster12 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Lik-means Clustering : Procedure (1)Initialization 1 Specify the number of cluster k : for example, k = 413From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Lik-means Clustering : Procedure (2)Initialization 2 Points are randomly assigned to one of k clusters14From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Lik-means Clustering : Procedure (3)Calculate the mean of each cluster(1,4) (1,2)(3,4) (3,2)15From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Lik-means Clustering : Procedure (4)Each gene is reassigned to the nearest cluster16From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Lik-means Clustering : Procedure (5)Iterate until the means are converged17From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiComments on the K-Means Method? ? Strength: relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t && n. Comment: often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithmslocal optimum global optimum?WeaknessC Applicable only when mean is defined, not applicable to categorical data? C Need to specify k, the number of clusters, in advance C Unable to handle noisy data and outliers C Not suitable to discover clusters with non-convex shapes18 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiVariations of the K-Means Method? A few variants of the k-means which differ inC Selection of the initial k means C Dissimilarity calculations C Strategies to calculate cluster means? Handling categorical data: k-modes (Huang 1998)C Replacing means of clusters with modes C Using new dissimilarity measures to deal with categorical objects C Using a frequency-based method to update modes of clusters C A mixture of categorical and numerical data: k-prototype method19 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe K-Medoids Clustering MethodFind representative objects, called medoids, in clusters ? PAM (Partitioning Around Medoids, 1987)C starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering C PAM works effectively for small data sets, but does not scale well for large data sets? CLARA (Kaufmann & Rousseeuw, 1990) ? CLARANS (Ng & Han, 1994): randomized sampling20 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPAM (Partitioning Around Medoids) (1987)? PAM (Kaufman and Rousseeuw, 1987), built in Splus ? Use real object to represent the cluster C Select k representative objects arbitrarily C For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih C For each pair of i and h, ? If TCih & 0, i is replaced by h ? Then assign each non-selected object to the most similar representative object C repeat steps 2-3 until there is no change21 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiTypical k-medoids algorithm (PAM)Total Cost = 2010 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 1010 9 810 9 8Arbitrary choose k object as initial medoids7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10Assign each remainin g object to nearest medoids7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10K=2Do loop Until no change10 9Total Cost = 2610Randomly select a nonmedoid object,Oramdom Compute total cost of swapping9 8 7 6 5 4 3 2 1 0Swapping O and Oramdom If quality is improved.8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 1001234567228910From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiComments on PAM?? PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean ? PAM works efficiently for small data sets but does not scale well for large data sets. C O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters23 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCLARA (Clustering Large Applications) (1990)? CLARA (Kaufmann and Rousseeuw in 1990)C Built in statistical analysis packages, such as S+? It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output ? Strength: deals with larger data sets than PAM ? Weakness:C Efficiency depends on the sample size C A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased24 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiHierarchical Clustering? Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination conditionStep 0 Step 1 Step 2 Step 3 Step 4a b c d eStep 4agglomerative (AGNES)ab abcde cde deStep 3 Step 2 Step 1 Step 0divisive (DIANA)25From Han and Kamber’s slides on cluster Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiHierarchical ClusteringGiven a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.26 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiAmalgamation or Linkage RulesSingle linkage (nearest neighbor). The distance between two clusters is determined by the distance of the two closest objects (nearest neighbors) in the different clusters. This rule will, in a sense, string objects together to form clusters, and the resulting clusters tend to represent long &chains.& Complete linkage (furthest neighbor). The distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e., by the &furthest neighbors&). This method usually performs quite well in cases when the objects actually form naturally distinct &clumps.& If the clusters tend to be somehow elongated or of a &chain& type nature, then this method is inappropriate. Unweighted pair-group average. The distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters. This method is also very efficient when the objects form natural distinct &clumps,& however, it performs equally well with elongated, &chain& type clusters. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation UPGMA to refer to this method as unweighted pair-group method using arithmetic averages. Weighted pair-group average. This method is identical to the unweighted pair-group average method, except that in the computations, the size of the respective clusters (i.e., the number of objects contained in them) is used as a weight. Thus, this method (rather than the previous method) should be used when the cluster sizes are suspected to be greatly uneven. Note that in their book, Sneath and Sokal (1973) introduced the abbreviation WPGMA to refer to this method as weighted pair-group method using arithmetic averages. Unweighted pair-group centroid. The centroid of a cluster is the average point in the multidimensional space defined by the dimensions. In a sense, it is the center of gravity for the respective cluster. In this method, the distance between two clusters is determined as the difference between centroids. Sneath and Sokal (1973) use the abbreviation UPGMC to refer to this method as unweighted pair-group method using the centroid average. Weighted pair-group centroid (median). This method is identical to the previous one, except that weighting is introduced into the computations to take into consideration differences in cluster sizes (i.e., the number of objects contained in them). Thus, when there are (or one suspects there to be) considerable differences in cluster sizes, this method is preferable to the previous one. Sneath and Sokal (1973) use the abbreviation WPGMC to refer to this method as weighted pair-group method using the centroid average. Ward's method. This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step. Refer to Ward (1963) for details concerning this method. In general, this method is regarded as very efficient, however, it tends to create clusters of small size.27 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDistance Between Clusters? Single Link: smallest distance between points ? Complete Link: largest distance between points ? Average Link: average distance between points ? Centroid: distance between centroids28 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDistance MeasuresEuclidean distance. This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as: distance(x,y) = { i (xi - yi)2 }? Note that Euclidean (and squared Euclidean) distances are usually computed from raw data, and not from standardized data. This method has certain advantages (e.g., the distance between any two objects is not affected by the addition of new objects to the analysis, which may be outliers). However, the distances can be greatly affected by differences in scale among the dimensions from which the distances are computed. For example, if one of the dimensions denotes a measured length in centimeters, and you then convert it to millimeters (by multiplying the values by 10), the resulting Euclidean or squared Euclidean distances (computed from multiple dimensions) can be greatly affected (i.e., biased by those dimensions which have a larger scale), and consequently, the results of cluster analyses may be very different. Generally, it is good practice to transform the dimensions so they have similar scales. Squared Euclidean distance. You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as (see also the note in the previous paragraph): distance(x,y) = i (xi - yi)2 City-block (Manhattan) distance. This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as: distance(x,y) = i |xi - yi| Chebychev distance. This distance measure may be appropriate in cases when one wants to define two objects as &different& if they are different on any one of the dimensions. The Chebychev distance is computed as: distance(x,y) = Maximum|xi - yi| Power distance. Sometimes one may want to increase or decrease the progressive weight that is placed on dimensions on which the respective objects are very different. This can be accomplished via the power distance. The power distance is computed as: distance(x,y) = ( i |xi - yi|p)1/r where r and p are user-defined parameters. A few example calculations may demonstrate how this measure &behaves.& Parameter p controls the progressive weight that is placed on differences on individual dimensions, parameter r controls the progressive weight that is placed on larger differences between objects. If r and p are equal to 2, then this distance is equal to the Euclidean distance. Percent disagreement. This measure is particularly useful if the data for the dimensions included in the analysis are categorical in nature. This distance is computed as: distance(x,y) = (Number of xi yi)/ i.29 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDistance Measures: Minkowski MetricSuppose two objects x and y both have p features : x = ( x1 x 2 L xp ) y = ( y1 y 2 L yp ) The Minkowski metric is defined by d ( x, y) = ri =1∑|pxi ? yi |r30 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCommonly Used Minkowski Metrics1, r = 2 (Euclidean distance ) d ( x, y ) = 22 | x ? y | i i ∑ i =1 p2, r = 1 (Manhattan distance) d ( x, y ) = ∑ | xi ? yi |i =1 p3, r = +∞ (& sup& distance ) d ( x, y ) = max | xi ? yi |1≤i ≤ p31 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiAn Examplex 3 4 y1, Euclidean distance : 2 4 2 + 32 = 5. 2, Manhattan distance : 4 + 3 = 7. 3, & sup& distance : max{4,3} = 4.32 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiManhattan distance is called Hamming distance when all features are binary.Gene expression levels under 17 conditions (1-High,0-Low) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1Hamming Distance : #( 01 ) + #( 10 ) = 4 + 1 = 5.33 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiMeasuring Similarity(x2,y2)? Euclidean (L2) distance ? Manhattan (L1) distance ? Lm: (|x1-x2|m+|y1-y2|m)1/m ? L∞: max(|x1-x2|,|y1-y2|) ? Inner product: x1x2+y1y2 ? Correlation coefficient (Pearson) ? Spearman rank correlation coefficient34(x1, y1) Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSimilarity measures: correlation coefficients ( x, y ) =∑ ( x ? x)( yi i =1 p p i =1pi? y)2 2 ( x ? x ) × ( y ? y ) ∑ i ∑ i i =1s( x, y) ≤ 135 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDendrogram? Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. ? Each level shows clusters for that level.C Leaf C individual clusters C Root C one cluster? A cluster at level i is the union of its children clusters at level i+1.36 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiLevels of Clustering37 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiAgglomerative exampleA B C D E A B C D E 0 1 2 2 3 1 0 2 4 3 2 2 0 1 5 2 4 1 0 3 3 3 5 3 0ABECD Threshold of 1 2 34 5A B C D E38 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSingle-Link, Complete-Link & Average-Link Clustering39 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiIssues in Cluster Analysis? A lot of clustering algorithms ? A lot of distance/similarity metrics ? Which clustering algorithm runs faster and uses less memory? ? How many clusters after all? ? Are the clusters stable? ? Are the clusters meaningful?40 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiStatistical Significance Testing? Cluster analysis is a &collection& of different algorithms that &put objects into clusters according to well defined similarity rules.& not as much a typical statistical test ? Cluster analysis methods are mostly used when we do not have any a priori hypotheses, but are still in the exploratory phase of our research. In a sense, cluster analysis finds the &most significant solution possible.& ? Statistical significance testing is not appropriate here, even in cases when p-levels are reported (as in k-means clustering).41 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: Hierarchical cluster analysis library(RODBC) channel = odbcConnectAccess('D:/text/pheasant/pheasants_points_feature.mdb') Galliform63 = sqlFetch(channel,'species63') EnvironmentVariables = Galliform63[,c('elevation_mean','footprint_mean','veg_mean')] row.names(EnvironmentVariables)=Galliform63[,2]for(i in 1:length(EnvironmentVariables)){ EnvironmentVariables[,i] = (EnvironmentVariables[,i]-mean(EnvironmentVariables[,i],na.rm = T)) /sd(EnvironmentVariables[,i],na.rm = T) } # standarization # Hierarchical cluster analysis hi.cluster=hclust(dist(EnvironmentVariables)) #using default configuration x11() #create a new window plot(hi.cluster,hang=-1)#plot cluster dendrogram42 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSAS exampleIn the following example, the observations are states. Binary-valued variables correspond to various grounds for divorce and indicate whether the grounds for divorce apply in each of the states. A DATA step is used to compute the Jaccard coefficient between each pair of states. The Jaccard coefficient is defined as the number of variables that are coded as 1 for both states divided by the number of variables that are coded as 1 for either or both states. The Jaccard coefficient is converted to a distance measure by subtracting it from 1.Jaccard coefficient43 Cluster analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSAS codeproc cluster data=distjacc method=centroid pseudo outtree= var dj1-dj50;incompat cruelty desertn non_supp alcohol felony impotenc insanity separateALABAMA
CALIFORNIA
CONNECTICUT
MASSACHUSETTS
MISSISSIPPI
NEW HAMPSHIRE
NEW JERSEY
NEW MEXICO
NORTH CAROLINA
NORTH DAKOTA
PENNSYLVANIA
RHODE ISLAND
SOUTH CAROLINA
SOUTH DAKOTA
WASHINGTON
WEST VIRGINIA options ls=120 ps=60; title2 'Grounds for Divorce'; input state $15. (incompat cruelty desertn non_supp alcohol felony impotenc insanity separate) (1.) @@; if mod(_n_,2) then input +4 @@; ALABAMA
CALIFORNIA
CONNECTICUT
MASSACHUSETTS
MISSISSIPPI
NEW HAMPSHIRE
NEW JERSEY
NEW MEXICO
NORTH CAROLINA
NORTH DAKOTA
PENNSYLVANIA
RHODE ISLAND
SOUTH CAROLINA
SOUTH DAKOTA
WASHINGTON
WEST VIRGINIA
; /* compute distance matrix containing (1.0 - Jaccard coefficient) */ data distjacc(type=distance); array dj(*) dj1-dj50; /* variables to contain 1-Jaccard */ retain dj1-dj50 .; /* initialize to missing values */ do row=1 to 50; /* loop over rows of distance matrix set divorce point= /* read row state array grounds(*) incompat-- /* declare arrays after array save(*) save1-save9; /* the SET statement do g=1 to 9; save(g)=grounds(g); /* save data for row state */ */ */ */ */do col=1 /* loop over columns of distance matrix set divorce(drop=state) point= /* read column state num=0; /* number of grounds that apply to both states den=0; /* number of grounds that apply to either state do g=1 to 9; /* loop over grounds for divorce num=num+(grounds(g) & save(g)); den=den+(grounds(g) | save(g)); if den then dj(col)=1-num/ /* convert to distance else dj(col)=1; /* output a row of the distance matrix*/ */ */ */ */*/*//* stop statement is needed because set statement uses point= option */ keep state dj1-dj50; /* keep only the state and proc print data=distjacc(obs=10); var dj1-dj10; title2 'First 10 states'; title2; proc cluster data=distjacc method=centroid pseudo outtree= var dj1-dj50; proc tree data=tree noprint n=9 out= var incompat--*/44 Lecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDiscriminant analysis45 Discriminant analysisLecture 11. Multivariate analysis (2/2)CORN 16 CORN 15 CORN 16 CORN 18 CORN 15 CORN 15 CORN 12 SOYBEANS 20 SOYBEANS 24 SOYBEANS 21 SOYBEANS 27 SOYBEANS 12 SOYBEANS 22 COTTON 31 COTTON 29 COTTON 34 COTTON 26 COTTON 53 COTTON 34 SUGARBEETS22 SUGARBEETS25 SUGARBEETS34 SUGARBEETS54 SUGARBEETS25 SUGARBEETS26 CLOVER 12 CLOVER 24 CLOVER 87 CLOVER 51 CLOVER 96 CLOVER 31 CLOVER 56 CLOVER 32 CLOVER 36 CLOVER 53 CLOVER 32 27 23 27 20 15 32 15 23 24 25 45 13 32 32 24 32 25 48 35 23 25 25 23 43 54 45 58 54 31 48 31 13 13 26 08 32 31 30 27 25 31 32 16 23 25 23 24 15 31 33 26 28 23 75 25 25 24 16 21 32 2 32 25 61 31 54 11 13 27 54 06 62 33 30 26 23 32 15 73 25 32 24 12 42 43 34 28 45 24 26 78 42 26 52 54 15 54 54 34 21 16 62 11 71 32 32 54 16Biostatistics Xinhai LiExampleDiscriminant analysis of remote sensing data on five crops46 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDiscriminant analysis? Similar to regression, except that criterion (or dependent variable) is categorical rather than continuous. ? Alternatively, discriminant function analysis is multivariate analysis of variance (MANOVA) reversed.In MANOVA, the independent variables are the groups and the dependent variables are the continuous measures. In DA, the independent variables are the continuous measures and the dependent variables are the groups.? DA is used to identify boundaries between groups of objects by a measure of distance.For example: (a) does a person have the disease or not (b) Is someone a good credit risk or not? (c) Should a student be admitted to college?47 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: Linear Discriminant Analysistable(remote.sensing$crop) CLOVER 11 CORN 7 COTTON SOYBEANS SUGARBEETS 6 6 6nrow(remote.sensing) # 36 lda.result &- lda(crop ~ band1+band2+band3+band4, remote.sensing)0 -1 0 1 2 -4 -3 CLOVER CLOVER 3 SUGARBEETS CLOVER CLOVER SOYBEANS CLOVER SUGARBEETS SUGARBEETS COTTON CLOVER SOYBEANS SUGARBEETS CLOVER SUGARBEETS SOYBEAN SOYBEA COTTON COTTON SOYBEANS CORN COTTON CLOVER SUGARBEET COTTON SOYBEANS COR CORN CORN CLOVER CLOVER COR OVER CLOVER CLOVER COTTON COTTON CLOVER SUGARBEETS COR CLOVER OVER COTTON CLOVER CORN SUGARB CLOVER CLOVER CLOVER SUGARBEET CLOVER CLOVER SUGARBEETS 2 -2 -1 1plot(lda.result, cex = 1)LD1-2-10123-2-101SOYBEANS CORN CORN CORN SOYBEANS SOYBEANS CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CORN CLOVER CLOV CLOVER SOYBEANS SOYBEANS SOYBEANS SUGARBEETS SUGARBEETS SUGARBEETS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS LOVER CLOVER CLOVER SUGARBEETS SUGARBEETS SUGARBEETS COTTON COTTON COTTON CLOVER CLOVER CLOVER COTTON COTTON COTTON CLOVER CLOVER CLOVER SUGARBE SUGARBEETS SUGARBEETS COTTON COTTON COTTON COTTON COTTON COTTON SOYBEANS SOYBEA SOYBEANS SUGARBEETS SUGARBEETS SUGARBEETS CLOVER OVER CLOVER COTT CLOVER COTTON COTTON CLOVER CLOV SUGARB SUGARBEET SUGARBEETS COTTON COTTON TTON CLOVER CLOVER LOVER CLOVER OVER CLOVER SUGARBEETS SUGARBEETS SUGARBEETS CLOVER OVER CLOVERLD2OVER CLOVER CLOV SOYBEANS CLOVER SOYBEA CLOVER LOVER SUGARBEETS SUGARBEETS SUGARBEETS SUGARBE COTTON COTTON SOYBEANS SOYBEANS OVER CLOVER CLOVER OVER SUGARBEETS SUGARBEETS SUGARBEETS SUGARBEETS SOYBEANS SOYBEANS SOYBEANS SOYBEANS COTTON COTTON CLOV COTTON COTTON SOYBEANS SOYBEANS CORN CORN CORN COTT CORN COTTON CLOVER CORN CORN SUGARBEETS SUGARBEETS COTTON COTTON SOYBEANS SOYBEANS CORN CORN CORN CORN CORN CORN CLOVER CLOVER CLOVER CORNCLOVER CORN CLOVER CLOVER CLOVER CLOVER COTTON CLOVER CLOVER COTTON COTTON CLOVER SUGARBEETS CLOV CLOVER SOYBEANS COTTON COTTON SUGARBEETS CORN COTTON OVER SUGARBEETS CLOVER SOYBEANS CORN SUGARBEETS CORN SUGARBE CLOVER SOYBEANS CORN CORN SOYBEA SOYBEANS SUGARBEETS SOYBEANS COTTON CORN CLOVER COTTON LOVERCLOVER CLOVER CLOVER CORN-2-101CLOVER SOYBEANS SOYBEANS COTTON TTON COTTON COTTON SUGARBEETS SUGARBEETS CORN CORN SUGARBEET SUGARBEETS COTTON COTTON CLOVER CLOVER CLOVER CLOVER CORN SOYBEANS CORN SOYBEA SUGARBEETS SUGARBEETS CORN CORN SUGARBEETS SUGARBEETS SOYBEANS CLOVER SOYBEANS COR LOVER CORN CORN CORN SOYBEANS SOYBEANS SUGARBEETS SOYBEANS SOYBEANS SUGARBEETS COTTON CLOVER SOYBEAN CLOVER SOYBEANS COTTON COTTON CORN CORN COTTON CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER SOYBEANS SOYBEANS SUGARBEETS TTON SUGARBEETS COTTON CLOVER CLOVER CORN LOVER CORN CLOVERCLOVER SUGARBEETS CLOVER SUGARB CLOVER CORN CLOVER CORN CLOVER COTTON COTTON CORN CORN SOYBEANS SOYBEANS SOYBEANS SOYBEANS COTTON CLOVER COTTON CORN SOYBEANS SUGARBEETS SOYBEAN CORN CLOVER SUGARBEETS CORN CORN COTTON COTTON SOYBEANS SOYBEANS COTTON OVER CLOVER SUGARBEETS SUGARBEET CORN COR SUGARBEETS CLOVER CLOVER SUGARBEETS COTTON COTTON SOYBEA SOYBEANS SUGARBEETS SUGARBEETS CORN COR CLOVER CLOVER -4 -3 -2 -1 0 1 CLOVER CLOVERLD3-2CLOVER SOYBEANS SUGARBEETS COTTON CLOVER CORN CLOVER SUGARBEET CLOV OVER CORN OVER COTTON CORN SOYBEANS SOYBEANS COTTON CLOVER SOYBEANS CORN SUGARBEETS CORN COTTON SOYBEANS COTTON SUGARBEETS CORNCLOVER SUGARBEETS COTT CLOVER SOYBEANS SUGARBEETS CORN CLOVER CLOVER -1 0 1 201LD4-148 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR: Linear Discriminant Analysislda.result49 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR: Linear Discriminant Analysislda.predict &- predict(lda.result, remote.sensing) table(lda.predict$class) #number of observation for each crop lda.predict$class #predicted crop types[1] CORN [7] CORN CORN CORN CORN CORN SOYBEANS SOYBEANS SOYBEANS SOYBEANS SUGARBEETS CORNlda.predict$posterior #predicted crop types valuesCLOVER CORN 1 0..8027 COTTON 0.09 SOYBEANS 0..253010 SUGARBEETS 0.19lda.predict$x #linear discriminant scores for each observationLD1 LD2 LD3 LD4 1 1.... 1....50 Discriminant analysisLecture 11. Multivariate analysis (2/2)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 remote.sensing.crop CORN CORN CORN CORN CORN CORN CORN SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS SOYBEANS COTTON COTTON COTTON COTTON COTTON COTTON SUGARBEETS SUGARBEETS SUGARBEETS SUGARBEETS SUGARBEETS SUGARBEETS CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVER CLOVERBiostatistics Xinhai Lilda.predict.class CORN CORN CORN CORN CORN SOYBEANS CORN SOYBEANS SOYBEANS SOYBEANS SUGARBEETS CORN COTTON CLOVER SOYBEANS CLOVER SOYBEANS CLOVER COTTON CORN SOYBEANS SUGARBEETS CLOVER SOYBEANS SUGARBEETS COTTON SUGARBEETS CLOVER CLOVER CLOVER SUGARBEETS CLOVER CLOVER COTTON CLOVER COTTONR - Linear Discriminant Analysisplot(lda.predict$x[,1], lda.predict$x[,2]) compare = data.frame(remote.sensing$crop, lda.predict$class)lda.predict$x[, 2]-2 -4-10123-3-2-101lda.predict$x[, 1]51 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Li title 'Discriminant Analysis of Remote Sensing Data on Five Crops'; input crop $ 1-10 x1-x4 xvalues $ 11-21;CORN …… 16 27 31 33SAS codeproc discrim data=crops out=qdaout1 outstat=qdaout2 method=normal pool= var x1-x4; title2 'Using Quadratic Discriminant Function'; set qdaout2; if _type_=&QUAD&; proc print data=52 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSAS ResultObs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 crop _TYPE_ _NAME_ x1 x2 x3 x4 CLOVER QUAD x1 -0.001 0.000 0.001 0.000 CLOVER QUAD x2 0.000 -0.002 0.001 -0.000 CLOVER QUAD x3 0.001 0.001 -0.002 -0.001 CLOVER QUAD x4 0.000 -0.000 -0.001 -0.001 CLOVER QUAD _LINEAR_ 0.022 0.096 0.061 0.112 CLOVER QUAD _CONST_ -18.010 -18.010 -18.010 -18.010 CORN QUAD x1 -0.582 -0.046 -0.090 -0.077 CORN QUAD x2 -0.046 -0.025 -0.005 -0.010 CORN QUAD x3 -0.090 -0.005 -0.066 -0.024 CORN QUAD x4 -0.077 -0.010 -0.024 -0.016 CORN QUAD _LINEAR_ 29.882 3.491 8.189 5.201 CORN QUAD _CONST_ -473.542 -473.542 -473.542 -473.542 COTTON QUAD x1 -0.273 -0.009 0.141 0.038 COTTON QUAD x2 -0.009 -0.243 0.114 0.048 COTTON QUAD x3 0.141 0.114 -0.124 -0.041 COTTON QUAD x4 0.038 0.048 -0.041 -0.015 COTTON QUAD _LINEAR_ 6.575 4.776 -5.326 -1.641 COTTON QUAD _CONST_ -74.308 -74.308 -74.308 -74.308 SOYBEANS QUAD x1 -0.212 0.019 0.123 -0.046 SOYBEANS QUAD x2 0.019 -0.019 0.010 -0.005 SOYBEANS QUAD x3 0.123 0.010 -0.116 0.038 SOYBEANS QUAD x4 -0.046 -0.005 0.038 -0.019 SOYBEANS QUAD _LINEAR_ 4.856 0.073 -2.535 1.565 SOYBEANS QUAD _CONST_ -53.227 -53.227 -53.227 -53.22753 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiDiscriminant analysis vs. clusteringDiscriminant Analysis? known number of classes ? based on a training set ? used to classify future observations ? classification is a form of supervised learning ? Y = X1 + X2 + X3……Clustering? unknown number of classes ? no prior knowledge ? used to understand (explore) data ? clustering is a form of unsupervised learning ? X1 + X2 + X3……54 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiLinear discriminant analysis? Linear discriminant analysis attempts to find the linear combination of the selected measures that best separate the populationb = discriminant coefficients x = input variables55 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiProcedureDiscriminant function analysis is broken into a 2-step process: 1. testing significance of a set of discriminant functions? The first step is computationally identical to MANOVA. There is a matrix of total varia likewise, there is a matrix of pooled within-group variances and covariances. The two matrices are compared via multivariate F tests in order to determine whether or not there are any significant differences (with regard to all variables) between groups. One first performs the multivariate test, and, if statistically significant, proceeds to see which of the variables have significantly different means across the groups.56?? Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiProcedure2. classification? Once group means are found to be statistically significant, classification of variables is undertaken. ? Discriminant analysis automatically determines some optimal combination of variables so that the first function provides the most overall discrimination between groups, the second provides second most, and so on. ? Moreover, the functions will be independent or orthogonal, that is, their contributions to the discrimination between groups will not overlap.57 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiAssumptionsSample size: Unequal sample sizes are acceptable. The sample size of the smallest group needs to exceed the number of predictor variables. As a “rule of thumb”, the smallest sample size should be at least 20 for a few (4 or 5) predictors. The maximum number of independent variables is n - 2, where n is the sample size. While this low sample size may work, it is not encouraged, and generally it is best to have 4 or 5 times as many observations and independent variables. Normal distribution: It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. You can examine whether or not variables are normally distributed with histograms of frequency distributions. However, note that violations of the normality assumption are not &fatal& and the resultant significance test are still reliable as long as non-normality is caused by skewness and not outliers (Tabachnick and Fidell 1996). Homogeneity of variances/covariances: Discriminant analysis is very sensitive to heterogeneity of variance-covariance matrices. Before accepting final conclusions for an important study, it is a good idea to review the within-groups variances and correlation matrices. Homoscedasticity is evaluated through scatterplots and corrected by transformation of variables.58 Discriminant analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiOutliers:Assumptions? Discriminant analysis is highly sensitive to the inclusion of outliers. ? Run a test for univariate and multivariate outliers for each group, and transform or eliminate them. ? If one group in the study contains extreme outliers that impact the mean, they will also increase variability. Overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. Non-multicollinearity: ? If one of the independent variables is very highly correlated with another, or one is a function (e.g., the sum) of other independents, then the matrix will not have a unique discriminant solution. ? To the extent that independents are correlated, the standardized discriminant function coefficients will not reliably assess the relative importance of the predictor variables.59 Lecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Component Analysis (PCA)60 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: PCAibis = read.csv('D:/database/ibis2010.csv', header=T) head(ibis) ibis.pre = ibis[ibis$use==1,c(3:6,8,9,11,12)] head(ibis.pre)1 42 86 104 105 116 latitude 33.1 33.3 33.1 33.4 33.4 33.2 aspect 0.893 0.798 0.56 0.502 0.502 0.201 elevation footprint 476 61 484 38 473 60 942 20 942 20 476 44 year 08 06 GDP 333 420 256 186 186 169 pop 85 488 488 1321 slope 0.503 0.685 0.812 5.002 5.002 2.275############### PCA ############### ## The variances of the variables in the ibis data ## vary by orders of magnitude, so scaling is appropriate pca1 &- princomp(ibis.pre) # inappropriate pca2 &- princomp(ibis.pre, cor = TRUE) # =^= prcomp(ibis.pre, scale=TRUE)61 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: PCAplot(pca2) # shows a scree plot biplot(pca2) summary(pca2) pca2$loadings pca2$scores # principal component vector0.0 0.1 495 319 318 96 6 315 year 317 329 08 24 4 48 4 25 5 404 4 9 23 20 966 53 2 8
43 05 28 33 85 12 07 72
27 25 95 85 33 70 03 04 18 94 83 34 02 40 32 77 16 88 17 32 26 34 16 25 14 51 89 55 80 38 04 15 21 89 11 54 68 09
06 9 46 825 461 31 04 45 32 824 95 360 247 246 95 30 29 92 9 36 74 48 5 965 448 94 73 0 41 4 964 447 42 55 12 07 6 14 53 44
789 755 739 68 75 50 644 643 84 5 65 09 7 62 8 48
3 439 56 4 6 433 31 435 434 52 02 432 431 430
9 6 385 2 97 slope 383 382 87 1 312 0 222 221 5 74 82 252 311 244 2 91 377 6 8 7 83 73 84 211 0 219 218 217 1660 aspect 81 2 9 92 7 76 49 89 73 71 174 173 187 93 80 171 168 08 footprint 64 843 4 7
28 138 137 136
pop 128 140 105 104 117 141 116 GDP 2673 y elevatio
-30 -20 -10 0 -30 -20 -10 0 10pca23.0Comp.22.5-0.12.086 -0.2Variances1.5177 26741.0-0.3420.5-0.30.0 Comp.1 Comp.3 Comp.5 Comp.7-0.2-0.1 Comp.10.00.162 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: PCApca2$loadingsComp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 latitude aspect elevation footprint year GDP pop slope 0.366 -0.519 0.363 -0.541 -0.406 -0.252 -0.23 0.254 0.19 0.133 0.178 0.859 0.206 0.127 0.853 0.372 -0.14 -0.665 0.137 0.293630.1430.696 -0.113 0.165-0.119 -0.836 -0.1190.499 -0.138 0.211 -0.686-0.457 -0.406 -0.212 -0.338 -0.485 -0.357 -0.211 -0.228 0.267 -0.489 0.415 -0.6530.716 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: PCApca2$scoresComp.1 1 42 86 104 105 116 117 -11.732 -13.055 -9.641 -2.673 -2.762 -6.773 -6.814Comp.2 -6.269 -8.397 -4.353 -2.746 -2.646 -2.977 -2.932Comp.3 -1.487 -1.325 -0.257 0.447 0.466 1.060 1.071Comp.4 -0.016 0.868 -0.166 0.537 0.839 -0.025 0.128Comp.5 -0.722 -2.854 -0.521 -0.290 -0.160 -1.138 -1.07264 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiExample: study on climate change consequences to the crested ibisThe loadings of 20 variables (five climate variables, i.e. annual total precipitation (A), annual minimum temperature (B), annual maximum temperature (C), seasonal variance of temperature (D), and seasonal variance of precipitation (E), at four time periods, i.e. present, , and 2080 at the first and second principal components space. The grey circles are the scores of 5751 sites in Yang county at the first and second principal components space.65 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal component analysisPrincipal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set of variables, that nonetheless retains most of the sample's information. By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.66 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Component Analysis? Given m points in a n dimensional space, for large n, how does one project on to a 1 dimensional space? ? Choose a line that fits the data so the points are spread out well along the line.67 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Component Analysis? Formally, minimize sum of squares of distances to the line.? Why sum of squares? Because it allows fast minimization.68 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Component Analysis? For one data point and a line through point (0,0), minimizing sum of squares of distances to the line is the same as maximizing the sum of squares of the projections on that line (Pythagoras, long ago)69 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPCA: General methodologyFrom k original variables: x1,x2,...,xk: Produce k new variables: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk yk's are ... Principal Components yk = ak1x1 + ak2x2 + ... + akkxk such that: yk's are uncorrelated (orthogonal) y1 explains as much as possible of original variance in data set y2 explains as much as possible of remaining variance70 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Components Analysis52nd Principal Component, y241st Principal Component, y132 4.04.55.05.56.071 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Components Analysis{a11,a12,...,a1k} is 1st Eigenvector of covariance matrix, and coefficients of first principal component {a21,a22,...,a2k} is 2nd Eigenvector of covariance matrix, and coefficients of 2nd principal component … {ak1,ak2,...,akk} is kth Eigenvector of covariance matrix, and coefficients of kth principal component72 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiScoresScore of ith unit on jth principal component yi,j = aj1xi1 + aj2xi2 + ... + ajkxik5xi24yi,1yi,232 4.04.55.0xi15.56.073 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal Components AnalysisAmount of variance accounted for by: 1st principal component, λ1, 1st eigenvalue 2nd principal component, λ2, 2nd eigenvalue ... λ1 & λ2 & λ3 & λ4 & ...5λ1λ2Average λj = 143742 4.0 4.5 5.0 5.5 6.0 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPCA: Terminology? ? ? ? ? ? jth principal component is jth eigenvector of covariance matrix coefficients, ajk, are elements of eigenvectors and relate original variables (standardized if using correlation matrix) to components scores are values of units on components (produced using coefficients) amount of variance accounted for by component is given by eigenvalue, λj proportion of variance accounted for by component is given by λj / Σ λj loading of kth original variable on jth component is given by ajkλj --correlation between variable and component75 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPCA: potential problems? Lack of independenceC No problem? Lack of normalityC Normality desirable but not essential? Lack of precisionC Precision desirable but not essential? Many zeroes in data matrixC Problem (use correspondence analysis)76 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiNote? The principal components are dependent on the units used to measure the original variables as well as on the range of values they assume. ? We usually standardize the data prior to using PCA.77 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiHourly records of sperm whale behaviour? Variables:C C C C C C C C C C C C Mean cluster size Max. cluster size Mean speed Heading consistency Fluke-up rate Breach rate Lobtail rate Spyhop rate Sidefluke rate Coda rate Creak rate High click rate? Data collected:C Off Galapagos Islands C 1985 and 1987? Units:C hours spent following sperm whales C 440 hours78Whitehead, Hal and Weilgart, Linda. 1991. Patterns of Visually Observable Behaviour and Vocalizations in Groups of Female Sperm Whales. Behaviour 118 (3): 275-296 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal ComponentsPrincipal Components: 1 % of variance accounted for Loadings: Mean cluster size Max. cluster size Mean speed Heading consistency Fluke-up rate Breach rate Lobtail rate Spyhop rate Sidefluke rate Coda rate Creak rate High click rate 0.82 0.83 -0.38 -0.48 -0.65 0.24 0.29 0.46 0.49 0.68 0.57 -0.41 0.35 0.24 0.30 0.39 -0.19 0.24 0.30 -0.60 -0.57 -0.08 -0.11 -0.55 0.01 0.17 0.44 0.19 0.30 -0.13 -0.09 -0.23 -0.20 0.53 0.70 0.47 -0.14 -0.12 -0.09 -0.04 0.25 0.74 0.71 0.01 0.07 0.03 0.02 0.31792 13.413 12.084 10.5231.09Whitehead, Hal and Weilgart, Linda. 1991. Patterns of Visually Observable Behaviour and Vocalizations in Groups of Female Sperm Whales. Behaviour 118 (3): 275-296 PCALecture 11. Multivariate analysis (2/2)Principal Components: 1 % of variance accounted for Loadings: Mean cluster size Max. cluster size Mean speed Heading consistency Fluke-up rate Breach rate Lobtail rate Spyhop rate Sidefluke rate Coda rate Creak rate High click rate 0.82 0.83 -0.38 -0.48 -0.65 0.24 0.29 0.46 0.49 0.68 0.57 -0.41 “Socializing/ foraging” 0.35 0.24 0.30 0.39 -0.19 0.24 0.30 -0.60 -0.57 -0.08 -0.11 -0.55 “Directed movement” 0.01 0.17 0.44 0.19 0.30 -0.13 -0.09 -0.23 -0.20 0.53 0.70 0.47 “Vocal” -0.14 -0.12 -0.09 -0.04 0.25 0.74 0.71 0.01 0.07 0.03 0.02 0.31 “Aerial” 31.09 2 13.41 3 12.08 4 10.52Biostatistics Xinhai LiPrincipal Components meanings80Whitehead, Hal and Weilgart, Linda. 1991. Patterns of Visually Observable Behaviour and Vocalizations in Groups of Female Sperm Whales. Behaviour 118 (3): 275-296 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiUS Crime Statistics? VariablesC C C C C C C Murder Rape Robbery Assault Burglary Larceny Autotheft? Units:C States81 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCrime StatisticsScree PlotComponent loadings51 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTOTHFT 0.557 0.851 0.782 0.784 0.881 0.728 0.71424-0.771 -0.139 0.055 -0.546 0.308 0.480 0.438Eigenvalue3 2 1 0 012 3 4 5 6 Number of Factors7828 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCrime Statistics: Component LoadingsAfter Varimax Rotation:Factor Loadings Plot1.0 1.0Factor Loadings PlotMURDERASSAULT0.5Component 2LARCENY AUTOTHFT BURGLARYRAPE0.5Factor 2ROBBERY BURGLARYROBBERY0.0RAPE0.0AUTOTHFT LARCENY-0.5ASSAULT-0.5MURDER-1.0 -1.0-0.50.0 0.5 Component 11.0-1.0 -1.0-0.50.0 Factor 10.51.0From k original variables: x1,x2,...,xk: Produce k principal components: y1,y2,...,yk: y1 = a11x1 + a12x2 + ... + a1kxk y2 = a21x1 + a22x2 + ... + a2kxk ... yk = ak1x1 + ak2x2 + ... + akkxkCrimes against property83Crimes against people PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiCrime Statistics: Scores Plot-2HI RI ND MN IA UT WI VT NH ME CT MT-1Factor 2MA WA0CAWVNV CO AZ NY MIDE VA KY NM TN IL AR MO AK TX MD GA AL LA SC FL NC1MS2 3210 -1 Factor 1-2-3Crimes against people84NE SD IN WY NJ KS ID OR PA OH OKCrimes against property PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiExample In this example wildlife (moose) population density was measured over time (once a year) in three areas.Year 1 2 3 4 5 6 7 8 9 10 11 12 Area 1 11.3 10.4 9.9 8.2 10.1 10.7 11 7.1 14.7 5.4 7.3 10.2 Area 2 14.1 14 13 11.4 11.9 13.8 14.9 8.5 14.5 9 7.6 10.9 Area 3 6.9 11.2 8.7 3.3 8.7 12.5 8.9 3.7 12.1 4.1 5.6 7.3 Year 13 14 15 16 17 18 19 20 21 22 23 Area 1 6.1 9.7 8.1 11.3 8.8 9.4 7.5 8.8 7.5 9.1 6.8 Area 2 9.9 13.2 9.4 11.8 11.5 11.6 11.4 10.7 11.1 13.2 9.8 Area 3 6.8 6.6 4 4.9 8.8 5.7 4.9 7.2 7 8.9 7.6From Laverty’s presentation “Techniques for studying correlation and covariance structure”85 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiHabitatsArea 3Area 2Area 186From Laverty’s presentation “Techniques for studying correlation and covariance structure” PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe Sample Statistics? 9.10 ? r ? x = ?11.62 ? ? ? ? 7.19 ? ? ? 4.297 3.307 3.295 ? ? S=? 3.527 3.527 ? ? ? 6.566 ? ? ? ?1 .796 .620 ? ? R=? 1 .687 ? ? ? 1 ? ? ?The mean vectorThe covariance matrixThe correlation matrix87From Laverty’s presentation “Techniques for studying correlation and covariance structure” PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiPrincipal component Analysis The eigenvalues of S λ1 = 11.85974, λ2 = 2.204232, λ3 = 0.814249 The eigenvectors of S?.522 ? ? .582 ? ? .624 ? r ? r ? r ? ? ? a1 = ?.523? , a2 = ? .359 ? , a3 = ? ?.733? ? ? ? ? ?.674 ? ? ? ?.730 ? ? ? .117 ? ?The principal componentsC1 = .522 x1 + .523x2 + .674 x3 C2 = .582 x1 + .359 x2 ? .730 x3 C3 = .624 x1 ? .733x2 + .117 x3From Laverty’s presentation “Techniques for studying correlation and covariance structure”88 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiExample 1/3C1 = .522 x1 + .523 x2 + .674 x3Area 3Area 2Area 189From Laverty’s presentation “Techniques for studying correlation and covariance structure” PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiExample 2/3C2 = .582 x1 + .359 x2 ? .730 x3Area 3Area 2Area 190From Laverty’s presentation “Techniques for studying correlation and covariance structure” PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiExample 3/3C3 = .624 x1 ? .733 x2 + .117 x3Area 3 Area 2Area 191From Laverty’s presentation “Techniques for studying correlation and covariance structure” PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai Li input SC POS SOC TRADITN PRO POL GEN ACH; 3.755 6.711 15.179 9.178 5.121 4.948 7.952 7.473 8.095 7.047 13.030 12.045 7.479 6.331 10.723 10.724 3.797 6.515 15.764 9.544 5.177 3.792 8.911 8.817 4.758 2.991 13.256 9.552 3.793 4.929 7.113 7.321 6.851 10.350 11.380 11.942 8.609 5.408 9.652 10.369 5.592 6.467 12.013 10.590 6.216 5.593 8.583 6.654 3.936 5.216 13.028 7.371 4.710 3.352 7.688 6.434 2.678 8.474 12.031 10.159 5.592 4.456 8.812 8.527 0.052 4.114 13.707 10.831 2.272 5.177 5.865 7.816 5.935 4.012 13.120 13.160 4.864 5.493 8.149 8.245 2.200 4.043 14.417 9.322 3.289 6.44 8.030 6.82 4.723 2.081 15.471 6.892 3.475 2.119 6.937 8.112 5.378 2.002 14.158 11.844 3.654 4.693 6.362 7.752 4.042 5.682 10.353 10.055 4.767 5.388 7.525 7.433 4.588 8.994 15.93 9.398 6.659 4.492 10.041 9.914 4.893 5.814 16.766 11.749 5.384 5.356 10.257 11.014 4.830 4.016 12.734 8.983 4.298 4.392 5.581 6.613 8.647 7.477 11.710 10.975 7.868 6.077 9.148 10.157 7.102 8.070 11.569 11.599 7.464 4.948 8.826 9.653 3.188 5.671 12.186 12.731 4.563 7.606 8.210 8.169 6.406 8.155 11.435 10.247 7.249 5.713 7.625 8.838 4.632 7.397 15.192 10.020 6.012 4.997 7.457 8.854 5.741 5.271 15.333 13.925 5.647 7.243 9.489 8.474 5.607 7.653 17.730 13.839 6.657 6.305 10.568 10.017 6.107 8.418 10.261 15.635 7.150 8.093 8.599 10.364 7.114 4.441 19.800 9.689 5.741 3.834 10.222 9.836 8.504 3.597 15.221 10.876 6.151 5.067 9.134 11.190 10.152 2.761 17.4 9.571 6.508 5.140 9.766 9.868 11.933 1.807 19.212 15.649 6.844 9.219 11.608 13.504 9.722 4.611 18.532 8.984 7.200 2.528 10.547 11.792 7.393 3.048 17.235 11.453 5.321 7.355 7.532 9.988 10.822 4.993 19.195 10.929 7.961 6.051 12.910 11.911 7.884 4.321 18.289 9.288 6.076 5.029 9.421 9.945 8.454 4.906 18.390 12.404 6.633 7.764 11.869 12.032 10.407 2.625 19.374 12.154 6.401 7.796 10.787 11.680 9.617 3.351 15.115 8.411 6.559 2.723 7.771 10.959 7.960 0.960 13.490 9.051 4.389 4.210 6.144 7.999 7.312 1.326 16.372 11.949 4.182 4.982 8.294 9.996 11.545 0.326 17.637 8.98 5.934 4.827 11.337 9.706 6.537 1.833 17.160 8.019 4.246 3.595 8.118 8.948 6.162 3.723 19.816 8.940 5.128 6.2 10.879 9.604 10.294 3.871 18.480 11.619 7.193 4.617 11.924 12.12 10.809 4.152 16.026 5.931 7.664 2.885 10.618 7.5 10.488 2.612 18.267 8.954 6.535 5.436 12.066 10.198 9.401 2.541 19.712 13.414 5.909 5.109 10.865 11.841 8.198 2.103 18.54 11.007 5.034 5.409 9.402 10.503 4.752 3.173 17.625 9.608 4.063 5.687 6.359 8.501 6.939 5.626 19.843 7.092 6.154 3.357 9.807 10.718 10.634 3.336 17.069 7.334 6.973 3.893 11.553 8.599 proc princomp COVARIANCE out = out1; proc print data = out1;SAS exampleThese data consist of a series of attitudinal and value scores on 49 individuals. The variables are: SC = self-confidence POS = positiveness SOC = sociability TRADITN = ones sense of values that explain why one would keep a traditional way of life and ethics PRO = propulsion: commitment to go ahead[move forward] with a plan POL = political view: high score means that the person has a conservative political view GEN = general knowledge score ACH = motive of achievement92 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSAS codeproc princomp COVARIANCE out = out1; proc print data = out1; Eigenvalues of the Covariance Matrix Eigenvalue 1 2 3 4 5 6 7 8 16........0059247 Difference 8.......3482639 Proportion 0.6 0.8 0.6 0.2 Cumulative 0.7 0.2 0.9 0.093 PCALecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSAS ResultsEigenvectors Prin1 SC 0.579662 POS -.276978 SOC 0.590183 TRADITN 0.027356 PRO 0.150217 POL 0.031990 GEN 0.340984 ACH 0.313613 Prin2 0..535892 -......260832 Prin3 Prin4 Prin5 0.004470 -.058174 -.017637 -....506857 -.617075 Prin6 0...236715 -...542879 -..011566 Prin7 0.....179740 -.342547 -.213711 -.662823 Prin8 -.400370 -.404031 -...822113 -.006811 -..014160-.283626 -.575273 -.....608007 -.088896 -.355619 -..411069 -.037315 -....09915294 Lecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Analysis95 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR code: Factor Analysis1 42 86 104 105 116 latitude 33.1 33.3 33.1 33.4 33.4 33.2 aspect 0.893 0.798 0.56 0.502 0.502 0.201 elevation footprint 476 61 484 38 473 60 942 20 942 20 476 44 year 08 06 GDP 333 420 256 186 186 169 pop 85 488 488 1321 slope 0.503 0.685 0.812 5.002 5.002 2.275# Exploratory Factor Analysis (Maximum Likelihood) # extracting 3 factors, with varimax rotation fit &- factanal(ibis.pre, 3, rotation=&varimax&) print(fit, digits=2, cutoff=.3, sort=TRUE)Call: factanal(x = ibis.pre, factors = 3, rotation = &varimax&) Uniquenesses: y aspect elevation footprint 0.20 1.00 0.20 0.64 Loadings: Factor1 Factor2 Factor3 footprint 0.53 GDP 0.93 pop 0.95 -0.30 Latitude 0.89 elevation 0.89 aspect year -0.30 slope year 0.90 GDP 0.00 pop 0.00 slope 0.90 Factor1 Factor2 Factor3 SS loadings 2.09 1.95 0.11 Proportion Var 0.26 0.24 0.01 Cumulative Var 0.26 0.50 0.52 Test of the hypothesis that 3 factors are sufficient. The chi square statistic is 50.9 on 7 degrees of freedom. The p-value is 9.78e-0996 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiR: Result# plot factor 1 by factor 2 load &- fit$loadings[,1:2] plot(load,type=&n&) # set up plot text(load,labels=names(ibis.pre),cex=1) # add variable names abline(h = -1:1, v = -1:1, col = &lightgray&, lty=1)y elevation 0.8 Factor2 0.4 0.2 slope 0.6-0.20.0aspectyear -0.2 0.0 0.2 0.4footprint 0.6 0.8GDP pop97Factor1 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Analysis? Data reduction tool ? Removes redundancy or duplication from a set of correlated variables ? Represents correlated variables with a smaller set of “derived” variables. ? Factors are formed that are relatively independent of one another. ? Two types of “variables”: C latent variables: factors C observed variables98 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSome Applications of Factor Analysis1. Identification of Underlying Factors:C clusters variables into homogeneous sets C creates new variables (i.e. factors) C allows us to gain insight to categories2. Screening of Variables:C identifies groupings to allow us to select one variable to represent many C useful in regression (recall collinearity)3. Summary:C Allows us to describe many variables using a few factors4. Clustering of objects:C Helps us to put objects (people) into categories depending on their factor scores99 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiData Matrix? Factor analysis is totally dependent on correlations between variables. ? Factor analysis summarizes correlation structurev1……...vk O1 . . . . . . . . On v1 . . . vk v1……...vk v1 . . . vk F1…..FjCorrelation MatrixFactor Matrix100Data Matrix Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiChoosing Number of Factors? Intuitively: The number of uncorrelated constructs that are jointly measured by the X’s. ? Only useful if number of factors is less than number of X’s (recall “data reduction”). Use “principal components” to help decideC type of factor analysis C number of factors is equivalent to number of variables C each factor is a weighted combination of the input variables: F1 = a11X1 + a12X2 + ….101 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiChoosing Number of Factors Eigenvalues? To select how many factors to use, consider eigenvalues from a principal components analysis ? Two interpretations:C eigenvalue ? equivalent number of variables which the factor represents C eigenvalue ? amount of variance in the data described by the factor.? Rules to go by:C C C C number of eigenvalues & 1 scree plot % variance explained comprehensibility? Note: sum of eigenvalues is equal to the number of items102 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiSteps in Factor Analysis? Factor analysis usually proceeds in four steps:C 1: the correlation matrix for all variables is computed C 2: Factor extraction C 3: Factor rotation C 4: Make final decisions about the number of underlying factors103 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe Correlation Matrix? 1: the correlation matrix C Generate a correlation matrix for all variables C Identify variables not related to other variables C If the correlation between variables are small, it is unlikely that they share common factors (variables must be related to each other for the factor model to be appropriate). C Correlation coefficients greater than 0.3 in absolute value are indicative of acceptable correlations. C Examine visually the appropriateness of the factor model.104 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiThe Correlation MatrixC Bartlett Test of Sphericity:{ used to test the hypothesis the correlation matrix is an identity matrix (all diagonal terms are 1 and all off-diagonal terms are 0). { If the value of the test statistic for sphericity is large and the associated significance level is small, it is unlikely that the population correlation matrix is an identity.C If the hypothesis that the population correlation matrix is an identity cannot be rejected because the observed significance level is large, the use of the factor model should be reconsidered.105 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Extractiony 2nd Step: Factor extractionyThe principal components analysis is the most commonly used extraction method . Other factor extraction methods include:? Maximum ? Principal ? Alphalikelihood methodaxis factoring lease squares method least square method106method? Unweighted ? Generalized ? Imagefactoring Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Extraction? In principal components analysis, linear combinations of the observed variables are formed. ? The 1st principal component is the combination that accounts for the largest amount of variance in the sample (1st extracted factor). ? The 2nd principle component accounts for the next largest amount of variance and is uncorrelated with the first (2nd extracted factor). ? Successive components explain progressively smaller portions of the total sample variance, and all are uncorrelated with each other.107 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Extraction? To decide on how many factors we need to represent the data, we use 2 statistical criteria: C Eigen Values, and C The Scree Plot. ? The determination of the number of factors is usually done by considering only factors with Eigen values greater than 1. ? Factors with a variance less than 1 are no better than a single variable, since each variable is expected to have a variance of 1.108 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Extraction? The Scree plot provides a visual of the total variance associated with each factor. The steep slope shows the large factors. The gradual trailing off (scree) shows the rest of the factors usually lower than an Eigen value of 1. At this stage, the decision about the number of factors is not final. User should make initial decisions based on conceptual and theoretical grounds as well.? ?? ?109 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiFactor Rotation3rd Step: Factor rotation. ? In this step, factors are rotated. ? Un-rotated factors are typically not very interpretable (most factors are correlated with may variables). ? Factors are rotated to make them more meaningful and easier to interpret (each variable is associated with a minimal number of factors). ? Different rotation methods may result in the identification of somewhat different factors.110 Factor analysisLecture 11. Multivariate analysis (2/2)Biostatistics Xinhai LiRotating Factors (Intuitively)F2 F23 2 1 3 1 2F14 4F1x1 x2 x3 x4Factor 1 0.5 0.8 -0.7 -0.5Factor 2 0.5 0.8 0.7 -0.5x1 x2 x3 x4Factor 1 0

我要回帖

更多关于 长沙家装公司 的文章

 

随机推荐