# Principal Component Analysis - Part III

Run Python code in Google Colab | Download Python code | Download R code (R Markdown) |

In this post, we will reproduce the results of a popular paper on PCA. The paper is titled ‘Principal component analysis’ and is authored by Herve Abdi and Lynne J. Williams. It got published in 2010 and since then its popularity has only grown. Its number of citations are more than 4800 as per Google Scholar data (This was the number when this post was last revised).

This post is Part-III of a three part series on PCA. Other parts of the series can be found at the links below.

This post contains code snippets in R. Equivalent MATLAB codes can be written using commands of Part-II. For figures, the reader has to write his/her own code in MATLAB.

### Structure of the paper

Along with basic theory, the paper contains three examples on PCA, one example on correspondence analysis, and one example on multiple factor analysis. We will only focus on PCA examples in this post.

To run following R codes seamlessly, readers have to load following packages. If these packages have not been installed previously, use `install.packages("package_name")`

to install those.

```
library(ggplot2)
library(ggrepel)
```

### How to get data

Data for the examples have been taken from the paper [1]. The datasets are pretty small. So one way to read the data is to create a dataframe itself in R using the values given in paper. Otherwise, the values can first be stored in a csv file and then read into R. To make this post self-sufficient, we will adopt the former approach.

**Note**: Throughout this article, additional comments have been made beside code segments. It would be a good idea to read those commented lines along with the codes.

```
# Table 1
# Create a dataframe
Words = c("Bag", "Across", "On", "Insane", "By", "Monastery", "Relief", "Slope", "Scoundrel", "With", "Neither", "Pretentious", "Solid", "This", "For", "Therefore", "Generality", "Arise", "Blot", "Infectious")
Word_length = c(3, 6, 2, 6, 2, 9, 6, 5, 9, 4, 7, 11, 5, 4, 3, 9, 10, 5, 4, 10)
Lines_in_dict = c(14, 7, 11, 9, 9, 4, 8, 11, 5, 8, 2, 4, 12, 9, 8, 1, 4, 13, 15, 6)
words = data.frame(Words, Word_length, Lines_in_dict, stringsAsFactors = F)
words
```

```
Words Word_length Lines_in_dict
1 Bag 3 14
2 Across 6 7
3 On 2 11
4 Insane 6 9
5 By 2 9
6 Monastery 9 4
7 Relief 6 8
8 Slope 5 11
9 Scoundrel 9 5
10 With 4 8
11 Neither 7 2
12 Pretentious 11 4
13 Solid 5 12
14 This 4 9
15 For 3 8
16 Therefore 9 1
17 Generality 10 4
18 Arise 5 13
19 Blot 4 15
20 Infectious 10 6
```

`(words_centered = scale(words[,2:3],scale = F)) # Centering after reemoving the first column`

```
Word_length Lines_in_dict
[1,] -3 6
[2,] 0 -1
[3,] -4 3
[4,] 0 1
[5,] -4 1
[6,] 3 -4
[7,] 0 0
[8,] -1 3
[9,] 3 -3
[10,] -2 0
[11,] 1 -6
[12,] 5 -4
[13,] -1 4
[14,] -2 1
[15,] -3 0
[16,] 3 -7
[17,] 4 -4
[18,] -1 5
[19,] -2 7
[20,] 4 -2
attr(,"scaled:center")
Word_length Lines_in_dict
6 8
```

### Covariance PCA

Covariance PCA uses centered data matrix. But data matrix is not scaled. `prcomp()`

centers data by default.

```
pca_words_cov = prcomp(words[,2:3],scale = F) # cov stands for Covariance PCA
factor_scores_words = pca_words_cov$x
round(factor_scores_words,2)
```

```
PC1 PC2
[1,] -6.67 0.69
[2,] 0.84 -0.54
[3,] -4.68 -1.76
[4,] -0.84 0.54
[5,] -2.99 -2.84
[6,] 4.99 0.38
[7,] 0.00 0.00
[8,] -3.07 0.77
[9,] 4.14 0.92
[10,] -1.07 -1.69
[11,] 5.60 -2.38
[12,] 6.06 2.07
[13,] -3.91 1.30
[14,] -1.92 -1.15
[15,] -1.61 -2.53
[16,] 7.52 -1.23
[17,] 5.52 1.23
[18,] -4.76 1.84
[19,] -6.98 2.07
[20,] 3.83 2.30
```

Observer that factor scores for PC1 are negatives of what has been given in the paper. This is not a problem as principal directions are orthogonal.

### Principal directions are orthogonal

```
# It can also be checked that both the principal components are orthogonal.
sum(factor_scores_words[,1]*factor_scores_words[,2]) # PCs are orthogonal
```

`[1] -4.773959e-15`

### Contribution of each factor

It is defined as square of factor score divided by sum of squares of factor scores in that column.

`round(factor_scores_words[,1]^2/sum(factor_scores_words[,1]^2)*100,2)`

```
[1] 11.36 0.18 5.58 0.18 2.28 6.34 0.00 2.40 4.38 0.29 8.00 9.37
[13] 3.90 0.94 0.66 14.41 7.78 5.77 12.43 3.75
```

`round(factor_scores_words[,2]^2/sum(factor_scores_words[,2]^2)*100,2)`

```
[1] 0.92 0.55 5.98 0.55 15.49 0.28 0.00 1.13 1.63 5.48 10.87 8.25
[13] 3.27 2.55 12.32 2.90 2.90 6.52 8.25 10.18
```

The calculations in above two lines can be done in a single line

`round(factor_scores_words^2/matrix(rep(colSums(factor_scores_words^2),nrow(words)),ncol = 2,byrow = T)*100,2)`

```
PC1 PC2
[1,] 11.36 0.92
[2,] 0.18 0.55
[3,] 5.58 5.98
[4,] 0.18 0.55
[5,] 2.28 15.49
[6,] 6.34 0.28
[7,] 0.00 0.00
[8,] 2.40 1.13
[9,] 4.38 1.63
[10,] 0.29 5.48
[11,] 8.00 10.87
[12,] 9.37 8.25
[13,] 3.90 3.27
[14,] 0.94 2.55
[15,] 0.66 12.32
[16,] 14.41 2.90
[17,] 7.78 2.90
[18,] 5.77 6.52
[19,] 12.43 8.25
[20,] 3.75 10.18
```

### Squared distance to center of gravity

`(dist = rowSums(factor_scores_words^2))`

` [1] 45 1 25 1 17 25 0 10 18 4 37 41 17 5 9 58 32 26 53 20`

### Squared cosine of observations

`(sq_cos = round(factor_scores_words^2/rowSums(factor_scores_words^2)*100))`

```
PC1 PC2
[1,] 99 1
[2,] 71 29
[3,] 88 12
[4,] 71 29
[5,] 53 47
[6,] 99 1
[7,] NaN NaN
[8,] 94 6
[9,] 95 5
[10,] 29 71
[11,] 85 15
[12,] 90 10
[13,] 90 10
[14,] 74 26
[15,] 29 71
[16,] 97 3
[17,] 95 5
[18,] 87 13
[19,] 92 8
[20,] 74 26
```

Nan’s are produced because of division by zero.

```
# Figue 1
p = ggplot(words,aes(x = Lines_in_dict,y = Word_length,label = Words))+
geom_point()+ geom_text_repel()+
geom_hline(yintercept = 6)+geom_vline(xintercept = 8)+
labs(x = "Lines in dictionary",y = "Word length")
print(p)
```

```
# Show directions of PCs
# Note that intercept argument in geom_abline considers the line to be at the origin. In our case the data are mean shifted.
# So we have to adjust the intercept taking new origin into consideration. These adjustments have been made below.
slope1 = pca_words_cov$rotation[1,1]/pca_words_cov$rotation[2,1] # Slope of first PC
slope2 = pca_words_cov$rotation[1,2]/pca_words_cov$rotation[2,2] # Slope of second PC
(new_origin = c(mean(words$Lines_in_dict),mean(words$Word_length)))
```

`[1] 8 6`

```
intercept1 = 6 - slope1*8
intercept2 = 6 - slope2*8
p+geom_abline(slope = slope1,intercept = intercept1,linetype = "dashed",size = 1.2,col = "red")+
geom_abline(slope = slope2,intercept = intercept2,linetype = "dashed",size = 1.2,col = "blue")
```

In the above figure red dashed line is the 1st principal component (PC) and blue dashed line is the 2nd PC.

### Rotated PCs

This figure is obtained by plotting factor scores. Note that we will plot negative of the factor scores of 1st PC to make the figure consistent with the paper.

```
ggplot(as.data.frame(pca_words_cov$x),aes(-pca_words_cov$x[,1],pca_words_cov$x[,2],label = words$Words))+
geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
xlab("Factor score along PC1")+ylab("Factor score along PC2")
```

### With supplementary data

Given a supplementary point (a point previously not used in finding principal components),we have to first center the data point. Its factor scores can then be obtained by multiplying it with the loading matrix.

### Factor score of the new word ‘sur’

```
sur = c(3,12) # It has 3 letter and 12 lines of dictionary entry
(sur_centered = sur - colMeans(words[,2:3]))
```

```
Word_length Lines_in_dict
-3 4
```

`(factor_scores_sur = round(sur_centered %*% pca_words_cov$rotation,2))`

```
PC1 PC2
[1,] -4.99 -0.38
```

### Eigenvalues and variance

See Part-II for details.

### Total variance before transformation

`(total_var_before = round(sum(diag(var(words_centered))),3))`

`[1] 23.368`

### Total variance after transformation

`(total_var_after = round(sum(diag(var(pca_words_cov$x))),3))`

`[1] 23.368`

Correlation between principal components and original variables (In the paper,this correlation is also termed loading. But we will strictly reserve the loading term to mean loading matrix \(\textbf{P}\) (see Part-I)

The sum of correlation coefficients between variables and principal components is 1. Intuitively, this means that variables are orthogonally projected onto the principal components.

### Correlation matrix

```
# Correlation between PCs and original variables
(cor(pca_words_cov$x,words_centered))
```

```
Word_length Lines_in_dict
PC1 0.8679026 -0.9741764
PC2 0.4967344 0.2257884
```

**Note** that the answers for correlation coefficients don’t match with that of the paper. Readers who get actual answers as given in paper are encouraged to send me an email using my contact details. However our procedure is correct and it does indeed give the correct answer for supplementary data as described below.

### Squared correlation

`(cor(pca_words_cov$x,words_centered)^2)`

```
Word_length Lines_in_dict
PC1 0.7532549 0.94901961
PC2 0.2467451 0.05098039
```

Sum of correlation coefficients between variables and principal components is 1.

`colSums((cor(pca_words_cov$x,words_centered)^2))`

```
Word_length Lines_in_dict
1 1
```

### Loading matrix

`(loading_matrix = pca_words_cov$rotation)`

```
PC1 PC2
Word_length 0.5368755 0.8436615
Lines_in_dict -0.8436615 0.5368755
```

### Correlation score for supplementary variables

```
# Supplementary variable (Table 4)
Frequency = c(8,230,700,1,500,1,9,2,1,700,7,1,4,500,900,3,1,10,1,1)
Num_entries = c(6,3,12,2,7,1,1,6,1,5,2,1,5,9,7,1,1,4,4,2)
supp_data = data.frame(Frequency,Num_entries) # Supplementary data
supp_data
```

```
Frequency Num_entries
1 8 6
2 230 3
3 700 12
4 1 2
5 500 7
6 1 1
7 9 1
8 2 6
9 1 1
10 700 5
11 7 2
12 1 1
13 4 5
14 500 9
15 900 7
16 3 1
17 1 1
18 10 4
19 1 4
20 1 2
```

### Centered supplementary data

`supp_data_cent = scale(supp_data,scale = F) # Centered supplementary data`

### Correlation score for supplementary data

`(corr_score_supp = round(cor(pca_words_cov$x,supp_data),4))`

```
Frequency Num_entries
PC1 -0.3012 -0.6999
PC2 -0.7218 -0.4493
```

Note that correlation score doesn’t depend on whether supplementary data is centered or not.

`(round(cor(pca_words_cov$x,supp_data_cent),4))`

```
Frequency Num_entries
PC1 -0.3012 -0.6999
PC2 -0.7218 -0.4493
```

### Squared correlation

`(round(cor(pca_words_cov$x,supp_data_cent)^2,4))`

```
Frequency Num_entries
PC1 0.0907 0.4899
PC2 0.5210 0.2019
```

### Column sums of squared correlation for support data

`(round(colSums(cor(pca_words_cov$x,supp_data_cent)^2),4))`

```
Frequency Num_entries
0.6118 0.6918
```

### Correlation circle plot

```
# First plot correlation circle
x = seq(0,2*pi,length.out = 300)
circle = ggplot() + geom_path(data = data.frame(a = cos(x),b = sin(x)),
aes(cos(x),sin(x)),alpha = 0.3, size = 1.5)+
geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
annotate("text",x = c(1.08,0.05),y = c(0.05,1.08),label = c("PC1","PC2"),angle = c(0,90))+
xlab(NULL)+ylab(NULL)
# Plotting original variables
cor_score = as.data.frame(cor(words_centered,pca_words_cov$x))
variable_plot_original = circle + geom_point(data = cor_score, aes(cor_score[,1],cor_score[,2]))+
geom_text_repel(aes(cor_score[,1],cor_score[,2],
label = c("Length of words","Number of lines in Dict.")))
print(variable_plot_original)
```

### Plotting supplementary variables

```
variable_plot_original+
geom_point(data = as.data.frame(corr_score_supp),
aes(corr_score_supp[,1],corr_score_supp[,2]))+
geom_text_repel(aes(corr_score_supp[,1],corr_score_supp[,2],
label = c("Frequency","Number of entries")))
```

Observe that our correlation circle plot is flipped about y-axis (i.e., PC2) when compared to the plot given in paper. This is because our first principal component is negative of the one given in paper. So while computing correlation score, this negative principal component results in negative correlation scores. Hence, our plot flips about y-axis.

## Example 2 (Wine example)

### Correlation PCA with wine data

```
# Table 6
wine_type = c(paste("wine", 1:5, sep = "_"))
hedonic = c(14, 10, 8, 2, 6)
for_meat = c(7, 7, 5, 4, 2)
for_dessert = c(8, 6, 5, 7, 4)
price = c(7, 4, 10, 16, 13)
sugar = c(7, 3, 5, 7, 3)
alcohol = c(13, 14, 12, 11, 10)
acidity = c(7, 7, 5, 3, 3)
wine = data.frame(wine_type, hedonic, for_meat, for_dessert, price, sugar, alcohol, acidity, stringsAsFactors = F)
wine
```

```
wine_type hedonic for_meat for_dessert price sugar alcohol acidity
1 wine_1 14 7 8 7 7 13 7
2 wine_2 10 7 6 4 3 14 7
3 wine_3 8 5 5 10 5 12 5
4 wine_4 2 4 7 16 7 11 3
5 wine_5 6 2 4 13 3 10 3
```

```
pca_wine_cor = prcomp(wine[2:8],scale = T)
ggplot(as.data.frame(pca_wine_cor$x),aes(x = pca_wine_cor$x[,1],y = pca_wine_cor$x[,2], label = paste0("wine ",1:5)))+
geom_point()+geom_text_repel()+ geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)+
xlab("Factor score along PC1")+ylab("Factor score along PC2")
```

Again our figure seems upside down than that of the paper. This is a minor discrepancy. Our 2nd eigenvector is negative of the one considered in paper. We can match the plot with that of the paper by just flipping the second principal component but we will not do that here.

### Factor scores along 1st and 2nd PC

```
# Table 7
(pca_wine_cor$x[,1:2])
```

```
PC1 PC2
[1,] -2.3301649 1.095284
[2,] -2.0842419 -1.223185
[3,] 0.1673228 -0.370258
[4,] 1.7842392 1.712563
[5,] 2.4628448 -1.214405
```

### Contribution of each observation to principal component

`round(pca_wine_cor$x[,1:2]^2/matrix(rep(colSums(pca_wine_cor$x[,1:2]^2),nrow(wine)),ncol = 2,byrow = T)*100,2)`

```
PC1 PC2
[1,] 28.50 16.57
[2,] 22.80 20.66
[3,] 0.15 1.89
[4,] 16.71 40.51
[5,] 31.84 20.37
```

### Squared cosine of observations of first PC

`(sq_cos = round(pca_wine_cor$x[,1:2]^2/rowSums(pca_wine_cor$x^2)*100))`

```
PC1 PC2
[1,] 77 17
[2,] 69 24
[3,] 7 34
[4,] 50 46
[5,] 78 19
```

### Loading scores corresponding to first two principal components

`(round(pca_wine_cor$rotation[,1:2],2))`

```
PC1 PC2
hedonic -0.40 -0.11
for_meat -0.45 0.11
for_dessert -0.26 0.59
price 0.42 0.31
sugar -0.05 0.72
alcohol -0.44 -0.06
acidity -0.45 -0.09
```

### Correlation score variables with first two principal components

`(corr_score_wine = round(cor(pca_wine_cor$x,wine[,2:8])[1:2,],2))`

```
hedonic for_meat for_dessert price sugar alcohol acidity
PC1 -0.87 -0.97 -0.58 0.91 -0.11 -0.96 -0.99
PC2 -0.15 0.15 0.79 0.42 0.97 -0.07 -0.12
```

### Correlation circle for wine data

```
# Figure 6
corr_score_wine = t(corr_score_wine)
circle +
geom_point(data = as.data.frame(corr_score_wine),
aes(corr_score_wine[,1],corr_score_wine[,2]))+
geom_text_repel(aes(corr_score_wine[,1],corr_score_wine[,2],
label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))
```

### Varimax rotation

Rotation is applied to loading matrix such that after rotation principal components are interpretable. By interpretable, we mean, some of the loading scores will have higher values and some other loading scores will have lower values. So it can be said that the variables whose loading scores have higher value, contribute significantly towards principal components as compared to other variables with lesser loading scores. Though rotation works in certain cases, it must be remembered that it is no magic wand for principal component interpretability. One of the popular rotations is Varimax rotation. R has a built-in command to perform varimax rotation.

Varimax rotation can be performed on the whole loading matrix or on a few components only. In the paper, varimax has been applied to first two principal components.

### Loading scores of first two principal components

`(round(pca_wine_cor$rotation[,1:2],2))`

```
PC1 PC2
hedonic -0.40 -0.11
for_meat -0.45 0.11
for_dessert -0.26 0.59
price 0.42 0.31
sugar -0.05 0.72
alcohol -0.44 -0.06
acidity -0.45 -0.09
```

### Varimax applied to first two principal components

`rotated_loading_scores = varimax(pca_wine_cor$rotation[,1:2])`

### Loading scores after rotation

```
# Table 10
(round(rotated_loading_scores$loadings[,1:2],2))
```

```
PC1 PC2
hedonic -0.41 -0.02
for_meat -0.41 0.21
for_dessert -0.12 0.63
price 0.48 0.21
sugar 0.12 0.72
alcohol -0.44 0.05
acidity -0.46 0.02
```

The same result can also be obtained by multiplying the original loading matrix by the rotation matrix obtained from varimax.

`(round(pca_wine_cor$rotation[,1:2] %*% rotated_loading_scores$rotmat,2))`

```
[,1] [,2]
hedonic -0.41 -0.02
for_meat -0.41 0.21
for_dessert -0.12 0.63
price 0.48 0.21
sugar 0.12 0.72
alcohol -0.44 0.05
acidity -0.46 0.02
```

### Plot of loading scores before rotation

```
#Figure 7
ggplot(as.data.frame(pca_wine_cor$rotation[,1:2]),aes(x = pca_wine_cor$rotation[,1],y = pca_wine_cor$rotation[,2],
label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))+
geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
xlab("Loading score along PC1")+ylab("Loading score along PC2")
```

### Plot of loading scores after rotation

```
ggplot(as.data.frame(rotated_loading_scores$loadings[,1:2]),
aes(x = rotated_loading_scores$loadings[,1],
y = rotated_loading_scores$loadings[,2],
label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))+
geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
xlab("Loading score along PC1 after rotation")+
ylab("Loading score along PC2 after rotation")
```

## Example 3

### French food example (Covariance PCA example)

```
# Table 11
class = rep(c("Blue_collar", "White_collar", "Upper_class"), times = 4)
children = rep(c(2,3,4,5), each = 3)
bread = c(332, 293, 372, 406, 386, 438, 534, 460, 385, 655, 584, 515)
vegetables = c(428, 559, 767, 563, 608, 843, 660, 699, 789, 776, 995, 1097)
fruit = c(354, 388, 562, 341, 396, 689, 367, 484, 621, 423, 548, 887)
meat = c(1437, 1527, 1948, 1507, 1501, 2345, 1620, 1856, 2366, 1848, 2056, 2630)
poultry = c(526, 567, 927, 544, 558, 1148, 638, 762, 1149, 759, 893, 1167)
milk = c(247, 239, 235, 324, 319, 243, 414, 400, 304, 495, 518, 561)
wine = c(427, 258, 433, 407, 363, 341, 407, 416, 282, 486, 319, 284)
food = data.frame(class, children, bread, vegetables, fruit, meat, poultry, milk, wine, stringsAsFactors = F)
food
```

```
class children bread vegetables fruit meat poultry milk wine
1 Blue_collar 2 332 428 354 1437 526 247 427
2 White_collar 2 293 559 388 1527 567 239 258
3 Upper_class 2 372 767 562 1948 927 235 433
4 Blue_collar 3 406 563 341 1507 544 324 407
5 White_collar 3 386 608 396 1501 558 319 363
6 Upper_class 3 438 843 689 2345 1148 243 341
7 Blue_collar 4 534 660 367 1620 638 414 407
8 White_collar 4 460 699 484 1856 762 400 416
9 Upper_class 4 385 789 621 2366 1149 304 282
10 Blue_collar 5 655 776 423 1848 759 495 486
11 White_collar 5 584 995 548 2056 893 518 319
12 Upper_class 5 515 1097 887 2630 1167 561 284
```

`pca_food_cov = prcomp(food[,3:9],scale = F)`

### Factor scores

```
# Table 12
(factor_scores_food = round(pca_food_cov$x[,1:2],2))
```

```
PC1 PC2
[1,] -635.05 120.89
[2,] -488.56 142.33
[3,] 112.03 139.75
[4,] -520.01 -12.05
[5,] -485.94 -1.17
[6,] 588.17 188.44
[7,] -333.95 -144.54
[8,] -57.51 -42.86
[9,] 571.32 206.76
[10,] -39.38 -264.47
[11,] 296.04 -235.92
[12,] 992.83 -97.15
```

### Contribution of each observation to principal component

`round(pca_food_cov$x[,1:2]^2/matrix(rep(colSums(pca_food_cov$x[,1:2]^2),nrow(food)),ncol = 2,byrow = T)*100,2)`

```
PC1 PC2
[1,] 13.34 5.03
[2,] 7.90 6.97
[3,] 0.42 6.72
[4,] 8.94 0.05
[5,] 7.81 0.00
[6,] 11.44 12.22
[7,] 3.69 7.19
[8,] 0.11 0.63
[9,] 10.80 14.71
[10,] 0.05 24.07
[11,] 2.90 19.15
[12,] 32.61 3.25
```

`dist = pca_food_cov$x[,1]^2+pca_food_cov$x[,2]^2`

### Squared cosine of observations

`(sq_cos = round(pca_food_cov$x[,1:2]^2/rowSums(pca_food_cov$x^2)*100))`

```
PC1 PC2
[1,] 95 3
[2,] 86 7
[3,] 26 40
[4,] 100 0
[5,] 98 0
[6,] 89 9
[7,] 83 15
[8,] 40 22
[9,] 86 11
[10,] 2 79
[11,] 57 36
[12,] 97 1
```

### Squared loading scores

```
# Table 13
(round(pca_food_cov$rotation[,1:2]^2,2))
```

```
PC1 PC2
bread 0.01 0.33
vegetables 0.11 0.17
fruit 0.09 0.01
meat 0.57 0.01
poultry 0.22 0.06
milk 0.01 0.40
wine 0.00 0.02
```

**Note** that this table doesn’t match with that of the paper. We will stick to our analysis.

### Correlation score

`(corr_score_food = round((cor(pca_food_cov$x,food[,3:9])[1:2,]),2))`

```
bread vegetables fruit meat poultry milk wine
PC1 0.36 0.91 0.96 1.00 0.98 0.41 -0.43
PC2 -0.87 -0.35 0.10 0.04 0.16 -0.88 -0.33
```

### Squared correlation score

`(round((cor(pca_food_cov$x,food[,3:9])[1:2,])^2,2))`

```
bread vegetables fruit meat poultry milk wine
PC1 0.13 0.83 0.92 1 0.96 0.17 0.18
PC2 0.76 0.12 0.01 0 0.03 0.77 0.11
```

### Correlation circle for food data

```
# Figure 9
corr_score_food = t(corr_score_food)
circle + geom_point(data = as.data.frame(corr_score_food),
aes(x = corr_score_food[,1],y = corr_score_food[,2]))+
geom_text_repel(data = as.data.frame(corr_score_food),
aes(x = corr_score_food[,1],y = corr_score_food[,2],
label = c("Bread","Vegetables","Fruit","Meat","Poultry","Milk","Wine")))
```

Now observe that our correlation circle plot is almost close to that of the papers (though in opposite quadrants. But this is not a problem as we have previously mentioned).

### Eigenvalues

Eigenvalues of data covariance matrix is square of singular values of centered data matrix. Hence eigenvalues of data covariance matrix can be obtained as below.

```
## Table 14
cent_food = food[,3:9]-matrix(rep(colMeans(food[,3:9]),times = 12),nrow = 12,
byrow = T)
svd_food = svd(cent_food)
(Eigenvalues = (svd_food$d)^2)
```

```
[1] 3023141.2354 290575.8390 68795.2333 25298.9496 22992.2474
[6] 3722.3214 723.9238
```

**Important Note:** These eigenvalues are not the same as variance of factor scores in principal components. Variance of principal component factor scores can be obtained by dividing the eigenvalues by \((n-1)\), where \(n\) is number of data points (in this case \(n = 12\)). If this point is still not clear, refer to Part-II.

### Percentage contribution of each principal component

`(round(Eigenvalues/sum(Eigenvalues),2))`

`[1] 0.88 0.08 0.02 0.01 0.01 0.00 0.00`

### Cumulative sum of eigenvalues

`(round(cumsum(Eigenvalues),2))`

`[1] 3023141 3313717 3382512 3407811 3430804 3434526 3435250`

### Cumulative percentage contribution

`(round(cumsum(Eigenvalues)/sum(Eigenvalues),2))`

`[1] 0.88 0.96 0.98 0.99 1.00 1.00 1.00`

### RESS (Refer to the paper for a description)

```
RESS = array(rep(0,7))
for (i in 1:7){
RESS[i] = sum(Eigenvalues)-sum(Eigenvalues[1:i])
}
RESS
```

```
[1] 412108.5146 121532.6756 52737.4423 27438.4927 4446.2453 723.9238
[7] 0.0000
```

### Ratio of RESS and sum of eigenvalues

`round(RESS/sum(Eigenvalues),2)`

`[1] 0.12 0.04 0.02 0.01 0.00 0.00 0.00`

We will not calculate the value of PRESS in this post as it requires us to consider random models. We will not pursue that here.

`sessionInfo()`

```
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggrepel_0.9.1 ggplot2_3.3.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 highr_0.9 bslib_0.3.1 compiler_4.1.2
[5] pillar_1.6.3 jquerylib_0.1.4 tools_4.1.2 digest_0.6.28
[9] jsonlite_1.7.2 evaluate_0.14 lifecycle_1.0.1 tibble_3.1.2
[13] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.11 DBI_1.1.1
[17] yaml_2.2.1 blogdown_1.7 xfun_0.29 fastmap_1.1.0
[21] withr_2.4.2 stringr_1.4.0 dplyr_1.0.6 knitr_1.36
[25] generics_0.1.0 sass_0.4.0 vctrs_0.3.8 tidyselect_1.1.1
[29] grid_4.1.2 glue_1.4.2 R6_2.5.1 fansi_0.4.2
[33] rmarkdown_2.8 bookdown_0.22 farver_2.1.0 purrr_0.3.4
[37] magrittr_2.0.1 scales_1.1.1 htmltools_0.5.2 ellipsis_0.3.2
[41] assertthat_0.2.1 colorspace_2.0-1 labeling_0.4.2 utf8_1.2.1
[45] stringi_1.6.2 munsell_0.5.0 crayon_1.4.1
```

Though unusually long, I hope, this post will be of help to (courageous) readers who work there way through it till end. Comments regarding any errors or omissions may be sent to author’s email.

## Reference

- Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.

Last updated: 19th January, 2020