Principal Component Analysis - Part III

Run Python code in Google Colab Download Python code Download R code (R Markdown)

In this post, we will reproduce the results of a popular paper on PCA. The paper is titled ‘Principal component analysis’ and is authored by Herve Abdi and Lynne J. Williams. It got published in 2010 and since then its popularity has only grown. Its number of citations are more than 4800 as per Google Scholar data (This was the number when this post was last revised).

This post is Part-III of a three part series on PCA. Other parts of the series can be found at the links below.

This post contains code snippets in R. Equivalent MATLAB codes can be written using commands of Part-II. For figures, the reader has to write his/her own code in MATLAB.

Structure of the paper

Along with basic theory, the paper contains three examples on PCA, one example on correspondence analysis, and one example on multiple factor analysis. We will only focus on PCA examples in this post.

To run following R codes seamlessly, readers have to load following packages. If these packages have not been installed previously, use install.packages("package_name") to install those.

library(ggplot2)
library(ggrepel)

How to get data

Data for the examples have been taken from the paper [1]. The datasets are pretty small. So one way to read the data is to create a dataframe itself in R using the values given in paper. Otherwise, the values can first be stored in a csv file and then read into R. To make this post self-sufficient, we will adopt the former approach.

Note: Throughout this article, additional comments have been made beside code segments. It would be a good idea to read those commented lines along with the codes.

# Table 1
# Create a dataframe
Words = c("Bag", "Across", "On", "Insane", "By", "Monastery", "Relief", "Slope", "Scoundrel", "With", "Neither", "Pretentious", "Solid", "This", "For", "Therefore", "Generality", "Arise", "Blot", "Infectious")
Word_length = c(3, 6, 2, 6, 2, 9, 6, 5, 9, 4, 7, 11, 5, 4, 3, 9, 10, 5, 4, 10)
Lines_in_dict = c(14, 7, 11, 9, 9, 4, 8, 11, 5, 8, 2, 4, 12, 9, 8, 1, 4, 13, 15, 6)
words = data.frame(Words, Word_length, Lines_in_dict, stringsAsFactors = F)
words
         Words Word_length Lines_in_dict
1          Bag           3            14
2       Across           6             7
3           On           2            11
4       Insane           6             9
5           By           2             9
6    Monastery           9             4
7       Relief           6             8
8        Slope           5            11
9    Scoundrel           9             5
10        With           4             8
11     Neither           7             2
12 Pretentious          11             4
13       Solid           5            12
14        This           4             9
15         For           3             8
16   Therefore           9             1
17  Generality          10             4
18       Arise           5            13
19        Blot           4            15
20  Infectious          10             6
(words_centered = scale(words[,2:3],scale = F)) # Centering after reemoving the first column
      Word_length Lines_in_dict
 [1,]          -3             6
 [2,]           0            -1
 [3,]          -4             3
 [4,]           0             1
 [5,]          -4             1
 [6,]           3            -4
 [7,]           0             0
 [8,]          -1             3
 [9,]           3            -3
[10,]          -2             0
[11,]           1            -6
[12,]           5            -4
[13,]          -1             4
[14,]          -2             1
[15,]          -3             0
[16,]           3            -7
[17,]           4            -4
[18,]          -1             5
[19,]          -2             7
[20,]           4            -2
attr(,"scaled:center")
  Word_length Lines_in_dict 
            6             8 

Covariance PCA

Covariance PCA uses centered data matrix. But data matrix is not scaled. prcomp() centers data by default.

pca_words_cov = prcomp(words[,2:3],scale = F) # cov stands for Covariance PCA
factor_scores_words = pca_words_cov$x
round(factor_scores_words,2)
        PC1   PC2
 [1,] -6.67  0.69
 [2,]  0.84 -0.54
 [3,] -4.68 -1.76
 [4,] -0.84  0.54
 [5,] -2.99 -2.84
 [6,]  4.99  0.38
 [7,]  0.00  0.00
 [8,] -3.07  0.77
 [9,]  4.14  0.92
[10,] -1.07 -1.69
[11,]  5.60 -2.38
[12,]  6.06  2.07
[13,] -3.91  1.30
[14,] -1.92 -1.15
[15,] -1.61 -2.53
[16,]  7.52 -1.23
[17,]  5.52  1.23
[18,] -4.76  1.84
[19,] -6.98  2.07
[20,]  3.83  2.30

Observer that factor scores for PC1 are negatives of what has been given in the paper. This is not a problem as principal directions are orthogonal.

Principal directions are orthogonal

#  It can also be checked that both the principal components are orthogonal.
sum(factor_scores_words[,1]*factor_scores_words[,2]) # PCs are orthogonal
[1] -4.773959e-15

Contribution of each factor

It is defined as square of factor score divided by sum of squares of factor scores in that column.

round(factor_scores_words[,1]^2/sum(factor_scores_words[,1]^2)*100,2)
 [1] 11.36  0.18  5.58  0.18  2.28  6.34  0.00  2.40  4.38  0.29  8.00  9.37
[13]  3.90  0.94  0.66 14.41  7.78  5.77 12.43  3.75
round(factor_scores_words[,2]^2/sum(factor_scores_words[,2]^2)*100,2)
 [1]  0.92  0.55  5.98  0.55 15.49  0.28  0.00  1.13  1.63  5.48 10.87  8.25
[13]  3.27  2.55 12.32  2.90  2.90  6.52  8.25 10.18

The calculations in above two lines can be done in a single line

round(factor_scores_words^2/matrix(rep(colSums(factor_scores_words^2),nrow(words)),ncol = 2,byrow = T)*100,2)
        PC1   PC2
 [1,] 11.36  0.92
 [2,]  0.18  0.55
 [3,]  5.58  5.98
 [4,]  0.18  0.55
 [5,]  2.28 15.49
 [6,]  6.34  0.28
 [7,]  0.00  0.00
 [8,]  2.40  1.13
 [9,]  4.38  1.63
[10,]  0.29  5.48
[11,]  8.00 10.87
[12,]  9.37  8.25
[13,]  3.90  3.27
[14,]  0.94  2.55
[15,]  0.66 12.32
[16,] 14.41  2.90
[17,]  7.78  2.90
[18,]  5.77  6.52
[19,] 12.43  8.25
[20,]  3.75 10.18

Squared distance to center of gravity

(dist = rowSums(factor_scores_words^2))
 [1] 45  1 25  1 17 25  0 10 18  4 37 41 17  5  9 58 32 26 53 20

Squared cosine of observations

(sq_cos = round(factor_scores_words^2/rowSums(factor_scores_words^2)*100))
      PC1 PC2
 [1,]  99   1
 [2,]  71  29
 [3,]  88  12
 [4,]  71  29
 [5,]  53  47
 [6,]  99   1
 [7,] NaN NaN
 [8,]  94   6
 [9,]  95   5
[10,]  29  71
[11,]  85  15
[12,]  90  10
[13,]  90  10
[14,]  74  26
[15,]  29  71
[16,]  97   3
[17,]  95   5
[18,]  87  13
[19,]  92   8
[20,]  74  26

Nan’s are produced because of division by zero.

# Figue 1
p = ggplot(words,aes(x = Lines_in_dict,y = Word_length,label = Words))+
  geom_point()+ geom_text_repel()+ 
  geom_hline(yintercept = 6)+geom_vline(xintercept = 8)+
  labs(x = "Lines in dictionary",y = "Word length")
print(p)

# Show directions of PCs
# Note that intercept argument in geom_abline considers the line to be at the origin. In our case the data are mean shifted.
# So we have to adjust the intercept taking new origin into consideration. These adjustments have been made below.
slope1 = pca_words_cov$rotation[1,1]/pca_words_cov$rotation[2,1] # Slope of first PC
slope2 = pca_words_cov$rotation[1,2]/pca_words_cov$rotation[2,2] # Slope of second PC
(new_origin = c(mean(words$Lines_in_dict),mean(words$Word_length)))
[1] 8 6
intercept1 = 6 - slope1*8
intercept2 = 6 - slope2*8
p+geom_abline(slope = slope1,intercept = intercept1,linetype = "dashed",size = 1.2,col = "red")+
  geom_abline(slope = slope2,intercept = intercept2,linetype = "dashed",size = 1.2,col = "blue")

In the above figure red dashed line is the 1st principal component (PC) and blue dashed line is the 2nd PC.

Rotated PCs

This figure is obtained by plotting factor scores. Note that we will plot negative of the factor scores of 1st PC to make the figure consistent with the paper.

ggplot(as.data.frame(pca_words_cov$x),aes(-pca_words_cov$x[,1],pca_words_cov$x[,2],label = words$Words))+
  geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
  xlab("Factor score along PC1")+ylab("Factor score along PC2")

With supplementary data

Given a supplementary point (a point previously not used in finding principal components),we have to first center the data point. Its factor scores can then be obtained by multiplying it with the loading matrix.

Factor score of the new word ‘sur’

sur = c(3,12) # It has 3 letter and 12 lines of dictionary entry
(sur_centered = sur - colMeans(words[,2:3]))
  Word_length Lines_in_dict 
           -3             4 
(factor_scores_sur = round(sur_centered %*% pca_words_cov$rotation,2))
       PC1   PC2
[1,] -4.99 -0.38

Eigenvalues and variance

See Part-II for details.

Total variance before transformation

(total_var_before = round(sum(diag(var(words_centered))),3))
[1] 23.368

Total variance after transformation

(total_var_after = round(sum(diag(var(pca_words_cov$x))),3))
[1] 23.368

Correlation between principal components and original variables (In the paper,this correlation is also termed loading. But we will strictly reserve the loading term to mean loading matrix \(\textbf{P}\) (see Part-I)

The sum of correlation coefficients between variables and principal components is 1. Intuitively, this means that variables are orthogonally projected onto the principal components.

Correlation matrix

# Correlation between PCs and original variables
(cor(pca_words_cov$x,words_centered))
    Word_length Lines_in_dict
PC1   0.8679026    -0.9741764
PC2   0.4967344     0.2257884

Note that the answers for correlation coefficients don’t match with that of the paper. Readers who get actual answers as given in paper are encouraged to send me an email using my contact details. However our procedure is correct and it does indeed give the correct answer for supplementary data as described below.

Squared correlation

(cor(pca_words_cov$x,words_centered)^2)
    Word_length Lines_in_dict
PC1   0.7532549    0.94901961
PC2   0.2467451    0.05098039

Sum of correlation coefficients between variables and principal components is 1.

colSums((cor(pca_words_cov$x,words_centered)^2))
  Word_length Lines_in_dict 
            1             1 

Loading matrix

(loading_matrix = pca_words_cov$rotation)
                     PC1       PC2
Word_length    0.5368755 0.8436615
Lines_in_dict -0.8436615 0.5368755

Correlation score for supplementary variables

# Supplementary variable (Table 4)
Frequency = c(8,230,700,1,500,1,9,2,1,700,7,1,4,500,900,3,1,10,1,1)
Num_entries = c(6,3,12,2,7,1,1,6,1,5,2,1,5,9,7,1,1,4,4,2)
supp_data = data.frame(Frequency,Num_entries) # Supplementary data
supp_data
   Frequency Num_entries
1          8           6
2        230           3
3        700          12
4          1           2
5        500           7
6          1           1
7          9           1
8          2           6
9          1           1
10       700           5
11         7           2
12         1           1
13         4           5
14       500           9
15       900           7
16         3           1
17         1           1
18        10           4
19         1           4
20         1           2

Centered supplementary data

supp_data_cent = scale(supp_data,scale = F) # Centered supplementary data

Correlation score for supplementary data

(corr_score_supp = round(cor(pca_words_cov$x,supp_data),4))
    Frequency Num_entries
PC1   -0.3012     -0.6999
PC2   -0.7218     -0.4493

Note that correlation score doesn’t depend on whether supplementary data is centered or not.

(round(cor(pca_words_cov$x,supp_data_cent),4))
    Frequency Num_entries
PC1   -0.3012     -0.6999
PC2   -0.7218     -0.4493

Squared correlation

(round(cor(pca_words_cov$x,supp_data_cent)^2,4))
    Frequency Num_entries
PC1    0.0907      0.4899
PC2    0.5210      0.2019

Column sums of squared correlation for support data

(round(colSums(cor(pca_words_cov$x,supp_data_cent)^2),4))
  Frequency Num_entries 
     0.6118      0.6918 

Correlation circle plot

# First plot correlation circle
x = seq(0,2*pi,length.out = 300)
circle = ggplot() + geom_path(data = data.frame(a = cos(x),b = sin(x)),
                     aes(cos(x),sin(x)),alpha = 0.3, size = 1.5)+
            geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
  annotate("text",x = c(1.08,0.05),y = c(0.05,1.08),label = c("PC1","PC2"),angle = c(0,90))+
            xlab(NULL)+ylab(NULL)
# Plotting original variables
cor_score = as.data.frame(cor(words_centered,pca_words_cov$x))
variable_plot_original = circle + geom_point(data = cor_score,  aes(cor_score[,1],cor_score[,2]))+
  geom_text_repel(aes(cor_score[,1],cor_score[,2],
                      label = c("Length of words","Number of lines in Dict."))) 
print(variable_plot_original)

Plotting supplementary variables

variable_plot_original+
  geom_point(data = as.data.frame(corr_score_supp),
             aes(corr_score_supp[,1],corr_score_supp[,2]))+
  geom_text_repel(aes(corr_score_supp[,1],corr_score_supp[,2],
                      label = c("Frequency","Number of entries"))) 

Observe that our correlation circle plot is flipped about y-axis (i.e., PC2) when compared to the plot given in paper. This is because our first principal component is negative of the one given in paper. So while computing correlation score, this negative principal component results in negative correlation scores. Hence, our plot flips about y-axis.

Example 2 (Wine example)

Correlation PCA with wine data

# Table 6
wine_type = c(paste("wine", 1:5, sep = "_"))
hedonic = c(14, 10, 8, 2, 6)
for_meat = c(7, 7, 5, 4, 2)
for_dessert = c(8, 6, 5, 7, 4)
price = c(7, 4, 10, 16, 13)
sugar = c(7, 3, 5, 7, 3)
alcohol = c(13, 14, 12, 11, 10)
acidity = c(7, 7, 5, 3, 3)
wine = data.frame(wine_type, hedonic, for_meat, for_dessert, price, sugar, alcohol, acidity, stringsAsFactors = F)
wine
  wine_type hedonic for_meat for_dessert price sugar alcohol acidity
1    wine_1      14        7           8     7     7      13       7
2    wine_2      10        7           6     4     3      14       7
3    wine_3       8        5           5    10     5      12       5
4    wine_4       2        4           7    16     7      11       3
5    wine_5       6        2           4    13     3      10       3
pca_wine_cor = prcomp(wine[2:8],scale = T)
ggplot(as.data.frame(pca_wine_cor$x),aes(x = pca_wine_cor$x[,1],y =  pca_wine_cor$x[,2], label = paste0("wine ",1:5)))+
  geom_point()+geom_text_repel()+ geom_vline(xintercept = 0)+ geom_hline(yintercept = 0)+
  xlab("Factor score along PC1")+ylab("Factor score along PC2")

Again our figure seems upside down than that of the paper. This is a minor discrepancy. Our 2nd eigenvector is negative of the one considered in paper. We can match the plot with that of the paper by just flipping the second principal component but we will not do that here.

Factor scores along 1st and 2nd PC

# Table 7
(pca_wine_cor$x[,1:2])
            PC1       PC2
[1,] -2.3301649  1.095284
[2,] -2.0842419 -1.223185
[3,]  0.1673228 -0.370258
[4,]  1.7842392  1.712563
[5,]  2.4628448 -1.214405

Contribution of each observation to principal component

round(pca_wine_cor$x[,1:2]^2/matrix(rep(colSums(pca_wine_cor$x[,1:2]^2),nrow(wine)),ncol = 2,byrow = T)*100,2)
       PC1   PC2
[1,] 28.50 16.57
[2,] 22.80 20.66
[3,]  0.15  1.89
[4,] 16.71 40.51
[5,] 31.84 20.37

Squared cosine of observations of first PC

(sq_cos = round(pca_wine_cor$x[,1:2]^2/rowSums(pca_wine_cor$x^2)*100))
     PC1 PC2
[1,]  77  17
[2,]  69  24
[3,]   7  34
[4,]  50  46
[5,]  78  19

Loading scores corresponding to first two principal components

(round(pca_wine_cor$rotation[,1:2],2))
              PC1   PC2
hedonic     -0.40 -0.11
for_meat    -0.45  0.11
for_dessert -0.26  0.59
price        0.42  0.31
sugar       -0.05  0.72
alcohol     -0.44 -0.06
acidity     -0.45 -0.09

Correlation score variables with first two principal components

(corr_score_wine = round(cor(pca_wine_cor$x,wine[,2:8])[1:2,],2))
    hedonic for_meat for_dessert price sugar alcohol acidity
PC1   -0.87    -0.97       -0.58  0.91 -0.11   -0.96   -0.99
PC2   -0.15     0.15        0.79  0.42  0.97   -0.07   -0.12

Correlation circle for wine data

# Figure 6
corr_score_wine = t(corr_score_wine)
circle + 
  geom_point(data = as.data.frame(corr_score_wine),
             aes(corr_score_wine[,1],corr_score_wine[,2]))+
  geom_text_repel(aes(corr_score_wine[,1],corr_score_wine[,2],
                      label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))

Varimax rotation

Rotation is applied to loading matrix such that after rotation principal components are interpretable. By interpretable, we mean, some of the loading scores will have higher values and some other loading scores will have lower values. So it can be said that the variables whose loading scores have higher value, contribute significantly towards principal components as compared to other variables with lesser loading scores. Though rotation works in certain cases, it must be remembered that it is no magic wand for principal component interpretability. One of the popular rotations is Varimax rotation. R has a built-in command to perform varimax rotation.

Varimax rotation can be performed on the whole loading matrix or on a few components only. In the paper, varimax has been applied to first two principal components.

Loading scores of first two principal components

(round(pca_wine_cor$rotation[,1:2],2))
              PC1   PC2
hedonic     -0.40 -0.11
for_meat    -0.45  0.11
for_dessert -0.26  0.59
price        0.42  0.31
sugar       -0.05  0.72
alcohol     -0.44 -0.06
acidity     -0.45 -0.09

Varimax applied to first two principal components

rotated_loading_scores = varimax(pca_wine_cor$rotation[,1:2])

Loading scores after rotation

# Table 10
(round(rotated_loading_scores$loadings[,1:2],2))
              PC1   PC2
hedonic     -0.41 -0.02
for_meat    -0.41  0.21
for_dessert -0.12  0.63
price        0.48  0.21
sugar        0.12  0.72
alcohol     -0.44  0.05
acidity     -0.46  0.02

The same result can also be obtained by multiplying the original loading matrix by the rotation matrix obtained from varimax.

(round(pca_wine_cor$rotation[,1:2] %*% rotated_loading_scores$rotmat,2))
             [,1]  [,2]
hedonic     -0.41 -0.02
for_meat    -0.41  0.21
for_dessert -0.12  0.63
price        0.48  0.21
sugar        0.12  0.72
alcohol     -0.44  0.05
acidity     -0.46  0.02

Plot of loading scores before rotation

#Figure 7
ggplot(as.data.frame(pca_wine_cor$rotation[,1:2]),aes(x = pca_wine_cor$rotation[,1],y = pca_wine_cor$rotation[,2],
                                                      label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))+
  geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
  xlab("Loading score along PC1")+ylab("Loading score along PC2")

Plot of loading scores after rotation

ggplot(as.data.frame(rotated_loading_scores$loadings[,1:2]),
                     aes(x = rotated_loading_scores$loadings[,1],
                         y = rotated_loading_scores$loadings[,2],
                         label = c("Hedonic","For Meat","For dessert","Price","Sugar","Alcohol","Acidity")))+
  geom_point()+geom_text_repel()+geom_hline(yintercept = 0)+geom_vline(xintercept = 0)+
    xlab("Loading score along PC1 after rotation")+
    ylab("Loading score along PC2 after rotation")

Example 3

French food example (Covariance PCA example)

# Table 11 
class = rep(c("Blue_collar", "White_collar", "Upper_class"), times = 4)
children = rep(c(2,3,4,5), each = 3)
bread = c(332, 293, 372, 406, 386, 438, 534, 460, 385, 655, 584, 515)
vegetables = c(428, 559, 767, 563, 608, 843, 660, 699, 789, 776, 995, 1097)
fruit = c(354, 388, 562, 341, 396, 689, 367, 484, 621, 423, 548, 887)
meat = c(1437, 1527, 1948, 1507, 1501, 2345, 1620, 1856, 2366, 1848, 2056, 2630)
poultry = c(526, 567, 927, 544, 558, 1148, 638, 762, 1149, 759, 893, 1167)
milk = c(247, 239, 235, 324, 319, 243, 414, 400, 304, 495, 518, 561)
wine = c(427, 258, 433, 407, 363, 341, 407, 416, 282, 486, 319, 284)
food = data.frame(class, children, bread, vegetables, fruit, meat, poultry, milk, wine, stringsAsFactors = F)
food
          class children bread vegetables fruit meat poultry milk wine
1   Blue_collar        2   332        428   354 1437     526  247  427
2  White_collar        2   293        559   388 1527     567  239  258
3   Upper_class        2   372        767   562 1948     927  235  433
4   Blue_collar        3   406        563   341 1507     544  324  407
5  White_collar        3   386        608   396 1501     558  319  363
6   Upper_class        3   438        843   689 2345    1148  243  341
7   Blue_collar        4   534        660   367 1620     638  414  407
8  White_collar        4   460        699   484 1856     762  400  416
9   Upper_class        4   385        789   621 2366    1149  304  282
10  Blue_collar        5   655        776   423 1848     759  495  486
11 White_collar        5   584        995   548 2056     893  518  319
12  Upper_class        5   515       1097   887 2630    1167  561  284
pca_food_cov = prcomp(food[,3:9],scale = F)

Factor scores

# Table 12
(factor_scores_food = round(pca_food_cov$x[,1:2],2))
          PC1     PC2
 [1,] -635.05  120.89
 [2,] -488.56  142.33
 [3,]  112.03  139.75
 [4,] -520.01  -12.05
 [5,] -485.94   -1.17
 [6,]  588.17  188.44
 [7,] -333.95 -144.54
 [8,]  -57.51  -42.86
 [9,]  571.32  206.76
[10,]  -39.38 -264.47
[11,]  296.04 -235.92
[12,]  992.83  -97.15

Contribution of each observation to principal component

round(pca_food_cov$x[,1:2]^2/matrix(rep(colSums(pca_food_cov$x[,1:2]^2),nrow(food)),ncol = 2,byrow = T)*100,2)
        PC1   PC2
 [1,] 13.34  5.03
 [2,]  7.90  6.97
 [3,]  0.42  6.72
 [4,]  8.94  0.05
 [5,]  7.81  0.00
 [6,] 11.44 12.22
 [7,]  3.69  7.19
 [8,]  0.11  0.63
 [9,] 10.80 14.71
[10,]  0.05 24.07
[11,]  2.90 19.15
[12,] 32.61  3.25
dist = pca_food_cov$x[,1]^2+pca_food_cov$x[,2]^2

Squared cosine of observations

(sq_cos = round(pca_food_cov$x[,1:2]^2/rowSums(pca_food_cov$x^2)*100))
      PC1 PC2
 [1,]  95   3
 [2,]  86   7
 [3,]  26  40
 [4,] 100   0
 [5,]  98   0
 [6,]  89   9
 [7,]  83  15
 [8,]  40  22
 [9,]  86  11
[10,]   2  79
[11,]  57  36
[12,]  97   1

Squared loading scores

# Table 13
(round(pca_food_cov$rotation[,1:2]^2,2))
            PC1  PC2
bread      0.01 0.33
vegetables 0.11 0.17
fruit      0.09 0.01
meat       0.57 0.01
poultry    0.22 0.06
milk       0.01 0.40
wine       0.00 0.02

Note that this table doesn’t match with that of the paper. We will stick to our analysis.

Correlation score

(corr_score_food = round((cor(pca_food_cov$x,food[,3:9])[1:2,]),2))
    bread vegetables fruit meat poultry  milk  wine
PC1  0.36       0.91  0.96 1.00    0.98  0.41 -0.43
PC2 -0.87      -0.35  0.10 0.04    0.16 -0.88 -0.33

Squared correlation score

(round((cor(pca_food_cov$x,food[,3:9])[1:2,])^2,2))
    bread vegetables fruit meat poultry milk wine
PC1  0.13       0.83  0.92    1    0.96 0.17 0.18
PC2  0.76       0.12  0.01    0    0.03 0.77 0.11

Correlation circle for food data

# Figure 9
corr_score_food = t(corr_score_food)
circle + geom_point(data = as.data.frame(corr_score_food), 
                    aes(x = corr_score_food[,1],y = corr_score_food[,2]))+
  geom_text_repel(data = as.data.frame(corr_score_food), 
                  aes(x = corr_score_food[,1],y = corr_score_food[,2],
                      label = c("Bread","Vegetables","Fruit","Meat","Poultry","Milk","Wine")))

Now observe that our correlation circle plot is almost close to that of the papers (though in opposite quadrants. But this is not a problem as we have previously mentioned).

Eigenvalues

Eigenvalues of data covariance matrix is square of singular values of centered data matrix. Hence eigenvalues of data covariance matrix can be obtained as below.

## Table 14
cent_food = food[,3:9]-matrix(rep(colMeans(food[,3:9]),times = 12),nrow = 12,
                              byrow = T)
svd_food = svd(cent_food)
(Eigenvalues = (svd_food$d)^2)
[1] 3023141.2354  290575.8390   68795.2333   25298.9496   22992.2474
[6]    3722.3214     723.9238

Important Note: These eigenvalues are not the same as variance of factor scores in principal components. Variance of principal component factor scores can be obtained by dividing the eigenvalues by \((n-1)\), where \(n\) is number of data points (in this case \(n = 12\)). If this point is still not clear, refer to Part-II.

Percentage contribution of each principal component

(round(Eigenvalues/sum(Eigenvalues),2))
[1] 0.88 0.08 0.02 0.01 0.01 0.00 0.00

Cumulative sum of eigenvalues

(round(cumsum(Eigenvalues),2))
[1] 3023141 3313717 3382512 3407811 3430804 3434526 3435250

Cumulative percentage contribution

(round(cumsum(Eigenvalues)/sum(Eigenvalues),2))
[1] 0.88 0.96 0.98 0.99 1.00 1.00 1.00

RESS (Refer to the paper for a description)

RESS = array(rep(0,7))
for (i in 1:7){
  RESS[i] = sum(Eigenvalues)-sum(Eigenvalues[1:i])
}
RESS
[1] 412108.5146 121532.6756  52737.4423  27438.4927   4446.2453    723.9238
[7]      0.0000

Ratio of RESS and sum of eigenvalues

round(RESS/sum(Eigenvalues),2)
[1] 0.12 0.04 0.02 0.01 0.00 0.00 0.00

We will not calculate the value of PRESS in this post as it requires us to consider random models. We will not pursue that here.

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
[5] LC_TIME=English_India.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggrepel_0.9.1 ggplot2_3.3.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        knitr_1.33        magrittr_2.0.1    munsell_0.5.0    
 [5] colorspace_2.0-1  R6_2.5.0          rlang_0.4.11      fansi_0.4.2      
 [9] highr_0.9         stringr_1.4.0     tools_4.1.0       grid_4.1.0       
[13] gtable_0.3.0      xfun_0.23         utf8_1.2.1        withr_2.4.2      
[17] htmltools_0.5.1.1 ellipsis_0.3.2    yaml_2.2.1        digest_0.6.27    
[21] tibble_3.1.2      lifecycle_1.0.0   crayon_1.4.1      bookdown_0.22    
[25] farver_2.1.0      vctrs_0.3.8       glue_1.4.2        evaluate_0.14    
[29] rmarkdown_2.8     blogdown_1.3      labeling_0.4.2    stringi_1.6.2    
[33] compiler_4.1.0    pillar_1.6.1      scales_1.1.1      pkgconfig_2.0.3  

Though unusually long, I hope, this post will be of help to (courageous) readers who work there way through it till end. Comments regarding any errors or omissions may be sent to author’s email.

Reference

  1. Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.

Last updated: 19th January, 2020

Avatar
Biswajit Sahoo
PhD Student

My research interests include machine learning, deep learning, signal processing and data-driven machinery condition monitoring.

Related