PCA (principal component analysis)

Author

Brandy Carr, Lee, Tavish

Published

December 7, 2022

1 Introduction

Principal Component Analysis (PCA) is an unsupervised machine learning method (Figure 1) that can be used as the first step in data analysis. The main goal of PCA is to reduce the number of dimensions/features while retaining most of the original variability. Reducing the number of features makes it easier to graphically visualize hidden patterns in the data, it’s also useful when encountering a high number of predictor variables or highly correlated variables when running a regression model (Lang and Qiu 2021). After the data has been transformed with PCA, using a clustering algorithm such as k-means can provide additional power to the analysis of high dimensional data containing linearly correlated features (Alkhayrat, Aljnidi, and Aljoumaa 2020). Because PCA projects high dimensional data into lower dimensions (often 2 or 3), is one of the most common (and useful) methods used for exploratory analysis. Plotting the data using 2 dimensions, we are now able to visualize trends, jumps, clusters, patterns and outliers which would have previously been hidden among the original (high dimensional) data.

PCA is a broadly used statistical method whose use stretches across many scientific disciplines and in turn many different adaptations of PCA have been developed based on the variation in goals and data types associated with each respective discipline. PCA was developed first by Pearson (1901) and Hotelling (1933) (A 2022). This technique transforms some number of possibly correlated variables into a smaller number of variables, the variables in this smaller matrix are referred to as the Principal Components (PC). This is achieved using a vector space transform (see Figure 2). By mathematical projection we can interpret the original data set with just a few variables, namely the principal components achieved through calculation. Reducing dimension size in large data sets makes it easier to spot trends, patterns, and outliers in data where that information would have previously been hidden by the size of data (Richardson). The information being preserved in the process of reduction is the variability in the data (i.e. the statistical information present in the data). In order to preserve as much variability as possible we should “…find new variables that are linear functions of those in the original data set, that successively maximize variance and that are uncorrelated with each other” (Jolliffe IT 2016). PCA assumes no distribution in data in a descriptive context, one of it’s key features that makes it so widely adaptive as an exploratory method across disciplines and data types (Jolliffe IT 2016).

Figure 2: Simplified PCA Process (Makesh Manral 2022)

To lists all of PCA’s applications would be tedious and excessive, but some examples where PCA has been used is in image analysis, analysis of web data, and cyber security analysis. Essentially anywhere that large data sets are found PCA can be used to aid in discovering trends amongst the variables of that data. PCA can also be useful when studying mental disorders, data consisting of symptoms and the connections between them. When many symptoms are being observed, it can be difficult to visually represent the connections between them, both the strength of the connections and the proximity to each other. Plotting using the first 2 principal components allows for the placement on the x or y axis to become interpretable, that is, observations far left differ in some dimension (the first principal component) compared to ones far right. The same can be said in the y direction (Jones, Mair, and McNally 2018). PCA also aids in recognizing microanerysm in medicine and how groundbreaking has been critical for diagnosis and treatment of Diabetic retinopathy (Cao et al. 2018). Research papers show a framework for coronary artery disease risk assessment in intravascular ultrasound. The paper reflects on a novel strategy for risk stratification based on plaque morphology embedded with principal component analysis (PCA) for plaque feature dimensionality reduction and dominant feature selection technique(Gorgoglione et al. 2021). Camargo identified that the best way to compare and evaluate facial recognition results with speed and accuracy is with PCA (Principal Component Analysis), alongside Support Vector Machine (SVM) methods (A 2022).

Taking a look into a real world example, say we have a dataset consisting of 1,000 students with exam scores from 7 different courses: Advanced Statistics, Probability Theory, Intro to Dance, World Religions, and Religion in America. We could group Advanced Statistics and Probability Theory into a new variable called Stats, group World Religions and Religion in America into a new variable Religion, and keep Intro to Dance by itself. We have Reduced the data set from 7 variables to 3 without much loss in variation. This is the main concept behind PCA except variables are not manually regrouped but instead the new variables (principal components) are derived from certain linear combinations of the original variables (Lang and Qiu 2021).

Although PCA is commonly used as a first step in exploratory analysis, it does have a few limitations. From Figure 3 we can see that visualization of the first 2 principal components is interpretable in either the x or y direction separately. The distance between x and y is uninterpretable; they cannot be interpreted simultaneously (Section 3.3.2). Another drawback of PCA is that because the new variables (principal components) are linear combinations or the original variables, you might now be able to visualize hidden patterns, but reading and understanding which of the original variables contribute to each principal component can be difficult. One of the assumptions (Section 2.1) when using PCA is that the original variables must have some linear relationship, since calculations rely on either the covariance or correlation matrices. PCA does not preform as well as other dimension reduction tools when the variables are independent (not linearly related) since it will simply just order the variables by their variances as the principal components.

Figure 3: Comparing Visualization Methods: What to Use When (Jones, Mair, and McNally 2018)

2 Methods

PCA forms the basis of multivariate data analysis based on projection methods. The variability in the original (correlated) variables will be explained through a smaller (uncorrelated) set of new variables i.e. the principal components (PC’s). PCA results depend on the scale or units of the variables so, for unscaled data, calculations should either be preformed on the correlation matrix (instead of the covariance matrix) or the data should be standardized with mean 0 and variance 1 (z scores) (Lang and Qiu 2021). Firstly, when performing PCA, a new set of orthogonal coordinate axes are identified from the original data set, which is accomplished by finding the direction of maximal variance through each dimension. This is equivalent to using the least squares method to find the line of best fit. This new axis is the first principal component of the data set. Next we use orthogonal projection to project the coordinates onto the new axis. Once this is done we obtain a second principal component (and principal coordinate axis) by finding the direction of the second largest variance in the data, this axis is orthogonal to our first PC. These two PCs define a plane onto which we can project further coordinates onto (Richardson).

Code

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#FAFAF5', 'primaryColor': '#EAEAD6', 'nodeBorder': '#8B814C', 'lineColor': '#8B814C', 'primaryTextColor': '#191970', 'textColor': '#191970', 'fontSize': '12px', 'width': '100%'}}}%%

flowchart LR
A[(Clean<br/>Data)] --> B((Are all<br/>vars the same<br/>scale/unit?))
B --> C((Yes))
B --> D((No))
D -.- |Standardize<br/>Data| E(Estimate<br/>Sample<br/>Mean<br/>Vector)
C --> E
D --> F(Estimate<br/>Sample<br/>Mean<br/>Vector)
E --> G(Estimate<br/>Sample<br/>Covariance<br/>Matrix)
F --> H(Estimate<br/>Sample<br/>Correlation<br/>Matrix)
G --> I(Eigenvalues<br/>Eigenvectors)
H --> I

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#FAFAF5', 'primaryColor': '#EAEAD6', 'nodeBorder': '#8B814C', 'lineColor': '#8B814C', 'primaryTextColor': '#191970', 'textColor': '#191970', 'fontSize': '12px', 'width': '100%'}}}%%

flowchart LR
A[(Clean<br/>Data)] --> B((Are all<br/>vars the same<br/>scale/unit?))
B --> C((Yes))
B --> D((No))
D -.- |Standardize<br/>Data| E(Estimate<br/>Sample<br/>Mean<br/>Vector)
C --> E
D --> F(Estimate<br/>Sample<br/>Mean<br/>Vector)
E --> G(Estimate<br/>Sample<br/>Covariance<br/>Matrix)
F --> H(Estimate<br/>Sample<br/>Correlation<br/>Matrix)
G --> I(Eigenvalues<br/>Eigenvectors)
H --> I

Assumptions

Variables should be continuous, interval or ratio level of measurement (ordinal variables, such as likert-scale, are also often used)
Variables should all be the same scale/units (Equation 1)
Variables are linearly related - visually check scatter plot matrices (see Section 3.2.2 for an example)
Outliers & missing or impossible values should be removed (see Section 3.2.1 for an example)

Sample Data

	\(X_1\)	\(X_2\)	…	\(X_j\)	…	\(X_p\)
\(x_1\)	\(x_{11}\)	\(x_{12}\)	…	\(x_{1j}\)	…	\(x_{1p}\)
\(x_2\)	\(x_{21}\)	\(x_{22}\)	…	\(x_{2j}\)	…	\(x_{2p}\)
…	…	…	…	…	…	…
\(x_i\)	\(x_{i1}\)	\(x_{i2}\)	…	\(x_{ij}\)	…	\(x_{ip}\)
…	…	…	…	…	…	…
\(x_n\)	\(x_{n1}\)	\(x_{n2}\)	…	\(x_{nj}\)	…	\(x_{np}\)

Equations

Standardized Data: \[{z_{ij}}\ =\ \frac{{x_{ij}} - {\overline{x_j}}} {\sqrt{\hat{\sigma_{jj}}}}, \ \ \ \ \ where\ \ \ \ i\ =\ \{1,2,...,n\}, \ \ \ \ j\ =\ \{1,2,...,p\} \tag{1}\]

Random Vector: \[{x_i} = {({x_{i1}}, ... , {x_{ip}})}^T,\ \ \ \ \ \ \ \ \ where\ \ i\ =\ \{1,2,...,n\} \tag{2}\]

Sample Mean Vector: \[\hat{\mu}\ \ =\ \ \overline{x}\ \ =\ \ \frac{1}{n} \sum_{i=1}^{n} x_i \tag{3}\]

Sample Covariance Matrix: \[\hat{\sum}\ \ =\ \ S\ \ =\ \ {(\hat{\sigma}_{ij})}_{p{\sf x}p}\ \ =\ \ \frac{1}{n-1} \sum_{i=1}^{n} {(x_i - \overline{x})(x_i - \overline{x})}^T \tag{4}\]

Note

The sample mean vector represents the center of the random vector (\(x_i\))

The sample covariance matrix represents the variations (diagonal elements) & correlations (off-diagonal elements) of the random vector (\(x_i\))

Eigenvectors: \[{\hat{a}_{k}} \tag{5}\]

Eigenvalues: \[{\hat{y}_{ik}}\ =\ {\hat{a}^T_{k}}(x_i\ -\ \overline{x}),\ \ \ \ \ \ \ i\ =\ \{1,2,...,n\},\ \ \ \ \ \ \ k\ =\ \{1,2,...,p\} \tag{6}\]

3 Analysis and Results

Packages

Code

library(dplyr)
library(ggplot2)
library(data.table)
library(ggfortify)
library(MASS)
library(AER)
library(tidyr)
library(paletteer)
library(knitr)
library(DescTools)
library(gt)
library(gridExtra)

Data

Results of 2000 respondents answering a 5-point Likert scale of their belief in various conspiracies. Sampled from a nationally representative survey in 2011. (Oliver and Wood 2013)

SCALE LEVELS:

Label	Level
Strongly Disagree	1
Disagree	2
Neither	3
Agree	4
Strongly Agree	5

VARIABLES:

truther911: Certain U.S. government officials planned the attacks of September 11, 2001, because they wanted the United States to go to war in the Middle East.
obamabirth: President Barack Obama was not really born in the United States and does not have an authentic Hawaiian birth certificate.
fincrisis: The current financial crisis was secretly orchestrated by a small group of Wall Street bankers to extend the power of the Federal Reserve and further their control of the world’s economy.
flourolights: The U.S. government is mandating the switch to compact fluorescent light bulbs because such lights make people more obedient and easier to control.
endtimes: We are currently living in End Times as foretold by Biblical prophecy.
sorosplot: Billionaire George Soros is behind a hidden plot to destabilize the American government, take control of the media, and put the world under his control.
iraqjews: The U.S. invasion of Iraq was not part of a campaign to fight terrorism, but was driven by oil companies and Jews in the U.S. and Israel.
vaportrail: Vapor trails left by aircraft are actually chemical agents deliberately sprayed in a clandestine program directed by government officials.

GROUPING FACTOR (y = political ideology):

Very Liberal
Liberal
Somewhat Liberal
Middle of the Road
Somewhat Conservative
Conservative
Very Conservative

Read & Clean Data

Code

# READ DATA FROM A GITHUB CSV FILE
conspiracy<- (read.csv("https://raw.githubusercontent.com/bjcarr08/sampleData/main/kaggleConspiracyTheoriesData.csv", stringsAsFactors = T))[,-1]

# REMOVE ROWS WITH NAs & IMPOSSIBLE VALUES (removed rows where participant marked 'not sure' as political ideology)
conspiracy<- conspiracy[complete.cases(conspiracy),] %>% filter(y!="Not Sure")

Visualize Data

Visually checking if linearity between variables assumption is met (see Section 2.1 for assumptions). This is not a strictly held assumption, especially when using ordinal data.

The correlation matrix plots show enough of a relationship between variables to meet the assumption of linearity.

Code

# SCATTER PLOT MATRICES: TO CHECK LINEARITY ASSUMPTION
par(col.axis="#8B814C",col.lab="#8B814C",col.main="#8B814C",col.sub="#8B814C",pch=20, col="#8B814C", bg="transparent")
DescTools::PlotPairs(sapply(conspiracy[,-9], function(x) jitter(x, 5)), 
                     g=conspiracy$y,
                     col=alpha("#8B814C", 0.1), 
                     col.smooth="#8B814C")

Code

# TRANSFORM TO LONG DATA FOR PLOTS
conspiracy.Long<- conspiracy %>% pivot_longer(!y, names_to="conspiracy", values_to="score", values_transform=list(score=as.numeric))

# HISTOGRAMS
ggplot(conspiracy.Long, aes(score, fill=conspiracy, color=conspiracy)) +
  geom_histogram(alpha=0.2, breaks=seq(0,5,1)) +
  lemon::facet_rep_wrap(.~conspiracy, nrow=2, labeller="label_both", repeat.tick.labels=T) +
  labs(title="Distributions of Raw Score") +
  theme_bw() +
  theme(legend.position = "none",
        panel.border = element_rect(color = "#8B814C"),
        strip.background = element_rect(fill = "#EAEAD6", color = "#8B814C"),
        strip.text = element_text(color = "#8B814C", size=14),
        plot.background = element_rect(fill = "#FAFAF5"),
        axis.text = element_text(color = "#8B814C"),
        axis.title = element_text(color = "#8B814C", size=14),
        plot.title = element_text(color = "#8B814C", size=14),
        axis.ticks = element_line(color = "#8B814C"))

Looking at the distribution plots above, we can see that all variables are measured in the same scale/units & do not need to be standardized (see Section 2.1 for assumptions).

PCA

Code

options(width = 100)

# STANDARDIZE DATA [SKIP]
#conspiracy<- conspiracy %>% mutate(across(.cols=truther911:vaportrail, scale))

# RE-LEVEL POLITICAL IDEOLOGY (Very Liberal - Very Conservative)
conspiracy$y<- factor(conspiracy$y, levels=c("Very Liberal", "Liberal", "Somewhat Liberal", "Middle of the Road", "Somewhat Conservative", "Conservative", "Very Conservative"))

# RE-NAMED VARIABLE 'y'
names(conspiracy)[9]<- "PoliticalIdeology"

# DATA FOR PCA FUNCTION (only keep numeric variables)
df<- conspiracy[,-9]

# PCA
#pc1<- prcomp(df, scale.=T)
pc1<- prcomp(df)

summary(pc1)

Importance of components:
                          PC1    PC2     PC3    PC4     PC5     PC6     PC7     PC8
Standard deviation     2.2922 1.5820 1.05878 0.9088 0.88231 0.85821 0.80225 0.69443
Proportion of Variance 0.4256 0.2027 0.09081 0.0669 0.06306 0.05966 0.05214 0.03907
Cumulative Proportion  0.4256 0.6284 0.71917 0.7861 0.84913 0.90880 0.96093 1.00000

From the table directly above (Importance of components), we are mainly focused on the first 2 columns or principal components. Looking at the second row in the table, Proportion of Variance, the first PC accounts for about 43% of total variation & the second PC accounts for about 20%, with the rest of the PC’s each accounting for about 9% or less. To use PCA, the threshold for Cumulative Proportion is usually at least 70%-80%, but this is just a guideline as there are no set rules.

Scree-Plot

This plot visually represents the ‘Proportion of Variance’ from the previous table. Here it is easy to see the drop off after the second PC. This type of plot can be used to help decide how many PC’s should be used/kept.

Code

par(col.axis="#8B814C",col.lab="#8B814C",col.main="#8B814C",col.sub="#8B814C",pch=20, col="#8B814C", bg="transparent")
screeplot(princomp(df), type="lines", bg="transparent", col="#8B814C", main="")

Biplot

In this plot we see the eigenvectors or points (Equation 5) & eigenvalues or arrows (Equation 6). PCA can only be interpreted in either the x direction (horizontal distances) or in the y direction (vertical distances), but not both. Although the first PC is about 43% of variance, there is no distinction between political ideology groups in the horizontal plane. But, we do see a pattern in the vertical direction, specifically obamabirth & sorosplot consisting of more conservative ideology, and iraqjews and truther911 with more liberal ideology.

Code

autoplot(pc1,
  # AUTOPLOT OPTIONS
  data=conspiracy, 
  colour="PoliticalIdeology", 
  loadings=T, loadings.colour=alpha("#191970", 0.5), 
  loadings.label=T, loadings.label.colour="#191970", loadings.label.size=5, loadings.label.hjust=0) + 
  # CUSTOM COLORS FOR POLITICAL IDEOLOGY GROUPS
  scale_colour_manual(values = alpha(paletteer_d("rcartocolor::Temps"), 0.5)) +
  # GGPLOT THEME OPTIONS
  theme_bw() +
  theme(legend.key = element_rect(fill = "#FAFAF5"),
        legend.background = element_rect(fill = "#FAFAF5"),
        legend.text = element_text(color = "#8B814C", size = 14),
        legend.title = element_text(color = "#8B814C", size = 16),
        panel.border = element_rect(color = "#8B814C"),
        plot.background = element_rect(fill = "#FAFAF5"),
        axis.text = element_text(color = "#8B814C", size = 14),
        axis.title = element_text(color = "#8B814C", size = 16),
        axis.ticks = element_line(color = "#8B814C"))

Loadings

The matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

Code

(princomp(df))$loadings


Loadings:
             Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
truther911    0.334  0.460  0.103  0.462  0.169  0.239  0.598       
obamabirth    0.393 -0.545  0.415  0.519 -0.198 -0.217 -0.135       
fincrisis     0.340  0.270  0.194 -0.166 -0.496  0.588 -0.383       
fluorolights  0.354               -0.134  0.650        -0.420  0.499
endtimes      0.396 -0.245 -0.855  0.102 -0.170                     
sorosplot     0.371 -0.361  0.211 -0.645                0.508       
iraqjews      0.288  0.436        -0.211 -0.355 -0.688         0.287
vaportrail    0.340  0.191                0.326 -0.242 -0.204 -0.799

               Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

Other Data

Let’s take a brief look at some data regarding affairs. This is a data set on infidelity from a survey conducted by Psychology Today in 1969 (taken from the AER package in R - Table F.22.2 - https://pages.stern.nyu.edu/~wgreene/Text/tables/tablelist5.htm).

VARIABLES:

affairs: numeric. How often engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4–10 times, 12 = monthly, 12 = weekly, 12 = daily.
gender: factor indicating gender.
age: numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over.
yearsmarried: numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.
children: factor. Are there children in the marriage?
religiousness: numeric variable coding religiousness: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.
education: numeric variable coding level of education: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master’s degree, 20 = Ph.D., M.D., or other advanced degree.
occupation: numeric variable coding occupation according to Hollingshead classification (reverse numbering).
rating: numeric variable coding self rating of marriage: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.

From these variables we exclude gender and children while calculating the PC’s because they do not fit the assumptions of PCA. However the following biplot was color coded according to gender.

Code

# Load data
data(Affairs)

# PCA
affairs.pca = prcomp(Affairs[,c(1,3:4,6:9)], center = TRUE, scale. = TRUE, rank. = 7)
summary(affairs.pca)

Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.4339 1.2445 1.1301 0.8714 0.83306 0.67735 0.45370
Proportion of Variance 0.2937 0.2212 0.1825 0.1085 0.09914 0.06554 0.02941
Cumulative Proportion  0.2937 0.5150 0.6974 0.8059 0.90505 0.97059 1.00000

Code

# Histogram of PC's
plot(affairs.pca, col='pink', lwd = 4, main = "Histogram of Principal Components", col.main = 'black', xlab= "PC's 1 - 7")

Code

# Rename
Affairs$affairs2<- ifelse(Affairs$affairs==0, "No Affair", "Affair")

# biplot
df<- Affairs[,-c(2,5,10)]
autoplot(prcomp(df, scale.=T), data=Affairs, colour="gender", 
         loadings=T, loadings.label=T, loadings.colour="darkgray", loadings.label.colour="black")

From the biplot and histogram above we can see that the first PC only accounts for 29.37% of the variance in the data, alongside the clustering producing results that don’t have any clear meaning or groups associated with any of the variables. One may be tempted to say that occupation and education make up a trend in the data but the second PC only accounts for 22.12% of the variance in the data which is much less than is required to make such claims.

4 Conclusion

PCA is a very powerful tool for data analysis. It allows us to see patterns and trends hidden in large sets of data that would otherwise be very difficult to make out. PCA is also very flexible in its application which is why it sees use in so many different fields. This flexibility has also led to many different adaptations of the method, each with its own variations added to adapt to the task at hand. Large data sets are extremely common amongst many different fields of study. Principal component analysis requires no assumption of a distribution and can be applied to many types of numerical data. We focus on the Assumptions we ensure the variables are continuous, intervals or ratio levels of measurements are also of the same scale and units. We also make sure in our dataset that the variables are linearly related, and we use the scatterplot to check the variables. Lastly, we ensure Outliers are removed and those Null values that are missing are noted and removed. In our conclusion of Data, we saw that Scree-plot Proportion of Variance 0.4256 0.2027 0.09081 0.0669 0.06306 0.05966 0.05214 0.03907 and as well that plot dropped 10% from variances after Comp 3. Lastly we noticed that the PCA can CA can only be interpreted in either the x direction (horizontal distances) or in the y direction (vertical distances), but not both but not diagonally

PCA can map the principal components to a 2-D plane and create a cluster we can visualize and use to analyze the data set. We can also just focus on the principal components themselves and draw conclusions from them when visualization is not a valid option. PCA’s strength is that it retains as much variance in the data as possible while increasing the interpretability of the data and while it is sensitive to scaling it remains a very useful method of analysis. PCA has been found to be useful when performing k-means clustering which by itself is a clustering method with countless applications. With all of the flexibility and usefulness of PCA taken into consideration it is easy to see why it is such a popular way to analyze large data sets across so many fields. As long as there are large data sets there will be a demand to reduce the dimension of that data and make valid interpretations about the trends present in that data, to this end we have PCA. (Richardson 2009)

5 References

A, Camargo. 2022. “PCAtest: Testing the Statistical Significance of Principal Component Analysis in r.” PeerJ. https://doi.org/10.7717/peerj.12967.

Alkhayrat, Maha, Mohamad Aljnidi, and Kadan Aljoumaa. 2020. “A Comparative Dimensionality Reduction Study in Telecom Customer Segmentation Using Deep Learning and PCA.” Journal of Big Data 7 (February): 9. https://doi.org/10.1186/s40537-020-0286-0.

Cao, Wen, Nicholas Czarnek, Juan Shan, and Lin Li. 2018. “Microaneurysm Detection Using Principal Component Analysis and Machine Learning Methods.” IEEE Transactions on NanoBioscience 17 (3): 191–98. https://doi.org/10.1109/TNB.2018.2840084.

Gorgoglione, Angela, Alberto Castro, V. Iacobellis, and Gioia Andrea. 2021. “A Comparison of Linear and Non-Linear Machine Learning Techniques (PCA and SOM) for Characterizing Urban Nutrient Runoff.” Sustainability 13 (February): 2054. https://doi.org/10.3390/su13042054.

Jolliffe IT, Cadima J. 2016. “Principal Component Analysis: A Review and Recent Developments.” Philosophical Transactions A.

Jones, Payton, Patrick Mair, and Richard McNally. 2018. “Visualizing Psychological Networks: A Tutorial in r.” Frontiers in Psychology 9 (September). https://doi.org/10.3389/fpsyg.2018.01742.

Lang, WU, and Jin Qiu. 2021. Applied Multivariate Statistical Analysis and Related Topics with r. edp sciences.

Makesh Manral. 2022. “Principal Component Analysis | Dimension Reduction (1).”

Oliver, J. Eric, and Thomas J. Wood. 2013. “Replication data for: Conspiracy Theories and the Paranoid Style(s) of Mass Opinion.” Harvard Dataverse. https://doi.org/10.7910/DVN/22976.

Richardson, M. 2009. “Principal Component Analysis.” http://aurora.troja.mff.cuni.cz/nemec/idl/09bonus/pca.pdf.