PCA Slides

Brandy Carr

Lee

Tavish

What is it?

Principal component analysis (PCA) is an unsupervised machine learning method used as the first step in data analysis
Main goal: reduce the number of dimensions/features while retaining most of the original variability
This is achieved using a vector space transform

Why use PCA?

Makes it easier to graphically visualize hidden patterns in the data
High number of predictor variables or highly correlated variables when running a regression model
To observe hidden trends, jumps, clusters, patterns and outliers
PCA assumes no distribution in data in a descriptive context
Widely adaptive as an exploratory method across disciplines and data types

Where has PCA been applied?

PCA is a broadly used statistical method whose use stretches across many scientific disciplines
Image analysis, analysis of web data, cyber security analysis, mental disorders, recognizing microanerysm, facial recognition, etc.
Many different adaptations of PCA have been developed based on the variation in goals and data types associated with each respective discipline

How does it work?

By mathematical projection we can interpret the original data set with just a few variables, namely the principal components
The process of reduction preserves the maximum variability in the data (i.e. the statistical information present in the data)
“…find new variables that are linear functions of those in the original data set, that successively maximize variance and that are uncorrelated with each other” (Jolliffe IT 2016)

Methods

PCA is an unsupervised machine learning tool used to find hidden patterns in multivariate datasets.

It is often used as the 1st step when preforming other multivariate methods such as:
- multiple regression
- cluster analysis
- discriminant analysis

The main goal is to take the original correlated variables and project them into a new set of smaller uncorrelated variables, the principal components. (dimension reduction)

PCA is the most beneficial when the first 2 PC’s (combined) explain over 80% of the variance.

Most often only the first 2 PC’s are of importance because it is easiest te represent visually, with the 1st PC as the xaxis, the 2nd as the yaxis, and the eigenvectors (loadings) represented as arrows originating from the center of the plot.

Assumptions & Process

Variables should be continuous, interval or ratio level of measurement (ordinal variables, such as likert-scale, are also often used)

Variables should all be the same scale/units & if not, they should be standardized

Variables are linearly related

Outliers & missing or impossible values should be removed

%%{init: {'theme': 'base', 'themeVariables': {'mainBkg': '#FAFAF5', 'background': '#FAFAF5', 'primaryColor': '#EAEAD6', 'nodeBorder': '#8B814C', 'lineColor': '#8B814C', 'primaryTextColor': '#191970', 'textColor': '#191970', 'fontSize': '12px', 'width': '100%'}}}%%

flowchart LR
A[(Clean<br/>Data)] --> B((Are all<br/>vars the same<br/>scale/unit?))
B --> C((Yes))
B --> D((No))
D -.- |Standardize<br/>Data| E(Estimate<br/>Sample<br/>Mean<br/>Vector)
C --> E
D --> F(Estimate<br/>Sample<br/>Mean<br/>Vector)
E --> G(Estimate<br/>Sample<br/>Covariance<br/>Matrix)
F --> H(Estimate<br/>Sample<br/>Correlation<br/>Matrix)
G --> I(Eigenvalues<br/>Eigenvectors)
H --> I

Sample data

COLUMNS: Variables
ROWS: Observations

Equations

\[{\sf Standardize\ Data:}\ \ \ \ \ \ \ \ {z_{ij}}\ =\ \frac{{x_{ij}} - {\overline{x_j}}} {\sqrt{\hat{\sigma_{jj}}}}, \ \ \ \ \ where\ \ \ \ i\ =\ \{1,2,...,n\}, \ \ \ \ j\ =\ \{1,2,...,p\} \qquad(1)\]

\[{\sf Random\ Vector:}\ \ \ \ \ \ \ \ {x_i} = {({x_{i1}}, ... , {x_{ip}})}^T,\ \ \ \ \ \ \ \ \ where\ \ i\ =\ \{1,2,...,n\} \qquad(2)\]

\[{\sf Sample\ Mean\ Vector:}\ \ \ \ \ \ \ \ \hat{\mu}\ \ =\ \ \overline{x}\ \ =\ \ \frac{1}{n} \sum_{i=1}^{n} x_i \qquad(3)\]

\[{\sf Sample\ Covariance\ Vector:}\ \ \ \ \ \ \ \ \hat{\sum}\ \ =\ \ S\ \ =\ \ {(\hat{\sigma}_{ij})}_{p{\sf x}p}\ \ =\ \ \frac{1}{n-1} \sum_{i=1}^{n} {(x_i - \overline{x})(x_i - \overline{x})}^T \qquad(4)\]

\[{\sf Eigenvectors:}\ \ \ \ \ \ \ \ {\hat{a}_{k}} \qquad(5)\]

\[{\sf Eigenvalues:}\ \ \ \ \ \ \ \ {\hat{y}_{ik}}\ =\ {\hat{a}^T_{k}}(x_i\ -\ \overline{x}),\ \ \ \ \ \ \ i\ =\ \{1,2,...,n\},\ \ \ \ \ \ \ k\ =\ \{1,2,...,p\} \qquad(6)\]

Data

Replication Data & Code

(Oliver and Wood 2013)

Description
Groups (clusters)
Variables

2,000 respondents answering a 5-point Likert scale on their belief in various conspiracies
Grouped by political ideology
Sampled from a nationally representative survey in 2011

Very Liberal
Liberal
Somewhat Liberal
Middle of the Road
Somewhat Conservative
Conservative
Very Conservative

All variables were measured using a 5-point likert scale: \[Strongly\ Disagree\ (1)\ <\ Disagree\ (2)\ <\ Neutral\ (3)\ <\ Agree\ (4)\ <\ Strongly\ Agree\ (5)\]

truther911: Certain U.S. government officials planned the attacks of September 11, 2001, to incite war
obamabirth: President Barack Obama was not really born in the US and does not have an authentic Hawaiian birth certificate
fincrisis: The current financial crisis was secretly orchestrated by a small group of Wall Street bankers to extend the power of the Federal Reserve and further their control of the world’s economy
flourolights: The U.S. government is mandating the switch to compact fluorescent light bulbs because such lights make people more obedient and easier to control
endtimes: We are currently living in End Times as foretold by Biblical prophecy
sorosplot: Billionaire George Soros is behind a hidden plot to destabilize the American government, take control of the media, and put the world under his control
iraqjews: The U.S. invasion of Iraq was driven by oil companies and Jews in the U.S. and Israel
vaportrail: Vapor trails left by aircraft are actually chemical agents deliberately sprayed in a clandestine program directed by government officials

Screeplot

PCA Biplot with Political Ideology Clusters

Good Examples

Figure 6: Good Example 1 (Martíni et al. 2021)

Figure 7: Good Example 2 (Grabska, Beć, and Huck 2021)

Bad Examples

Figure 8: Bad Example 1 (Juan Pablo 2021)

Conclusion

PCA is a very powerful tool for data analysis. It allows us to see patterns and trends hidden in large sets of data that would otherwise be very difficult to make out.

In our conclusion of Data, we saw that Scree-plot Proportion of Variance 0.4256 0.2027 0.09081 0.0669 0.06306 0.05966 0.05214 0.03907 and as well that plot dropped 10% from variances after Comp 3.

We focus on the Assumptions we ensure the variables are continuous, intervals or ratio levels of measurements are also of the same scale and units.

We also make sure in our dataset that the variables are linearly related, and we use the scatterplot to check the variables. Lastly, we ensure Outliers are removed and those Null values that are missing are noted and removed

PCA has been found to be useful when performing k-means clustering which by itself is a clustering method with countless applications. With all of the flexibility and usefulness of PCA taken into consideration it is easy to see why it is such a popular way to analyze large data sets across so many fields.

As long as there are large data sets there will be a demand to reduce the dimension of that data and make valid interpretations about the trends present in that data, to this end we have PCA.

End

Grabska, Justyna, Krzysztof B Beć, and Christian W Huck. 2021. “Novel Near-Infrared and Raman Spectroscopic Technologies for Print and Photography Identification, Classification, and Authentication.” NIR News 32 (1-2): 11–16. https://doi.org/10.1177/09603360211003757.

Jolliffe IT, Cadima J. 2016. “Principal Component Analysis: A Review and Recent Developments.” Philosophical Transactions A.

Juan Pablo. 2021. “Principal Component Analysis (PCA) from Scratch.”

Martíni, Aline Fachin, Gustavo Pereira Valani, Laura Fernanda Simões da Silva, Denizart Bolonhezi, Simone Di Prima, and Miguel Cooper. 2021. “Long-Term Trial of Tillage Systems for Sugarcane: Effect on Topsoil Hydrophysical Attributes.” Sustainability 13 (6). https://doi.org/10.3390/su13063448.

Oliver, J. Eric, and Thomas J. Wood. 2013. “Replication data for: Conspiracy Theories and the Paranoid Style(s) of Mass Opinion.” Harvard Dataverse. https://doi.org/10.7910/DVN/22976.