In statistics, principal components analysis (PCA) is a technique for simplifying a dataset, by reducing multidimensional datasets to lower dimensions for analysis.

Technically speaking, PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data. But this is not necessarily the case, depending on the application.

PCA is also called the (discrete) Karhunen-Loève transform (or KLT, named after Kari Karhunen and Michel Loève) or the Hotelling transform (in honor of Harold Hotelling). PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance. This advantage, however, comes at the price of greater computational requirement if compared, for example, to the discrete cosine transform. Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.

Organize the data set
Suppose you have data comprising a set of observations of M variables, and you want to reduce the data so that each observation can be described with only L variables, L < M. Suppose further, that the data are arranged as a set of N data vectors with each representing a single grouped observation of the M variables.

Write as column vectors, each of which has M rows.
Place the column vectors into a single matrix X of dimensions M × N.

the number of elements in each column vector : M
the number of column vectors in the data set (dimension) : N

MxM : covariance matrix, correlation matrix, diagonal matrix
principal_components.pdf
pca_lecture2.pdf
pca_lecture.pdf
pca_lecture.pdf
PCA is one of the multivariate methods of analysis and has been used widely with large multidimensional data sets. The use of PCA allows the number of variables in a multivariate data set to be reduced, whilst retaining as much as possible of the variation present in the data set. This reduction is achieved by taking p variables X1, X2,…, Xp and finding the combinations of these to produce principal components (PCs) PC1, PC2,…, PCp, which are uncorrelated. These PCs are also termed eigenvectors. The lack of correlation is a useful property as it means that the PCs are measuring different "dimensions" in the data. Nevertheless, PCs are ordered so that PC1 exhibits the greatest amount of the variation, PC2 exhibits the second greatest amount of the variation, PC3 exhibits the third greatest amount of the variation, and so on. That is var (PC1) ≥ var (PC2) ≥ var (PC3) ≥… ≥ var (PCp), where var (PCi) expresses the variance of PCi in the data set being considered. Var (PCi) is also called the eigenvalue of PCi. When using PCA, it is hoped that the eigenvalues of most of the PCs will be so low as to be virtually negligible. Where this is the case, the variation in the data set can be adequately described by means of a few PCs where the eigenvalues are not negligible. Accordingly, some degree of economy is accomplished as the variation in the original number of variables (X variables) can be described using a smaller number of new variables (PCs). ----------------------------------
3가지 변수들, 12번의 측정:  # Fionn Murtagh's Multivariate Data Analysis Software and Resources Page

실제 예를 들어 보아야 할 듯 합니다.
9 3 = (n,m)
-4. -4. -4.0
-3. -3.0 -3.0
-2. -2. -2.
-1. -1. -1.0
0. 0. 0.0
1. 1. 1.
2. 2.0 2.0
3. 3. 3.
4. 4. 4.0
위와 같은 입력에 대해서 계산한 결과 (모든 계산을 single precision으로 수행함):
CORRELATION MATRIX FOLLOWS.
0.10000000E+01
0.10000000E+01 0.10000000E+01
0.10000000E+01 0.10000000E+01 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.30000002E+01 100.0000 100.0000
0.23841858E-06 0.0000 100.0000
0.00000000E+00 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57735026E+00 -0.40824828E+00 0.70710683E+00
2 -0.57735032E+00 -0.40824831E+00 -0.70710677E+00
3 -0.57735026E+00 0.81649661E+00 0.00000000E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.89442730E+00 -0.29802322E-07 -0.29802322E-07
2 0.67082042E+00 0.00000000E+00 -0.29802322E-07
3 0.44721365E+00 -0.14901161E-07 -0.14901161E-07
4 0.22360682E+00 -0.74505806E-08 -0.74505806E-08
5 0.00000000E+00 0.00000000E+00 0.00000000E+00
6 -0.22360682E+00 0.74505806E-08 0.74505806E-08
7 -0.44721365E+00 0.14901161E-07 0.14901161E-07
8 -0.67082042E+00 0.00000000E+00 0.29802322E-07
9 -0.89442730E+00 0.29802322E-07 0.29802322E-07
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99999994E+00 0.00000000E+00 0.00000000E+00
2 -0.99999994E+00 0.00000000E+00 0.00000000E+00
3 -0.99999994E+00 0.00000000E+00 0.00000000E+00

구성한 행렬이 correlation이 아니라 covariance 행렬일 경우은 아래와 같이 계산됨.

--------------
COVARIANCE MATRIX FOLLOWS.
0.60000000E+02
0.60000000E+02 0.60000000E+02
0.60000000E+02 0.60000000E+02 0.60000000E+02
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.18000000E+03 100.0000 100.0000
0.00000000E+00 0.0000 100.0000
0.00000000E+00 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57735026E+00 -0.40824828E+00 0.70710683E+00
2 -0.57735026E+00 -0.40824831E+00 -0.70710677E+00
3 -0.57735026E+00 0.81649655E+00 0.00000000E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.69282031E+01 0.23841858E-06 -0.23841858E-06
2 0.51961522E+01 0.00000000E+00 -0.23841858E-06
3 0.34641016E+01 0.11920929E-06 -0.11920929E-06
4 0.17320508E+01 0.59604645E-07 -0.59604645E-07
5 0.00000000E+00 0.00000000E+00 0.00000000E+00
6 -0.17320508E+01 -0.59604645E-07 0.59604645E-07
7 -0.34641016E+01 -0.11920929E-06 0.11920929E-06
8 -0.51961522E+01 0.00000000E+00 0.23841858E-06
9 -0.69282031E+01 -0.23841858E-06 0.23841858E-06
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.77459664E+01 0.00000000E+00 0.00000000E+00
2 -0.77459664E+01 0.00000000E+00 0.00000000E+00
3 -0.77459664E+01 0.00000000E+00 0.00000000E+00

입력을 조금 바꿔 보면 아래와 같습니다.
9 3 =(n,m)
-4. -4.1 -4.2
-3. -3.1 -3.4
-2. -2.8 -2.8
-1. -1.7 -1.5
0. 0. 0.0
1. 1.4 1.
2. 2.1 2.0
3. 3.3 3.
4. 4.9 4.0

입력이 위와 같을 경우:
CORRELATION MATRIX FOLLOWS.

0.10000000E+01
0.99279141E+00 0.10000000E+01
0.99702907E+00 0.99659711E+00 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.29909465E+01 99.6982 99.6982
0.72201653E-02 0.2407 99.9389
0.18336014E-02 0.0611 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57713306E+00 0.68738067E+00 0.44093683E+00
2 -0.57704931E+00 -0.72531402E+00 0.37541103E+00
3 -0.57786798E+00 0.37780199E-01 -0.81525528E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.84291685E+00 -0.37306525E-01 -0.97039640E-02
2 0.64754790E+00 -0.26830021E-01 0.11034280E-01
3 0.51171762E+00 0.40069986E-01 0.21742210E-01
4 0.27502084E+00 0.44631910E-01 -0.23924559E-02
5 -0.14698235E-01 0.96094544E-03 -0.20736242E-01
6 -0.25005686E+00 -0.20414315E-01 -0.26857480E-02
7 -0.43980157E+00 0.15544230E-01 -0.14310300E-01
8 -0.66212767E+00 0.10550087E-01 -0.47384202E-02
9 -0.91051888E+00 -0.27206317E-01 0.21790713E-01
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99811417E+00 0.58408562E-01 0.18880583E-01
2 -0.99796939E+00 -0.61629649E-01 0.16074387E-01
3 -0.99938524E+00 0.32110037E-02 -0.34910426E-01

3 4
1. 4. 7. 10.
2. 5. 8. 11.
3. 6. 9. 12.
위와 같은 입력이 주어질 경우:

CORRELATION MATRIX FOLLOWS.

0.10000000E+01
0.99999994E+00 0.10000000E+01
0.99999994E+00 0.99999994E+00 0.10000000E+01
0.99999994E+00 0.99999994E+00 0.99999994E+00 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.39999998E+01 100.0000 100.0000
0.18371391E-06 0.0000 100.0000
0.59604659E-07 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.49999997E+00 -0.43885323E+00 0.70710689E+00
2 -0.49999997E+00 -0.43885353E+00 -0.70710671E+00
3 -0.49999988E+00 0.77769768E+00 -0.15463391E-06
4 -0.50000012E+00 0.10000902E+00 -0.25764868E-07
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.14142134E+01 0.37252903E-07 0.83519094E-08 -0.11920929E-06
2 0.00000000E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
3 -0.14142134E+01 -0.37252903E-07 -0.83519094E-08 0.11920929E-06
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
2 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
3 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
4 -0.10000000E+01 0.00000000E+00 0.00000000E+00 0.00000000E+00 import math

def average(x):
assert len(x) > 0
return float(sum(x)) / len(x)

def pearson_def(x, y):
assert len(x) == len(y)
n = len(x)
assert n > 0
avg_x = average(x)
avg_y = average(y)
diffprod = 0
xdiff2 = 0
ydiff2 = 0
for idx in range(n):
xdiff = x[idx] - avg_x
ydiff = y[idx] - avg_y
diffprod += xdiff * ydiff
xdiff2 += xdiff * xdiff
ydiff2 += ydiff * ydiff

return diffprod / math.sqrt(xdiff2 * ydiff2)

print( pearson_def([1,2,3], [1,5,7]))

