Principal Components Analysis (PCA) by 바죠


http://en.wikipedia.org/wiki/Karhunen-Lo%C3%A8ve_transform

In statistics, principal components analysis (PCA) is a technique for simplifying a dataset, by reducing multidimensional datasets to lower dimensions for analysis.

Technically speaking, PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a dataset while retaining those characteristics of the dataset that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data. But this is not necessarily the case, depending on the application.

PCA is also called the (discrete) Karhunen-Loève transform (or KLT, named after Kari Karhunen and Michel Loève) or the Hotelling transform (in honor of Harold Hotelling). PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance. This advantage, however, comes at the price of greater computational requirement if compared, for example, to the discrete cosine transform. Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.

Organize the data set
Suppose you have data comprising a set of observations of M variables, and you want to reduce the data so that each observation can be described with only L variables, L < M. Suppose further, that the data are arranged as a set of N data vectors with each representing a single grouped observation of the M variables.

Write as column vectors, each of which has M rows.
Place the column vectors into a single matrix X of dimensions M × N.

the number of elements in each column vector : M
the number of column vectors in the data set (dimension) : N

MxM : covariance matrix, correlation matrix, diagonal matrix

-------------------------------
cf. linear correlation: http://incredible.egloos.com/1992323
-------------------------------
principal_components.pdfpca_lecture2.pdf
pca_lecture.pdf
http://neon.otago.ac.nz/chemlect/chem306/pca/index.html http://www.ks.uiuc.edu/Research/Method/long_time/
-------------------------------
PCA is one of the multivariate methods of analysis and has been used widely with large multidimensional data sets. The use of PCA allows the number of variables in a multivariate data set to be reduced, whilst retaining as much as possible of the variation present in the data set. This reduction is achieved by taking p variables X1, X2,…, Xp and finding the combinations of these to produce principal components (PCs) PC1, PC2,…, PCp, which are uncorrelated. These PCs are also termed eigenvectors. The lack of correlation is a useful property as it means that the PCs are measuring different "dimensions" in the data. Nevertheless, PCs are ordered so that PC1 exhibits the greatest amount of the variation, PC2 exhibits the second greatest amount of the variation, PC3 exhibits the third greatest amount of the variation, and so on. That is var (PC1) ≥ var (PC2) ≥ var (PC3) ≥… ≥ var (PCp), where var (PCi) expresses the variance of PCi in the data set being considered. Var (PCi) is also called the eigenvalue of PCi. When using PCA, it is hoped that the eigenvalues of most of the PCs will be so low as to be virtually negligible. Where this is the case, the variation in the data set can be adequately described by means of a few PCs where the eigenvalues are not negligible. Accordingly, some degree of economy is accomplished as the variation in the original number of variables (X variables) can be described using a smaller number of new variables (PCs).


----------------------------------
http://www.anchem.su.se/downloads/diss_pdf/k_wiberg_ths2004.pdf
3가지 변수들, 12번의 측정:


Fionn Murtagh's
Multivariate Data Analysis Software and Resources
Page

http://astro.u-strasbg.fr/~fmurtagh/mda-sw/

실제 예를 들어 보아야 할 듯 합니다.
9 3 = (n,m)
-4. -4. -4.0
-3. -3.0 -3.0
-2. -2. -2.
-1. -1. -1.0
0. 0. 0.0
1. 1. 1.
2. 2.0 2.0
3. 3. 3.
4. 4. 4.0
위와 같은 입력에 대해서 계산한 결과 (모든 계산을 single precision으로 수행함):
CORRELATION MATRIX FOLLOWS.
0.10000000E+01
0.10000000E+01 0.10000000E+01
0.10000000E+01 0.10000000E+01 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.30000002E+01 100.0000 100.0000
0.23841858E-06 0.0000 100.0000
0.00000000E+00 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57735026E+00 -0.40824828E+00 0.70710683E+00
2 -0.57735032E+00 -0.40824831E+00 -0.70710677E+00
3 -0.57735026E+00 0.81649661E+00 0.00000000E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.89442730E+00 -0.29802322E-07 -0.29802322E-07
2 0.67082042E+00 0.00000000E+00 -0.29802322E-07
3 0.44721365E+00 -0.14901161E-07 -0.14901161E-07
4 0.22360682E+00 -0.74505806E-08 -0.74505806E-08
5 0.00000000E+00 0.00000000E+00 0.00000000E+00
6 -0.22360682E+00 0.74505806E-08 0.74505806E-08
7 -0.44721365E+00 0.14901161E-07 0.14901161E-07
8 -0.67082042E+00 0.00000000E+00 0.29802322E-07
9 -0.89442730E+00 0.29802322E-07 0.29802322E-07
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99999994E+00 0.00000000E+00 0.00000000E+00
2 -0.99999994E+00 0.00000000E+00 0.00000000E+00
3 -0.99999994E+00 0.00000000E+00 0.00000000E+00

구성한 행렬이 correlation이 아니라 covariance 행렬일 경우은 아래와 같이 계산됨.

--------------
COVARIANCE MATRIX FOLLOWS.
0.60000000E+02
0.60000000E+02 0.60000000E+02
0.60000000E+02 0.60000000E+02 0.60000000E+02
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.18000000E+03 100.0000 100.0000
0.00000000E+00 0.0000 100.0000
0.00000000E+00 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57735026E+00 -0.40824828E+00 0.70710683E+00
2 -0.57735026E+00 -0.40824831E+00 -0.70710677E+00
3 -0.57735026E+00 0.81649655E+00 0.00000000E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.69282031E+01 0.23841858E-06 -0.23841858E-06
2 0.51961522E+01 0.00000000E+00 -0.23841858E-06
3 0.34641016E+01 0.11920929E-06 -0.11920929E-06
4 0.17320508E+01 0.59604645E-07 -0.59604645E-07
5 0.00000000E+00 0.00000000E+00 0.00000000E+00
6 -0.17320508E+01 -0.59604645E-07 0.59604645E-07
7 -0.34641016E+01 -0.11920929E-06 0.11920929E-06
8 -0.51961522E+01 0.00000000E+00 0.23841858E-06
9 -0.69282031E+01 -0.23841858E-06 0.23841858E-06
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.77459664E+01 0.00000000E+00 0.00000000E+00
2 -0.77459664E+01 0.00000000E+00 0.00000000E+00
3 -0.77459664E+01 0.00000000E+00 0.00000000E+00


입력을 조금 바꿔 보면 아래와 같습니다.
9 3 =(n,m)
-4. -4.1 -4.2
-3. -3.1 -3.4
-2. -2.8 -2.8
-1. -1.7 -1.5
0. 0. 0.0
1. 1.4 1.
2. 2.1 2.0
3. 3.3 3.
4. 4.9 4.0

입력이 위와 같을 경우:
CORRELATION MATRIX FOLLOWS.

0.10000000E+01
0.99279141E+00 0.10000000E+01
0.99702907E+00 0.99659711E+00 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.29909465E+01 99.6982 99.6982
0.72201653E-02 0.2407 99.9389
0.18336014E-02 0.0611 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.57713306E+00 0.68738067E+00 0.44093683E+00
2 -0.57704931E+00 -0.72531402E+00 0.37541103E+00
3 -0.57786798E+00 0.37780199E-01 -0.81525528E+00
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.84291685E+00 -0.37306525E-01 -0.97039640E-02
2 0.64754790E+00 -0.26830021E-01 0.11034280E-01
3 0.51171762E+00 0.40069986E-01 0.21742210E-01
4 0.27502084E+00 0.44631910E-01 -0.23924559E-02
5 -0.14698235E-01 0.96094544E-03 -0.20736242E-01
6 -0.25005686E+00 -0.20414315E-01 -0.26857480E-02
7 -0.43980157E+00 0.15544230E-01 -0.14310300E-01
8 -0.66212767E+00 0.10550087E-01 -0.47384202E-02
9 -0.91051888E+00 -0.27206317E-01 0.21790713E-01
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99811417E+00 0.58408562E-01 0.18880583E-01
2 -0.99796939E+00 -0.61629649E-01 0.16074387E-01
3 -0.99938524E+00 0.32110037E-02 -0.34910426E-01


3 4
1. 4. 7. 10.
2. 5. 8. 11.
3. 6. 9. 12.
위와 같은 입력이 주어질 경우:

CORRELATION MATRIX FOLLOWS.

0.10000000E+01
0.99999994E+00 0.10000000E+01
0.99999994E+00 0.99999994E+00 0.10000000E+01
0.99999994E+00 0.99999994E+00 0.99999994E+00 0.10000000E+01
EIGENVALUES FOLLOW.
Eigenvalues As Percentages Cumul. Percentages
----------- -------------- ------------------
0.39999998E+01 100.0000 100.0000
0.18371391E-06 0.0000 100.0000
0.59604659E-07 0.0000 100.0000
0EIGENVECTORS FOLLOW.

VBLE. EV-1 EV-2 EV-3
------ ------ ------
1 -0.49999997E+00 -0.43885323E+00 0.70710689E+00
2 -0.49999997E+00 -0.43885353E+00 -0.70710671E+00
3 -0.49999988E+00 0.77769768E+00 -0.15463391E-06
4 -0.50000012E+00 0.10000902E+00 -0.25764868E-07
0PROJECTIONS OF ROW-POINTS FOLLOW.

OBJECT PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 0.14142134E+01 0.37252903E-07 0.83519094E-08 -0.11920929E-06
2 0.00000000E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
3 -0.14142134E+01 -0.37252903E-07 -0.83519094E-08 0.11920929E-06
0PROJECTIONS OF COLUMN-POINTS FOLLOW.

VBLE. PROJ-1 PROJ-2 PROJ-3
------ ------ ------
1 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
2 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
3 -0.99999994E+00 0.00000000E+00 0.00000000E+00 0.00000000E+00
4 -0.10000000E+01 0.00000000E+00 0.00000000E+00 0.00000000E+00



import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

print( pearson_def([1,2,3], [1,5,7]))




핑백

덧글

  • 바죠 2007/02/16 10:42 # 답글

  • 키키 2007/02/20 21:45 # 답글

    이론하는 사람이 왠 PCA ? PCA는 알지 못하는 다수의 요인에 의해서 측정값(출력)이 변화할 경우 어느 요인이 측정값의 변화에 가장 큰 영향을 주는가 하는 것을 정할 때 사용하는 것입니다. 물론 그 요인의 크기는 eigen value의 크기로 나타낼 뿐 물리적으로 무엇인지는 모를 경우가 많지요. 하여간 하나 혹은 몇 개의 주요 원인이 측정값의 변화에 가장 큰 영향을 줄 것이라는 가정하에 그 주요한 요인이 미치는 효과를 분리해 내는 것입니다.
  • 바죠 2007/02/21 10:30 # 답글

    키키>> 역시 박식하십니다. 단백질이 접힐 때, 어떻게 접히는지가 매우 복잡합니다. 그래서, 주요원인, 부주요인을 분석할 때 사용하는 사람들이 많이 있습니다. 활동적인 원자, 아미노산을 찾아낼 수도 있습니다. 가장 플렉시블한 아미노산 부위도 얻을 수 있습니다. 많은 사람들이 이 방법을 이미 활용하였더군요. 저도 잘 모르는 내용이 있습니다. PCA는 리니어 변환인데, Non 리니어로 PCA와 유사한 분석을 하는 방식도 있다고 합니다. Sammon mapping? 아직 이놈은 이해를 못하고 있는 실정입니다. 필요가 없을지도 모르겠습니다. 화학 반응과 같은 매우 복잡한 과정에서 주요한 성분을 뽑아낼 수 있다면 이미 유용한 툴임에 분명합니다. 1번, 2번 주요 성분이 60 %, 10 %의 variance를 각각 취하는 경우가 많습니다. 70 %정도의 variance를 두 개의 좌표로 해석할 수 있다면 꿰 도움이 되는 분석방법이겠죠.

    아무튼, PCA 다양한 응용분야를 가지고 있는 수학적 방편임에 틀림이 없습니다. 엄청난 응용 리스트를 확보하고 있더군요.
    오디오/비디오 이미지 압축, 실험측정값 분석, 응용분야가 실로 방대하더군요.
댓글 입력 영역

최근 포토로그



MathJax