R [programming language] by 바죠

R (programming language)
https://en.wikipedia.org/wiki/R_(programming_language)
https://en.wikipedia.org/wiki/Programming_with_Big_Data_in_R
http://www.r-graph-gallery.com/all-graphs/

통계학자들, 데이터 과학자들 사이에서 많이 이용되는 컴퓨터 언어가 R이다. 객체지향 언어이고 인터프리터 언어이다.
로버트 젠틀맨, 로스 이하카에 의해서 개발이되었다. 현재는 R 코어팀이 개발을 하고 있다. 
현실적으로 SPSS(Statistical Package for the Social Sciences), MATLAB을 대체하는 효과가 있다.
고가의 소프트웨어 구입과 이어지는 유지 비용을 생각하지 않을 수 없다.
통계 소프트웨어 개발, 자료 분석, 데이터 처리에 최적화된 컴퓨터 언어이다. 그래픽 기능까지 취급하고 있다.
많은 개발자들에 의해서 구현된 현대 통계 패키지들이 지원된다. 기계학습, 재무관리에 널리 이용된다.
이 언어를 사용할 수 있으면 넓은 분야에서 우대받을 수도 있다.
특히, 사회활동, 회사활동, 여론조사 등과 같은 사회과학 분야에서 특히 유리하다. 앞서 언급한 SPSS 를 주목할 필요학 있다. 
구글, 페이스북, 뉴욕 타임즈 등 회사에서 사용된다. 
설문조사, 실험데이터를 특정 이론 모델로 회귀분석을 하고자 한다면 R 언어가 제격이다.
매우 다양한 이론적 배경을 가지고 있는 통계학적 패키지들이 잘 개발되고 테스트되어 있다.
따라서 사용자들은 여러가지 모델들을 순서적으로 테스트할 수 있고 그 자리에서 비교 평가할 수 있다. 
이 때 다양한 방식으로 모델들을 비교할 수 있다. 예제들을 통하면 자신의 데이터에 적합한 모델을 구축할 수 있다. 
매우 간편하게 사용자 자신의 모델들을 만들고 모델들을 비교할 수 있다.  
대부분의 패키지들은 R 언어로 작성되어 있다.
http://healthstat.snu.ac.kr/CRAN/
계산 속도가 중요한 이슈가 되는 경우에는 C, C++, FORTRAN 등으로 만들어진 부분도 존재한다.
통계학 분야의 다양한 이론과  편리한 이용을 위해서는 패키지를 적극 활용하는 것이 매우 중요하다.

R 언어에 대한 많은 자료들이 제공되어 있다. 이들을 활용하면 쉽게 R 언어를 습득할 수 있다.
Youtube 동영상 자료도 매우 잘 개발되어 있다.

R cookbook                                     
R graphics cookbook                        
Hands-On Programming with R         
R in a Nutshell
The R book
http://www.statmethods.net/r-tutorial/index.html
http://rgraphgallery.blogspot.kr/search?updated-max=2013-05-22T08:54:00-07:00&max-results=1
http://forensics.umass.edu/classes/cs691BL/Data_Structures.pdf

R 언어의 특징
language for statistical computing and data science
100% free, open source
visiualization
~5000 R packages

프로그램 다운로드/설치
download R for windows              free open source,  linux, mac, windows
download RStudio for windows     free open source,  linux, mac, windows
R, RStudio 순서대로 다운로드함. 사용자 편의를 위한 RStudio 가 반드시 필요하다.
기존 데이터 참조뿐만 아니라 파일 입출력에 편리함을 지원한다.

www.r-project.org > CRAN
LINUX
MAC
WINDOWS
www.rstudio.com > download > desktop > OS

사용하는 방법: RStudio를 실행한다.
RStudio 실행 --> 스크립트 작성 --> R 명령어 실행

Script  (명령어 집합을 스크립트 형식으로 보관해 둔다.) 
.R
형식으로 저장

Open file
New file

스크립트에 명령어를 작성한다. 마우스를 이용해서 특정 구문을 하일라이트로 잡고, RUN 또는 엔터를 치면 해당 부분이 R 콘솔에서 즉시 실행된다.

TAB 을 이용함. 자동완성
name, command 두 가지 모두 자동완성될 수 있게 도와준다.

FILE > Save

-----------------------------------------------------------------------------------------------------------------
Working directory (작업 장소, 프로젝트마다 서로 다른 작업 장소가 필요함. RStudio를 활용하면 명령어를 사용하지 않고도 작업장소 생성, 변경, 이용이 가능하다.)

getwd()
setwd("/User...../")
setwd("~/Desktop/project1")
projectWD <-  "/....."
setwd(projectWD)

다른 방식
Session > Set working directory >.....

getwd()
z =  summary(data1)
다음에 일을 계속하기 위해서 지금까지한 일들을 저장하고자 할 떄:
save.image("firstproject1.Rdata")

다른 방식
Session > Save workspace

rm(list=ls())
다른 방식
Session >  Clear workspace

q()
--
일을 다시 시작하고자 할 때:
ls()
setwd("/User....")
getwd()

load("firstproject1.Rdata")
다른 방식
load(file.choose())
또 다른 방식
Session > Load workspace

--------------------------------------------------------------------------------------------------------------------
Working directory

getwd()
setwd("abcd")

z=summary(data1)
z
save.image("firstProject.Rdata")
Session
Save Workspace As
Set Working Directory
Clear workspace
Load workspace

rm(list=ls())
q()

ls()
setwd("~/Desktop/project1")
getwd()
load("firstproject.Rdata")
load(file.choose())

-------------------------------------------------------------------------------------------------------------------
Customizing RStudio

Preferences  (맥킨토시)

Tools > Options  (윈도우즈)

------------------------------------------------------------------------------------------------------------------
/usr/bin/R
/usr/bin/Rscript


#!/usr/bin/Rscript --slave
Rscript test.R

BMI <- data.frame( gender=c("Male", "Male", "Female"),
                            height=c(152, 171.5, 165),
                            weight=c(81, 93, 78),
                            Age=c(42, 38, 26)
)
BMI

------------------------------------------------------------------------------------------------------------------
변수, 벡터, 행렬

변수에 숫자 할당하는 것이 기본과정이다. 다른 컴퓨터 언어에서처럼 =를 사용할 수 있다. 또한 = 대신에 <- 를 사용할 수 있다.
<- 또는 =
-> 도 사실은 가능하다.

x<- 1:5
y<- 6:10
plot(x,y)
현재 사용하고 있는 변수들을 확인하는 방법.
ls()
ls.str()
print(x)
대신에 단순히 x
라고 해도 같은 결과를 준다. 변수 값들을 출력해서 보여 준다.

x=1
print(x)

y <- 7
y <- 9
현재 Workspace 참조
ls()
rm(y)
변수 지우기할 때 사용한다.
x.1 <- 14
xx <- "marin"

11+14
7*9
y^2
x^2+y^2
sqrt(x)
y^(1/2)
log(y)
exp(x)
log2(y)   :  base 2
abs(-14)

Console
Variables

height <- 2
height

width <- 4

Workspace
ls()

height * width
area <- height * width
ls()

typeof(TRUE)
typeof(13)
typeof(13L)
typeof("abcd")
is.character("ABCD")
is.double(13)
is.double(13L)
is.logical(F)
is.vector(x)
is.list(x)

이전 명령어들을 쉽게 재 사용할 수 있는 기능이 있다. 물론 곧바로 수정을 할 수 있다.  
arrow keys
arrow up
arrow down

# 문자 뒤는 모두 comment가 되고 만다.
+ 부호는 부족한 명령어를 의미할 수 있다. 연결 부호이다.

File >  New file > Rscript
Save as
File > Project

함수 c()를 활용하는 방법. concatenation, combination

x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
 c(10.4, 5.6, 3.1, 6.4, 21.7) -> x

x1 <- c(1, 3, 5, 7, 9)
gender <- c("male","female")

2:7

seq(1,10)
seq(from=1, to=7, by=1)
seq(from=1, to=7, by=1/3)
seq(from=1, to=7, by=0.25)

rep(1, times=10)
rep(1:3, times=5)
rep(seq(from=2, to=5, by=0.2), times=5)

x <- 1:5
y <- c(1, 3, 5, 7, 9)

y[1:3]
y[3]

matrix()
matrix(1:6, nrow=2)
1 3 5
2 4 6
matrix(1:6, ncol =3)
1 3 5
2 4 6
matrix(1:6, nrow =2, byrow = TRUE)
1  2 3
4 5  6

Recycling
matrix(1:3, nrow =2, ncol =3)
1 3 2
2 1 3
rbind()
cbind()

-----------------------------------------------------------------------------------------------------------------
기존의 실험데이터를 R로 불러들이기

.csv   (comma separated value, 엑셀에서 만들 수 있다.) 
https://en.wikipedia.org/wiki/Comma-separated_values
.txt     (tab-delimited text, 엑셀에서 만들 수 있다.)

help(read.csv)
?read.csv

data1 <- read.csv(file.choose(), header=TRUE)
data1
데이터 확인 (출력하여 확인하기)

data2 <- read.table(file.choose(), header=TRUE, sep=",")
data2

data3 <- read.delim(file.choose(), header=T)
data3
data4 <- read.table(file.choose(), hedaer=T, sep="\t")
data4

help(read.table)
?read.table


RStudio 의 경우:
Import  Dataset > Data from text file

dim(data1)
head(data1)
tail(data1)

data1[c(5,6,7,8,9),]
names(data1)
# Manually assign the header names
names(data1) <- c("Column1","Column2","Column3")

mean(data1$Age)
data1$Age
attach(data1)
mean(Age)
Age
detach(data1)

calss(Age)
summary(data1)
length(Age)
data1[11:14, ]

------------------------------------------------------------------------------------------------------------------
Packages (다른 사람들이 만들어 놓은 유용한 패키지를 적극 활용한다.)

현대 통계학 관련 5000 packages
help(install.packages)
http://healthstat.snu.ac.kr/CRAN/

Menu를 이용하는 방법.
명령어를 이용하는 방법.
install.packages("epiR")
install.packages() 리턴
리스트를 보여줌.

library(epiR)
사용할 때, 항상 불러야 함.

www.r-project.org
packages
available
list of packages

help(package= epiR)

remove.packges("epiR")

Tools > Install Packages
CRAN

-------------------------------------------------------------------------------------------------------------------
Barcharts, Piecharts

help(barplot)
?barplot


table(Gender)
count <- table(Gender)
count
table(Gender)/725
percent <- table(Gender)/725
barplot(count)
barplot(percent)
barplot(percent, main="Title", xlab="Gender", ylab="%")

657 colors
x <-  c(10,12,13, 18, 20)
barplot(x, col="slategray3")
barplot(x, col=colors() [602])
?rgb
?col2rgb
?palette


n <- 5
x <- c(rep(10,n))
barplot(x, col=rainbow(n))
n <- 8
x <- c(rep(10,n))
barplot(x, col=heat.colors(n))
                      cm.colors(n)
display.brewer.all()

barplot(x, col=rgb(.54, 0., .0))
barplot(x, col=rgb(159, 182, 205, max=255)
barplot(x, col=c("red", "blue", "green", "yellow"))

pie(count)
pie(count, main="title")
box()

library(ggplot2)
bunnies <- 1:10
carrots <- c(1, 2, 4, 6, 8, 12, 16, 20, 32, 36)
df <- data.frame(bunnies, carrots)
ggplot(data = df, aes(x = bunnies, y = carrots)) + geom_point()

--------------------------------------------------------------------------------------------------------------------
Boxplot

help(boxplot)
?boxplot
boxplot(data1)
quantile(data1, probs=c(0, 0.25, 0.5, 0.75, 1))
boxplot(data1, main="boxplot", ylab="LC", ylim=c(0,16), las=1)
boxplot(data1 ~ Gender)
boxplot(data1 ~ Gender, main="BX by gender")
boxplot(data1[Gender == "female"], data1[Gende == "male"])

--------------------------------------------------------------------------------------------------------------------
Sorting
attach(mtcars)

# sort by mpg
newdata <- mtcars[order(mpg),]

# sort by mpg and cyl
newdata <- mtcars[order(mpg, cyl),]

#sort by mpg (ascending) and cyl (descending)
newdata <- mtcars[order(mpg, -cyl),]

detach(mtcars)

--------------------------------------------------------------------------------------------------------------------
Stratified boxplot


--------------------------------------------------------------------------------------------------------------------
Plot

plot()
generic

plot(contries$continent)
plot(contries$population)
plot(conties$area,contires$population)
plot(log(contries$area),log(contries$population))
plot(conties$religion,contries$continent)

plot(mercury$temperature, mecrury$pressure, xlab="T", ylab="P", main="T vs P", type="o", col="orange", cex.axis=1.5, lty=5, pch=4)

type="l"

par()
?par

par(col="blue")
plot(mercury$temperature,mercury$pressure)

col.main="darkgray"

lable size
cex.axis = 1.5

line type
lty = 5

plot symbol
pch = 4

barplot()
boxplot()
pairs()

--------------------------------------------------------------------------------------------------------------------
Data frame (데이터 저장 방식 설정하기)

create data frame
csv
sql
excel, spss

data.frame()
name <- c("A", "B", "C", "D")
age <- c(1, 2, 3, 4)
child <- c(FALSE, FALSE, TRUE, TRUE)
df <- data.frame(name, age, child)
df
str(df)

add column
add row

height <-  c(170, 180, 165, 188, 199)
cbind(people,height)
tom<- data.frame(name="Tom", age=37, child=FALSE, height =183)
rbind(people, tom)

데이터 크기순으로 정렬하기:
sort(people$age)
ranks<- order(pepole$age)
people[order(people$age, decreasing=TRUE), ]

Data frame

df <- data.frame(name = c("Elvis", "Lou",  "Joni" ), yob= c(1954, 1942, 1932))
df

2017-df$yob
mean(2017-df$yob)

x <- seq(0, 9)
y <- seq(100, 109)
z <- seq(50, 80, length.out = 10)
data.frame(x,y,z)

--------------------------------------------------------------------------------------------------------------------
Correlation and Covariance in R

help(cor.test)
?cor.test


plot(Age, LungCap, main="scatterplot", las=1)
cor(Age, LungCap, method="pearson")
cor(Age, LungCap)
cor(LungCap, Age)
cor(Age, LungCap, method="spearman")
cor(Age, LungCap, method="kendall")
cor.test(Age, LungCap, method="pearson")
cor.test(Age, LungCap, method="spearman")
cor.test(Age, LungCap, method="spearman", exact=F)
cor.test(Age, LungCap, method="pearson", alt="greater", conf.level=0.99)
cov(Age, LungCap)
pairs(LungCapData)
pairs(LungCapData[,1:3])
cor(LungCapData[, 1:3])

mean(x)
median(x)
sd(x)
var(x)
cor(x,y)
cov(x,y)

mean(x, na.rm=TRUE)
sd(x, na.rm=TRUE)

--------------------------------------------------------------------------------------------------------------------
Linear Regression

plot(Age, LungCap, main="Ti")
cor(Age, LungCap)

help(lm)
?lm

                 yvariable ~ xvariable
mod <- lm(LungCap ~ Age)
sumary(mod)
attributes(mod)
mod$coefficients
mod$coef
coef(mod)
plot(Age, LungCap, main="Scatt")
abline(mod)
abline(mod, col=2, lwd=3)
confint(mod, level=0.99)
anova(mod)

plot(Age, LungCap)
mod <- lm(LungCap ~ Age)
summary(mod)
abline(mod)
Yhat  = Muhat  = b0 + b1 * x
error = Yobserved - Yhat

Checking linear regression assuptions in R
4가지 테스트하기
plot(mod)
1  Residuals vs Fitted values
2  Normal Q-Q plot, quantiles-quantiles plot
3  Scale - Location 
4  Residuals vs Leverage

4가지 한꺼번에 그리기:
par(mfrow=c(2,2))
plot
(mod)

par(mfrow=c(1,1))

prediction y using x
plot(x,y)
mod2 <- lm(y ~ x)
abline(mod2)
plot(mod2)

비선형 데이터의 경우:
plot(xx,yy)
mod3 <- lm(yy ~ xx)
abline(mod3)
plot(mod3)

Yhat = b0  + b1 X1 + b2 X2  + b3 X3 + ..... + bk Xk

-------------------------------------------------------------------------------------------------------------------
Multiple linear regression

help(lm) 또는 help search window

model1 <- lm(LungCap ~ Age + Height)
summary(model1)
cor(Age, Height, method="pearson")
confint(model1, conf.level=0.95)

model2 <- lm(LungCap ~ Age + Height + Smoke + Gender + Caesarean)
summary(model2)
plot(model2)
4가지 그림 체크

-------------------------------------------------------------------------------------------------------------------
Polynomial Regression

linearity and model  assumption

model1 <- lm(LungCap ~ Height)
summary(model1)
abline(model1, lwd=3, col="red")
                                                 Linear regression assumption

model2 <- lm(LungCap ~ Height + I(Height^2))
summary(model2)

HeightSquare <- Height^2
model2again <- lm(LungCap ~ Height + HeightSquare)
summary(model2again)

model2againagain <- lm(LungCap ~ poly(Height, degree=2, raw=T))
summary(model2againagain)
                                                                                       raw=F  : orthogonal polynomials
lines(smooth.spline(Height, predict(model2)), col="blue", lwd=3)
anova(model1,model2)

model3 <-  lm(LungCap ~ Height + I(Height^2) + I(Height^3))
summary(model3)
lines(smooth.spline(Height, predict(model3), col="green", lwd=3,  lty=3)
anova(model2,model3)

두 모델들 사이의 성능을 테스트하고 차이점을 확인할 수 있다.
https://en.wikipedia.org/wiki/F-test

-------------------------------------------------------------------------------------------------------------------
Chi-square test, Fisher's exact test, and cross tabulations in R
https://en.wikipedia.org/wiki/Chi-squared_test

help(chisq.test)
?chisq.test


table(Gender, Smoke)
TAB =table(Gender, Smoke)
TAB
barplot(TAB, beside=T, legend=T)
chisq.test(TAB, correct=T)
CHI=chisq.test(TAB, correct=T)
attributes(CHI)
CHI$expected
fisher.test(TAB, conf.int=T, conf.level=0.99)

-------------------------------------------------------------------------------------------------------------------
Converting a numeric variable to a categorical variable in R

?cut

CatHeight <- cut(Height, breaks=c(0,50,55,60,65,70,100), lables=c("A", "B", "C", "D", "E", "F"), right=FALSE)
CatHeight[1:10]
levels(CatHeight)

CatHeight <- cut(Height, breaks=4, right=FALSE)

-------------------------------------------------------------------------------------------------------------------
Function

설정
twosam <- function(y1, y2) {
n1 <- length(y1); n2 <- length(y2)
yb1 <- mean(y1); yb2 <- mean(y2)
s1 <- var(y1); s2 <- var(y2)
s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))
tst
}

Function

add <-  function(x, y=1){x+y}
formals(add)
body(add)
environment(add)

f <-  function(){
if(){}
}

mean2 <- mean

HousePrice <- read.table("houses.data")
HousePrice <- read.table("houses.data", header=TRUE)

pairs(xmatrix)
coplot(avector ~ bvector | cvector)

Rscript
rectagle.R

install.packages(c("ggplot2", "gcookbook"))
library(ggplot2)
library(gcookbook)

data <- read.csv("datafile.csv")
data <- read.csv("datafile.csv", header=FALSE)
names(data) <- c("Column1","Column2","Column3")
data <- read.csv("datafile.csv", sep="\t")

plot(pressure$temperature, pressure$pressure, type="l")
points(pressure$temperature, pressure$pressure)
lines(pressure$temperature, pressure$pressure/2, col="red")
points(pressure$temperature, pressure$pressure/2, col="red")

mysummary <- function(x, npar=TRUE, print=TRUE) {
  if (!npar) {
    center <- mean(x); spread <- sd(x)
  } else {
    center <- median(x); spread <- mad(x)
  }
  if (print & !npar) {
    cat("Mean=", center, "\n", "SD=", spread, "\n")
  } else if (print & npar) {
    cat("Median=", center, "\n", "MAD=", spread, "\n")
  }
  result <- list(center=center,spread=spread)
  return(result)
}

# invoking the function
set.seed(1234)
x <- rpois(500, 4)
y <- mysummary(x)

--------------------------------------------------------------------------------------------------------------------
PCA (Principal Component Analysis)

?stats
?prcomp

dim(data)
library(caret)
nzv <- nearZeroVar(data, saveMetrics=TRUE)

--------------------------------------------------------------------------------------------------------------------
Basic data types

class
class(TURE)
class(NA)

NA: not available

numeric
2L
2
class(2)
class(2L)
is.numeric(2)
is.numeric(2L)
is.integer(2)
is.integer(2L)

character
"I love data"

double
complex
raw

Coercion

as.numeric(TRUE)
as.numeric(FALSE)
as.character(4)
as.numeric("4.5")
as.numeric("Hello")

--------------------------------------------------------------------------------------------------------------------
Vectors

same basic type

c()
c("c", "d")

drawn_suits <- c("c", "d")

is.vector(drawn_suits)

remain <- c(11, 12, 11, 13)
remain
suits <- c("c", "d", "e", "f")
names(remain) <-  suits
remain

remain <- c()
str(remain)

is.vector()
length()
Only elements of the same type

Coercion for vectors

list

--------------------------------------------------------------------------------------------------------------------
Vector calculus

element-wise fashion

^, *, +, -

sum()
and
>

-------------------------------------------------------------------------------------------------------------------
Subsetting vectors, Subsetting matrices

subset by index
subset all but some

m[1,3]
m[3,]
m[,3]
m[4]
fortran like
m[2,c(2,3)]

m[c(1,2),c(2,3)]

matrix(10:21, ncol=3)
array(10:21, dim=c(2,3,2))

우선 일차원 형식으로 저장한다.
obj <- 1:12
그런 다음에 아래와 같이 해당 행렬이 3행 4열임을 알려준다. 이 때, dim()을 사용한다.
dim(obj) <- c(3,4)
obj

attributes(obj)
str(obj)

obj <- matrix(1:9, nrow=3)
cbind(obj, matrix(c(10:12)))
rbind(obj, matrix(c(10:12), nrow=1))

rowSums(obj)
colSums(obj)
colMeans(obj)
diag(obj)

colSums(), rowSums()

--------------------------------------------------------------------------------------------------------------------
Categorical variables

factor
str()
sorting
levels
lables

--------------------------------------------------------------------------------------------------------------------
Lists

different R objects
No coercion
Loss of some functionality
is.list()
str()
List in list

--------------------------------------------------------------------------------------------------------------------
Data frame

Rows
Columns
import from data source
csv
sql
spss, excel
data.frame()

--------------------------------------------------------------------------------------------------------------------
Graphics in R

publication quality
ggplot2, ggvis, lattice
plot()
hist()
barplot()
boxplot()
pairs()
ggplot(pressure, aes(x=temperature, y=pressure)) + geom_line() + geom_point()

par()
?par

multiple plot
layout()

?barplot
help(barplot)
table(Gender)
count <- table(Gender)
count
table(Gender)/725
percent <- table(Gender)/725
barplot(count)
barplot(percent)
barplot(percent, main="title", xlab="Gender", yab="%")

pie(count)
pie(count, main="Title")
box()

help(boxplot)
?boxplot


boxplot(data1)
quantile(data1, probs=c(0, 0.25, 0.5, 0.75, 1.))
boxplot(data1 ~ gender, main="boxplot")

plot(Age, Height, main="Scatter")
plot(Age, Height, main="Scatter", cex=0.5, cex.main=2, cex.lab=1.5, cex.axis=0.7)

font.main=3
font.lab=2
font.axis=3
col=5
col=5, col.main=4, col.lab=2, col.axis=3

pch=2
pch="w"
abline(lm(Height~Age), col=4, lty=2, lwd=6)

xlab="\u5c"

xlab="\u212b"

xlab="Å"

--------------------------------------------------------------------------------------------------------------------
Graphics in R (examples)
http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html
http://www.sr.bham.ac.uk/~ajrs/R/gallery/plot_blackbody_curves.txt
#--Define blackbody function:
blackbody <- function(lambda, Temp=1e3) {
  h <- 6.626068e-34 ; c <- 3e8; kb <- 1.3806503e-23                                    # constants
  lambda <- lambda * 1e-6                                                                      # Convert from metres to microns
  ( 2*pi*h*c^2 ) / ( lambda^5*( exp( (h*c)/(lambda*kb*Temp) ) - 1 ) )
}
#--Define title and axis labels:
main <- "Planck blackbody curves"
xlab <- expression(paste(Wavelength, " (", mu, "m)"))
ylab <- expression(paste(Intensity, " ", (W/m^3)))
#--Line styles and colours:
## (see colorbrewer2.org)
## require(RColorBrewer)
## col <- brewer.pal(3, "Dark2")
col <- c("#1B9E77", "#D95F02", "#7570B3")
lty <- 1:3                                                                                                # Line type
lwd <- 2.5                                                                                              # Line width
#--Set up plot device:
pdf("blackbody.pdf", height=5.5, width=7, pointsize=14)
#--Plot curves:
curve(blackbody(lambda=x), from=1, to=15, main=main, xlab=xlab, ylab=ylab, col=col[1], lwd=lwd)
curve(blackbody(lambda=x, T=900), add=TRUE, col=col[2], lty=lty[2], lwd=lwd)
curve(blackbody(lambda=x, T=800), add=TRUE, col=col[3], lty=lty[3], lwd=lwd)
#--Add legend:
legtext <- paste(c(1000, 900, 800), "K", sep="")
legend("topright", inset=0.05, legend=legtext, lty=lty, col=col, text.col=col, lwd=lwd, bty="n")


--------------------------------------------------------------------------------------------------------------------
Scripts
.R 
File > New > R script
Open File >  abc.R
선택 > Run
command & enter
Run
Edit
Find and Replace
Code
Comment/Uncomment lines
Tab
Run
File > Save as
Save
..
How to install packages in R
~5,000 packages
help(install.packages)
install.packages("epiR")
install.packages()
click > OK
library(epiR)
R-project.org
help(package= epiR)
remove.packages("epiR")
Tools> Install Packages
CRAN
epiR
Default
x<-1:5
y<-6:10
plot(x,y)
ls()
.csv
.txt
help(read.csv)
?read.csv
data1 <- read.csv(file.choose(), header=TRUE)
data2 <- read.table(file.choose(), header=T, sep=",")
data3 <- read.delim(file.choose(), header=T)
data4 <- read.table(file.choose(), header=T, sep="\t")
help(read.table)
?read.table
daat1 <- read.table(file="aaa", header=TRUE, sep="\t")"
"  "
","
data2 <- read.table(file.choose(),header=TRUE, sep="\t")
import dataset
form text file
rm(data1)
rm(data2)
dim(data1)
head(data1)
tail(data1)
data1[5:9,]
names(data1)
import local data
R script
R markdown
http://www.ggobi.org/
MarinStatsLectures (Youtube) M. Marin 2013

---------------------------------------------------------------------------------------------------------------------
require(MASS)
options(prompt="R> ")

source("~/.Rprofile")

help(options)
---------------------------------------------------------------------------------------------------------------------
con <- file("analysisReport.out", "w")
cat(data, file=con)
cat(results, file=con)
cat(conclusion, file=con)
close(con)

---------------------------------------------------------------------------------------------------------------------
w <- c(5,4,7,2,7,1)
sort(w)
sort(w,decreasing=TRUE)
length(w)

v <- c(11, 12, 13, 15, 14)
order(v)
v[order(v)]

---------------------------------------------------------------------------------------------------------------------
pch      A vector of plotting characters.
cex      A vector of character expansion sizes.
bg        A vector of background colors for plot symbols.
type      A character vector specifying the types of plots to generate. Use
type="p" for points, t
ype="l"  for lines,
type="b" for both,
type="c" for the lines part alone of "b",
type="o" for both overplotted points and lines,
type="h" for histogram-like (or high-density) vertical lines,
type="s" for stair steps,
type="S" for other steps, or
type="n" for no plotting.

---------------------------------------------------------------------------------------------------------------------
t(A)
Matrix transposition of A

solve(A)
Matrix inverse of A

A %*% B
Matrix multiplication of A and B

diag(n)
An n-by-n diagonal (identity) matrix

--------------------------------------------------------------------------------------------------------------------
paste("Everybody", "loves", "stats.")
[1] "Everybody loves stats."

paste("Everybody", "loves", "stats.", sep="-")
[1] "Everybody-loves-stats."

paste("Everybody", "loves", "stats.", sep="")
[1] "Everybodylovesstats."

path <- "/home/mike/data/trials.csv"
strsplit(path, "/")
[[1]]
[1] "" "home" "mike" "data" "trials.csv"

> choose(5,3) # How many ways can we select 3 items from 5 items?
[1] 10
> choose(50,3) # How many ways can we select 3 items from 50 items?
[1] 19600
> choose(50,30) # How many ways can we select 30 items from 50 items?
[1] 4.712921e+13

> set.seed(165) # Initialize the random number generator to a known state
> runif(10)

---------------------------------------------------------------------------------------------------------------------
plot(samp)
m <- mean(samp)
abline(h=m)
stdevs <-  m+c(-2, -1, 0, 1, 2)*sd(samp)
abline(h=stdevs, lty="dotted")

--------------------------------------------------------------------------------------------------------------------
Random Forest

CTG data, header
21 features + 1 response = 22 variables
str(data1)
data1$NSP <- as.factor(data1$NSP)
table(data1$NSP)
data1$NSP
1  2  3

sample(1:3)
sample(1:3, size=3, replace=FALSE)

set.seed(123)
ind  <- sample(2, nrow(data1), replace=TRUE, prob=c(0.7, 0.3))
ind
train  <-  data1[ind==1, ]
test  <-  data1[ind==2, ]
ind==1 인 경우 train
ind==2 인 경우 test

Two parameters for randomForest (avoid overfitting)
ntree=500, mtry

lbrary(randomForest)
set.seed(222)
rf <-  randomForest(NSP ~ . , data=train)
rf <-  randomForest(NSP ~ . , data=train,  ntree=300, mtry=8, importance=TRUE, proximity=TRUE)
print(rf)
21 개 변수 vs NSP
NSP가  factor variable이기 때문에 classification 이 된다.
                 mtry   = 4   ~ sqrt(21) = sqrt(the number of features)
                 ntree  = 500 ~ default
OOB estimate of error
Confusion matrix

attributes(rf)
rf$confusion

Prediction with train data
library(caret)
p1 <-  predict(rf,  train)
head(p1)                                           : predicted classes
head(train$NSP)                                : actual classes
confusionMatrix(p1,  train$NSP)

Reference vs Prediction
Accuracy
95% CI
sensitivity
specificity
pos pred value

Out of BAG(OOB) error

Prediction with test data
p2 <- predict(rf,  test)
confusionMatrix(p2,  test$NSP)
Reference vs Prediction
Accuracy
sensitivity
specificity
pos pred value

Error rate
plot(rf)
Error     vs    number of  trees

Tune random forest model
Tune mtry
22 : NSP
-22   :  input variable 무시
t <- tuneRF(train[,-22], train[,22],  stepFactor=0.5,  plot=TRUE,  ntreeTry=300,  trace=TRUE,  improve=0.05)
stepFactor : 1 ---> 0.5
OOB error    vs     mtry

ntree=300, mtry=8, importance=TRUE, proximity=TRUE
조정한다. 그리고 다시 실행한다.
rf <-  randomForest(NSP ~ . , data=train,  ntree=300, mtry=8, importance=TRUE, proximity=TRUE)

number of nodes for the trees
hist(treesize(rf), main="No. of Nodes for the Trees", col="green")
varImpPlot(rf, sort=T, n.var=10, main="title")
importance(rf)
varUsed(rf)
21 variables
partialPlot(rf, train, ASTV, "1")
partialPlot(rf, train, ASTV, "3")
partialPlot(rf, train, ASTV, "2")

getTree(rf, 1, lablelVar=TRUE)
MDSplot(rf, train$NSP)

install.packages('randomForest')
library(randomForest)

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

Sepal.Length,   Sepal.Width,      Petal.Length,      Petal.Width,         Species

dim(iris)
150     5
view(iris)
s_d=sample(150,100)
?sample

train_data = iris[s_d, ]
train_data
dim(train_data)
100
test_data = iris[-s_d, ]
rf_model = randomForest(Species ~ ., train_data, ntree=500 )
rf_model
?predict
p_model = predict(rf_model, test_data)
p_model
table(test_data[,5], p_model)
mean(test_data[,5]  ==  p_model)

--------------------------------------------------------------------------------------------------------------------
Support Vector Machines

library(e1071)
mymodel <- svm(Species ~ ., data=iris)
summary(mymodel)
plot(mymodel, data=iris, Petal.Width~Petal.Length),  slice=list(Sepal.Width=3, Sepal.Length=4))
pred <- predict(mymodel, iris)
tab <- table(Predicted = pred, Acutal = iris$Species)
tab
1.-sum(diag(tab))/sum(tab)
missclassification

mymodel <- svm(Species ~ ., data=iris, kernel="linear")
mymodel <- svm(Species ~ ., data=iris, kernel="polynomial")
mymodel <- svm(Species ~ ., data=iris, kernel="sigmoidl")

set.seed(123)
tmodel <- tune(svm, Species ~ . , data=iris, ranges=list(epsilon =seq(0, 1, 0.1), cost = 2^(2:9) ))
plot(tmodel)
summary(tmodel)

SVM
library(e1071)
plot(iris)
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species)
plot(iris$Petal.Length,  iris$Petal.Width,  col=iris$Species)
s <- sample(150,100)
col <- c( "Petal.Length",  "Petal.Width",  "Species")
iris_train <- iris[s,col]
iris_test <- iris[-s,col]

svmfit <- svm(Species ~. , data=iris_train,  kernel="linear", cost= 100, scale=FALSE)
print(svmfit)
plot(svmfit, iris_train[,col]) 
tuned <- tune(svm, Species~.,  data=iris_train, kernel="linear",  ranges=list(cos=c(0.001,0.01, 0.1, 1, 10, 100)))
summary(tuned)

p<-predict(svmfit, iris_test[,col], type="class")
plot(p)
table(p, iris_test[,3])
mean(p== iris_test[,3])

--------------------------------------------------------------------------------------------------------------------
LASSOR
https://en.wikipedia.org/wiki/Lasso_(statistics)
https://en.wikipedia.org/wiki/Tikhonov_regularization

install.packages(glmnet)
library(glmnet)

CV <- cv.glmnet(x=trainX, y=trainY, family="binomial",  type.measure="class",  alpha=1,  nlambda= 100)
plot(CV)
fit <-  glmnet(x=trainX, y=trainY, family="binomial",  alpha=1, lambda=CV$lambda.1se) 
fit$beta[,1]

--------------------------------------------------------------------------------------------------------------------
The glmnet package
Sparse Linear Models

gbm
glmnet
caret

library(caret)
library(pROC)

titanicDF <- read.csv(
http://math....
, sep='\t')

titanicDF <-  titanicDF[c('PClass', 'Age', 'Sex', 'Title', 'Survided' )]

prop.table(table(titanicDF$Survived))
names(getModelInfo())

Regression vs Classification

objControl <- trainControl(method='cv',  number=3, returnResamp='none',
summaryFunction=twoClassSummary, classProbs=TRUE )

objModel <- train(trainDF[, preictorNames], as.factor(trainDF[,outcomName]),
method='gbm', trControl=objControl, metric="ROC", preProc=c("center", "scale"))
summary(objModel)

--------------------------------------------------------------------------------------------------------------------
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
https://cran.r-project.org/web/packages/sampling/sampling.pdf

x <- 1:12
# a random permutation
sample(x)
# bootstrap resampling -- only if length(x) > 1 !
sample(x, replace = TRUE)
#                                100 Bernoulli trials
sample(c(0,1), 100, replace = TRUE)
## More careful bootstrapping --  Consider this when using sample()
## programmatically (i.e., in your function or simulation)!
# sample()'s surprise -- example
x <- 1:10
    sample(x[x >  8]) # length 2
    sample(x[x >  9]) # oops -- length 10!
    sample(x[x > 10]) # length 0
## safer version:
resample <- function(x, ...) x[sample.int(length(x), ...)]
resample(x[x >  8]) # length 2
resample(x[x >  9]) # length 1
resample(x[x > 10]) # length 0
## R 3.x.y only
sample.int(1e10, 12, replace = TRUE)
sample.int(1e10, 12) # not that there is much chance of duplicates

--------------------------------------------------------------------------------------------------------------------
Decision tree       vs     Random forest

library(rpart)
library(rpart.plot)
library(randomForest)

s <- sample(150,100)      #  100  : training
irist_train <- iris[s,]
iris_test  <- iris[-s,]

dtm <-  rpart(Species ~ ., data_iris_train, method="class")
summary(dtm)

p <- prediction(dtm, iris_test, type="class")
table(iris_test[,5],p)
mean(p ==  iris_test[,5])

rfm<- randomForest(Species ~ ., iris_train)
rfm
p <-  predict(rfm,iris_test)
table(iris_test[,5],p)
mean(iris_test[,5] == p)
importance(rfm)
varImpPlot(rfm)

--------------------------------------------------------------------------------------------------------------------
Decision tree in R for classification
decision trees

iris
Sepal.Length       Sepal.Width     Petal.Length          Petal.Width        Species

dim(iris)
150   5

s <- sample(150,50)

iris_train <-  iris[s,]
iris_test  <-  iris[-s,]
dim(iris_train)
50    5
dim(iris_test)
100   5

library(rpart)
dtm <-  rpart(Species ~.,  iris_train, method="class")
dtm
plot(dtm)
text(dtm)
library(rpart.plot)
rpart.plot(dtm, type=4,  extra= 101 )
?rpart.plot

p <-  predict(dtm, iris_test, type="class")
table(iris_test[,5], p)

Decision tree classification

rpart, evtree, tree

library(rpart)
library(rpart.plot)
library(tree)
library(evtree)

s <-  sample(150,100)
iris_train <-  iris[s,]
iris_test  <-  iris[-s, ]

dtm <-  rpart(Species ~ ., data=iris_train,  method="class")
rpart.plot(dtm, type=4,  extra=101)
p <-  prediction(dtm, iris_test,  type="class")
table(iris_test[,5],p)
mean(p== iris_test[,5])

dtm <- tree(Species ~., data=iris_train)
plot(dtm); text(dtm)
p <- prediction(dtm, iris_test, type= "class")
table(iris_test[,5], p)
mean(p == iris_test[,5])

dtm <-  evtree(Species ~.,   data=iris_train)
plot(dtm)
p <- prediction(dtm, iris_test)
table(iris_test[,5],p)
mean(p== iris_test[,5])

m1 <-  rpart(sleep_total ~., data=df,  method="anova")              #  numerical value case, anova 
m1
rpart.plot(m1,  type=3,  digits=3,  fallen.leaves=TRUE)
p1 <-  predict(m1, df)
p1
MAE <-   function(actual, predicted) { mean(abs(actual- predicted))}
MAE
MAE(df$sleep_total,p1)

--------------------------------------------------------------------------------------------------------------------
Principal component analysis

princomp

head(iris)
d <- iris[,1:4]
d <- iris[,-5] # 보다 더 간결한 표현
head(d)

pc <- princomp(d, cor=TRUE, score=TRUE)
summary(pc)
plot(pc)
plot(pc, type="l")
biplot(pc)
dim(d)
attributes(pc)
pc$loadings
pc$scores

--------------------------------------------------------------------------------------------------------------------
Rattle - Data mining in R

library(rattle)
rattle()

R data miner [Rattle]

--------------------------------------------------------------------------------------------------------------------

RStudio



-----------------------------------------------------------------------------------------------------------------
함수
출처: Hands-On Programming with R
함수 

-----------------------------------------------------------------------------------------------------------------
data <- read.csv("datafile.csv")
data <- read.csv("datafile.csv", stringAsFactors=FALSE)
data <- read.csv("datafile.csv", header=FALSE)
# Manually assign the header names
names(data) <- c("Column1","Column2","Column3")
str(data)
names(data) <-
NULL

> print(dframe)
small medium big
1 0.6739635 10.526448 99.83624
2 1.5524619 9.205156 100.70852
3 0.3250562 11.427756 99.73202
4 1.2143595 8.533180 98.53608
5 1.3107692 9.763317 100.74444
6 2.1739663 9.806662 98.58961
7 1.6187899 9.150245 100.46707
8 0.8872657 10.058465 99.88068
9 1.9170283 9.182330 100.46724
10 0.7767406 7.949692 100.49814
> mean(dframe)
small medium big
1.245040 9.560325 99.946003
> sd(dframe)
small medium big
0.5844025 0.9920281 0.8135498
> var(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326
> cor(dframe)
small medium big
small 1.00000000 -0.3711367 -0.08424345
medium -0.37113670 1.0000000 -0.11466070
big -0.08424345 -0.1146607 1.00000000
> cov(dframe)
small medium big
small 0.34152627 -0.21516416 -0.04005275
medium -0.21516416 0.98411974 -0.09253855
big -0.04005275 -0.09253855 0.66186326

----------------------------------------------------------------------------------------------------------------
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

> dfrm <- read.table("statisticians.txt")
> print(dfrm)
V1 V2 V3 V4
1 Fisher R.A. 1890 1962
2 Pearson Karl 1857 1936
3 Cox Gertrude 1900 1978
4 Yates Frank 1902 1994
5 Smith Kirstine 1878 1939

lastname firstname born died
Fisher R.A. 1890 1962
Pearson Karl 1857 1936
Cox Gertrude 1900 1978
Yates Frank 1902 1994
Smith Kirstine 1878 1939

> dfrm <- read.table("statisticians.txt", header=TRUE, stringsAsFactor=FALSE)
> print(dfrm)
lastname firstname born died
1 Fisher R.A. 1890 1962
2 Pearson Karl 1857 1936
3 Cox Gertrude 1900 1978
4 Yates Frank 1902 1994
5 Smith Kirstine 1878 1939

head(dfrm)
tail(dfrm)

------------------------------------------------------------------------------------------------------------------
> fibmat
[,1] [,2]
[1,] 0 1
[2,] 1 1
> eigen(fibmat)
$values
[1] 1.618034 -0.618034
$vectors
[,1] [,2]
[1,] 0.5257311 -0.8506508
[2,] 0.8506508 0.5257311
eigen(fibmat)$values
eigen(fibmat)$vectors

------------------------------------------------------------------------------------------------------------------
> r <- prcomp( ~ x + y)
> print(r)
Standard deviations:
[1] 0.3724471 0.1306414
Rotation:
PC1 PC2
x -0.5312726 -0.8472010
y -0.8472010 0.5312726
> summary(r)
Importance of components:
PC1 PC2
Standard deviation 0.372 0.131
Proportion of Variance 0.890 0.110
Cumulative Proportion 0.890 1.000

------------------------------------------------------------------------------------------------------------------

#!/usr/bin/Rscript --slave

------------------------------------------------------------------------------------------------------------------

x <- seq(from=0, to=6, length.out=100)                                                                 # Define the density domains
ylim <- c(0, 0.6)
par(mfrow=c(2,2))                                                                                                  # Create a 2x2 plotting area
plot(x, dunif(x,min=2,max=4), main="Uniform",  type='l', ylim=ylim)                         # Plot a uniform density
plot(x, dnorm(x,mean=3,sd=1), main="Normal",  type='l', ylim=ylim)                       # Plot a Normal density
plot(x, dexp(x,rate=1/2), main="Exponential",   type='l', ylim=ylim)                         # Plot an exponential density
plot(x, dgamma(x,shape=2,rate=1), main="Gamma",   type='l', ylim=ylim)               # Plot a gamma density

http://www.sr.bham.ac.uk/~ajrs/R/gallery/plot_blackbody_curves.txt


------------------------------------------------------------------------------------------------------------------

lty="solid" or lty=1 (default)
lty="dashed" or lty=2
lty="dotted" or lty=3
lty="dotdash" or lty=4
lty="longdash" or lty=5
lty="twodash" or lty=6
lty="blank" or lty=0 (inhibits drawing)

------------------------------------------------------------------------------------------------------------------

builtins() # List all built-in functions
options()  # Set options to control how R computes & displays results

?NA        # Help page on handling of missing data values
abs(x)     # The absolute value of "x"
append()   # Add elements to a vector
c(x)       # A generic function which combines its arguments
cat(x)     # Prints the arguments
cbind()    # Combine vectors by row/column (cf. "paste" in Unix)
diff(x)    # Returns suitably lagged and iterated differences
gl()       # Generate factors with the pattern of their levels
grep()     # Pattern matching
identical()  # Test if 2 objects are *exactly* equal
jitter()     # Add a small amount of noise to a numeric vector
julian()     # Return Julian date
length(x)    # Return no. of elements in vector x
ls()         # List objects in current environment
mat.or.vec() # Create a matrix or vector
paste(x)     # Concatenate vectors after converting to character
range(x)     # Returns the minimum and maximum of x
rep(1,5)     # Repeat the number 1 five times
rev(x)       # List the elements of "x" in reverse order
seq(1,10,0.4)  # Generate a sequence (1 -> 10, spaced by 0.4)
sequence()     # Create a vector of sequences
sign(x)        # Returns the signs of the elements of x
sort(x)        # Sort the vector x
order(x)       # list sorted element numbers of x
tolower(),toupper()  # Convert string to lower/upper case letters
unique(x)      # Remove duplicate entries from vector
system("cmd")  # Execute "cmd" in operating system (outside of R)
vector()       # Produces a vector of given length and mode

formatC(x)     # Format x using 'C' style formatting specifications
floor(x), ceiling(x), round(x), signif(x), trunc(x)   # rounding functions

Sys.getenv(x)  # Get the value of the environment variable "x"
Sys.putenv(x)  # Set the value of the environment variable "x"
Sys.time()     # Return system time
Sys.Date()     # Return system date
getwd()        # Return working directory
setwd()        # Set working directory
?files         # Help on low-level interface to file system
list.files()   # List files in a give directory
file.info()    # Get information about files

# Built-in constants:
pi, letters, LETTERS       # Pi, lower & uppercase letters, e.g. letters[7] = "g"
month.abb, month.name  # Abbreviated & full names for months

log(x),logb(),log10(),log2(),exp(),expm1(),log1p(),sqrt()   # Fairly obvious
cos(),sin(),tan(),acos(),asin(),atan(),atan2()       # Usual stuff
cosh(),sinh(),tanh(),acosh(),asinh(),atanh()         # Hyperbolic functions
union(),intersect(),setdiff(),setequal()             # Set operations
+,  -,  *,   /,   ^,   %%,   %/%                                     # Arithmetic operators
<,   >,   <=,   >=,   ==,    !=                         # Comparison operators
eigen()      # Computes eigenvalues and eigenvectors

deriv()      # Symbolic and algorithmic derivatives of simple expressions
integrate()  # Adaptive quadrature over a finite or infinite interval.

sqrt(),sum()
?Control     # Help on control flow statements (e.g. if, for, while)
?Extract     # Help on operators acting to extract or replace subsets of vectors
?Logic       # Help on logical operators
?Mod         # Help on functions which support complex arithmetic in R
?Paren       # Help on parentheses
?regex       # Help on regular expressions used in R
?Syntax      # Help on R syntax and giving the precedence of operators
?Special     # Help on special functions related to beta and gamma functions

help(package=graphics) # List all graphics functions

plot()                # Generic function for plotting of R objects
par()                 # Set or query graphical parameters
curve(5*x^3,add=T)    # Plot an equation as a curve
points(x,y)           # Add another set of points to an existing graph
arrows()              # Draw arrows [see errorbar script]
abline()              # Adds a straight line to an existing graph
lines()               # Join specified points with line segments
segments()            # Draw line segments between pairs of points
hist(x)               # Plot a histogram of x
pairs()               # Plot matrix of scatter plots
matplot()             # Plot columns of matrices

?device               # Help page on available graphical devices
postscript()          # Plot to postscript file
pdf()                 # Plot to pdf file
png()                 # Plot to PNG file
jpeg()                # Plot to JPEG file
X11()                 # Plot to X window
persp()               # Draws perspective plot
contour()             # Contour plot
image()               # Plot an image


lm # Fit liner model
glm # Fit generalised linear model
nls # non-linear (weighted) least-squares fitting
lqs # "library(MASS)" resistant regression

optim # general-purpose optimisation
optimize # 1-dimensional optimisation
constrOptim # Constrained optimisation
nlm # Non-linear minimisation
nlminb # More robust (non-)constrained non-linear minimisation

help(package=stats)   # List all stats functions

?Chisquare            # Help on chi-squared distribution functions
?Poisson              # Help on Poisson distribution functions
help(package=survival) # Survival analysis

cor.test()            # Perform correlation test
cumsum(); cumprod(); cummin(); cummax()   # Cumuluative functions for vectors
density(x)            # Compute kernel density estimates
ks.test()             # Performs one or two sample Kolmogorov-Smirnov tests
loess(), lowess()     # Scatter plot smoothing
mad()                 # Calculate median absolute deviation
mean(x), weighted.mean(x), median(x), min(x), max(x), quantile(x)
rnorm(), runif()  # Generate random data with Gaussian/uniform distribution
splinefun()           # Perform spline interpolation
smooth.spline()       # Fits a cubic smoothing spline
sd()                  # Calculate standard deviation
summary(x)            # Returns a summary of x: mean, min, max etc.
t.test()              # Student's t-test
var()                 # Calculate variance
sample()              # Random samples & permutations
ecdf()                # Empirical Cumulative Distribution Function
qqplot()              # quantile-quantile plot

------------------------------------------------------------------------------------------------------------------
R Package diagram: visualising simple graphs,
flowcharts, and webs

https://cran.r-project.org/web/packages/diagram/vignettes/diagram.pdf
https://www.stat.auckland.ac.nz/~paul/R/Diagram/diagram.pdf

------------------------------------------------------------------------------------------------------------------
ggplot2

Grammar of Graphics by L. Wilkinson
ggplot2.org

plot
text, lines, points, axis

xyplot, bwplot

library(ggpolt2)
mpg : dara frame
qplot(displ, hwy, data=mpg)
qplot(displ, hwy, data=mpg, color=4)
qplot(displ, hwy, data=mpg, geom=c("point", "smooth"))

Histograms
qplot(hwy, data=mgp, fill=4)

Facets
qplot(hwy, data=mpg, facets=.~drv)
qplot(hwy, data=mpg, facets= drv~., binwidth=2)

qplot(log(eno), data=maacs,  fill = mopos)
qplot(log(eno), data=maacs, geom = "density")
qplot(log(eno), data=maacs, geom = "density", color = mopos)

qplot(log(pm25), log(eno), data=maacs)
qplot(log(pm25), log(eno), data=maacs, shape = mopos)
qplot(log(pm25), log(eno), data=maacs, shape = mopos, color = mopos)
qplot(log(pm25), log(eno), data=maacs, color = mopos, geom = c("point", "smooth"), method = "lm")
qplot(log(pm25), log(eno), data=maacs, color = mopos, geom = c("point", "smooth"), method = "lm", facets = .~mopos)

ggplot(mammals, aes(x = vore, y = sleep)) +  geom_boxplot(varwidth = TRUE)
ggplot(mammals, aes(x = sleep, fill = vore))  +  geom_density(col = NA, alpha = 0.35)
ggplot(mammals, aes(x = sleep, fill = vore))  +  geom_density(aes(weight = n), col = NA, alpha = 0.35)
ggplot(mammals, aes(x = vore, y = sleep)) + geom_violin()
ggplot(mammals, aes(x = vore, y = sleep)) + geom_violin(aes(weight = n), col = NA)

First look
ggplot(faithful, aes(x = waiting, y = eruptions))  + geom_point()

2D density plot
ggplot(faithful, aes(x = waiting, y = eruptions))  + stat_density_2d(geom = "title", aes(fill =  ..density..), contour = FALSE)
library(viridis)
ggplot(faithful, aes(x = waiting, y = eruptions))  + stat_density_2d(geom = "title", aes(fill =  ..density..), contour = FALSE) + scale_fill_viridis()
ggplot(faithful, aes(x = waiting, y = eruptions)) + stat_denstiy_2d(geom = "point", aes(size = ..density..), n=20, contour = FALSE) + scale_size(range = c(0, 9))

library(ggplot2)
ggplot(mammals, aes(x = body, y = brain)) + geom_point()
ggplot(mammals, aes(x = body, y = brain)) + geom_point(alpha = 0.6) + stat_smooth(method = "lm", col = "red", se = FALSE)
ggplot(mammals, aes(x = body, y = brain)) + geom_point(alpha = 0.6) + coord_fixed() + scale_x_log10() + scale_y_log10() + stat_smooth(method = "lm", col = "#C42126",  se = FALSE, size = 1)
 
------------------------------------------------------------------------------------------------------------------
network visualization with ggplot2
https://briatte.github.io/ggnet/

------------------------------------------------------------------------------------------------------------------

curve
(x^2, from=1, to=50, , xlab="x", ylab="y")

eq
= function(x){x*x}
curve
(eq, from=1, to=50, xlab="x", ylab="y")

library("ggplot2")
eq
= function(x){x*x}
qplot
(c(1,50), fun=eq, stat="function", geom="line", xlab="x", ylab="y")

library("ggplot2"
)
eq = function(x){x*x
}
ggplot(data.frame(x=c(1, 50)), aes(x=x)) + stat_function(fun=eq, geom="line") + xlab("x")
ylab("y")

library(lattice)

eq
<-function(x) {x*x}
X
<-1:1000
xyplot
(eq(X)~X,type="l")


------------------------------------------------------------------------------------------------------------------

fit <- lm(mpg ~ cyl + hp, data = mtcars)

summary
(fit)
##Coefficients:
## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 36.90833 2.19080 16.847 < 2e-16 ***

## cyl -2.26469 0.57589 -3.933 0.00048 ***

## hp -0.01912 0.01500 -1.275 0.21253

plot
(mpg ~ cyl, data = mtcars, xlab = "Cylinders", ylab = "Miles per gallon")
abline
(coef(fit)[1:2])
## rounded coefficients for better output

cf
<- round(coef(fit), 2)
## sign check to avoid having plus followed by minus for negative coefficients

eq
<- paste0("mpg = ", cf[1], ifelse(sign(cf[2])==1, " + ", " - "), abs(cf[2]), " cyl ", ifelse(sign(cf[3])==1, " + ", " - "), abs(cf[3]), " hp")
## printing of the equation

mtext
(eq, 3, line=-2)

------------------------------------------------------------------------------------------------------------------
knitr
https://en.wikipedia.org/wiki/Knitr

fit <- lm(dist ~ speed, data = cars) # linear regression
par(mar = c(4, 4, 1, 0.1), mgp = c(2, 1, 0))
with(cars, plot(speed, dist, panel.last = abline(fit)))
text(10, 100, "$Y =
\\beta_0 + \\beta_1x + \\epsilon$")
library(ggplot2)
qplot(speed, dist, data = cars) + geom_smooth()

par(mar = c(3, 3, 0.1, 0.1))
plot(1:10, ann = FALSE, las = 1)
text(5, 9, "mass $\\rightarrow$ energy\n$E=mc^2$")

library(animation)
demo("Mandelbrot", echo = FALSE, package = "animation")

------------------------------------------------------------------------------------------------------------------
shiny.rstudio.com/gallery

------------------------------------------------------------------------------------------------------------------
plot3D in R

https://cran.r-project.org/web/packages/plot3D/vignettes/plot3D.pdf


x = seq(-10, 10, length= 30)
y = x
f = function(x,y) { r = sqrt(xˆ2+yˆ2); 10 * sin(r)/r }
z = outer(x, y, f)                                                               # Forms matrixz using function f
par(mfrow = c(1,2))
persp(x,y,z)
persp(x,y,z,theta = 30, phi = 30)

------------------------------------------------------------------------------------------------------------------

height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)
bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)

plot(bodymass, height)
plot(bodymass, height, pch = 16, cex = 1.3, col = "blue", main = "HEIGHT PLOTTED AGAINST BODY MASS", xlab = "BODY MASS (kg)", ylab = "HEIGHT (cm)")
abline(lm(height ~ bodymass))

vertical line 추가할 때:
abline(v=3, col="purple")

-----------------------------------------------------------------------------------------------------------------

http://www.statmethods.net/advgraphs/layout.html
http://www.harding.edu/fmccown/r/

# 4 figures arranged in 2 rows and 2 columns
attach(mtcars)
par(mfrow=c(2,2))
plot(wt,mpg, main="Scatterplot of wt vs. mpg")
plot(wt,disp, main="Scatterplot of wt vs disp")
hist(wt, main="Histogram of wt")
boxplot(wt, main="Boxplot of wt")

# 3 figures arranged in 3 rows and 1 column
attach(mtcars)
par(mfrow=c(3,1))
hist(wt)
hist(mpg)
hist(disp)

# One figure in row 1 and two figures in row 2
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)

# One figure in row 1 and two figures in row 2
# row 1 is 1/3 the height of row 2
# column 2 is 1/4 the width of the column 1
attach(mtcars)
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE), widths=c(3,1), heights=c(1,2))
hist(wt)
hist(mpg)
hist(disp)

# Add boxplots to a scatterplot
par(fig=c(0,0.8,0,0.8), new=TRUE)
plot(mtcars$wt, mtcars$mpg, xlab="Car Weight", ylab="Miles Per Gallon")
par(fig=c(0,0.8,0.55,1), new=TRUE)
boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE)
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg, axes=FALSE)
mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3)

------------------------------------------------------------------------------------------------------------------

lm_model <- lm(y ∼ x1 + x2, data=as.data.frame(cbind(y,x1,x2)))
summary(lm_model)


glm_mod <- glm(y ∼ x1+x2, family=binomial(link="logit"), data=as.data.frame(cbind(y,x1,x2)))

kmeans_model <- kmeans(x=X, centers=m)

class package
knn_model <- knn(train=X_train, test=X_test, cl=as.factor(labels), k=K)

e1071 package
nB_model <- naiveBayes(y ∼ x1 + x2, data=as.data.frame(cbind(y,x1,x2)))

rpart
cart_model <- rpart(y ∼ x1 + x2, data=as.data.frame(cbind(y,x1,x2)), method="class")

rpart package
boost_model <- ada(x=X, y=labels)

e1071 package
svm_model <- svm(x=X, y=as.factor(labels), kernel ="radial", cost=C)
summary(svm_model)

---------------------------------------------------------------------------------------------------------------------

require(MASS)
options(prompt="R> ")

source("~/.Rprofile")

help(options)
---------------------------------------------------------------------------------------------------------------------

con <- file("analysisReport.out", "w")
cat(data, file=con)
cat(results, file=con)
cat(conclusion, file=con)
close(con)

---------------------------------------------------------------------------------------------------------------------

w <- c(5,4,7,2,7,1)
sort(w)
sort(w,decreasing=TRUE)
length(w)

v <- c(11, 12, 13, 15, 14)
order(v)
v[order(v)]

---------------------------------------------------------------------------------------------------------------------

pch      A vector of plotting characters.
cex      A vector of character expansion sizes.
bg        A vector of background colors for plot symbols.
type      A character vector specifying the types of plots to generate. Use
type="p" for points, t
ype="l"  for lines,
type="b" for both,
type="c" for the lines part alone of "b",
type="o" for both overplotted points and lines,
type="h" for histogram-like (or high-density) vertical lines,
type="s" for stair steps,
type="S" for other steps, or
type="n" for no plotting.

---------------------------------------------------------------------------------------------------------------------

t(A)
Matrix transposition of A

solve(A)
Matrix inverse of A

A %*% B
Matrix multiplication of A and B

diag(n)
An n-by-n diagonal (identity) matrix

--------------------------------------------------------------------------------------------------------------------

paste("Everybody", "loves", "stats.")
[1] "Everybody loves stats."

paste("Everybody", "loves", "stats.", sep="-")
[1] "Everybody-loves-stats."

paste("Everybody", "loves", "stats.", sep="")
[1] "Everybodylovesstats."

path <- "/home/mike/data/trials.csv"
strsplit(path, "/")
[[1]]
[1] "" "home" "mike" "data" "trials.csv"

> choose(5,3) # How many ways can we select 3 items from 5 items?
[1] 10
> choose(50,3) # How many ways can we select 3 items from 50 items?
[1] 19600
> choose(50,30) # How many ways can we select 30 items from 50 items?
[1] 4.712921e+13

> set.seed(165) # Initialize the random number generator to a known state
> runif(10)

---------------------------------------------------------------------------------------------------------------------

plot(samp)
m <- mean(samp)
abline(h=m)
stdevs <-  m+c(-2, -1, 0, 1, 2)*sd(samp)
abline(h=stdevs, lty="dotted")


--------------------------------------------------------------------------------------------------------------------

mush <- apriori(as.matrix(dataset), parameter=list(supp=0.8, conf=0.9))
sumarry(mush)
inspect(mush)

glmmodel <- glm(y~ x1+x2, family=binomial(link="logit"), data=as.data.frame(cbind(y, x1, x2))))

kmeansmodel <- kmeans(x=Xmatrix, conters=10)

knnmodel <- knn(train=Xtrainmatrix, test=Xtestmatrix, cl=as.factor(labels), k=K)


nmodel <- naiveBayes(y ~ x1+x2, data=as.data.frame(cbind(y,x1,x2)), method="class")

cartmodel <- rpart(y ~ x1+ x2, data=as.data.frame(cbind(y,x1,x2)), model="class")
plot.rpart
text.rpart

boostmodel <- ada(x=Xmatrixfeatures, y=labels)

svmmodel <- svm(x=Xmatrixfeatures, y=as.factor(labels), kernel="radial", cost=C)
summary(svmmodel)

matrix 선형대수학 용도, 본질적으로 vector이다. data.frame 양식으로 전환이 가능하다.
data.frame 양식은 본질적으로 list 이다.
보다 일반적인 데이터 양식, $ 사용이 가능한 양식: as.data.frame(matrix)

nnetmodel <- nnet(Species ~., data=iris, size=10)
# load the package
library(nnet)
data(iris)
# fit model
fit <- nnet(Species~., data=iris, size=4, decay=0.0001, maxit=500)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4], type="class")
# summarize accuracy
table(predictions, iris$Species)

http://machinelearningmastery.com/non-linear-classification-in-r/
Flexible disscriminant analysis

# load the package
library(mda)
data(iris)
# fit model
fit <- fda(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4])
# summarize accuracy
table(predictions, iris$Species)

http://machinelearningmastery.com/non-linear-classification-in-r/
# load the package
library(rpart)
# load data
data(iris)
# fit model
fit <- rpart(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4], type="class")
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(RWeka)
# load data
data(iris)
# fit model
fit <- J48(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4])
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(RWeka)
# load data
data(iris)
# fit model
fit <- PART(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4])
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(ipred)
# load data
data(iris)
# fit model
fit <- bagging(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4], type="class")
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(randomForest)
# load data
data(iris)
# fit model
fit <- randomForest(Species~., data=iris)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, iris[,1:4])
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(gbm)
# load data
data(iris)
# fit model
fit <- gbm(Species~., data=iris, distribution="multinomial")
# summarize the fit
print(fit)
# make predictions
predictions <- predict(fit, iris)
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(C50)
# load data
data(iris)
# fit model
fit <- C5.0(Species~., data=iris, trials=10)
# summarize the fit
print(fit)
# make predictions
predictions <- predict(fit, iris)
# summarize accuracy
table(predictions, iris$Species)

# load the package
library(glmnet)
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- glmnet(x, y, family="gaussian", alpha=0, lambda=0.001)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, x, type="link")
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)

# load the package
library(lars)
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- lars(x, y, type="lasso")
# summarize the fit
summary(fit)
# select a step with a minimum error
best_step <- fit$df[which.min(fit$RSS)]
# make predictions
predictions <- predict(fit, x, s=best_step, type="fit")$fit
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)

# load the package
library(glmnet)
# load data
data(longley)
x <- as.matrix(longley[,1:6])
y <- as.matrix(longley[,7])
# fit model
fit <- glmnet(x, y, family="gaussian", alpha=0.5, lambda=0.001)
# summarize the fit
summary(fit)
# make predictions
predictions <- predict(fit, x, type="link")
# summarize accuracy
mse <- mean((y - predictions)^2)
print(mse)

mice
rpart
ctree
train
randomForest
nnet
e1071
rbfdot
ksvm
https://www.datacamp.com/community/tutorials/machine-learning-in-r#six
http://www.kdnuggets.com/2017/02/top-r-packages-machine-learning.html

http://www.kdnuggets.com/2015/06/top-20-r-packages.html

--------------------------------------------------------------------------------------------------------------------




--------------------------------------------------------------------------------------------------------------------




덧글

댓글 입력 영역