R-Modeling(step 4)

简介: R is a data analysis and visualization platform.

[I]Regression

OLSregression

Description Function
simple linear regression lm(Y~X1,data)
polynomial regression lm(Y~X1+...+I(X^2),data)
multiple linear regression lm(Y~X1+X2+...+Xk)
multiple linear regression with interaction terms lm(Y~X1+X2+X1:X2)

selection the optimal regression model

Function Description
anova(fit1,fit2) nested model fit
AIC(fit1,fit2) statistical fit
stepAIC(fit,direction=) stepwise method:"forward""backward"
regsubsets() all-subsets regression

test

Function Description
plot() plot
qqplot() quantile comparison
durbinWatsonTest() Durbin-Waston
crPlots() component and residual
ncvTest() score test of non-constant error variance
spreadLevelPlot() dispersion level
outlierTest() Bonferroni outlier
avPlots() added variable
inluencePlot() regression
scatterplot() enhanced scatter plot
scatterplotMatrix() enhanced scatter plot matrix
vif() variance expansion factor

abnormal

  • abnormal observation

1.outlier:outlierTest()
2.high leverage point:hat.plot()
3.strong influence point:Cook's D

  • improvement measures

1.delete observation point
2.variable transformation
3.addition and deletion of variables

Generalized Linear

generalized linear

  • glm()
    Distribution Family | Default Connection Function
binomial (link = "logit")
gaussian (link = "identity")
gamma (link = "inverse")
inverse.gaussian (link = "1/mu^2")
poisson (link = "log")
quasi (link = "identity",variance = "constant")
quasibinomial (link = "logit")
quasipoisson (link = "log")
  • function with glm()
    Function | Description
summary() show details of the fitting model
coefficients()/coef() list the parameters of the fitting model
confint() give a confidence interval for the model paramenters
residuals() list residual value of the fitting model
anova() generate an analysis table of variance for two fitting model
plot() generate a diagnostic map of the evaluation fitting model
predict() forecast new datasets with a fitting model
deviance() the deviation of the fitting model
df.residual() the residual freedom of the fitting model

logistic

> data(Affairs,package="ARE")
> summary(Affairs)
> table(Affair$affairs)
> Affairs$ynaffair[Affairs$affairs > 0] <-1
> Affairs$ynaffair[Affairs$affairs == 0] <-0
> Affairs$naffair <- factor(Affairs$ynaffair),levels=c(0,1),labels=c("No","Yes"))
> table(Affairs$ynaffair)
> fit.full <-glm(ynaffair~gender + age + yearmarried +children + 
             religiousness + education + occupation + rating,family = binomial()
             data = Affairs)
> fit.reduced <-glm(ynaffair ~ age + yearmarried + religiousness +rating,
                      family = binomial(),data = Affairs)
> anova(fit.reduced,fit.full,test="Chisq")
> coef(fit.reduced)
> exp(coef(fit.reduced))
> testdata<-data.frame(rating=c(1,2,3,4,5),age=mean(Affair$age),
                                   yearsmarried=mean(Affairs$yearsmsrraied),
                                   religiousness=mean(Affairs$religiousness))
> teatdata
> testdata$prob <- predict(fit.reduced,newdata=testdata,type="response")
> testdata <- data.frame(rating = mean(Affair$rating),
                                    age=seq(17,57,10),
                                    yearsmarried=mean(Affair$yearsmarried),
                                    religiousness=mean(Affairs$religiousness))
> testdata
> testdata$prob<-predict(fit.reduced,newdata=testdata,type="response")
> testdata
> deviance(fit.reduced)/df.residual(fit.reduced)
> fit <- glm(ynaffair ~ age + yearmarried + religiousness + rating,
                 family = quasibinomial(),data=Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
             fit$df.residual,lower=F)

poisson

> data(breslow.dat,package="robust")
> name(breslow.dat)
> summary(breslow.dat[c(6,7,8,10)])
> opar<-par(no.readonly=TRUE)
> par(mfrow=c(1,2))
> attach(breakslow.dat)
> hist(sumY,breaks=20,xlab="Seizure Count",main="Distribution of Seizures")
> boxplot(sumY~Trt,xlab="Treatment",main="Group Comparisons")
> par(opar)
> fit<-glm(sumY~Base + Age +Trt,data=breslow.dat,family=poisson())
> summary(fit)
> coef(fit)
> exp(coef(fit))
> deviance(fit)/df.residual(fit)
> library(qcc)
> qcc.overdispersion.test(breslow.dat$sumY,type="poisson")
> fit.od<-glm(sumY~Base + Age + Trt,data=breslow.dat,
       family=quassipossion())
> summary(fit.od)
> fit glm(sumY~Base + Age +Trt,data=breslow.dat,offset=log(time),family=poisson)

[II]Cluster

1.cluster analysis step

1.choose the right variable
2.scale data
3.looking for anomalies
4.calculated distance
5.selection clustering algorithm
6.obtain one or more clustering methods
7.determine the number of classes
8.get the ultimate clustering solution
9.visualization of results
10.interpretation class
11.validation results

2.calculated distance

> data(nurtrient,package="flexclust")
> head(nutrient,4)
> d<-dist(nutrient)
> as.martrix(d)[1:4,1:4]

3.hierarchical clustering analysis

> data(nutrient,package="flexclust")
> row.name(nutrient) <-tolower(row.names(nutrient))
> nutrient.scaled<-scale(nutrient)

> d<-dist(nutrient.scaled)

> fit.average <-hclust(d,method="average")
> plot(fit.average,hang=-1,cex=.8,main="Average Linkage Clustering")
> library(Nbcluster)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(nutrient,scaled,distance="euclidean",
                       min.nc=2,max.nc=15,method="average")
> table(nc$Best.n[1,])
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Cluster Chosen by 26 Criteria")  
> clusters<-cutree(fit.average,k=5)
> table(clusters)
> aggregate(nutrient,by=list(cluster=clusters),median)
> aggregate(as.data.frame(nutrient.scaled),bu=list(cluster=clusters),median)
> plot(fit.average,hang=-1,cex=.8,
         main="Average Linkage Cluster\n5 Cluster Solution")
> rect.hclust(fit.average,k=5)

4.Clustering analysis

  • K-means clustering
> wssplot<-function(data,nc=15,seed=1234)(
> head(wine)
> df<-scale(wine[-1])
> wssplot(df)
> library(NcClust)
> set.seed(1234)
> devAskNewPage(ask=TRUE)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmeans")
> table(nc$Best.n[1,)
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> set.seed(1234)
> fit.km<-kmeans(df,3,nstart=25)
> fit.km$size
> fit.km$centers
> aggregate(wine[-1],by=list(cluster=fit.km$cluster),mean)
  • Division around the center point
> library(cluster)
> set.seed(1234)
> fit.pam<-pam(wine[-1],k=3,stand=TRUE)
> fit.pam$medoids
> clusplot(fit.pam,main="Bivariate Cluster Plot")
> ct.pam<-table(wine$Type,fit.pam$clustering)
> randIndex(ct.pam)

5.avoid non-existing classes

> library(fMultivar)
> set.seed(1234)
> df<-rnom2d(1000,rho=.5)
> df<-as.data.frame(df)
> plot(df,main="Bivariate Normal Distribution with rho=0.5")
> wssplot(df)
> library(NbClust)
> nc<-NbClust(df,min.nc=2,max.nc=15,method="kmean")
> dev.new()
> barplot(table(nc$Best,n[1,]),
              xlab="Number of Clusters",ylab="Number of Criteria",
              main="Number of Clusters Chosen by 26 Criteria")
> library(ggplot2)
> library(cluster)
> fit<-pam(df,k=2)
> df$clustering<-factor(fit$clustering)
> ggplot(data=df,aes(x=V1,y=V2,color=clustering,shape=clustering)) +
             geom_point() + ggtitle("Clustering of Bivariate Normal Data")
> plot(nc$All,index[,4],type="o",ylab="CCC",xlab="Number of clusters",col="blue")

[III]Classification

data preparation

> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"breast-cancer-wisconsin/breast-cancer-wisconsin.data"
> url<-paste(loc,ds,sep="")
> breast<-read.table(url,sep=",",header=FALSE,na.string="?")
> names(breast)<-c("ID","clumpThickness","sizeUniformity",
                              "shapeUniformity","maginalAdhesion",
                              "singleEpitheliaCellSize","bareNuclei",
                              "blandChromatin","normalNucleoli","mitosis","class")
> df<-breast[-1]
> df$class<-factor(df$class,levels=c(2,4),
                           labels=c("benign","malignant"))
> set.seed(1234)
> train<-sample(nrow(df),0.7*nrow(df))
> df.train<-df[train,]
> df.validate<-df[-train,]
> table(df.train$class)
> table(df.validate$class)

logistic regression

> fit.logit<-glm(class~.,data=df.train,family=binomial())
> summary(fit.logit)
> prob<-predict(fit.logit,df.validate,type="response")
> logit.perd<-factor(prob>.5,levels=c(FALSE,TRUE),
                             labels=c("benign","malignant"))
> logit.perf<-table(df.validate$class,logit.pred,
                            dnn=c("Actual","Predicted"))
> logit.perf

decision tree

  • classic decision tree
> library(repart)
> set.seed(1234)
> dtree<-repart(class~.,data=df.train,method="class",
                       parms=list(split="information"))
> dtree$cptable
> plotcp(dtree)
> dtree.pruned<-prune(dtree,cp=.0125)
> library(repart.plot)
> prp(dtree.pruned,type=2,extra=104,
        fallen.leaves=TRUE,main="Decesion Tree")
> dtree.pred<-predict(dtree.pruned,df.validate,type="class")
> dtree.perf<-table(df.validate$class,dtree.pred,
                             dnn=c("Actual","Predicted"))
> dtree.perf
  • conditional inference tree
> library(party)
> fit.ctree<-ctree(class~.,data=sf.train)
> plot(fit.ctree,main="Conditional Inference Tree")
> ctree.pred<-predict(fit.ctree,df.validate,type="response")
> ctree.perf<-table(df.validate$class,ctree.pred
                            dnn=c("Actual","Predicted"))
> stree.perf

random forest

> library(randomForest)
> set.seed(1234)
> fit.forest<-randomForest(class~.,data=df.train,na.action=na.roughfix,importance=TRUE)
> fit.forest
> importance(fit.forest,type=2)
> forest.pred<-predict(fit.forest,df.validate)
> forest.perf<-table(df.validate$class,forest.pred,
                              dnn=c("Actual","Predicted"))
> forest.perf

support vector machines

  • svm
> library(e1071)
> set.seed(1234)
> fit.svm<-svm(class~.,data=df.train)
> fit.svm
> svm,pred<-predict(fit.svm,na.omit(df.validate))
> svm.perf<-table(na.omit(df.validate)$class,
                           svm,pred,dnn=c("Actual","Predicted"))
> svm.perf
  • svm model with rbf core
> set.seed(1234)
> tuned<-tune.svm(class~.,data=df.train,gamma=10^(-6:1),cost=10^(-10:10))
> turned
> fit.svm<-svm(class~.,data=df.train,gamma=.01,cost=1)
> svm.pred<-predict(fit.svm,na.omit(df,validate))
> svm.perf<-table(na.omit(df.validate)$class,
                            svm.pred,dnn=c("Actual","Predicted"))
> svm.perf

choose the best solution for forecasting

> performance<-function(table,n=2){
        if(!all(dim(table)==c(2,2)))
              stop("Must be a 2 × 2 table")
   tn=table[1,1]
   fp=table[1,2]
   fn=table[2,1]
   tp=table[2,2]
   sensitivity=tp/(tp+fn)
   specificity=tn/(tn+fp)
   ppp=tp/(tp+fp)
   npp=tn/(tn+fn)
   hitrate=(tp+tn)/(tp+tn+fp+fn)
   result<-paste("Sensitivity=",round(ppp,n),
           "\nSpecificity=",round(specificity,n),
           "\nPositive Predictive Value=",round(ppp,n),
           "\nNegative Predictive Value=",round(npp,n),
           "\nAccuracy=",round(hitrate,n),"\n",seq="")
   cat(result)
  }

data mining with the rattle package

> install.packages("rattle")
> rattle()
> loc<-"http://archive.ics.uci.edu/ml/machine-learning-databases/"
> ds<-"pima-indians-diabetes/pima-indians-diabetes.data"
> url <-paste(loc,ds,sep="")
> diabetes<-read.table(url,seq=",",header=FALSE)
> names(diabetes)<-c("npregant","plasma","bp","triceps",
                                  "insulin","bmi","pedigree","age","class")
> diabetes$class<-factor(diabetes$class,levels=c(0,1),
                                      labels=c("normal","diabetic"))
> library(rattle)
> rattle()
> cv<-matrix(c(145,50,8,27),nrow=2)
> performance(as.table(cv))

[IV]Time Series

Function Package Description
ts() stats generate timing objects
plot() graphics draw a line graph of the time series
start() stats return the start time of the time series
end() stats return the end time of the time series
frequency() stats return the number of time points of the time series
windows() stats subset a sequence object
ma() forecast fit a simple moving average model
stl() stats use LOESS smooth to decompose timing into seasonal terms,trend terms and random terms
monthplot() stats draw seasonal terms in time series
seasonplot() forecast generate seasonal plot
HoltWinters() ststs fit exponential smoothing model
forecast() forecast predicting the future value of the timing
accuracy() forecast return time-of-fit goodness variable
ets() forecast fit exponential smoothing model and automatically select the optimal model
lag() stats return the timing after the specified hysteresis is taken
Acf() forecast estimated autocorrelation function
Pacf() forecast estimated partial autocorrelation function
diff() base return the sequence after the lag term or difference
ndiffs() forecast find the optimal difference count to remove the trend terms in the sequence
adf.test() tseries perform an ADF test on the sequence to determine if it is stable
arima stats fit the ARIMA model
Box.test() ststs perform a Ljung-Box test to determine if the model's residuals are independent
bds.test() teeries perform a BDS test to determine whether random variables in the sequence
are subject to independent and identical distribution
auto.arima forecast automatically select the ARIMA model

END!

目录
相关文章
|
6月前
|
数据挖掘
【提示学习】Automatic Multi-Label Prompting: Simple and Interpretable Few-Shot Classification
文章提出了一种简单确高效地构建verbalization的方法:
|
8月前
|
机器学习/深度学习 人工智能 自然语言处理
OneIE:A Joint Neural Model for Information Extraction with Global Features论文解读
大多数现有的用于信息抽取(IE)的联合神经网络模型使用局部任务特定的分类器来预测单个实例(例如,触发词,关系)的标签,而不管它们之间的交互。
101 0
|
8月前
|
自然语言处理 算法
SIFRank New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model
在社交媒体上,面临着大量的知识和信息,一个有效的关键词抽取算法可以广泛地被应用的信息检索和自然语言处理中。传统的关键词抽取算法很难使用外部的知识信息。
89 0
SIFRank New Baseline for Unsupervised Keyphrase Extraction Based on Pre-Trained Language Model
|
8月前
|
存储 自然语言处理 测试技术
LASS: Joint Language Semantic and Structure Embedding for Knowledge Graph Completion 论文解读
补全知识三元组的任务具有广泛的下游应用。结构信息和语义信息在知识图补全中都起着重要作用。与以往依赖知识图谱的结构或语义的方法不同
67 0
|
8月前
|
自然语言处理 算法 知识图谱
DEGREE: A Data-Efficient Generation-Based Event Extraction Model论文解读
事件抽取需要专家进行高质量的人工标注,这通常很昂贵。因此,学习一个仅用少数标记示例就能训练的数据高效事件抽取模型已成为一个至关重要的挑战。
73 0
|
8月前
|
机器学习/深度学习 自然语言处理 算法
Joint Information Extraction with Cross-Task and Cross-Instance High-Order Modeling 论文解读
先前的信息抽取(IE)工作通常独立地预测不同的任务和实例(例如,事件触发词、实体、角色、关系),而忽略了它们的相互作用,导致模型效率低下。
57 0
|
8月前
|
人工智能 自然语言处理 算法
【论文精读】AAAI 2022 - OneRel Joint Entity and Relation Extraction with One Module in One Step
联合实体和关系提取是自然语言处理和知识图构建中的一项重要任务。现有的方法通常将联合提取任务分解为几个基本模块或处理步骤,以使其易于执行
63 0
|
8月前
|
机器学习/深度学习 自然语言处理 算法
SS-AGA:Multilingual Knowledge Graph Completion with Self-Supervised Adaptive Graph Alignment 论文解读
预测知识图(KG)中缺失的事实是至关重要的,因为现代知识图远未补全。由于劳动密集型的人类标签,当处理以各种语言表示的知识时,这种现象会恶化。
55 0
|
机器学习/深度学习 算法 数据挖掘
【多标签文本分类】Improved Neural Network-based Multi-label Classification with Better Initialization ……
【多标签文本分类】Improved Neural Network-based Multi-label Classification with Better Initialization ……
【多标签文本分类】Improved Neural Network-based Multi-label Classification with Better Initialization ……
|
机器学习/深度学习 移动开发 编解码
【论文阅读】A Three-Stage Self-Training Framework for Semi-Supervised Semantic Segmentation
【论文阅读】A Three-Stage Self-Training Framework for Semi-Supervised Semantic Segmentation
156 0
【论文阅读】A Three-Stage Self-Training Framework for Semi-Supervised Semantic Segmentation