[转载]迭代重复加权最小二乘法(IRLS)_熙箔_五道口民工

The following R code uses IRLS algorithm to estimate parameters of Logistic Regression model, and the sample data we used is from Machine Learning Ex4 - Logistic Regression. It is a dataset representing 40 students who were admitted to college and 40 students who were not admitted, and their corresponding grades for 2 exams. Our mission, is to build a binary classification model that estimates college admission chances based on a student's scores on two exams(test1 and test2).

To run the codes in R, please follow these steps:

1. Download the csv file by clickling the URL:

http://spreadsheets.google.com/pub?key=0AnypY27pPCJydC1vRVEzM1VJQnNneFo5dWNzR1F5Umc&output=csv

2. Run the R scripts to build the model

# reading data
mydata = read.csv("c://stat/ex4.csv", header = TRUE)

# plots
plot(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], xlab="test1", ylab="test2", , col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )

http://s16/bmiddle/8db50cf7tcf9d98c043ff&690

# build the model
g <- function(x) { return(log(x/(1 - x))) }

g.inv <- function(y) { return(exp(y)/(1 + exp(y))) }

g.prime = function(x) {
return(1/(x * (1 - x)))
}

V = function(mu) {
return(mu * (1 - mu))
}

niter = 10

mu = rep(mean(mydata$admitted), length(mydata$admitted))

for (i in 1:niter) {

Z = g(mu) + g.prime(mu) * (mydata$admitted - mu)

W = 1/(g.prime(mu)^2 * V(mu))

Z.lm = lm(Z ~ test1 + test2, weights = W, data = mydata)

eta = predict(Z.lm)

mu = g.inv(eta)

beta = Z.lm$coef

}

# print estimated parameters

print(beta)

(Intercept) test1 test2
-16.3787434 0.1483408 0.1589085

# get 2 points (that will define a line)
x1 = c(min(mydata$test1), max(mydata$test1))

# when ax0 + bx2 + cx3 = 0 is the middle(decision boundary line),
# so given x1 from sample data, solving to x2, we get:
x2 = (-1/beta[3]) * ((beta[2] * x1) + beta[1])

# plot
plot(x1,x2, type='l', xlab="test1", ylab="test2")
points(mydata$test1[mydata$admitted == 0], mydata$test2[mydata$admitted == 0], col="red")
points(mydata$test1[mydata$admitted == 1], mydata$test2[mydata$admitted == 1], col="blue", pch=2)
legend("bottomright", c("not admitted", "admitted"), pch=c(1, 2), col=c("red", "blue") )

http://s16/mw690/8db50cf7tcf9d30e8fc4f&690

The original R code is from

http://www.stanford.edu/class/stats191/logistic.html, and I changed it a little.

3. Implementation in SAS

proc import datafile="c:statex4.csv"
   out=ex4
   dbms=csv
   replace;
   getnames=yes;
run;

proc logistic data=ex4 des;
model admitted=test1 test2/tech=fisher;
run;quit;

Analysis of Maximum Likelihood Estimates
Parameter	DF	Estimate	Standard Error	Wald Chi-Square	Pr > ChiSq
Intercept	1	-16.3787	3.6559	20.0717	<.0001
test1	1	0.1483	0.0408	13.2009	0.0003
test2	1	0.1589	0.0416	14.5613	0.0001

Association of Predicted Probabilities and Observed Responses
Percent Concordant	89.5	Somers' D	0.791
Percent Discordant	10.4	Gamma	0.792
Percent Tied	0.1	Tau-a	0.401
Pairs	1600	c	0.896

阅读┊ 收藏 ┊转载原文 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：介绍Kaggle上各种数据挖掘应用的文章

后一篇：矩阵求导公式

新浪BLOG意见反馈留言板　欢迎批评指正