R学习资料

标签:
杂谈 |
分类: R软件学习 |
ISU | |
http://math.illinoisstate.edu/dhkim/rstuff/home.gif |
0. R Basics |
0.1. What is R?
R is a software package especially suitable for data analysis and
graphical representation. Functions and results of analysis are all
stored as objects, allowing easy function modification and model
building. R provides the language, tool, and environment in one
convenient package.
It is very flexible and highly customizable. Excellent graphical tools make R an ideal environment for EDA (Exploratory Data Analysis). Since most high level functions are written in R language itself, you can learn the language by studying the function code.
On the other hand, R has a few weaknesses. For example, R is not particularly efficient in handling large data sets. Also, it is rather slow in executing a large number of for loops, compared to compiler languages such as C/C++. Learning curve is somewhat steep compared to "point and click" software.
0.2
There are versions for Unix, Windows, and Macintosh. All of them
are free, and Windows version is
and follow the download instructions.
http://math.illinoisstate.edu/dhkim/rstuff/image4UP.JPG
0.3 Invoking R
If properly installed, usually R has a shortcut icon on the desktop
screen and/or you can find it under Start|Programs|R menu. If not,
search and run the executable file rgui.exe by double clicking from
the search result window.
To quit R, type
http://math.illinoisstate.edu/dhkim/rstuff/imageLRJ.JPG
Commands you entered can be easily recalled and modified. Just by
hitting the arrow keys in the keyboard, you can navigate through
the recently entered commands.
> objects() # list the names of all
objects
>
rm(data1)
1. Graphics: a few examples |
http://math.illinoisstate.edu/dhkim/rstuff/2-1.gif
http://math.illinoisstate.edu/dhkim/rstuff/2-2.gif
http://math.illinoisstate.edu/dhkim/rstuff/2-3.gif
http://math.illinoisstate.edu/dhkim/rstuff/stars.gif
http://math.illinoisstate.edu/dhkim/rstuff/2-4.gif
http://math.illinoisstate.edu/dhkim/rstuff/2-5.gif
Interactive graphics can serve as a great learning tool. Students
can quickly grasp the role of outliers and influential points in a
simple linear regression by the following
example.
> library(tcltk) > demo(tkcanvas) |
http://math.illinoisstate.edu/dhkim/rstuff/2-6.gif
Effect of kernel choice, sample size and bandwidth can be conveniently illustrated by the following demonstration:
> library(tcltk) > demo(tkdensity) |
http://math.illinoisstate.edu/dhkim/rstuff/2-7.gif
2. Basic Operations |
First of all, R can be used as an ordinary calculator. There are a few examples:
> 2 + 3 *
5
> log
(10)
>
4^2
>
3/2
> sqrt
(16)
> abs
(3-7)
>
pi
>
exp(2)
> 15 %/%
4
>
Assignment operator (<-) stores the value
(object) on the right side of (<-) expression in the
left side. Once assigned, the object can be used just as an
ordinary component of the computation. To find out what the object
looks like, simply type its name. Note that R is case sensitive,
e.g., object
names
> x<- log(2.843432)
*pi
> x
[1] 3.283001
> sqrt(x)
[1] 1.811905
>
floor(x)
[1] 3
>
ceiling(x)
[1] 4
R can handle complex numbers, too.
>
x<-3+2i
>
Re(x)
[1] 3
>
Im(x)
[1] 2
> y<-
-1+1i
> x+y
[1] 2+3i
> x*y
[1] -5+1i
Important note: since there are many built-in functions in
R, make sure that the new object names you assign are not already
used by the system. A simple way of checking this is to type in the
name you want to use. If the system returns an error message
telling you that such object is not found, it is safe to use the
name. For example,
2.2 Vector
R handles vector objects quite easily and intuitively.
>
x<-c(1,3,2,10,5)
> x
[1]
>
y<-1:5
> y
[1] 1 2 3 4 5
>
y+2
[1] 3 4 5 6 7
>
2*y
[1]
>
y^2
[1]
>
2^y
[1]
>
y
[1] 1 2 3 4 5
>
y<-y*2
>
y
[1]
More examples of vector arithmetic:
> x<-c(1,3,2,10,5);
y<-1:5 #two or more statements are separated by
semicolons
> x+y
[1]
> x*y
[1]
> x/y
[1] 1.0000000 1.5000000 0.6666667 2.5000000
1.0000000
> x^y
[1]
>
sum(x)
[1] 21
>
cumsum(x)
[1]
>
diff(x)
[1]
>
diff(x,2)
[1] 1 7 3
>
max(x)
[1] 10
>
min(x)
[1] 1
Sorting can be done
using
> x
[1]
>
sort(x)
[1]
> sort(x, decreasing=T)
[1] 10
Component extraction is a very important part of vector
calculation.
> x
[1]
>
length(x)
[1] 5
>
x[3]
[1] 2
>
x[3:5]
[1]
>
x[-2]
[1]
>
x[x>3]
[1] 10
Logical vector can be handy:
>
x>3
[1] FALSE FALSE FALSE
>
as.numeric(x>3)
[1] 0 0 0 1 1
>
sum(x>3)
[1] 2
> (1:length(x))[x<=2] # indices
of x whose components are less than or equal to
2
[1] 1 3
> z<-as.logical(c(1,0,0,1)) #
numeric to logical vector conversion
> z
[1]
Character vector:
> colors<-c("green", "blue",
"orange", "yellow", "red")
> colors
[1] "green"
Individual components can be named and referenced by their
names.
>
names(x)
NULL
>
names(x)<-colors
> names(x)
[1] "green"
> x
>
x["green"]
green
>
names(x)<-NULL
> x
[1]
seq()
and
> seq(10)
>
seq(0,1,length=10)
> seq(0,1,by=0.1)
> rep(1,3)
[1] 1 1 1
>
c(rep(1,3),rep(2,2),rep(-1,4))
[1]
> rep("Small",3)
[1] "Small" "Small" "Small"
>
c(rep("Small",3),rep("Medium",4))
[1] "Small"
>
rep(c("Low","High"),3)
[1] "Low"
2.3 Matrices
A matrix refers to a numeric array of rows and columns. One of the
easiest ways to create a matrix is to combine vectors of equal
length using cbind(), meaning "column
bind":
> x
[1]
> y
[1] 1 2 3 4 5
>
m1<-cbind(x,y);m1
[1,]
[2,]
[3,]
[4,] 10 4
[5,]
>
t(m1)
x
y
>
m1<-t(cbind(x,y))
>
dim(m1)
[1] 2 5
>
m1<-rbind(x,y)
Of course you can directly list the elements and specify the
matrix:
>
m2<-matrix(c(1,3,2,5,-1,2,2,3,9),nrow=3);m2
[1,]
[2,]
[3,]
Note that the elements are used to fill the first column, then the
second column and so on. To fill row-wise, we
specify
>
m2<-matrix(c(1,3,2,5,-1,2,2,3,9),ncol=3,byrow=T);m2
[1,]
[2,]
[3,]
Extracting the component of a matrix involves one or two
indices.
> m2
[1,]
[2,]
[3,]
>
m2[2,3]
[1] 2
>
m2[2,]
[1]
>
m2[,3]
[1] 2 2 9
>
m2[-1,]
[1,]
[2,]
>
m2[,-1]
[1,]
[2,]
[3,]
>
m2[-1,-1]
[1,]
[2,]
Matrix computation is usually done
component-wise.
> m1<-matrix(1:4, ncol=2);
m2<-matrix(c(10,20,30,40),ncol=2)
>
2*m1
[1,]
[2,]
>
m1+m2
[1,]
[2,]
>
m1*m2
[1,]
[2,]
Note that m1*m2 is NOT the usual matrix multiplication. To do the
matrix multiplication, you should
use
> m1 %*% m2
[1,]
[2,]
>
solve(m1)
[1,]
[2,]
>
solve(m1)%*%m1
>
diag(3)
[1,]
[2,]
[3,]
>
diag(c(2,3,3))
[1,]
[2,]
[3,]
Eigenvalues and eigenvectors of a matrix is handled
by
> eigen(m2)
$values
[1] 53.722813 -3.722813
$vectors
[1,] -0.5657675 -0.9093767
[2,] -0.8245648
2.4 Finding roots: a simple example
A built-in R
function
> y.fun<-function
(x) {y<-(log(x))^2-x*exp(-x^3) } > root.fun<- function
() > root.fun() |
http://math.illinoisstate.edu/dhkim/rstuff/rootfind.gif
2.5 Data frame
Data frame is an array consisting of columns of various mode
(numeric, character, etc). Small to moderate size data frame can be
constructed
by
Make | Model | Cylinder | Weight | Mileage | Type |
Honda | Civic | V4 | 2170 | 33 | Sporty |
Chevrolet |
Beretta | V4 | 2655 | 26 | Compact |
Ford | Escort | V4 | 2345 | 33 | Small |
Eagle | Summit | V4 | 2560 | 33 | Small |
Volkswagen | Jetta | V4 | 2330 | 26 | Small |
Buick | Le Sabre | V6 | 3325 | 23 | Large |
Mitsubishi | Galant | V4 | 2745 | 25 | Compact |
Dodge | Grand Caravan | V6 | 3735 | 18 | Van |
Chrysler | New Yorker | V6 | 3450 | 22 | Medium |
Acura | Legend | V6 | 3265 | 20 | Medium |
>
Make<-c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi",
+ "Dodge","Chrysler","Acura")
>
Model<-c("Civic","Beretta","Escort","Summit","Jetta","Le
Sabre","Galant",
+ "Grand Caravan","New Yorker","Legend")
Note that the plus sign (+) in the above commands are automatically
inserted when the carriage return is pressed without completing the
list. Save some typing by
using
>
Cylinder<-c(rep("V4",5),"V6","V4",rep("V6",3))
> Cylinder
>
Weight<-c(2170,2655,2345,2560,2330,3325,2745,3735,3450,3265)
>
Mileage<-c(33,26,33,33,26,23,25,18,22,20)
>
Type<-c("Sporty","Compact",rep("Small",3),"Large","Compact","Van",rep("Medium",2))
Now
>
Car<-data.frame(Make,Model,Cylinder,Weight,Mileage,Type)
> Car
1
2
3
4
5
6
7
8
9
10
> names(Car)
[1]
"Make"
Just as in matrix objects, partial information can be easily
extracted from the data frame:
> Car[1,]
1 Honda
Civic
In addition, individual columns can be referenced by their
labels:
> Car$Mileage
>
Car[,5]
>
mean(Car$Mileage)
[1] 25.9
> min(Car$Weight)
[1] 2170
table() command gives a frequency table:
> table(Car$Type)
Compact
If the proportion is desired, type the following command
instead:
> table(Car$Type)/10
Compact
Note that the values were divided by 10 because there are that many
vehicles in total. If you don't want to count them each time, the
following does the trick:
> table(Car$Type)/length(Car$Type)
Cross tabulation is very easy, too:
> table(Car$Make, Car$Type)
What if you want to arrange the data set by vehicle
weight?
>
i<-order(Car$Weight);i
> Car[i,]
1
5
3
4
2
7
10
6
9
8
2.6 Creating/editing data objects
> y
[1] 1 2 3 4 5
If you want to modify the data object, use edit() function and
assign it to an object. For example, the following command opens
notepad for editing. After editing is done, choose File | Save and
Exit from Notepad.
> y<-edit(y)
If you prefer entering the data.frame in a spreadsheet style data
editor, the following command invokes the built-in editor with an
empty spreadsheet.
>
data1<-edit(data.frame())
After entering a few data points, it looks like
this:
http://math.illinoisstate.edu/dhkim/rstuff/imageJAK.JPG
You can also change the variable name by clicking once on the cell
containing it. Doing so opens a dialog box:
http://math.illinoisstate.edu/dhkim/rstuff/imageC5K.JPG
When finished, click
> data1
3. More on R Graphics |
3.1 Histogram
We will use a data set
>
fuel.frame<-read.table("c:/fuel-frame.txt",
header=T, sep=",")
>
names(fuel.frame)
[1] "row.names"
"Weight"
>
attach(fuel.frame)
attach()
In general, graphic functions are very flexible and intuitive to
use. For
example,
> hist(Mileage)
> hist(Mileage,
freq=F)
http://math.illinoisstate.edu/dhkim/rstuff/imageC2M.JPG
Let us look at the Old Faithful geyser data, which is a built-in R
data set.
> data(faithful)
> attach(faithful)
> names(faithful)
[1] "eruptions" "waiting"
> hist(eruptions, seq(1.6, 5.2, 0.2),
prob=T)
> lines(density(eruptions,
bw=0.1))
> rug(eruptions,
side=1)
http://math.illinoisstate.edu/dhkim/rstuff/faithful.gif
3.2 Boxplot
>
boxplot(Weight)
> boxplot(Weight,
horizontal=T)
> rug(Weight,
side=2)
http://math.illinoisstate.edu/dhkim/rstuff/boxplot.gif
If you want to get the statistics involved in the boxplots, the
following commands show them. In this
example,
> a<-boxplot(Weight,
plot=F)
> a$stats
[1,] 1845.0
[2,] 2567.5
[3,] 2885.0
[4,] 3242.5
[5,] 3855.0
>
a
>
fivenum(Weight)
[1] 1845.0 2567.5 2885.0 3242.5 3855.0
Boxplot is more useful when comparing grouped data. For example,
side-by-side boxplots of weights grouped by vehicle types are shown
below:
> boxplot(Weight
~Type)
> title("Weight by Vehicle Types")
http://math.illinoisstate.edu/dhkim/rstuff/imageMHQ.JPG
On-line help is available for the commands:
> help(hist)
> help(boxplot)
3.3 plot()
plot()
> plot(Weight)
The following command produce a scatterplot with Weight on the
x-axis and Mileage on the y-axis.
> plot(Weight, Mileage, main="Weight vs.
Mileage")
A fitted straight line is shown in the plot by executing two more
commands.
>
fit<-lm(Mileage~Weight)
> abline(fit)
http://math.illinoisstate.edu/dhkim/rstuff/linefit.gif
3.4 matplot()
matplot()
> y60<-c(316.27, 316.81, 317.42,
318.87, 319.87, 319.43, 318.01, 315.74, 314.00, 313.68, 314.84,
316.03)
> y70<-c(324.89, 325.82, 326.77,
327.97, 327.91, 327.50, 326.18, 324.53, 322.93, 322.90, 323.85,
324.96)
> y80<-c(337.84, 338.19, 339.91,
340.60, 341.29, 341.00, 339.39, 337.43, 335.72, 335.84, 336.93,
338.04)
> y90<-c(353.50, 354.55, 355.23,
356.04, 357.00, 356.07, 354.67, 352.76, 350.82, 351.04, 352.69,
354.07)
> y97<-c(363.23, 364.06, 364.61,
366.40, 366.84, 365.68, 364.52, 362.57, 360.24, 360.83, 362.49,
364.34)
> CO2<-data.frame(y60, y70, y80,
y90, y97)
> row.names(CO2)<-c("Jan", "Feb",
"Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
> CO2
Jan 316.27 324.89 337.84 353.50
363.23
Feb 316.81 325.82 338.19 354.55
364.06
Mar 317.42 326.77 339.91 355.23
364.61
Apr 318.87 327.97 340.60 356.04
366.40
May 319.87 327.91 341.29 357.00
366.84
Jun 319.43 327.50 341.00 356.07
365.68
Jul 318.01 326.18 339.39 354.67
364.52
Aug 315.74 324.53 337.43 352.76
362.57
Sep 314.00 322.93 335.72 350.82
360.24
Oct 313.68 322.90 335.84 351.04
360.83
Nov 314.84 323.85 336.93 352.69
362.49
Dec 316.03 324.96 338.04 354.07
364.34
> matplot(CO2)
Note that the observations labeled 1 represents the monthly CO2
levels for 1960, 2 represents those for 1970, and so on. We can
enhance the plot by changing the line types and adding axis labels
and titles:
>
matplot(CO2,axes=F,frame=T,type='b',ylab="")
> #axes=F: initially do not draw
axis
> #frame=T: box around the plot is
drawn;
> #type=b: both line and character represent a
seris;
> #ylab="": No label for y-axis is
shown;
> #ylim=c(310,400): Specify the y-axis
range
> axis(2) # put numerical annotations at the
tickmarks in y-axis;
> axis(1, 1:12,
row.names(CO2))
> # use the Monthly names for the tickmarks in
x-axis; length is 12;
>
title(xlab="Month")
> title(ylab="CO2 (ppm)")#label for
y-axis;
> title("Monthly CO2 Concentration \n for 1960,
1970, 1980, 1990 and 1997")
> # two-line title for the
matplot
http://math.illinoisstate.edu/dhkim/rstuff/image5Q6.JPG
4. Plot Options |
You can have more than one plot in a graphic window. For example,
4.2 Adjusting graphical parameters
4.2.1 Labels and title; axis limits
Any plot benefits from clear and concise labels which greatly
enhances the readability.
>
If the main title is too long, you can split it into two and adding
a subtitle below the horizontal axis label is
easy:
> title(main="Title is too
long \n so split it into two",sub="subtitle goes
here")
http://math.illinoisstate.edu/dhkim/rstuff/plottitle.gif
By default, when you issue a plot command R inserts variable
name(s) if it is available and figures out the range of x axis and
y axis by itself. Sometimes you may want to change
these:
> plot(Fuel, Weight,
ylab="Weight in pounds", ylim=c(1000,6000))
Similarly, you can specify xlab and xlim to change x-axis. If you
do not want the default labels to appear,
specify
> plot(Mileage, Weight,
xlab="Miles per gallon", ylab="Weight in pounds",
xlim=c(20,30),ylim=c(2000,4000))
> title(main="Weight
versus Mileage \n data=fuel.frame;", sub="Figure
4.1")
4.2.2 Types for plots and lines
In a series plot (especially time series plot), type provides
useful options:
>
par(mfrow=c(2,2))
> plot(Fuel, type="l");
title("lines")
> plot(Fuel, type="b");
title("both")
> plot(Fuel, type="o");
title("overstruck")
> plot(Fuel, type="h");
title("high density")
http://math.illinoisstate.edu/dhkim/rstuff/plottype.gif
Also you can specify the line types using lty argument within
plot() command:
> plot(Fuel, type="l",
lty=1)
> plot(Fuel, type="l",
lty=2)
> plot(Fuel, type="l",
lty=1); title(main="Fuel data",
sub="lty=1")
> plot(Fuel, type="l",
lty=2); title(main="Fuel data",
sub="lty=2")
> plot(Fuel, type="l",
lty=3); title(main="Fuel data",
sub="lty=3")
> plot(Fuel, type="l",
lty=4); title(main="Fuel data",
sub="lty=4")
http://math.illinoisstate.edu/dhkim/rstuff/linetype.gif
Note that we can control the thickness of the lines
by
4.3 Colors and characters
You can change the color by specifying
> plot(Fuel,
col=2)
which shows a plot with different color. The default
is
> plot(Fuel,
pch="*")
> plot(Fuel,
pch="M")
4.4 Controlling axis line
bty ="n"; No box is drawn around the plot, although the x and y
axes are still drawn.
bty="o"; The default box type; draws a four-sided box around the
plot.
bty="c"; Draws a three-sided box around the plot in the shape of an
uppercase "C."
bty="l"; Draws a two-sided box around the plot in the shape of an
uppercase "L."
bty="7"; Draws a two-sided box around the plot in the shape of a
square numeral "7."
> par(mfrow =
c(2,2))
>
plot(Fuel)
> plot(Fuel,
bty="l")
> plot(Fuel,
bty="7")
> plot(Fuel,
bty="c")
4.5 Controlling tick marks
tck
> plot(Fuel,
main="Default")
> plot(Fuel, tck=0.05,
main="tck=0.05")
> plot(Fuel, tck=1,
main="tck=1")
> plot(Fuel, axes=F,
main="Different tick marks for each
axis")
> #axes=F suppresses the
drawing of axis
> axis(1)# draws
x-axis.
> axis(2, tck=1, lty=2) #
draws y-axis with horizontal grid of dotted
line
> box()# draws box around
the remaining sides.
4.6 Legend
legend()
In the following example, the legend() command
says
(1) put a box whose upper left corner coordinates
are
(2) write the two texts Fuel and Smoothed Fuel within the box
together with corresponding symbols described in pch and lty
arguments.
>par(mfrow =
c(1,1))
>plot(Fuel)
>lines(lowess(Fuel))
>legend(30,3.5,
c("Fuel","Smoothed Fuel"), pch="* ",
lty=c(0,1))
http://math.illinoisstate.edu/dhkim/rstuff/legend.gif
If you want to keep the legend box from appearing, add bty="n" to
the legend command.
4.7 Putting text to the plot; controlling the text size
mtext()
> plot(Fuel, xlab=" ",
ylab=" ", type="n")
> mtext("Text on side 1,
cex=1", side=1,cex=1)
> mtext("Text on side 2,
cex=1.2", side=2,cex=1.2)
> mtext("Text on side 3,
cex=1.5", side=3,cex=1.5)
> mtext("Text on side 4,
cex=2", side=4,cex=2)
> text(15, 4.3, "text(15,
4.3)")
> text(35, 3.5, adj=0,
"text(35, 3.5), left aligned")
> text(40, 5, adj=1,
"text(40, 5), right aligned")
http://math.illinoisstate.edu/dhkim/rstuff/graph13.jpg
4.8 Adding symbols to plots
abline()
abline(a,b)
abline(h=30)
abline(v=12)
4.9 Adding arrow and line segment
> plot(Mileage,
Fuel)
>
arrows(23,3.5,24.5,3.9)
>
segments(31.96713,3.115541,
29.97309,3.309592)
> title("arrow and
segment")
> text(23,3.4,"Chrysler Le
Baron V6", cex=0.7)
http://math.illinoisstate.edu/dhkim/rstuff/arrow.gif
4.10 Identifying plotted points
While examining a plot, identifying a data point such as possible
outliers can be achieved
using
>
plot(Fuel)
> identify(Fuel,
n=3)
After pressing return, R waits for you to identify (n=3) points
with the mouse. Moving the mouse cursor over the graphics window
and click on a data point. Then the observation number appears next
to the point, thus making the point identifiable.
4.11 Managing graphics windows
Normally high level graphic commands (hist(), plot(),
boxplot(),
> for (i in 1:3)
win.graph()
> dev.list()
windows windows windows
> dev.cur()
windows
>
dev.set(3)
windows
>
dev.cur()
windows
>
dev.off()
windows
> dev.list()
windows windows
>
graphics.off()
> dev.list()
NULL
5. Statistical Analysis |
summary()
>
fuel.frame<-read.table("c:/fuel_frame.txt",
header=T, sep=",")
> names(fuel.frame)
[1] "row.names"
"Weight"
>
attach(fuel.frame)
> summary(Mileage)
>
summary(fuel.frame)
var()
> var(Mileage)
[1] 22.95904
> sd(Mileage)
[1] 4.791559
>
cor(Mileage,Weight)
[1] -0.8478541
5.2 Empirical distribution function
>
library(stepfun)
>
F20<-rnorm(20)
> plot.ecdf(F20,main="Empirical distribution
function")
http://math.illinoisstate.edu/dhkim/rstuff/image1OD.JPG
5.3 One sample and two sample t
tests
Recall the CO2 data (CO2 concentration in the atmosphere). The
following command performs a one-sample t-test whether the average
CO2 level for the year 1960 is 320 ppm. By default, it does a
two-sided test and extremely small p-value indicates that the null
hypothesis is rejected for any reasonable choice of alpha.
> t.test(CO2$y60, mu=320)
data:
t = -5.5183, df = 11, p-value =
0.0001812
alternative hypothesis: true mean is not equal to
320
95 percent confidence interval:
sample estimates:
mean of x
Now we perform a two-sample independent t-test of equal mean for the CO2 level of 1960 and 1970. We assume that the variances for the two populations are equal. The average concentrations are significantly different, just as the test shows.
> t.test(CO2$y60, CO2$y70,var.equal=T)
data:
t = -11.1522, df = 22, p-value =
1.602e-10
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
sample estimates:
mean of x mean of y
Paired t-test is also available. All you have to do is to
include
5.4 Checking normality
Quite a few statistical tests are based on the normality of the
underlying population. Here we illustrate normal plot and
Kolmogorov-Smirnov test to check the normality assumption.
> #generate 500 observations from uniform (0,1)
distribution
>
F500<-runif(500);a<-c(mean(F500),sd(F500))
>
qqnorm(F500)
>
qqline(F500)
http://math.illinoisstate.edu/dhkim/rstuff/image8RF.JPG
Obviously the curve is far from the straight line so we strongly
suspect the normality (if we didn't know that the generated data
came from uniform). We formally test the normality by performing
Kolmogorov-Smirnov test, comparing the empirical distribution of
F500 to a comparable normal distribution with the mean and standard
deviation same as that of F500.
> ks.test(F500, "pnorm", mean=a[1],
sd=a[2])
data:
D = 0.0655, p-value = 0.02742
alternative hypothesis: two.sided
5.5 Analysis of variance
ANOVA is an extension of a two-sample t test, testing the equality
of means of more than two groups. In the example below, we
use
>
a<-aov(Weight~Type)
> summary(a)
Type
Residuals
---
Signif. codes:
> boxplot(Weight~Type) #side-by-side
boxplot
5.6 Linear regression model
lm()
>
attach(fuel.frame)
> names(fuel.frame)
[1] "row.names"
"Weight"
>
fit1<-lm(Mileage~Weight+Disp.)
>
fit1
Call:
lm(formula = Mileage ~ Weight + Disp.)
Coefficients:
(Intercept)
>
summary(fit1)
Call:
lm(formula = Mileage ~ Weight + Disp.)
Residuals:
-4.5726 -1.5814 -0.2569
Coefficients:
(Intercept) 44.379702
Weight
Disp.
---
Signif. codes:
Residual standard error: 2.485 on 57 degrees of
freedom
Multiple R-Squared:
0.7402,
F-statistic: 81.21 on 2 and 57 DF,
> names(fit1)
> plot(fit1$fitted.values,fit1$residuals) #residual plot
If no intercept term is desired, use the following
command:
> fit2<-lm(Mileage~
-1+Weight+Disp.)
Usual diagnostic plots can be obtained by reguesting
plot(fit1):
> par(mfrow=c(2,2))
> plot(fit1)
http://math.illinoisstate.edu/dhkim/rstuff/plotfit.gif
6. Miscellanies |
Small to moderate size data sets can be easily handled using tools presented so far. However, quite often we have a garden variety of data sources from data handling programs. By far the easiest way to import and export the data in R is using text files. Save the data in plain text format which may be imported to a different software. That way, you can easily view the data using any of the capable text editor even when the original software that produced the data is no longer available.
write.table()
>
CO2
> write.table(CO2, file="c:/CO2.txt", sep="
")
http://math.illinoisstate.edu/dhkim/rstuff/imageUVK.JPG
On the other
hand,
>
data1<-read.table("c:/file.dat",
header=TRUE)
getwd()
> getwd()
[1] "C:\\Program Files\\R\\rw1070"
>
setwd("c:/")
> getwd()
[1] "c:\\"
>
read.table(file="CO2.txt")
> # now pathname is not required to read data
files in the root directory
6.2 Saving graphical output
http://math.illinoisstate.edu/dhkim/rstuff/imageUSC.JPG
Right clicking anywhere inside the active graphics window shows a context sensitive menu, allowing either saving the plot as metafile (EMF) or postscript format (PS). On the other hand, Copy as metafile or Copy as bitmap (BMP) puts the information in the clipboard, a temporary memory area used by Windows. In the latter, you need to immediately paste it in some applications which understand the graphics format, e.g., MS Word. More graphical formats are available from the main menu. While the graphic window is active, click File| Save As from the menu and it lists six file formats (metafile, postscript, PDF, PNG, BMP, and JPG at three quality levels) in total so you have plenty of choices.
Some comments on the choice of graphic formats are in order. In general metafile format retains graphic quality even when it is resized in the application. On the other hand, JPG is a very popular choice on the Internet and file size is usually much smaller than metafile. Except for rare circumstances, I would not recommend BMP file format because it is usually very large and shows very poor picture quality when resized. Postscript file format is useful when including the graphic file in another postscript file or when postscript printer is available. Picture quality does not deteriorate when resized, and it is the default file format to be included in TeX documents.
6.3 Missing values
NA (Not Applicable) is used to denote missing values. Since many
functions returns NA if missing values are present, we illustrate
how to handle them.
>
x
[1]
>
mean(x)
[1] NA
>
is.na(x)
[1] FALSE FALSE FALSE FALSE FALSE
>
sum(is.na(x))
[1] 1
>
x1<-x[!is.na(x)];x1
[1] 1 2 3 4 5
>
a<-mean(x[!is.na(x)]);a
[1] 3
>
x2<-x
>
x2[is.na(x)]<-a;x2
[1] 1 2 3 4 5 3
The following example shows how to select those listwise nonmissing
cases.
> data2
1
2
3
4
5
>
a1<-!is.na(data2$var1);a1
[1]
>
a2<-!is.na(data2$var2);a2
[1] TRUE TRUE TRUE TRUE TRUE
>
a3<-!is.na(data2$var3);a3
[1]
>
data3<-data2[a1*a2*a3==1,]
> #select those rows if all of the elements are
nonmissing.
> data3
1
3
5
6.4 Getting help
By default, R has a couple of excellent manuals in PDF format. "An
Introduction to R" is almost a required reading to begin using R.
To access the manual, click Help | Manuals and the list of
available documents will be shown. Also use help() to get
command-specific information.
> help(read.table)
|
http://math.illinoisstate.edu/dhkim/rstuff/home.gif |