Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 – Part III
29 days ago by rahuldave
(This article was first published on All Things R, and kindly contributed to R-bloggers)
Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related DelaysFor this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA. Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps. The previous blogs on this series are:Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part IIBig Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google MapsGetting Historical Weather Data in R and SAP HANA Tracking SFO Airport's Performance Using R, HANA and D3In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay. Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays. Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims. However, I will not be doing this here, I will just be displaying the mashed-up data.I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js. One can use all 20 years of data and for all the airports to extend this example. I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?). Here is how the final result will look like in HTML5:Click here to interact with the live example. Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up. Choose a different airport from the drop-down to change the performance calendar. * Weather icons are properties of Weather Underground.As anticipated, SFO airport had more red on the calendar than SJC and OAK. SJC definitely is the best performing airport in the bay-area. Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3. Here is the R code and if it not clear from all my previous blogs: I just love data.table package.########################################################################################### # Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data########################################################################################### baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)), by=list(Year, Month, DayofMonth, Origin)]setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15, list(DelayedFlights=length(DepDelay), WeatherDelayed=length(WeatherDelay[WeatherDelay>0]), AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2), CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)# Merge two data-tablesbaa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin, TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed, PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2), AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)# Merge with weather databaa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year, baa.hp.daily.flights.summary.weather$Month, baa.hp.daily.flights.summary.weather$DayofMonth, sep="-"),"%Y-%m-%d")# remove few columnsbaa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[, which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]#Write the output in both JSON and CSV file formatsobjs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]# You have now (Airportcode, JSONString), Once again, you need to attach them together.row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))json.st <- paste('[', paste(row.json, collapse=', '), ']')writeLines(json.st, "baa-2005-2008.summary.json") write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)Happy Coding!
To leave a comment for the author, please follow the link and comment on his blog: All Things R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
business_analytics
business_intelligence
cloud
D3
D3.js
Google
google_maps
HANA
HTML5
predictive
R
r-project
RApache
SAP
SFO
SJC
TechEd
twitter
from google
Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related DelaysFor this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA. Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps. The previous blogs on this series are:Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part IIBig Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google MapsGetting Historical Weather Data in R and SAP HANA Tracking SFO Airport's Performance Using R, HANA and D3In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay. Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays. Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims. However, I will not be doing this here, I will just be displaying the mashed-up data.I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js. One can use all 20 years of data and for all the airports to extend this example. I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?). Here is how the final result will look like in HTML5:Click here to interact with the live example. Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up. Choose a different airport from the drop-down to change the performance calendar. * Weather icons are properties of Weather Underground.As anticipated, SFO airport had more red on the calendar than SJC and OAK. SJC definitely is the best performing airport in the bay-area. Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3. Here is the R code and if it not clear from all my previous blogs: I just love data.table package.########################################################################################### # Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data########################################################################################### baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)), by=list(Year, Month, DayofMonth, Origin)]setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15, list(DelayedFlights=length(DepDelay), WeatherDelayed=length(WeatherDelay[WeatherDelay>0]), AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2), CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2), LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)# Merge two data-tablesbaa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin, TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed, PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2), AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)# Merge with weather databaa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year, baa.hp.daily.flights.summary.weather$Month, baa.hp.daily.flights.summary.weather$DayofMonth, sep="-"),"%Y-%m-%d")# remove few columnsbaa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[, which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]#Write the output in both JSON and CSV file formatsobjs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]# You have now (Airportcode, JSONString), Once again, you need to attach them together.row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))json.st <- paste('[', paste(row.json, collapse=', '), ']')writeLines(json.st, "baa-2005-2008.summary.json") write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)Happy Coding!
To leave a comment for the author, please follow the link and comment on his blog: All Things R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
29 days ago by rahuldave
Measuring time series characteristics
29 days ago by rahuldave
(This article was first published on Research tips » R, and kindly contributed to R-bloggers)
A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were:
Wang, Smith and Hyndman (2006) Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335–364.
Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-learning the characteristics of univariate time series”, Neurocomputing, 72, 2581–2594.
I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.
Finding the period of the data
Usually in time series work, we know the period of the data (if the observations are monthly, the period is 12, for example). But in this project, some of our data was of unknown period and we wanted a method to automatically determine the appropriate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a better approach (prompted on crossvalidated.com) using an estimate of the spectral density:
find.freq <- function(x)
{
n <- length(x)
spec <- spec.ar(c(na.contiguous(x)),plot=FALSE)
if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
{
period <- round(1/spec$freq[which.max(spec$spec)])
if(period==Inf) # Find next local maximum
{
j <- which(diff(spec$spec)>0)
if(length(j)>0)
{
nextmax <- j[1] + which.max(spec$spec[j[1]:500])
if(nextmax <= length(spec$freq))
period <- round(1/spec$freq[nextmax])
else
period <- 1
}
else
period <- 1
}
}
else
period <- 1
return(period)
}
The function is called find.freq because time series people often call the period of seasonality the “frequency” (which is of course highly confusing).
Decomposing the data into trend and seasonal components
We needed a measure of the strength of trend and the strength of seasonality, and to do this we decomposed the data into trend, seasonal and error terms.
Because not all data could be decomposed additively, we first needed to apply an automated Box-Cox transformation. We tried a range of Box-Cox parameters on a grid, and selected the one which gave the most normal errors. That worked ok, but I’ve since found some papers that provide quite good automated Box-Cox algorithms that I’ve implemented in the forecast package. So this code uses Guerrero’s (1993) method instead.
For seasonal time series, we decomposed the transformed data using an stl decomposition with periodic seasonality.
For non-seasonal time series, we estimated the trend of the transformed data using penalized regression splines via the mgcv package.
decomp <- function(x,transform=TRUE)
{
require(forecast)
# Transform series
if(transform & min(x,na.rm=TRUE) >= 0)
{
lambda <- BoxCox.lambda(na.contiguous(x))
x <- BoxCox(x,lambda)
}
else
{
lambda <- NULL
transform <- FALSE
}
# Seasonal data
if(frequency(x)>1)
{
x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
trend <- x.stl$time.series[,2]
season <- x.stl$time.series[,1]
remainder <- x - trend - season
}
else #Nonseasonal data
{
require(mgcv)
tt <- 1:length(x)
trend <- rep(NA,length(x))
trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
season <- NULL
remainder <- x - trend
}
return(list(x=x,trend=trend,season=season,remainder=remainder,
transform=transform,lambda=lambda))
}
Putting everything on a [0,1] scale
We wanted to measure a range of characteristics such as strength of seasonality, strength of trend, level of nonlinearity, skewness, kurtosis, serial correlatedness, self-similarity, level of chaoticity (is that a word?) and the periodicity of the data. But we wanted all these on the same scale which meant mapping the natural range of each measure onto [0,1]. The following two functions were used to do this.
# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
{
eax <- exp(a*x)
if (eax == Inf)
f1eax <- 1
else
f1eax <- (eax-1)/(eax+b)
return(f1eax)
}
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
{
eax <- exp(a*x)
ea <- exp(a)
return((eax-1)/(eax+b)*(ea+b)/(ea-1))
}
The values of and in each function were chosen so the measure had a 90th percentile of 0.10 when the data were iid standard normal, and a value of 0.9 using a well-known benchmark time series.
Calculating the measures
Now we are ready to calculate the measures on the original data, as well as on the adjusted data (after removing trend and seasonality).
measures <- function(x)
{
require(forecast)
N <- length(x)
freq <- find.freq(x)
fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
x <- ts(x,f=freq)
# Decomposition
decomp.x <- decomp(x)
# Adjust data
if(freq > 1)
fits <- decomp.x$trend + decomp.x$season
else # Nonseasonal data
fits <- decomp.x$trend
adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
# Backtransformation of adjusted data
if(decomp.x$transform)
tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
else
tadj.x <- adj.x
# Trend and seasonal measures
v.adj <- var(adj.x, na.rm=TRUE)
if(freq > 1)
{
detrend <- decomp.x$x - decomp.x$trend
deseason <- decomp.x$x - decomp.x$season
trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(deseason,na.rm=TRUE))))
season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(detrend,na.rm=TRUE))))
}
else #Nonseasonal data
{
trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(decomp.x$x,na.rm=TRUE))))
season <- 0
}
m <- c(fx,trend,season)
# Measures on original data
xbar <- mean(x,na.rm=TRUE)
s <- sd(x,na.rm=TRUE)
# Serial correlation
Q <- Box.test(x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
# Nonlinearity
p <- terasvirta.test(na.contiguous(x))$statistic
fp <- f1(p,0.069,2.304)
# Skewness
s <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
# Kurtosis
k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
# Hurst=d+0.5 where d is fractional difference.
H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
# Lyapunov Exponent
if(freq > N-10)
stop("Insufficient data")
Ly <- numeric(N-freq)
for(i in 1:(N-freq))
{
idx <- order(abs(x[i] - x))
idx <- idx[idx < (N-freq)]
j <- idx[2]
Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
if(is.na(Ly[i]) | Ly[i]==Inf | Ly[i]==-Inf)
Ly[i] <- NA
}
Lyap <- mean(Ly,na.rm=TRUE)
fLyap <- exp(Lyap)/(1+exp(Lyap))
m <- c(m,fQ,fp,fs,fk,H,fLyap)
# Measures on adjusted data
xbar <- mean(tadj.x, na.rm=TRUE)
s <- sd(tadj.x, na.rm=TRUE)
# Serial
Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
# Nonlinearity
p <- terasvirta.test(na.contiguous(adj.x))$statistic
fp <- f1(p,0.069,2.304)
# Skewness
s <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
# Kurtosis
k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
m <- c(m,fQ,fp,fs,fk)
names(m) <- c("frequency", "trend","seasonal",
"autocorrelation","non-linear","skewness","kurtosis",
"Hurst","Lyapunov",
"dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")
return(m)
}
Here is a quick example applied to Australian monthly gas production:
library(forecast)
measures(gas)
frequency trend seasonal autocorrelation
0.1096 0.9989 0.9337 0.9985
non-linear skewness kurtosis Hurst
0.4947 0.1282 1.0000 0.9996
Lyapunov dc autocorrelation dc non-linear dc skewness
0.5662 0.1140 0.0538 0.1743
dc kurtosis
1.0000
The function is far from perfect, and it is not hard to find examples where it fails. For example, it doesn’t work with multiple seasonality — try measure(taylor) and check the seasonality. Also, I’m not convinced the kurtosis provides anything useful here, or that the skewness measure is done in the best way possible. But it was really a proof of concept, so we will leave it to others to revise and improve the code.
In our papers, we took the measures obtained using R, and produced self-organizing maps using Viscovery. There is now a som package in R for that, so it might be possible to integrate that step into R as well. The dendogram was generated in matlab, although that could now also be done in R using the ggdendro package for example.
Download the code in a single file.
To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
forecasting
R
Research_tips
statistics
from google
A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were:
Wang, Smith and Hyndman (2006) Characteristic-based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335–364.
Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-learning the characteristics of univariate time series”, Neurocomputing, 72, 2581–2594.
I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.
Finding the period of the data
Usually in time series work, we know the period of the data (if the observations are monthly, the period is 12, for example). But in this project, some of our data was of unknown period and we wanted a method to automatically determine the appropriate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a better approach (prompted on crossvalidated.com) using an estimate of the spectral density:
find.freq <- function(x)
{
n <- length(x)
spec <- spec.ar(c(na.contiguous(x)),plot=FALSE)
if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
{
period <- round(1/spec$freq[which.max(spec$spec)])
if(period==Inf) # Find next local maximum
{
j <- which(diff(spec$spec)>0)
if(length(j)>0)
{
nextmax <- j[1] + which.max(spec$spec[j[1]:500])
if(nextmax <= length(spec$freq))
period <- round(1/spec$freq[nextmax])
else
period <- 1
}
else
period <- 1
}
}
else
period <- 1
return(period)
}
The function is called find.freq because time series people often call the period of seasonality the “frequency” (which is of course highly confusing).
Decomposing the data into trend and seasonal components
We needed a measure of the strength of trend and the strength of seasonality, and to do this we decomposed the data into trend, seasonal and error terms.
Because not all data could be decomposed additively, we first needed to apply an automated Box-Cox transformation. We tried a range of Box-Cox parameters on a grid, and selected the one which gave the most normal errors. That worked ok, but I’ve since found some papers that provide quite good automated Box-Cox algorithms that I’ve implemented in the forecast package. So this code uses Guerrero’s (1993) method instead.
For seasonal time series, we decomposed the transformed data using an stl decomposition with periodic seasonality.
For non-seasonal time series, we estimated the trend of the transformed data using penalized regression splines via the mgcv package.
decomp <- function(x,transform=TRUE)
{
require(forecast)
# Transform series
if(transform & min(x,na.rm=TRUE) >= 0)
{
lambda <- BoxCox.lambda(na.contiguous(x))
x <- BoxCox(x,lambda)
}
else
{
lambda <- NULL
transform <- FALSE
}
# Seasonal data
if(frequency(x)>1)
{
x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
trend <- x.stl$time.series[,2]
season <- x.stl$time.series[,1]
remainder <- x - trend - season
}
else #Nonseasonal data
{
require(mgcv)
tt <- 1:length(x)
trend <- rep(NA,length(x))
trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
season <- NULL
remainder <- x - trend
}
return(list(x=x,trend=trend,season=season,remainder=remainder,
transform=transform,lambda=lambda))
}
Putting everything on a [0,1] scale
We wanted to measure a range of characteristics such as strength of seasonality, strength of trend, level of nonlinearity, skewness, kurtosis, serial correlatedness, self-similarity, level of chaoticity (is that a word?) and the periodicity of the data. But we wanted all these on the same scale which meant mapping the natural range of each measure onto [0,1]. The following two functions were used to do this.
# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
{
eax <- exp(a*x)
if (eax == Inf)
f1eax <- 1
else
f1eax <- (eax-1)/(eax+b)
return(f1eax)
}
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
{
eax <- exp(a*x)
ea <- exp(a)
return((eax-1)/(eax+b)*(ea+b)/(ea-1))
}
The values of and in each function were chosen so the measure had a 90th percentile of 0.10 when the data were iid standard normal, and a value of 0.9 using a well-known benchmark time series.
Calculating the measures
Now we are ready to calculate the measures on the original data, as well as on the adjusted data (after removing trend and seasonality).
measures <- function(x)
{
require(forecast)
N <- length(x)
freq <- find.freq(x)
fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
x <- ts(x,f=freq)
# Decomposition
decomp.x <- decomp(x)
# Adjust data
if(freq > 1)
fits <- decomp.x$trend + decomp.x$season
else # Nonseasonal data
fits <- decomp.x$trend
adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
# Backtransformation of adjusted data
if(decomp.x$transform)
tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
else
tadj.x <- adj.x
# Trend and seasonal measures
v.adj <- var(adj.x, na.rm=TRUE)
if(freq > 1)
{
detrend <- decomp.x$x - decomp.x$trend
deseason <- decomp.x$x - decomp.x$season
trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(deseason,na.rm=TRUE))))
season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(detrend,na.rm=TRUE))))
}
else #Nonseasonal data
{
trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(decomp.x$x,na.rm=TRUE))))
season <- 0
}
m <- c(fx,trend,season)
# Measures on original data
xbar <- mean(x,na.rm=TRUE)
s <- sd(x,na.rm=TRUE)
# Serial correlation
Q <- Box.test(x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
# Nonlinearity
p <- terasvirta.test(na.contiguous(x))$statistic
fp <- f1(p,0.069,2.304)
# Skewness
s <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
# Kurtosis
k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
# Hurst=d+0.5 where d is fractional difference.
H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
# Lyapunov Exponent
if(freq > N-10)
stop("Insufficient data")
Ly <- numeric(N-freq)
for(i in 1:(N-freq))
{
idx <- order(abs(x[i] - x))
idx <- idx[idx < (N-freq)]
j <- idx[2]
Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
if(is.na(Ly[i]) | Ly[i]==Inf | Ly[i]==-Inf)
Ly[i] <- NA
}
Lyap <- mean(Ly,na.rm=TRUE)
fLyap <- exp(Lyap)/(1+exp(Lyap))
m <- c(m,fQ,fp,fs,fk,H,fLyap)
# Measures on adjusted data
xbar <- mean(tadj.x, na.rm=TRUE)
s <- sd(tadj.x, na.rm=TRUE)
# Serial
Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
# Nonlinearity
p <- terasvirta.test(na.contiguous(adj.x))$statistic
fp <- f1(p,0.069,2.304)
# Skewness
s <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
# Kurtosis
k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
m <- c(m,fQ,fp,fs,fk)
names(m) <- c("frequency", "trend","seasonal",
"autocorrelation","non-linear","skewness","kurtosis",
"Hurst","Lyapunov",
"dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")
return(m)
}
Here is a quick example applied to Australian monthly gas production:
library(forecast)
measures(gas)
frequency trend seasonal autocorrelation
0.1096 0.9989 0.9337 0.9985
non-linear skewness kurtosis Hurst
0.4947 0.1282 1.0000 0.9996
Lyapunov dc autocorrelation dc non-linear dc skewness
0.5662 0.1140 0.0538 0.1743
dc kurtosis
1.0000
The function is far from perfect, and it is not hard to find examples where it fails. For example, it doesn’t work with multiple seasonality — try measure(taylor) and check the seasonality. Also, I’m not convinced the kurtosis provides anything useful here, or that the skewness measure is done in the best way possible. But it was really a proof of concept, so we will leave it to others to revise and improve the code.
In our papers, we took the measures obtained using R, and produced self-organizing maps using Viscovery. There is now a som package in R for that, so it might be possible to integrate that step into R as well. The dendogram was generated in matlab, although that could now also be done in R using the ggdendro package for example.
Download the code in a single file.
To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
29 days ago by rahuldave
NSF BIGDATA webinar
4 weeks ago by rahuldave
(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)
If you're doing any kind of big data analysis - genomics, transcriptomics, proteomics, bioinformatics - then unless you've been on vacation the last few weeks you've no doubt heard about the NSF/NIH BIGDATA Initiative (here's the NSF solicitation and here's the New York Times article about the funding opportunity). The solicitation "aims to advance core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed and heterogeneous data sets so as to: accelerate the progress of scientific discovery and innovation; lead to new fields of inquiry that would not otherwise be possible; encourage the development of new data analytic tools and algorithms; facilitate scalable, accessible, and sustainable data infrastructure; increase understanding of human and social processes and interactions; and promote economic growth and improved health and quality of life."NSF is holding a webinar to describe the goals and focus of the BIGDATA solicitation, help investigators understand its scope, and answer any questions potential PIs might have.The Webinar will be held from 11am-noon EST on May 8, 2012. Register here. The webinar will also be archived here a few days later.NSF BIGDATA Webinar - May 8 2012Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
Announcements
Bioinformatics
R
from google
If you're doing any kind of big data analysis - genomics, transcriptomics, proteomics, bioinformatics - then unless you've been on vacation the last few weeks you've no doubt heard about the NSF/NIH BIGDATA Initiative (here's the NSF solicitation and here's the New York Times article about the funding opportunity). The solicitation "aims to advance core scientific and technological means of managing, analyzing, visualizing, and extracting useful information from large, diverse, distributed and heterogeneous data sets so as to: accelerate the progress of scientific discovery and innovation; lead to new fields of inquiry that would not otherwise be possible; encourage the development of new data analytic tools and algorithms; facilitate scalable, accessible, and sustainable data infrastructure; increase understanding of human and social processes and interactions; and promote economic growth and improved health and quality of life."NSF is holding a webinar to describe the goals and focus of the BIGDATA solicitation, help investigators understand its scope, and answer any questions potential PIs might have.The Webinar will be held from 11am-noon EST on May 8, 2012. Register here. The webinar will also be archived here a few days later.NSF BIGDATA Webinar - May 8 2012Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
4 weeks ago by rahuldave
What does this package look like?
4 weeks ago by rahuldave
(This article was first published on tuxettechix » R, and kindly contributed to R-bloggers)
In this post, I give a very simple trick to understand the way a package is organized, which functions are included in and how these functions depend from each others. The idea has been brought by one of my student, Soraya, who is currently working in a very hostile environment, surrounded by true geeks. However, she handles the situation pretty well and manages to take the best from them. As I asked her to insert in her report a chart representing the dependency structure between her functions, she learned from them that tools existed to produce such charts automatically and was able (thanks to R blogger) to find one in an R package.
Suppose that you want to know which functions are included in the package GeneNet (that is of interest for another of my students) and to understand the dependency structure between these functions. First, download the package source, decompress it and use the subdirectory R as the working directory of your R session. If you’re on linux, the following command lines will do the job:
?View Code BASHtar zxvf GeneNet_1.2.5.tar.gz
cd GeneNet/R
R
Then, collect and source all files located in this subdirectory and use the function foodweb (from the package mvbutils) to display the dependency structure between functions in this package:
?View Code RSPLUSthefiles = list.files()
sapply(thefiles,source)
library(mvbutils)
par(mar=rep(0.1,4))
foodweb(border=TRUE,boxcolor="pink",lwd=1.5,cex=0.8)
which gives the following picture:
To spare my student’s time, I did the same with simone
and igraph for which I had to exclude a configuration file:
?View Code RSPLUSthefiles = setdiff(list.files(),"config.R.in")
sapply(thefiles,source)
library(mvbutils)
par(mar=rep(0.1,4))
foodweb(border=TRUE,boxcolor="lightgreen",lwd=1.5,cex=0.8)
I’m not sure whether the last one is very useful
Now, Soraya is just waiting for her “warhol-R-rabbit” that Paul has promised her to include in her report. It’s gonna be a pop art variant of:
I’m really looking forward to post the R script for it!
To leave a comment for the author, please follow the link and comment on his blog: tuxettechix » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
genenet
igraph
mvbutils
R
R_package
simone
from google
In this post, I give a very simple trick to understand the way a package is organized, which functions are included in and how these functions depend from each others. The idea has been brought by one of my student, Soraya, who is currently working in a very hostile environment, surrounded by true geeks. However, she handles the situation pretty well and manages to take the best from them. As I asked her to insert in her report a chart representing the dependency structure between her functions, she learned from them that tools existed to produce such charts automatically and was able (thanks to R blogger) to find one in an R package.
Suppose that you want to know which functions are included in the package GeneNet (that is of interest for another of my students) and to understand the dependency structure between these functions. First, download the package source, decompress it and use the subdirectory R as the working directory of your R session. If you’re on linux, the following command lines will do the job:
?View Code BASHtar zxvf GeneNet_1.2.5.tar.gz
cd GeneNet/R
R
Then, collect and source all files located in this subdirectory and use the function foodweb (from the package mvbutils) to display the dependency structure between functions in this package:
?View Code RSPLUSthefiles = list.files()
sapply(thefiles,source)
library(mvbutils)
par(mar=rep(0.1,4))
foodweb(border=TRUE,boxcolor="pink",lwd=1.5,cex=0.8)
which gives the following picture:
To spare my student’s time, I did the same with simone
and igraph for which I had to exclude a configuration file:
?View Code RSPLUSthefiles = setdiff(list.files(),"config.R.in")
sapply(thefiles,source)
library(mvbutils)
par(mar=rep(0.1,4))
foodweb(border=TRUE,boxcolor="lightgreen",lwd=1.5,cex=0.8)
I’m not sure whether the last one is very useful
Now, Soraya is just waiting for her “warhol-R-rabbit” that Paul has promised her to include in her report. It’s gonna be a pop art variant of:
I’m really looking forward to post the R script for it!
To leave a comment for the author, please follow the link and comment on his blog: tuxettechix » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
4 weeks ago by rahuldave
How to create R extensions / packages [resources]
4 weeks ago by rahuldave
(This article was first published on Pairach Piboonrungroj » R, and kindly contributed to R-bloggers)
One of the advantages of R is its add-on packages which are now 3,759 packages freely available in CRAN (May 1, 2012). I have been using R for my research since 2010. These packages have brought me to R. It is also the main reason why I am mainly use R (90%) for my research and [...]
To leave a comment for the author, please follow the link and comment on his blog: Pairach Piboonrungroj » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
R
from google
One of the advantages of R is its add-on packages which are now 3,759 packages freely available in CRAN (May 1, 2012). I have been using R for my research since 2010. These packages have brought me to R. It is also the main reason why I am mainly use R (90%) for my research and [...]
To leave a comment for the author, please follow the link and comment on his blog: Pairach Piboonrungroj » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
4 weeks ago by rahuldave
French Global Factors
4 weeks ago by rahuldave
(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)
I have said it already in multiple posts, but Kenneth French’s data library is one of the most generous and powerful contributions to the financial community. To build on Systematic Investor’s series on factors, I thought I should run some basic analysis on the Global Factors maintained by Kenneth French. I cannot imagine how long this would take without the data library and the incredible set of R packages available.
From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio R code from GIST:
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
factors
fPortfolio
French
R
stocks
systematic_investor
from google
I have said it already in multiple posts, but Kenneth French’s data library is one of the most generous and powerful contributions to the financial community. To build on Systematic Investor’s series on factors, I thought I should run some basic analysis on the Global Factors maintained by Kenneth French. I cannot imagine how long this would take without the data library and the incredible set of R packages available.
From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio From TimelyPortfolio R code from GIST:
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
4 weeks ago by rahuldave
R, Julia and genome wide selection
5 weeks ago by rahuldave
(This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers)
— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.
What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)
Having a look at different—statistical—horses (Photo: Luis).
I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.
In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example):
nmarkers = 2000; # number of markers
startMarker = 1981; # set to 1 to use all
numiter = 2000; # number of iterations
vara = 1.0/20.0;
# input data
data = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
beg = Sys.time()
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a = data[,nmarkers+2];
# inital values
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5; # just an approximation
scalea = 0.5*vara/(nmarkers*mean2pq); # 0.5 = (v-2)/v for v=4
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var = array(0.0,size);
# adjust y
ycorr = y - x%*%b;
# MCMC sampling
for (iter in 1:numiter){
# sample vare
vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
# sample intercept
ycorr = ycorr + x[,1]*b[1];
rhs = sum(ycorr)/vare;
invLhs = 1.0/(nrecords/vare);
mean = rhs*invLhs;
b[1] = rnorm(1,mean,sqrt(invLhs));
ycorr = ycorr - x[,1]*b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for (locus in 2:size){
var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
}
# sample effect for each locus
for (locus in 2:size){
# unadjust y for this locus
ycorr = ycorr + x[,locus]*b[locus];
rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
invLhs = 1.0/lhs;
mean = invLhs*rhs;
b[locus]= rnorm(1,mean,sqrt(invLhs));
#adjust y for the new value of this locus
ycorr = ycorr - x[,locus]*b[locus];
meanb[locus] = meanb[locus] + b[locus];
}
}
Sys.time() - beg
meanb = meanb/numiter;
aHat = x %*% meanb;
Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.
nmarkers = 2000 # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000 # Number of iterations
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
tic()
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
scalea = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian = zeros(Float64, ndesign)
# adjust y
ycorr = y - X * b
# MCMC sampling
for i = 1:numiter
# sample vare
vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
# sample intercept
ycorr = ycorr + X[:, 1] * b[1];
rhs = sum(ycorr)/vare;
invlhs = 1.0/(nrecords/vare);
mn = rhs*invlhs;
b[1] = randn() * sqrt(invlhs) + mn;
ycorr = ycorr - X[:, 1] * b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for locus = 2:ndesign
varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
end
# sample effect for each locus
for locus = 2:ndesign
# unadjust y for this locus
ycorr = ycorr + X[:, locus] * b[locus];
rhs = dot(X[:, locus], ycorr)/vare;
lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
invlhs = 1.0/lhs;
mn = invlhs * rhs;
b[locus] = randn() * sqrt(invlhs) + mn;
#adjust y for the new value of this locus
ycorr = ycorr - X[:, locus] * b[locus];
meanb[locus] = meanb[locus] + b[locus];
end
end
toc()
meanb = meanb/numiter;
aHat = X * meanb;
The code looks remarkably similar and there are four main sources of differences:
The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.
One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.
In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.
P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
To leave a comment for the author, please follow the link and comment on his blog: Quantum Forest » rblogs.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
Bayesian
Julia
R
rblogs
Simulation
stats
from google
— “You are a pussy” emailed my friend.
— “Sensu cat?” I replied.
— “No. Sensu chicken” blurbed my now ex-friend.
What was this about? He read my post on R, Julia and the shiny new thing, which prompted him to assume that I was the proverbial old dog unwilling (or was it unable?) to learn new tricks. (Incidentally, with friends like this who needs enemies? Hi, Gus.)
Having a look at different—statistical—horses (Photo: Luis).
I decided to tackle a small—but hopefully useful—piece of code: fitting/training a Genome Wide Selection model, using the Bayes A approach put forward by Meuwissen, Hayes and Goddard in 2001. In that approach the breeding values of the individuals (response) are expressed as a function of a very large number of random predictors (2000, our molecular markers). The dataset (csv file) is a simulation of 2000 bi-allelic markers (aa = 0, Aa = 1, AA = 2) for 250 individuals, followed by the phenotypes (column 2001) and breeding values (column 2002). These models are frequently adjusted using MCMC.
In 2010 I attended this course in Ames, Iowa where Rohan Fernando passed us the following R code (pretty much a transliteration from C code; notice the trailing semicolons, for example):
nmarkers = 2000; # number of markers
startMarker = 1981; # set to 1 to use all
numiter = 2000; # number of iterations
vara = 1.0/20.0;
# input data
data = matrix(scan("trainData.out0"),ncol=nmarkers+2,byrow=TRUE);
nrecords = dim(data)[1];
beg = Sys.time()
# x has the mean followed by the markers
x = cbind(1,data[,startMarker:nmarkers]);
y = data[,nmarkers+1];
a = data[,nmarkers+2];
# inital values
nmarkers = nmarkers - startMarker + 1;
mean2pq = 0.5; # just an approximation
scalea = 0.5*vara/(nmarkers*mean2pq); # 0.5 = (v-2)/v for v=4
size = dim(x)[2];
b = array(0.0,size);
meanb = b;
b[1] = mean(y);
var = array(0.0,size);
# adjust y
ycorr = y - x%*%b;
# MCMC sampling
for (iter in 1:numiter){
# sample vare
vare = ( t(ycorr)%*%ycorr )/rchisq(1,nrecords + 3);
# sample intercept
ycorr = ycorr + x[,1]*b[1];
rhs = sum(ycorr)/vare;
invLhs = 1.0/(nrecords/vare);
mean = rhs*invLhs;
b[1] = rnorm(1,mean,sqrt(invLhs));
ycorr = ycorr - x[,1]*b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for (locus in 2:size){
var[locus] = (scalea*4+b[locus]*b[locus])/rchisq(1,4.0+1)
}
# sample effect for each locus
for (locus in 2:size){
# unadjust y for this locus
ycorr = ycorr + x[,locus]*b[locus];
rhs = t(x[,locus])%*%ycorr/vare;
lhs = t(x[,locus])%*%x[,locus]/vare + 1.0/var[locus];
invLhs = 1.0/lhs;
mean = invLhs*rhs;
b[locus]= rnorm(1,mean,sqrt(invLhs));
#adjust y for the new value of this locus
ycorr = ycorr - x[,locus]*b[locus];
meanb[locus] = meanb[locus] + b[locus];
}
}
Sys.time() - beg
meanb = meanb/numiter;
aHat = x %*% meanb;
Thus, we just need defining a few variables, reading the data (marker genotypes, breeding values and phenotypic data) into a matrix, creating loops, matrix and vector multiplication and generating random numbers (using a Gaussian and Chi squared distributions). Not much if you think about it, but I didn’t have much time to explore Julia’s features as to go for something more complex.
nmarkers = 2000 # Number of markers
startmarker = 1981 # Set to 1 to use all
numiter = 2000 # Number of iterations
data = dlmread("markers.csv", ',')
(nrecords, ncols) = size(data)
tic()
#this is the mean and markers matrix
X = hcat(ones(Float64, nrecords), data[:, startmarker:nmarkers])
y = data[:, nmarkers + 1]
a = data[:, nmarkers + 2]
nmarkers = nmarkers - startmarker + 1
vara = 1.0/nmarkers
mean2pq = 0.5
scalea = 0.5*vara/(nmarkers*mean2pq) # 0.5 = (v-2)/v for v=4
ndesign = size(X, 2)
b = zeros(Float64, ndesign)
meanb = zeros(Float64, ndesign)
b[1] = mean(y)
varian = zeros(Float64, ndesign)
# adjust y
ycorr = y - X * b
# MCMC sampling
for i = 1:numiter
# sample vare
vare = dot(ycorr, ycorr )/randchi2(nrecords + 3)
# sample intercept
ycorr = ycorr + X[:, 1] * b[1];
rhs = sum(ycorr)/vare;
invlhs = 1.0/(nrecords/vare);
mn = rhs*invlhs;
b[1] = randn() * sqrt(invlhs) + mn;
ycorr = ycorr - X[:, 1] * b[1];
meanb[1] = meanb[1] + b[1];
# sample variance for each locus
for locus = 2:ndesign
varian[locus] = (scalea*4 + b[locus]*b[locus])/randchi2(4.0 + 1);
end
# sample effect for each locus
for locus = 2:ndesign
# unadjust y for this locus
ycorr = ycorr + X[:, locus] * b[locus];
rhs = dot(X[:, locus], ycorr)/vare;
lhs = dot(X[:, locus], X[:, locus])/vare + 1.0/varian[locus];
invlhs = 1.0/lhs;
mn = invlhs * rhs;
b[locus] = randn() * sqrt(invlhs) + mn;
#adjust y for the new value of this locus
ycorr = ycorr - X[:, locus] * b[locus];
meanb[locus] = meanb[locus] + b[locus];
end
end
toc()
meanb = meanb/numiter;
aHat = X * meanb;
The code looks remarkably similar and there are four main sources of differences:
The first trivial one is that the original code read a binary dataset and I didn’t know how to do it in Julia, so I’ve read a csv file instead (this is why I start timing after reading the file too).
The second trivial one is to avoid name conflicts between variables and functions; for example, in R the user is allowed to have a variable called var that will not interfere with the variance function. Julia is picky about that, so I needed renaming some variables.
Julia pases variables by reference, while R does so by value when assigning matrices, which tripped me because in the original R code there was something like: b = array(0.0,size); meanb = b;. This works fine in R, but in Julia changes to the b vector also changed meanb.
The definition of scalar vs array created some problems in Julia. For example y' * y (t(y) %*% y in R) is numerically equivalent to dot(y, y). However, the first version returns an array, while the second one a scalar. I got an error message when trying to store the ‘scalar like an array’ in to an array. I find that confusing.
One interesting point in this comparison is using rough code, not really optimized for speed; in fact, the only thing that I can say of the Julia code is that ‘it runs’ and it probably is not very idiomatic. Testing runs with different numbers of markers we get that R needs roughly 2.8x the time used by Julia. The Julia website claims better results in benchmarks, but in real life we work with, well, real problems.
In 1996-7 I switched from SAS to ASReml for genetic analyses because it was 1-2 orders of magnitude faster and opened a world of new models. Today a change from R to Julia would deliver (in this particular case) a much more modest speed up (~3x), which is OK but not worth changing languages (yet). Together with the embryonic graphical capabilities and the still-to-develop ecosystem of packages, means that I’m still using R. Nevertheless, the Julia team has achieved very impressive performance in very little time, so it is worth to keep an eye on their progress.
P.S.1 Readers are welcome to suggesting ways of improving the code.
P.S.2 WordPress does not let me upload the binary version of the simulated data.
P.S.3 Hey WordPress guys; it would be handy if the sourcecode plugin supported Julia!
To leave a comment for the author, please follow the link and comment on his blog: Quantum Forest » rblogs.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
5 weeks ago by rahuldave
Ggplot2 notes part 2
7 weeks ago by rahuldave
(This article was first published on Sharp Statistics » R, and kindly contributed to R-bloggers)
Here is part 2 of my guide to using ggplot2. Scales Following on directly from the previous notes you can manually adjust the colours and shapes used in the chart if you don’t like the defaults, as shown in figure 1. … Continue reading →
To leave a comment for the author, please follow the link and comment on his blog: Sharp Statistics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
charts
R
from google
Here is part 2 of my guide to using ggplot2. Scales Following on directly from the previous notes you can manually adjust the colours and shapes used in the chart if you don’t like the defaults, as shown in figure 1. … Continue reading →
To leave a comment for the author, please follow the link and comment on his blog: Sharp Statistics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
7 weeks ago by rahuldave
Latex Allergy Cured by knitr
7 weeks ago by rahuldave
(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)
I have always known that at some point I would have to succumb to the power of Latex, but Latex has been uncharacteristically intimidating to me. I finally found the remedy to my Latex allergy with the amazing and fantastic knitr package from Yihui Xie. With very minimal effort, I ran my first experiment and now am extremely excited to incorporate it in production-quality performance reports (I plan to document the steps to get there in future posts).
For those starting from scratch on Windows, I think the easiest method to get up and running is to install LyX, which will also install MikTex. If these are successfully installed, then you should be ready to experiment with knitr in R.
I will use knitr’s stich function, which is clearly not designed for the robust production use of knitr, but makes for a very easy first test. stitch will open a very short script, apply a template, and generate a Sweave style .Rnw (can be changed). knit2pdf converts the .Rnw file into a pdf, and with a couple lines of code you get a remarkable result.
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
knitr
LaTex
R
reporting
from google
I have always known that at some point I would have to succumb to the power of Latex, but Latex has been uncharacteristically intimidating to me. I finally found the remedy to my Latex allergy with the amazing and fantastic knitr package from Yihui Xie. With very minimal effort, I ran my first experiment and now am extremely excited to incorporate it in production-quality performance reports (I plan to document the steps to get there in future posts).
For those starting from scratch on Windows, I think the easiest method to get up and running is to install LyX, which will also install MikTex. If these are successfully installed, then you should be ready to experiment with knitr in R.
I will use knitr’s stich function, which is clearly not designed for the robust production use of knitr, but makes for a very easy first test. stitch will open a very short script, apply a template, and generate a Sweave style .Rnw (can be changed). knit2pdf converts the .Rnw file into a pdf, and with a couple lines of code you get a remarkable result.
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
7 weeks ago by rahuldave
Getting Started with JAGS, rjags, and Bayesian Modelling
7 weeks ago by rahuldave
(This article was first published on Jeromy Anglim's Blog: Psychology and Statistics, and kindly contributed to R-bloggers)
This post provides links to various resources on getting started with Bayesianmodelling using JAGS and R.It discusses: (1) what is JAGS;(2) why you might want to perform Bayesian modelling using JAGS;(3) how to install JAGS;(4) where to find further information on JAGS;(5) where to find examples of JAGS scripts in action;(6) where to ask questions; and(7) some interesting psychological applications of Bayesian modelling.
What is JAGS?JAGS stands for Just Another Gibbs Sampler.To quote the program author, Martyn Plummer, "It is a program for analysis ofBayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation..." It uses a dialect of the BUGS language, similar but a little different to OpenBUGS and WinBUGS.
Why JAGS?The question of why you might want to use JAGS can be approached in several different ways:
Why Bayesian rather than Null Hypothesis Significance Testing (NHST) approaches?
To quote John D. Cook quoting Anthony O'Hagan, the benefits of "the bayesianapproach are that it is 1. fundamentally sound, 2. very flexible, 3.produces clear and direct inferences, and 4. makes use of all availableinformation." (see John's blog post forelaboration)John K. Kruschke made a similar argument in an Open Letter extolling thebenefits of the bayesianapproach summarisedas: "(1) Scientific disciplines from astronomy to zoology are moving toBayesian data analysis. We should be leaders of the move, not followers. (2) Modern Bayesian methods provide richer information,with greater flexibility and broader applicability than 20th centurymethods. Bayesian methods are intellectually coherent and intuitive.Bayesian analyses are readily computed with modern software and hardware.(3) Null-hypothesis significance testing (NHST), with its reliance on pvalues, has many problems. There is little reason to persist with NHST nowthat Bayesian methods are accessible to everyone."Why JAGS/BUGS rather than coding in a low-level language?
It's simpler; for models that BUGS can handle, BUGS can shield you fromsome of the thorny details related to numeric integration.There are simple interfaces with R.Why JAGS rather than WinBUGS or OpenBUGS?
I'm using JAGS because it works well on Ubuntu. WinBUGS is broadly Windowsspecific, although I've read that it may work with the emulation software Wine.JAGS interfaces well with R. I'm comfortable writingscripts. Thus, I don't personally see the benefits of using a dedicatedGUI like WinBUGS. I can leverage what I know about R. However, ultimately converting code between different flavours of BUGSis not that difficult. For further discussion of the issue, see this r-helpdiscussion andthis discussion on CrossValidated.More than anything I found that JAGS provided a useful entry point into the world ofBayesian modelling. This in turn appealed to me for several reasons:
Even when I perform analyses using an NHST approach I often intuitively thinkof empirical research questions in terms of probability densities on aparameter of interest that changes as empirical and theoretical evidence isaccumulated. See for example Thompson's (2002) concept ofmeta-analytic thinking.Bayesian analysis provides tools for formalising this orientation. More broadly, I appreciate the explicitness that a Bayesian approachrequires and encourages. E.g., specifying the distribution of the error term,specifying a prior, specifying the distribution of parameters in a mixedeffects model, and so on.There are several modelling challenges that I'm currently working throughwhere a Bayesian approach offers substantial flexibility and applicability. In particular, I'm interested in modelling individual differences in theeffect of practice on strategy use and task performance and then relatingthese individual differences to factors like intelligence, prior experience,and personality.JAGS InstallationJAGS runs on Linux, Mac, and Windows.I run JAGS on Ubuntu through an interface with R called rjags.
The following sets out a basic installation process:
If necessary Download and install R andpotentially a user interface to R like R Studio(see here for tips on getting started withR).Download and install JAGS as peroperating system requriements. Install additional R packages: e.g., in R install.packages("rjags") .In particular, I use the packages rjags to interface with JAGS andcoda to process MCMC output.Information on JAGSThe manual for different versions of JAGS is locatedhere.e.g., the pdf of the manual for 3.1.0.Several particularly relevant sections include:the list of supported distributions and how they are parameterised. This isoften important given that the code looks similar to R but often uses different parameterisation (e.g., precision is used instead ofstandard deviation for a normal distribution).It summarises differences between WinBUGS and JAGS. It sets out available functions and operators.The rjags help pdffor information about how to interface with JAGS from R.Martin Plummer has a blog called JAGS NEWSThe Bayesian Task View on CRAN lists and brieflydescribes the many R packages related to Bayesian statistics. Lunn and colleagues have a 2009 article calledThe BUGS project: Evolution, critique and future directions.It provides a useful historical perspective on the broader BUGS project,although it does not mention much about JAGS specifically.Examples JAGS ScriptsI find it easier to pick up a new language by playing with examples.The following provides links to example JAGS code, often with accompanyingexplanations:
John Myles White A course on statistical models that is under development with JAGS scripts on githubA model of Cannabalt scores using a gammadistributionSimple introductory examples of fitting a normal distribution, linearregression, and logisticregressionA follow-up post demonstrating the use of the coda package with rjagsto perform MCMCdiagnostics.John K. KruschkeJohn Krushke wrote a book called Doing Bayesian Data Analysis: A Tutorial with Rand BUGS. It's an excellent entry point into the world of Bayesianstatistics for the social and behavioural scientist who hasreasonable quantiative training, but is not necessarily ready to absorbthe kinds of books that are used in graduate-level statistics courses.The book has awebsite thatprovides all the examples used in the book all the examples used in thebook. See this blog post for a link to the zip file containing the JAGScode.BUGS ProjectBUGS is well known for the large set of examples that accompany the project.The PDF providing documentation for Volume 1 and 2 of the examples is availablehere.You can see the JAGS code used to run these exampleshere.Patrick J Mineault
An example from Gelman et al examining the effect of training programs onSATscoresMiguel Lobo
A short tutorialSimon JackmanSimon Jackman wrote the book Bayesian Analysis for the Social Sciencesthat has accompanying JAGS code.The book's website has severaluseful resources including example papers using Bayesian methods.An associated coursethat uses the book as a text book has slides and many examples of usingand R and JAGS.Johannes Karreth
A course on applied bayesian modelling with examples of data, and code using the R2jags interface.Myself
I also plan to post a few examples in upcoming blog posts. I typicallywill share the code for these on my github account:jeromyanglim. If you are reading thisthrough syndication you may wish to subscribe to the RSS feed of thesource blog jeromyanglim.blogspot.com.More broadly, examples and tutorials designed for WinBUGS can generally beadapted to be useful for JAGS. So for example, you can explore these WinBUGSexamples:
Michael Lee and Eric-Jan Wagnemakers have a free online book called A Course in Bayesian Graphical Modelingfor Cognitive Science: see PDF andwebsite.The website for the book Markov Chain MonteCarlo has several WinBUGS examples.There is an extensive list of BUGSresources onthe BUGS project website.Asking questionsThere are several places to ask questions about JAGS, R, and Bayesianstatistics.
JAGS, BUGS, and bayesian questionson stats.stackexchange.com (aka CrossValidated). JAGS discussion forumThere's also a BUGS discussionlistIn general, I prefer the Stack Exchange model for asking and answering questionson the internet, although the most important issue is typically where theexperts are located.
Interesting Psychological Applications of Bayesian ModellingIf you want to see some examples of Bayesian modelling applied to psychologicaldata, I found the following articles quite interesting. PDFs are available online.
Shiffrin, Lee, Kim, and Wagenmakers (2008, PDF) present a tutorial on hierarchical bayesian methods in the context of cognitive science.Michael Lee (2011, PDF) in Journal of Mathematical Psychologydiscusses the benefits of hiearchical Bayesian methods to modellingpsychological data and provides several example applications.Lee Averell and Andrew Heathcote (2010,PDF) in Journal ofMathematical Psychology analyse individual differences in the forgetting curveusing a hierarchical Bayesian approach.If you know of any other interesting JAGS resources or have any comments about mychoice of software for Bayesian data analysis, feel free to post a comment.
To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
Bayesian
JAGS
R
from google
This post provides links to various resources on getting started with Bayesianmodelling using JAGS and R.It discusses: (1) what is JAGS;(2) why you might want to perform Bayesian modelling using JAGS;(3) how to install JAGS;(4) where to find further information on JAGS;(5) where to find examples of JAGS scripts in action;(6) where to ask questions; and(7) some interesting psychological applications of Bayesian modelling.
What is JAGS?JAGS stands for Just Another Gibbs Sampler.To quote the program author, Martyn Plummer, "It is a program for analysis ofBayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation..." It uses a dialect of the BUGS language, similar but a little different to OpenBUGS and WinBUGS.
Why JAGS?The question of why you might want to use JAGS can be approached in several different ways:
Why Bayesian rather than Null Hypothesis Significance Testing (NHST) approaches?
To quote John D. Cook quoting Anthony O'Hagan, the benefits of "the bayesianapproach are that it is 1. fundamentally sound, 2. very flexible, 3.produces clear and direct inferences, and 4. makes use of all availableinformation." (see John's blog post forelaboration)John K. Kruschke made a similar argument in an Open Letter extolling thebenefits of the bayesianapproach summarisedas: "(1) Scientific disciplines from astronomy to zoology are moving toBayesian data analysis. We should be leaders of the move, not followers. (2) Modern Bayesian methods provide richer information,with greater flexibility and broader applicability than 20th centurymethods. Bayesian methods are intellectually coherent and intuitive.Bayesian analyses are readily computed with modern software and hardware.(3) Null-hypothesis significance testing (NHST), with its reliance on pvalues, has many problems. There is little reason to persist with NHST nowthat Bayesian methods are accessible to everyone."Why JAGS/BUGS rather than coding in a low-level language?
It's simpler; for models that BUGS can handle, BUGS can shield you fromsome of the thorny details related to numeric integration.There are simple interfaces with R.Why JAGS rather than WinBUGS or OpenBUGS?
I'm using JAGS because it works well on Ubuntu. WinBUGS is broadly Windowsspecific, although I've read that it may work with the emulation software Wine.JAGS interfaces well with R. I'm comfortable writingscripts. Thus, I don't personally see the benefits of using a dedicatedGUI like WinBUGS. I can leverage what I know about R. However, ultimately converting code between different flavours of BUGSis not that difficult. For further discussion of the issue, see this r-helpdiscussion andthis discussion on CrossValidated.More than anything I found that JAGS provided a useful entry point into the world ofBayesian modelling. This in turn appealed to me for several reasons:
Even when I perform analyses using an NHST approach I often intuitively thinkof empirical research questions in terms of probability densities on aparameter of interest that changes as empirical and theoretical evidence isaccumulated. See for example Thompson's (2002) concept ofmeta-analytic thinking.Bayesian analysis provides tools for formalising this orientation. More broadly, I appreciate the explicitness that a Bayesian approachrequires and encourages. E.g., specifying the distribution of the error term,specifying a prior, specifying the distribution of parameters in a mixedeffects model, and so on.There are several modelling challenges that I'm currently working throughwhere a Bayesian approach offers substantial flexibility and applicability. In particular, I'm interested in modelling individual differences in theeffect of practice on strategy use and task performance and then relatingthese individual differences to factors like intelligence, prior experience,and personality.JAGS InstallationJAGS runs on Linux, Mac, and Windows.I run JAGS on Ubuntu through an interface with R called rjags.
The following sets out a basic installation process:
If necessary Download and install R andpotentially a user interface to R like R Studio(see here for tips on getting started withR).Download and install JAGS as peroperating system requriements. Install additional R packages: e.g., in R install.packages("rjags") .In particular, I use the packages rjags to interface with JAGS andcoda to process MCMC output.Information on JAGSThe manual for different versions of JAGS is locatedhere.e.g., the pdf of the manual for 3.1.0.Several particularly relevant sections include:the list of supported distributions and how they are parameterised. This isoften important given that the code looks similar to R but often uses different parameterisation (e.g., precision is used instead ofstandard deviation for a normal distribution).It summarises differences between WinBUGS and JAGS. It sets out available functions and operators.The rjags help pdffor information about how to interface with JAGS from R.Martin Plummer has a blog called JAGS NEWSThe Bayesian Task View on CRAN lists and brieflydescribes the many R packages related to Bayesian statistics. Lunn and colleagues have a 2009 article calledThe BUGS project: Evolution, critique and future directions.It provides a useful historical perspective on the broader BUGS project,although it does not mention much about JAGS specifically.Examples JAGS ScriptsI find it easier to pick up a new language by playing with examples.The following provides links to example JAGS code, often with accompanyingexplanations:
John Myles White A course on statistical models that is under development with JAGS scripts on githubA model of Cannabalt scores using a gammadistributionSimple introductory examples of fitting a normal distribution, linearregression, and logisticregressionA follow-up post demonstrating the use of the coda package with rjagsto perform MCMCdiagnostics.John K. KruschkeJohn Krushke wrote a book called Doing Bayesian Data Analysis: A Tutorial with Rand BUGS. It's an excellent entry point into the world of Bayesianstatistics for the social and behavioural scientist who hasreasonable quantiative training, but is not necessarily ready to absorbthe kinds of books that are used in graduate-level statistics courses.The book has awebsite thatprovides all the examples used in the book all the examples used in thebook. See this blog post for a link to the zip file containing the JAGScode.BUGS ProjectBUGS is well known for the large set of examples that accompany the project.The PDF providing documentation for Volume 1 and 2 of the examples is availablehere.You can see the JAGS code used to run these exampleshere.Patrick J Mineault
An example from Gelman et al examining the effect of training programs onSATscoresMiguel Lobo
A short tutorialSimon JackmanSimon Jackman wrote the book Bayesian Analysis for the Social Sciencesthat has accompanying JAGS code.The book's website has severaluseful resources including example papers using Bayesian methods.An associated coursethat uses the book as a text book has slides and many examples of usingand R and JAGS.Johannes Karreth
A course on applied bayesian modelling with examples of data, and code using the R2jags interface.Myself
I also plan to post a few examples in upcoming blog posts. I typicallywill share the code for these on my github account:jeromyanglim. If you are reading thisthrough syndication you may wish to subscribe to the RSS feed of thesource blog jeromyanglim.blogspot.com.More broadly, examples and tutorials designed for WinBUGS can generally beadapted to be useful for JAGS. So for example, you can explore these WinBUGSexamples:
Michael Lee and Eric-Jan Wagnemakers have a free online book called A Course in Bayesian Graphical Modelingfor Cognitive Science: see PDF andwebsite.The website for the book Markov Chain MonteCarlo has several WinBUGS examples.There is an extensive list of BUGSresources onthe BUGS project website.Asking questionsThere are several places to ask questions about JAGS, R, and Bayesianstatistics.
JAGS, BUGS, and bayesian questionson stats.stackexchange.com (aka CrossValidated). JAGS discussion forumThere's also a BUGS discussionlistIn general, I prefer the Stack Exchange model for asking and answering questionson the internet, although the most important issue is typically where theexperts are located.
Interesting Psychological Applications of Bayesian ModellingIf you want to see some examples of Bayesian modelling applied to psychologicaldata, I found the following articles quite interesting. PDFs are available online.
Shiffrin, Lee, Kim, and Wagenmakers (2008, PDF) present a tutorial on hierarchical bayesian methods in the context of cognitive science.Michael Lee (2011, PDF) in Journal of Mathematical Psychologydiscusses the benefits of hiearchical Bayesian methods to modellingpsychological data and provides several example applications.Lee Averell and Andrew Heathcote (2010,PDF) in Journal ofMathematical Psychology analyse individual differences in the forgetting curveusing a hierarchical Bayesian approach.If you know of any other interesting JAGS resources or have any comments about mychoice of software for Bayesian data analysis, feel free to post a comment.
To leave a comment for the author, please follow the link and comment on his blog: Jeromy Anglim's Blog: Psychology and Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
7 weeks ago by rahuldave
An Introduction to Social Network Analysis with R and NetDraw
7 weeks ago by rahuldave
With the rise in the use of social media, data related to social networks is ripe for analysis using techniques from social network analysis and graph theory. According to International Network for Social Network Analysis, ‘Social network analysis is...
R_bloggers
R
R_code
social_network_analysis
from google
7 weeks ago by rahuldave
Piggybacking and Hopefully Publicizing R Experts
7 weeks ago by rahuldave
(This article was first published on Timely Portfolio, and kindly contributed to R-bloggers)
I was inspired by the Revolution Analytics blog post http://blog.revolutionanalytics.com/2009/11/charting-time-series-as-calendar-heat-maps-in-r.html on the d3.js style calendar heat map that Paul Bleicher from Humedica developed in R. In an effort to publicize such fine work, I wanted to put a slightly different spin on it to visualize a system’s time in the market. The system is ridiculously basic (not investment advice and will lose money), but the visualization is very helpful.
From TimelyPortfolio To continue with the theme, I would like to continue to highlight some fine R work from ªªhttp://systematicinvestor.wordpress.com. ºº Although his work is far more sophisticated than this, I thought I would use his plota function to plot the German Dax (used in this example) with RSI below.
From TimelyPortfolio Thanks so much to the amazing R coders who provided the code and functions for this post.
R code from GIST:
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
lattice
R
systematic_investor
from google
I was inspired by the Revolution Analytics blog post http://blog.revolutionanalytics.com/2009/11/charting-time-series-as-calendar-heat-maps-in-r.html on the d3.js style calendar heat map that Paul Bleicher from Humedica developed in R. In an effort to publicize such fine work, I wanted to put a slightly different spin on it to visualize a system’s time in the market. The system is ridiculously basic (not investment advice and will lose money), but the visualization is very helpful.
From TimelyPortfolio To continue with the theme, I would like to continue to highlight some fine R work from ªªhttp://systematicinvestor.wordpress.com. ºº Although his work is far more sophisticated than this, I thought I would use his plota function to plot the German Dax (used in this example) with RSI below.
From TimelyPortfolio Thanks so much to the amazing R coders who provided the code and functions for this post.
R code from GIST:
To leave a comment for the author, please follow the link and comment on his blog: Timely Portfolio.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
7 weeks ago by rahuldave
Writing reproducibly in the open with knitr
7 weeks ago by rahuldave
Sweave is something of a gold standard in reproducible research. It creates a dynamic document, written in a mix of LaTeX and R code where the results of the analysis (numbers, figures, tables) are automatically generated from the code and inserted into the resulting pdf document, making them easy to update if the data or methods change. It’s a nice idea, in principle.
However, the practical troubles are many. Coauthors don’t know LaTeX, publishers who don’t accept LaTeX or pdfs. The LaTeX myth that you are freed from thinking about formatting, when in fact you have to fill your document with LaTeX specific markup that makes it a burden both to type and to read the source-code. Compiling and debugging your text. And then the reproducibility comes from sharing that Sweave file — a mix of LaTeX and R that almost no one can read easily. Where’s the elegance in that? 1 Sure, none of these are show-stoppers — I’ve been content with LaTeX for years — but suddenly there’s a better way.
Thanks to knitr, a successor of Sweave, I can write my publications in markdown. Unlike LaTeX, HTML, or other markup languages, markdown is designed to be easily read as plain text, but can also be interpreted into pretty HTML, and now into almost any other format thanks to pandoc. All of which is to say that writing and sharing just got a lot easier.
As I have written previously, I already use this markdown format for my notes and code, so there’s no re-typing required. When working on the paper, I can just write. I can edit the code without flipping back and forth between files. Knitr can run the code blocks, caching parts that have already run for efficiency, and upload the resulting figures in png format automatically to the Internet. Github displays the resulting document and the /source, while also tracking the versions as my writing progresses.
Pandoc allows me to transform these notes into a LaTeX file that can generate professional-looking pdfs with given journal .cls files by using a custom latex template. Pandoc can also generate the less pretty but often required word documents. A separate Rscript combines with a Makefile to control the relevant formatting — for LaTeX output I want high-quality pdf graphics, for Word-doc output I want eps graphics which are created but not pasted into the Word file, for the drafts I want png graphics stored online for easy sharing. Pandoc allows citations to be extracted from my Mendeley library (via Bibtex files) and inserted into each of these output formats (doc, pdf, github markdown).
Getting the LaTeX template, Makefile, and knit script set up for this pipeline takes a little care — mostly to ensure figures and tables look appropriate in all outputs. Once these files are created though, they can be easily reused on other manuscripts. A simple make pdf builds the pdf copy, make docx builds a MS Word copy,2 and make github the copy that displays with images on Github.
The links in this post point to what is an active draft of a little manuscript at the time of this writing. In addition to making the final result reproducible, Github captures the provenance or history of the research and writing process. It’s not a perfect system, but it’s a nice step.
I’m glossing over the additional challenges of highlighting, caching, and formating on the R code side, which have been largely addressed by additional packages and are elegantly solved in knitr.though these binary files aren’t stored in the git repository
Open_Notebook_Thoughts
R
from google
However, the practical troubles are many. Coauthors don’t know LaTeX, publishers who don’t accept LaTeX or pdfs. The LaTeX myth that you are freed from thinking about formatting, when in fact you have to fill your document with LaTeX specific markup that makes it a burden both to type and to read the source-code. Compiling and debugging your text. And then the reproducibility comes from sharing that Sweave file — a mix of LaTeX and R that almost no one can read easily. Where’s the elegance in that? 1 Sure, none of these are show-stoppers — I’ve been content with LaTeX for years — but suddenly there’s a better way.
Thanks to knitr, a successor of Sweave, I can write my publications in markdown. Unlike LaTeX, HTML, or other markup languages, markdown is designed to be easily read as plain text, but can also be interpreted into pretty HTML, and now into almost any other format thanks to pandoc. All of which is to say that writing and sharing just got a lot easier.
As I have written previously, I already use this markdown format for my notes and code, so there’s no re-typing required. When working on the paper, I can just write. I can edit the code without flipping back and forth between files. Knitr can run the code blocks, caching parts that have already run for efficiency, and upload the resulting figures in png format automatically to the Internet. Github displays the resulting document and the /source, while also tracking the versions as my writing progresses.
Pandoc allows me to transform these notes into a LaTeX file that can generate professional-looking pdfs with given journal .cls files by using a custom latex template. Pandoc can also generate the less pretty but often required word documents. A separate Rscript combines with a Makefile to control the relevant formatting — for LaTeX output I want high-quality pdf graphics, for Word-doc output I want eps graphics which are created but not pasted into the Word file, for the drafts I want png graphics stored online for easy sharing. Pandoc allows citations to be extracted from my Mendeley library (via Bibtex files) and inserted into each of these output formats (doc, pdf, github markdown).
Getting the LaTeX template, Makefile, and knit script set up for this pipeline takes a little care — mostly to ensure figures and tables look appropriate in all outputs. Once these files are created though, they can be easily reused on other manuscripts. A simple make pdf builds the pdf copy, make docx builds a MS Word copy,2 and make github the copy that displays with images on Github.
The links in this post point to what is an active draft of a little manuscript at the time of this writing. In addition to making the final result reproducible, Github captures the provenance or history of the research and writing process. It’s not a perfect system, but it’s a nice step.
I’m glossing over the additional challenges of highlighting, caching, and formating on the R code side, which have been largely addressed by additional packages and are elegantly solved in knitr.though these binary files aren’t stored in the git repository
7 weeks ago by rahuldave
Dynamite plots in R
7 weeks ago by rahuldave
(This article was first published on The Praise of Insects, and kindly contributed to R-bloggers)
For some time I've contemplated creating a function for creating the dynamite plots beloved by many of the applied sciences. There's a lot of criticism regarding their utility, and there are several ways that present data in a more intelligible way. A search on the subject brings up pages with such emotive titles as "Dynamite plots: unmitigated evil?" and "Why dynamite plots are BAD". The "Beware of Dynamite" poster sums up the main problem with dynamite plots by concluding "Intentionally or not, a dynamite plot hides more than it reveals".All that said, I'm an R advocate. If being able to to create dynamite plots is going to encourage people to use R, I can cope with their inadequacies. Here's hoping that the paucity of information required to create them makes people reconsider.dynamitePlot <- function(height, error, names = NA, significance = NA, ylim = c(0,maxLim), ...){ maxLim <- 1.1* max(mapply(sum, height, error)) bp <- barplot(height, names.arg = names, ylim = ylim, ...) arrows(x0 = bp, y0 = height, y1 = height + error, angle = 90) text(x = bp, y = 0.2 + height + error, labels = significance)}Values <- c(1,2,5,4)Errors <- c(0.25, 0.5, 0.33, 0.12)Names <- paste("Trial", 1:4)Sig <- c("a", "a", "b", "b")dynamitePlot(Values, Errors, names = Names, significance = Sig) The result looks like this: The code is also available on gitHub.
To leave a comment for the author, please follow the link and comment on his blog: The Praise of Insects.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
R
from google
For some time I've contemplated creating a function for creating the dynamite plots beloved by many of the applied sciences. There's a lot of criticism regarding their utility, and there are several ways that present data in a more intelligible way. A search on the subject brings up pages with such emotive titles as "Dynamite plots: unmitigated evil?" and "Why dynamite plots are BAD". The "Beware of Dynamite" poster sums up the main problem with dynamite plots by concluding "Intentionally or not, a dynamite plot hides more than it reveals".All that said, I'm an R advocate. If being able to to create dynamite plots is going to encourage people to use R, I can cope with their inadequacies. Here's hoping that the paucity of information required to create them makes people reconsider.dynamitePlot <- function(height, error, names = NA, significance = NA, ylim = c(0,maxLim), ...){ maxLim <- 1.1* max(mapply(sum, height, error)) bp <- barplot(height, names.arg = names, ylim = ylim, ...) arrows(x0 = bp, y0 = height, y1 = height + error, angle = 90) text(x = bp, y = 0.2 + height + error, labels = significance)}Values <- c(1,2,5,4)Errors <- c(0.25, 0.5, 0.33, 0.12)Names <- paste("Trial", 1:4)Sig <- c("a", "a", "b", "b")dynamitePlot(Values, Errors, names = Names, significance = Sig) The result looks like this: The code is also available on gitHub.
To leave a comment for the author, please follow the link and comment on his blog: The Praise of Insects.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
7 weeks ago by rahuldave
Gaussian process regression with R
8 weeks ago by rahuldave
(This article was first published on James Keirstead » R, and kindly contributed to R-bloggers)
I’m currently working my way through Rasmussen and Williams’s book on Gaussian processes. It’s another one of those topics that seems to crop up a lot these days, particularly around control strategies for energy systems, and thought I should be able to at least perform basic analyses with this method.
While the book is sensibly laid-out and pretty comprehensive in its choice of topics, it is also a very hard read. My linear algebra may be rusty but I’ve heard some mathematicians describe the conventions used in the book as “an affront to notation”. So just be aware that if you try to work through the book, you will need to be patient. It’s not a cookbook that clearly spells out how to do everything step-by-step.
That said, I have now worked through the basics of Gaussian process regression as described in Chapter 2 and I want to share my code with you here. As always, I’m doing this in R and if you search CRAN, you will find a specific package for Gaussian process regression: gptk. The implementation shown below is much slower than the gptk functions, but by doing things manually I hope you will find it easier to understand what’s actually going on. The full code is given below and is available Github. (PS anyone know how to embed only a few lines from a gist?)
Step 1: Generating functions
With a standard univariate statistical distribution, we draw single values. By contrast, a Gaussian process can be thought of as a distribution of functions. So the first thing we need to do is set up some code that enables us to generate these functions. The code at the bottom shows how to do this and hopefully it is pretty self-explanatory. The result is basically the same as Figure 2.2(a) in Rasmussen and Williams, although with a different random seed and plotting settings.
Example of functions from a Gaussian process
Step 2: Fitting the process to noise-free data
Now let’s assume that we have a number of fixed data points. In other words, our Gaussian process is again generating lots of different functions but we know that each draw must pass through some given points. For now, we will assume that these points are perfectly known. In the code, I’ve tried to use variable names that match the notation in the book. In the resulting plot, which corresponds to Figure 2.2(b) in Rasmussen and Williams, we can see the explicit samples from the process, along with the mean function in red, and the constraining data points.
Example of Gaussian process trained on noise-free data
Step 3: Fitting the process to noisy data
The next extension is to assume that the constraining data points are not perfectly known. Instead we assume that they have some amount of normally-distributed noise associated with them. This case is discussed on page 16 of the book, although an explicit plot isn’t shown. The code and resulting plot is shown below; again, we include the individual sampled functions, the mean function, and the data points (this time with error bars to signify 95% confidence intervals).
Example of Gaussian process trained on noisy data
Next steps
Hopefully that will give you a starting point for implementating Gaussian process regression in R. There are several further steps that could be taken now including:
Changing the squared exponential covariance function to include the signal and noise variance parameters, in addition to the length scale shown here.
Speed up the code by using the Cholesky decomposition, as described in Algorithm 2.1 on page 19.
Try to implement the same regression using the gptk package. Sadly the documentation is also quite sparse here, but if you look in the source files at the various demo* files, you should be able to figure out what’s going on.
The code
To leave a comment for the author, please follow the link and comment on his blog: James Keirstead » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
modelling
R
rstats
from google
I’m currently working my way through Rasmussen and Williams’s book on Gaussian processes. It’s another one of those topics that seems to crop up a lot these days, particularly around control strategies for energy systems, and thought I should be able to at least perform basic analyses with this method.
While the book is sensibly laid-out and pretty comprehensive in its choice of topics, it is also a very hard read. My linear algebra may be rusty but I’ve heard some mathematicians describe the conventions used in the book as “an affront to notation”. So just be aware that if you try to work through the book, you will need to be patient. It’s not a cookbook that clearly spells out how to do everything step-by-step.
That said, I have now worked through the basics of Gaussian process regression as described in Chapter 2 and I want to share my code with you here. As always, I’m doing this in R and if you search CRAN, you will find a specific package for Gaussian process regression: gptk. The implementation shown below is much slower than the gptk functions, but by doing things manually I hope you will find it easier to understand what’s actually going on. The full code is given below and is available Github. (PS anyone know how to embed only a few lines from a gist?)
Step 1: Generating functions
With a standard univariate statistical distribution, we draw single values. By contrast, a Gaussian process can be thought of as a distribution of functions. So the first thing we need to do is set up some code that enables us to generate these functions. The code at the bottom shows how to do this and hopefully it is pretty self-explanatory. The result is basically the same as Figure 2.2(a) in Rasmussen and Williams, although with a different random seed and plotting settings.
Example of functions from a Gaussian process
Step 2: Fitting the process to noise-free data
Now let’s assume that we have a number of fixed data points. In other words, our Gaussian process is again generating lots of different functions but we know that each draw must pass through some given points. For now, we will assume that these points are perfectly known. In the code, I’ve tried to use variable names that match the notation in the book. In the resulting plot, which corresponds to Figure 2.2(b) in Rasmussen and Williams, we can see the explicit samples from the process, along with the mean function in red, and the constraining data points.
Example of Gaussian process trained on noise-free data
Step 3: Fitting the process to noisy data
The next extension is to assume that the constraining data points are not perfectly known. Instead we assume that they have some amount of normally-distributed noise associated with them. This case is discussed on page 16 of the book, although an explicit plot isn’t shown. The code and resulting plot is shown below; again, we include the individual sampled functions, the mean function, and the data points (this time with error bars to signify 95% confidence intervals).
Example of Gaussian process trained on noisy data
Next steps
Hopefully that will give you a starting point for implementating Gaussian process regression in R. There are several further steps that could be taken now including:
Changing the squared exponential covariance function to include the signal and noise variance parameters, in addition to the length scale shown here.
Speed up the code by using the Cholesky decomposition, as described in Algorithm 2.1 on page 19.
Try to implement the same regression using the gptk package. Sadly the documentation is also quite sparse here, but if you look in the source files at the various demo* files, you should be able to figure out what’s going on.
The code
To leave a comment for the author, please follow the link and comment on his blog: James Keirstead » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
8 weeks ago by rahuldave
Add a frame to a map
8 weeks ago by rahuldave
(This article was first published on me nugget, and kindly contributed to R-bloggers)
Here is a function that adds a frame of alternating colors to a map (un-projected). One defines the extension of each bar (in degrees) and an optional width of the bars (in inches). It uses the "joinPolys" function of the package to trim the bars near the map corners where the axes meet.the map.frame function:#bar.width is the width of the frame in inches#deg.ext is the extention of the frame segments in degrees#other parameters from polygon can be passed to the frame#requires PBSmapping package (function "joinPolys")map.frame <- function(bar.width=NULL, deg.ext=1, ...){ if(missing(bar.width)) bar.width <- mean(par()$pin)*0.02 usr <- par()$usr bar.width.x <- bar.width/par()$pin[1] * (usr[2]-usr[1]) bar.width.y <- bar.width/par()$pin[2] * (usr[4]-usr[3]) bar.lims.x <- seq(-180,180,deg.ext) bar.lims.y <- seq(-90,90,deg.ext) is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol bar.bottom <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1]+bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[3],usr[3]+bar.width.x,usr[3]+bar.width.x,usr[3])) bar.top <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1]+bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[4],usr[4]-bar.width.x,usr[4]-bar.width.x,usr[4])) bar.left <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1],usr[1]+bar.width.y,usr[1]+bar.width.y), Y=c(usr[3],usr[4], usr[4]-bar.width.x,usr[3]+bar.width.x)) bar.right <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[2],usr[2]-bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[3],usr[3]+bar.width.x,usr[4]-bar.width.x,usr[4])) #X axis for(i in seq(length(bar.lims.x)-1)){ xs <- c(bar.lims.x[i], bar.lims.x[i], bar.lims.x[i+1], bar.lims.x[i+1]) #bottom ys <- c(usr[3], usr[3]+bar.width.y, usr[3]+bar.width.y, usr[3]) bottom <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) bottom.join <- joinPolys(bottom,bar.bottom) #top ys <- c(usr[4]-bar.width.y, usr[4], usr[4], usr[4]-bar.width.y) top <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) top.join <- joinPolys(top,bar.top) tmp.col <- ifelse(is.wholenumber(i/2), "black", "white") polygon(bottom.join$X, bottom.join$Y, col=tmp.col, ...) polygon(top.join$X, top.join$Y, col=tmp.col, ...) } #Y axis for(i in seq(length(bar.lims.y)-1)){ ys <- c(bar.lims.y[i], bar.lims.y[i], bar.lims.y[i+1], bar.lims.y[i+1]) #left xs <- c(usr[1], usr[1]+bar.width.x, usr[1]+bar.width.x, usr[1]) left <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) left.join <- joinPolys(left,bar.left) #right xs <- c(usr[2], usr[2]-bar.width.x, usr[2]-bar.width.x, usr[2]) right <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) right.join <- joinPolys(right,bar.right) tmp.col <- ifelse(is.wholenumber(i/2), "black", "white") polygon(left.join$X, left.join$Y, col=tmp.col, ...) polygon(right.join$X, right.join$Y, col=tmp.col, ...) } box()}Created by Pretty R at inside-R.orgthe code to reproduce the map:#required packagesrequire(maps)require(PBSmapping) #required functions (from "www.menugget.blogspot.com")source("map.frame.R") #example plotpng("worldmap_w_frame.png", width=8, height=4, units="in", res=400)par(mar=c(4,4,1,1))plot(0,0,t="n", xlim=c(-180, 180),ylim=c(-80,80), xlab="", ylab="", xaxs="i", yaxs="i", xaxt="n", yaxt="n")map("world", add=TRUE, fill=TRUE, col="grey90", lwd=0.5)axis(1, at=seq(-150, 150, 30), line=-0.5, lwd = 0)axis(2, at=seq(-60, 60, 30), line=-0.5, lwd = 0)abline(h=seq(-90,90,10), lty=3, col="grey")abline(v=seq(-180,180,10), lty=3, col="grey")map.frame(deg.ext=30)dev.off()Created by Pretty R at inside-R.org
To leave a comment for the author, please follow the link and comment on his blog: me nugget.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
map
R
r-project
spatial
from google
Here is a function that adds a frame of alternating colors to a map (un-projected). One defines the extension of each bar (in degrees) and an optional width of the bars (in inches). It uses the "joinPolys" function of the package to trim the bars near the map corners where the axes meet.the map.frame function:#bar.width is the width of the frame in inches#deg.ext is the extention of the frame segments in degrees#other parameters from polygon can be passed to the frame#requires PBSmapping package (function "joinPolys")map.frame <- function(bar.width=NULL, deg.ext=1, ...){ if(missing(bar.width)) bar.width <- mean(par()$pin)*0.02 usr <- par()$usr bar.width.x <- bar.width/par()$pin[1] * (usr[2]-usr[1]) bar.width.y <- bar.width/par()$pin[2] * (usr[4]-usr[3]) bar.lims.x <- seq(-180,180,deg.ext) bar.lims.y <- seq(-90,90,deg.ext) is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol bar.bottom <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1]+bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[3],usr[3]+bar.width.x,usr[3]+bar.width.x,usr[3])) bar.top <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1]+bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[4],usr[4]-bar.width.x,usr[4]-bar.width.x,usr[4])) bar.left <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[1],usr[1],usr[1]+bar.width.y,usr[1]+bar.width.y), Y=c(usr[3],usr[4], usr[4]-bar.width.x,usr[3]+bar.width.x)) bar.right <- data.frame(PID=rep(1, 4), POS=1:4, X=c(usr[2],usr[2]-bar.width.y,usr[2]-bar.width.y,usr[2]), Y=c(usr[3],usr[3]+bar.width.x,usr[4]-bar.width.x,usr[4])) #X axis for(i in seq(length(bar.lims.x)-1)){ xs <- c(bar.lims.x[i], bar.lims.x[i], bar.lims.x[i+1], bar.lims.x[i+1]) #bottom ys <- c(usr[3], usr[3]+bar.width.y, usr[3]+bar.width.y, usr[3]) bottom <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) bottom.join <- joinPolys(bottom,bar.bottom) #top ys <- c(usr[4]-bar.width.y, usr[4], usr[4], usr[4]-bar.width.y) top <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) top.join <- joinPolys(top,bar.top) tmp.col <- ifelse(is.wholenumber(i/2), "black", "white") polygon(bottom.join$X, bottom.join$Y, col=tmp.col, ...) polygon(top.join$X, top.join$Y, col=tmp.col, ...) } #Y axis for(i in seq(length(bar.lims.y)-1)){ ys <- c(bar.lims.y[i], bar.lims.y[i], bar.lims.y[i+1], bar.lims.y[i+1]) #left xs <- c(usr[1], usr[1]+bar.width.x, usr[1]+bar.width.x, usr[1]) left <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) left.join <- joinPolys(left,bar.left) #right xs <- c(usr[2], usr[2]-bar.width.x, usr[2]-bar.width.x, usr[2]) right <- data.frame(PID=rep(1, 4), POS=1:4, X=xs, Y=ys) right.join <- joinPolys(right,bar.right) tmp.col <- ifelse(is.wholenumber(i/2), "black", "white") polygon(left.join$X, left.join$Y, col=tmp.col, ...) polygon(right.join$X, right.join$Y, col=tmp.col, ...) } box()}Created by Pretty R at inside-R.orgthe code to reproduce the map:#required packagesrequire(maps)require(PBSmapping) #required functions (from "www.menugget.blogspot.com")source("map.frame.R") #example plotpng("worldmap_w_frame.png", width=8, height=4, units="in", res=400)par(mar=c(4,4,1,1))plot(0,0,t="n", xlim=c(-180, 180),ylim=c(-80,80), xlab="", ylab="", xaxs="i", yaxs="i", xaxt="n", yaxt="n")map("world", add=TRUE, fill=TRUE, col="grey90", lwd=0.5)axis(1, at=seq(-150, 150, 30), line=-0.5, lwd = 0)axis(2, at=seq(-60, 60, 30), line=-0.5, lwd = 0)abline(h=seq(-90,90,10), lty=3, col="grey")abline(v=seq(-180,180,10), lty=3, col="grey")map.frame(deg.ext=30)dev.off()Created by Pretty R at inside-R.org
To leave a comment for the author, please follow the link and comment on his blog: me nugget.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
8 weeks ago by rahuldave
Bootstrap example
8 weeks ago by rahuldave
(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)
Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your \( \widehat{\beta} \). You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is \( \widehat{\beta} \pm 1.96 \times sd( \widehat{\beta}) \), the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of \( \beta \) which came from the sample, is suppose to estimate the “real” \( \beta \), the population \( \beta \), which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your \( \beta \) according to this (bootstrap)sample, now this new estimate is an estimate for your original \( \widehat{\beta} \), the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This shuffling, or permuted sample is what we call the bootstrap sample. Now we estimate the same statistic (in our case \( \beta \)) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
?View Code RSPLUSlibrary(boot) #load the package
# Now we need the function we would like to estimate
# In our case the beta:
betfun = function(data,b,formula){
# b is the random indexes for the bootstrap sample
d = data[b,]
return(lm(d[,1]~d[,2], data = d)$coef[2])
# thats for the beta coefficient
}
# now you can bootstrap:
bootbet = boot(data="your data here", statistic=betfun, R=5000)
# R is how many bootstrap samples
names(bootbet)
plot(bootbet)
hist(bootbet$t, breaks = 100)
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
?View Code RSPLUSzoombot = boot.array(bootbet, indices = T)
dim(zoombot)
hist(zoombot[1,], breaks = 100)
# this is the frequency of each index, [1,] for the first
#bootstrap run
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
?View Code RSPLUSnb = 5000 ; bet = NULL ; n = NROW("your data here")
ptm <- proc.time() # have a look at the time it takes
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T) # pick random indices
bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2]
}
proc.time() - ptm # about 80 seconds on my end
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley \(\widehat{\beta}\) with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless ). I created a fake normal return distribution using same center and standard deviation as reported, and made the exact same analysis:
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
Bootstrap Methods and their Application (Cambridge Series in Statistical and Probabilistic Mathematics)
An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability)
Resampling Methods: A Practical Guide to Data Analysis
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
library(ggplot2) ; library(fields) ; library(grDevices)
tckr = c('MS', 'SPY')
end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd
start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] )
ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1])
plot(ret1)
length(as.numeric(ret1)) ; length(as.numeric(ret1))
plot(as.numeric(ret1)~as.numeric(ret2))
t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2)))
names(t1) <- tckr
lm1 = lm(MS~SPY, data = t1) ; summary(lm1)
nb = 5000 ; bet = NULL ; n = NROW(t1)
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2]
}
proc.time() - ptm
den = density(bet)
par(mfrow = c(1,1))
bethat <- bquote(Histogram~of ~bold(widehat(beta)))
h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = bethat )
lwd1 = 2.5
xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1)
xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(bet),max(bet),length=length(bet))
yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet))
yfit <- yfit*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2])
yfit2 <- yfit2*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit2, col=5, lwd=2)
###################################################
### Now I generate actually from the normal distribution
###################################################
eps = rnorm(n, mean(ret1), sd(ret1))
fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps
# hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200)
lm2 = lm(fakey~ret1)
fakebet = NULL
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2]
}
proc.time() - ptm
fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta)))
h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = fakebethat )
xline(mean(fakebet), col = 4, lwd = lwd1)
xline(lm2$coef[2], col = 6, lwd = lwd1)
xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet))
yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet))
yfit <- yfit*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2])
yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit2, col=5, lwd=2)
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on top[…]
R_bloggers
blog
R
from google
Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your \( \widehat{\beta} \). You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is \( \widehat{\beta} \pm 1.96 \times sd( \widehat{\beta}) \), the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of \( \beta \) which came from the sample, is suppose to estimate the “real” \( \beta \), the population \( \beta \), which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your \( \beta \) according to this (bootstrap)sample, now this new estimate is an estimate for your original \( \widehat{\beta} \), the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This shuffling, or permuted sample is what we call the bootstrap sample. Now we estimate the same statistic (in our case \( \beta \)) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
?View Code RSPLUSlibrary(boot) #load the package
# Now we need the function we would like to estimate
# In our case the beta:
betfun = function(data,b,formula){
# b is the random indexes for the bootstrap sample
d = data[b,]
return(lm(d[,1]~d[,2], data = d)$coef[2])
# thats for the beta coefficient
}
# now you can bootstrap:
bootbet = boot(data="your data here", statistic=betfun, R=5000)
# R is how many bootstrap samples
names(bootbet)
plot(bootbet)
hist(bootbet$t, breaks = 100)
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
?View Code RSPLUSzoombot = boot.array(bootbet, indices = T)
dim(zoombot)
hist(zoombot[1,], breaks = 100)
# this is the frequency of each index, [1,] for the first
#bootstrap run
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
?View Code RSPLUSnb = 5000 ; bet = NULL ; n = NROW("your data here")
ptm <- proc.time() # have a look at the time it takes
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T) # pick random indices
bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2]
}
proc.time() - ptm # about 80 seconds on my end
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley \(\widehat{\beta}\) with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless ). I created a fake normal return distribution using same center and standard deviation as reported, and made the exact same analysis:
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
Bootstrap Methods and their Application (Cambridge Series in Statistical and Probabilistic Mathematics)
An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability)
Resampling Methods: A Practical Guide to Data Analysis
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
library(ggplot2) ; library(fields) ; library(grDevices)
tckr = c('MS', 'SPY')
end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd
start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] )
ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1])
plot(ret1)
length(as.numeric(ret1)) ; length(as.numeric(ret1))
plot(as.numeric(ret1)~as.numeric(ret2))
t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2)))
names(t1) <- tckr
lm1 = lm(MS~SPY, data = t1) ; summary(lm1)
nb = 5000 ; bet = NULL ; n = NROW(t1)
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2]
}
proc.time() - ptm
den = density(bet)
par(mfrow = c(1,1))
bethat <- bquote(Histogram~of ~bold(widehat(beta)))
h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = bethat )
lwd1 = 2.5
xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1)
xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(bet),max(bet),length=length(bet))
yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet))
yfit <- yfit*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2])
yfit2 <- yfit2*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit2, col=5, lwd=2)
###################################################
### Now I generate actually from the normal distribution
###################################################
eps = rnorm(n, mean(ret1), sd(ret1))
fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps
# hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200)
lm2 = lm(fakey~ret1)
fakebet = NULL
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2]
}
proc.time() - ptm
fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta)))
h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = fakebethat )
xline(mean(fakebet), col = 4, lwd = lwd1)
xline(lm2$coef[2], col = 6, lwd = lwd1)
xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet))
yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet))
yfit <- yfit*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2])
yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit2, col=5, lwd=2)
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on top[…]
8 weeks ago by rahuldave
Updates to the Deducer family of packages
9 weeks ago by rahuldave
(This article was first published on Fells Stats » R, and kindly contributed to R-bloggers)
Over the past month there have been a number of package updates in the deducer ecosystem. Deducer is a general purpose, extensible, data analysis GUI. It is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excel-like spreadsheet in which to view and edit data frames.
More information is available in the online manual:
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual
And there is an intro video in youtube:
Deducer 0.6-3
The main change in Deducer 0.6-3 is an update to the (award winning) Plot Builder GUI to make use of the new features in ggplot2 0.9-0.
New plot builder features: Part 1
New plot builder features: Part 2
DeducerSpatial 0.4
A Deducer plug-in for spatial data analysis. Includes The ability to plot and explore open street map and Bing satellite images. Currently there is not much here in terms of heavy data analysis, but there are some great tools for importing and exploring spatial data.
DeducerPlugInScaling 0.1-0
A Deducer plug-in for factor analysis, reliability analysis and discriminant analysis.
Version 0.1-0 includes some general improvements as well as a dialog for linear discriminant analysis (thanks to Helios De Rosario-Martinez).
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.LinearDiscriminantAnalysis
DeducerExtras 1.5
Added functionality for Deducer. This package includes additional dialogs for calculating distribution function values, cluster analysis, and more.
Version 1.5 includes some general improvements along with a new dialog implementing Friedman's test, and Kendall's W test (thanks to Helios De Rosario-Martinez).
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RankingAnalysis
To leave a comment for the author, please follow the link and comment on his blog: Fells Stats » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
deducer
R
from google
Over the past month there have been a number of package updates in the deducer ecosystem. Deducer is a general purpose, extensible, data analysis GUI. It is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excel-like spreadsheet in which to view and edit data frames.
More information is available in the online manual:
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.DeducerManual
And there is an intro video in youtube:
Deducer 0.6-3
The main change in Deducer 0.6-3 is an update to the (award winning) Plot Builder GUI to make use of the new features in ggplot2 0.9-0.
New plot builder features: Part 1
New plot builder features: Part 2
DeducerSpatial 0.4
A Deducer plug-in for spatial data analysis. Includes The ability to plot and explore open street map and Bing satellite images. Currently there is not much here in terms of heavy data analysis, but there are some great tools for importing and exploring spatial data.
DeducerPlugInScaling 0.1-0
A Deducer plug-in for factor analysis, reliability analysis and discriminant analysis.
Version 0.1-0 includes some general improvements as well as a dialog for linear discriminant analysis (thanks to Helios De Rosario-Martinez).
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.LinearDiscriminantAnalysis
DeducerExtras 1.5
Added functionality for Deducer. This package includes additional dialogs for calculating distribution function values, cluster analysis, and more.
Version 1.5 includes some general improvements along with a new dialog implementing Friedman's test, and Kendall's W test (thanks to Helios De Rosario-Martinez).
http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RankingAnalysis
To leave a comment for the author, please follow the link and comment on his blog: Fells Stats » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
9 weeks ago by rahuldave
Simplify working with times and dates in R
10 weeks ago by rahuldave
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
R has some very powerful built-in features for working with dates, times, and time-zones. But power and flexibility rarely correlate with ease-of-use, and this is no exception. The lubridate package comes to the rescue, make things a bit easier when working with chronological data in R.
The paper Dates and Times Made Easy with lubridate provides in-depth explanations of its functionality, but a recent post on the r-statistics blog provides a handy guide to the latest version, with tips on:
how to convert string-formatted dates like "13-04-2011" or "04/13/2011" to date objects
how to extract time components (weekday, month, hour, second) etc. from chronological data
how to set and convert time zones for data with a time (hour-minute-second) component
how to calculate time intervals and other arithmetic operations on times and dates
To see these tips, read the post at the r-statistics blog linked below.
r-statistics blog: Do more with dates and times in R with lubridate 1.1.0
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
advanced_tips
packages
R
from google
R has some very powerful built-in features for working with dates, times, and time-zones. But power and flexibility rarely correlate with ease-of-use, and this is no exception. The lubridate package comes to the rescue, make things a bit easier when working with chronological data in R.
The paper Dates and Times Made Easy with lubridate provides in-depth explanations of its functionality, but a recent post on the r-statistics blog provides a handy guide to the latest version, with tips on:
how to convert string-formatted dates like "13-04-2011" or "04/13/2011" to date objects
how to extract time components (weekday, month, hour, second) etc. from chronological data
how to set and convert time zones for data with a time (hour-minute-second) component
how to calculate time intervals and other arithmetic operations on times and dates
To see these tips, read the post at the r-statistics blog linked below.
r-statistics blog: Do more with dates and times in R with lubridate 1.1.0
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
10 weeks ago by rahuldave
R and Hadoop: Step-by-step tutorials
11 weeks ago by rahuldave
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more Hadoop experimentation, now with more realistic data sizes.) And finally, as a true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple nodes running Hadoop and R.
With that infrastructure set up, you're ready to start using RHadoop to write map-reduce jobs in R. The final part of Jeffrey's workshop is a tutorial on the rmr package, with a worked example of loading a large data airline data set of airline departures and arrivals into HDFS and using an R-based map-reduce task to calculate the scheduled (orange) and actual (yellow) total flight hours in the US over the last decade or so. (Actual time spent in the air is also shown, in blue.)
By the way, if you're not familiar with RHadoop, project lead Antonio Piccolboni presented an overview at last month's Strata conference; I've included his slides from that presentation below.
Jeffry Breen: Slides from today’s Big Data Step-by-Step Tutorials: Infrastructure series and Intro to R+Hadoop with RHadoop’s rmr
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
Big_Data
R
from google
At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more Hadoop experimentation, now with more realistic data sizes.) And finally, as a true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple nodes running Hadoop and R.
With that infrastructure set up, you're ready to start using RHadoop to write map-reduce jobs in R. The final part of Jeffrey's workshop is a tutorial on the rmr package, with a worked example of loading a large data airline data set of airline departures and arrivals into HDFS and using an R-based map-reduce task to calculate the scheduled (orange) and actual (yellow) total flight hours in the US over the last decade or so. (Actual time spent in the air is also shown, in blue.)
By the way, if you're not familiar with RHadoop, project lead Antonio Piccolboni presented an overview at last month's Strata conference; I've included his slides from that presentation below.
Jeffry Breen: Slides from today’s Big Data Step-by-Step Tutorials: Infrastructure series and Intro to R+Hadoop with RHadoop’s rmr
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
11 weeks ago by rahuldave
Creating a Stratified Sample of a Dataframe
11 weeks ago by rahuldave
(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)
Expanding on a question on Stack Overflow I'll show how to make a 10 % subsample of each stratum within a given dataframe: d <- expand.grid(id = 1:35000, stratum = letters[1:10])p = 0.1dsample <- data.frame()system.time(for(i in levels(d$stratum)) { dsub <- subset(d, d$stratum == i) B = ceiling(nrow(dsub) * p) dsub <- dsub[sample(1:nrow(dsub), B), ] dsample <- rbind(dsample, dsub) })# size per stratum in resulting df is 10 % of original size:table(dsample$stratum)
To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
R
Stratified_Sampling
from google
Expanding on a question on Stack Overflow I'll show how to make a 10 % subsample of each stratum within a given dataframe: d <- expand.grid(id = 1:35000, stratum = letters[1:10])p = 0.1dsample <- data.frame()system.time(for(i in levels(d$stratum)) { dsub <- subset(d, d$stratum == i) B = ceiling(nrow(dsub) * p) dsub <- dsub[sample(1:nrow(dsub), B), ] dsample <- rbind(dsample, dsub) })# size per stratum in resulting df is 10 % of original size:table(dsample$stratum)
To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
11 weeks ago by rahuldave
R-Function to Read Data from Google Docs Spreadsheets
11 weeks ago by rahuldave
(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)
I used this idea posted on Stack Overflow to plug together a function for reading data from Google Docs spreadsheets into R. Read more »
To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
google_docs
Google_Spreadsheets
R
from google
I used this idea posted on Stack Overflow to plug together a function for reading data from Google Docs spreadsheets into R. Read more »
To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
11 weeks ago by rahuldave
Experience on using R to build prediction models in business applications
12 weeks ago by rahuldave
(This article was first published on RDataMining, and kindly contributed to R-bloggers)
By Yanchang zhao, RDataMining.com
Building prediction/classification models is one of the most widely-seen data mining tasks in business applications. To share experience on building prediction models with R, I have started a discussion at RDataMining group on LinkedIn with the following questions. And my experience can be found at the end of question list. Pls join our discussion if you are interested.
1. What is your application about?
2. What techniques did you use? E.g., linear/logistic/non-linear regression, decision tree, random forest, neural networks, SVM, k-NN classification.
3. Which tool/function/package did you use? E.g., package rpart or party in R.
4. Why did you choose the above technique/tool/package?
5. Was your data a mixture of numerical and categorical attributes? If yes, what did you do to preprocess data before feeding them into the above techniques/functions?
6. Was your data of imbalanced classes? If yes, what did you do to achieve good prediction results?
7. Did your data have many missing values or extreme values? If yes, what did you do with them?
8. Was your table very wide, i.e., having many attributes/variables? If yes, how did you do variable/feature selection or dimensionality reduction before building predictive models?
Below are my experiences in a business application.
It was an application to model risk of customers. There were two classes, good and bad, labelled as 0 and 1. A model was to build to predict classes of new cases.
I used decision tree, because the tree is easy to understand by business people and managers, and the rules are simple and easily to be accepted by business, as compared to SVM or neural networks. We finally built a tree with less than 30 leaf nodes, that is, less than 30 rules. Although random forest can make a similar model, but it would end up with too many rules. To sum up, decision trees are easy to understand, have good performance, accommodate both categorical and numerical data and also missing values, and produce simple models.
I used ctree() in package party. The reason for this choice is not technical at all. I first tried rpart, but when I plotted rpart tree, I got a tree without any labels, and the tree does not look like a tree at all. Then I tried ctree(), and fell in love with its plot, a nice looking tree with all labels I needed. I believe both packages produce similar results, and my choice of ctreeI() is simply personal preference.
The reason I didn’t use package randomForest is that it cannot handle missing values, and in our application, we didn’t believe filling those missing values would produce good result, because many missing values were introduced when joining tables (e.g., there were no records for a customer in some tables). However, it may work in other applications. I did try cforest() in package party, it produced similar result as ctree(), so I finally chose ctree() because it produces much simpler trees than a random forest. When performance is similar, the simpler, the better.
The data was a mixture of numerical and categorical attributes, and ctree() handles that very well.
There were many missing values. We didn’t fill them. However, for attributes with too many missing values, e.g., if above 60% are missing, we simply excluded the attribute from modelling. SAS EM uses a similar strategy when importing data into EM, and I think the default threshold for missing values are 20% or 30% in EM.
The data were large and also composed of over 300 variables from dozens of tables. We didn’t do any feature selection first, but found that it took too long to train a model, especially with some categorical variables having many value levels. An option is to use a small sample to train models. I took a different way to use as many training data as possible. To make it work, I draw 20 random samples of training data, and built 20 decision trees, one tree for each sample. There are around 20-30 variables in each tree, and many trees share similar set of variables. Then I collected all variables appearing in those trees, and got around 60 variables. After that, I used all original training data for training without any sampling, but with the above 60 variables only. That’s my way to select features in that application, where all training cases were used to build a final model, but with only those attributes having appeared in the 20 trees on sampled data.
More information on R and data mining is available at:
RDataMining: http://www.rdatamining.com
Twitter: http://www.twitter.com/RDataMining
Group on Linkedin: http://group.rdatamining.com
Group on Google: http://group2.rdatamining.com
To leave a comment for the author, please follow the link and comment on his blog: RDataMining.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
data_mining
prediction
R
Uncategorized
from google
By Yanchang zhao, RDataMining.com
Building prediction/classification models is one of the most widely-seen data mining tasks in business applications. To share experience on building prediction models with R, I have started a discussion at RDataMining group on LinkedIn with the following questions. And my experience can be found at the end of question list. Pls join our discussion if you are interested.
1. What is your application about?
2. What techniques did you use? E.g., linear/logistic/non-linear regression, decision tree, random forest, neural networks, SVM, k-NN classification.
3. Which tool/function/package did you use? E.g., package rpart or party in R.
4. Why did you choose the above technique/tool/package?
5. Was your data a mixture of numerical and categorical attributes? If yes, what did you do to preprocess data before feeding them into the above techniques/functions?
6. Was your data of imbalanced classes? If yes, what did you do to achieve good prediction results?
7. Did your data have many missing values or extreme values? If yes, what did you do with them?
8. Was your table very wide, i.e., having many attributes/variables? If yes, how did you do variable/feature selection or dimensionality reduction before building predictive models?
Below are my experiences in a business application.
It was an application to model risk of customers. There were two classes, good and bad, labelled as 0 and 1. A model was to build to predict classes of new cases.
I used decision tree, because the tree is easy to understand by business people and managers, and the rules are simple and easily to be accepted by business, as compared to SVM or neural networks. We finally built a tree with less than 30 leaf nodes, that is, less than 30 rules. Although random forest can make a similar model, but it would end up with too many rules. To sum up, decision trees are easy to understand, have good performance, accommodate both categorical and numerical data and also missing values, and produce simple models.
I used ctree() in package party. The reason for this choice is not technical at all. I first tried rpart, but when I plotted rpart tree, I got a tree without any labels, and the tree does not look like a tree at all. Then I tried ctree(), and fell in love with its plot, a nice looking tree with all labels I needed. I believe both packages produce similar results, and my choice of ctreeI() is simply personal preference.
The reason I didn’t use package randomForest is that it cannot handle missing values, and in our application, we didn’t believe filling those missing values would produce good result, because many missing values were introduced when joining tables (e.g., there were no records for a customer in some tables). However, it may work in other applications. I did try cforest() in package party, it produced similar result as ctree(), so I finally chose ctree() because it produces much simpler trees than a random forest. When performance is similar, the simpler, the better.
The data was a mixture of numerical and categorical attributes, and ctree() handles that very well.
There were many missing values. We didn’t fill them. However, for attributes with too many missing values, e.g., if above 60% are missing, we simply excluded the attribute from modelling. SAS EM uses a similar strategy when importing data into EM, and I think the default threshold for missing values are 20% or 30% in EM.
The data were large and also composed of over 300 variables from dozens of tables. We didn’t do any feature selection first, but found that it took too long to train a model, especially with some categorical variables having many value levels. An option is to use a small sample to train models. I took a different way to use as many training data as possible. To make it work, I draw 20 random samples of training data, and built 20 decision trees, one tree for each sample. There are around 20-30 variables in each tree, and many trees share similar set of variables. Then I collected all variables appearing in those trees, and got around 60 variables. After that, I used all original training data for training without any sampling, but with the above 60 variables only. That’s my way to select features in that application, where all training cases were used to build a final model, but with only those attributes having appeared in the 20 trees on sampled data.
More information on R and data mining is available at:
RDataMining: http://www.rdatamining.com
Twitter: http://www.twitter.com/RDataMining
Group on Linkedin: http://group.rdatamining.com
Group on Google: http://group2.rdatamining.com
To leave a comment for the author, please follow the link and comment on his blog: RDataMining.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
Ggplot2 Notes
12 weeks ago by rahuldave
(This article was first published on Sharp Statistics » R, and kindly contributed to R-bloggers)
During the time I have used R the base graphics package has met my needs, although I have been aware of ggplot2 but found learning it a bit of a struggle so have pretty much ignored it until now. Most … Continue reading →
To leave a comment for the author, please follow the link and comment on his blog: Sharp Statistics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
charts
R
from google
During the time I have used R the base graphics package has met my needs, although I have been aware of ggplot2 but found learning it a bit of a struggle so have pretty much ignored it until now. Most … Continue reading →
To leave a comment for the author, please follow the link and comment on his blog: Sharp Statistics » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
Plot maps like a boss
12 weeks ago by rahuldave
(This article was first published on Fells Stats » R, and kindly contributed to R-bloggers)
A new package OpenStreetMap has been released to CRAN this week which is designed to allow you to easily add satellite imagery, or open street maps to your plots. Raster maps are a great way to add context to your spatial data with a minimum outlay of effort.
The syntax in OpenStreetMap is fairly simple, just give it a bounding box in lat/long and it will download a high quality raster image ready for plotting
library(OpenStreetMap)
library(rgdal)
map <- openmap(c(70,-179), c(-70,179))
plot(map)
(click for higher quality image)
The above code downloads multiple map tiles and stitches together, with the level of zoom determined automatically. Unlike RGoogleMaps no files are created or stored on the hard drive. The plot gets is images from the mapnik server, which can provide street level information. In my opinion, there rendering looks pretty clean as well.
We can also access satellite imagery though Bing.
map <- openmap(c(70,-179), c(-70,179),type='bing')
plot(map)
Now, that is all fine and dandy, but kind of useless unless you are able to combine it with your own data. Open street map, and Bing (and Google maps) use a particular form of the Mercator projection which has the properties that angles are preserved, and tiles on multiple zoom levels can be stitched together. You can access this projection with the 'osm' function.
In terms of combining maps with your data there are two options. The data can be transformed to the osm() projection, or the raster map can be translated to whatever projection the data are in. Here is an example of the first option:
library(UScensus2000)
data(california.tract)
lat <- c(43.834526782236814,30.334953881988564)
lon <- c(-131.0888671875 ,-107.8857421875)
southwest <- openmap(c(lat[1],lon[1]),c(lat[2],lon[2]),zoom=6,'osm')
california.tract <- spTransform(california.tract,osm())
plot(southwest)
plot(california.tract,add=TRUE,col=(california.tract@data$med.age>40)+4)
Here we take a map from the UScensus2000 package, transform it to mercator coordinates, and make a choropleth. The spTransform function is a great easy way to project your data into different map coordinate systems.
We may also want to go the other way and transform the image. The openproj function can transform open street maps to different projections. Here is an example combining OpenStreetMap with the maps library by projecting the map into longlat coordinates.
map <- openmap(c(70,-179), c(-70,179),type='bing')
map_longlat <- openproj(map, projection = "+proj=longlat")
plot(map_longlat,raster=TRUE)
map("world",col="red",add=TRUE)
but, we are not just limited to the longlat projection, we can also do weirder ones like the lambert conic conformal.
map <- openmap(c(70,-179),
c(40,179),zoom=2,type='bing')
map_longlat <- openproj(map)
#Lambert Conic Conformal (takes some time...)
map_llc <- openproj(map_longlat, projection=
"+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96")
plot(map_llc,raster=TRUE)
#add choropleth
data(states)
st_llc <- spTransform(states,CRS("+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96"))
plot(st_llc,add=T,col=heat.colors(48,.4)[slot(st_llc,"data")[["ORDER_ADM"]]])
Now, I have no idea why you would want to use this projection, but it is pretty cool none the less.
One of the best uses for raster maps is in street level data, where it is particularly important to give the reader contextual information. Here is a plot of some locations in the Long Beach harbor, using the LA_places dataset.
data(LA_places)
xy <- coordinates(LA_places)
map <- openmap(c(33.760525217369974,-118.22052955627441),
c(33.73290566922855,-118.17521095275879))
png(width = 1000, height = 1000)
plot(map,raster=TRUE)
plot(LA_places,add=TRUE,col="red")
text(xy[,1],xy[,2],slot(LA_places,"data")[,'NAME'],adj=1)
If you are a Deducer user, the DeducerSpatial package provides a GUI for spatial plotting using the OpenStreetMap package.
To leave a comment for the author, please follow the link and comment on his blog: Fells Stats » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
R
spatial
from google
A new package OpenStreetMap has been released to CRAN this week which is designed to allow you to easily add satellite imagery, or open street maps to your plots. Raster maps are a great way to add context to your spatial data with a minimum outlay of effort.
The syntax in OpenStreetMap is fairly simple, just give it a bounding box in lat/long and it will download a high quality raster image ready for plotting
library(OpenStreetMap)
library(rgdal)
map <- openmap(c(70,-179), c(-70,179))
plot(map)
(click for higher quality image)
The above code downloads multiple map tiles and stitches together, with the level of zoom determined automatically. Unlike RGoogleMaps no files are created or stored on the hard drive. The plot gets is images from the mapnik server, which can provide street level information. In my opinion, there rendering looks pretty clean as well.
We can also access satellite imagery though Bing.
map <- openmap(c(70,-179), c(-70,179),type='bing')
plot(map)
Now, that is all fine and dandy, but kind of useless unless you are able to combine it with your own data. Open street map, and Bing (and Google maps) use a particular form of the Mercator projection which has the properties that angles are preserved, and tiles on multiple zoom levels can be stitched together. You can access this projection with the 'osm' function.
In terms of combining maps with your data there are two options. The data can be transformed to the osm() projection, or the raster map can be translated to whatever projection the data are in. Here is an example of the first option:
library(UScensus2000)
data(california.tract)
lat <- c(43.834526782236814,30.334953881988564)
lon <- c(-131.0888671875 ,-107.8857421875)
southwest <- openmap(c(lat[1],lon[1]),c(lat[2],lon[2]),zoom=6,'osm')
california.tract <- spTransform(california.tract,osm())
plot(southwest)
plot(california.tract,add=TRUE,col=(california.tract@data$med.age>40)+4)
Here we take a map from the UScensus2000 package, transform it to mercator coordinates, and make a choropleth. The spTransform function is a great easy way to project your data into different map coordinate systems.
We may also want to go the other way and transform the image. The openproj function can transform open street maps to different projections. Here is an example combining OpenStreetMap with the maps library by projecting the map into longlat coordinates.
map <- openmap(c(70,-179), c(-70,179),type='bing')
map_longlat <- openproj(map, projection = "+proj=longlat")
plot(map_longlat,raster=TRUE)
map("world",col="red",add=TRUE)
but, we are not just limited to the longlat projection, we can also do weirder ones like the lambert conic conformal.
map <- openmap(c(70,-179),
c(40,179),zoom=2,type='bing')
map_longlat <- openproj(map)
#Lambert Conic Conformal (takes some time...)
map_llc <- openproj(map_longlat, projection=
"+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96")
plot(map_llc,raster=TRUE)
#add choropleth
data(states)
st_llc <- spTransform(states,CRS("+proj=lcc +lat_1=33 +lat_2=45 +lat_0=39 +lon_0=-96"))
plot(st_llc,add=T,col=heat.colors(48,.4)[slot(st_llc,"data")[["ORDER_ADM"]]])
Now, I have no idea why you would want to use this projection, but it is pretty cool none the less.
One of the best uses for raster maps is in street level data, where it is particularly important to give the reader contextual information. Here is a plot of some locations in the Long Beach harbor, using the LA_places dataset.
data(LA_places)
xy <- coordinates(LA_places)
map <- openmap(c(33.760525217369974,-118.22052955627441),
c(33.73290566922855,-118.17521095275879))
png(width = 1000, height = 1000)
plot(map,raster=TRUE)
plot(LA_places,add=TRUE,col="red")
text(xy[,1],xy[,2],slot(LA_places,"data")[,'NAME'],adj=1)
If you are a Deducer user, the DeducerSpatial package provides a GUI for spatial plotting using the OpenStreetMap package.
To leave a comment for the author, please follow the link and comment on his blog: Fells Stats » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
Data visualization
12 weeks ago by rahuldave
(This article was first published on Research tips » R, and kindly contributed to R-bloggers)
For those who have not read the seminal works of Tufte and Cleveland, please hang your heads in shame. To salvage some sense of self-worth, you can then head over to Solomon Messing’s blog where he is starting a series on data visualization based on the principles developed by Tufte and Cleveland (with R examples).
The classics are also worth reading, and remain relevant despite the 20 or 30 years that have elapsed since they appeared.
To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
graphics
R
statistics
Uncategorized
from google
For those who have not read the seminal works of Tufte and Cleveland, please hang your heads in shame. To salvage some sense of self-worth, you can then head over to Solomon Messing’s blog where he is starting a series on data visualization based on the principles developed by Tufte and Cleveland (with R examples).
The classics are also worth reading, and remain relevant despite the 20 or 30 years that have elapsed since they appeared.
To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups
12 weeks ago by rahuldave
After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools. He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors. That’s not many people.
In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments. Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations. In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.
A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.
Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity). A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:
Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
Avoid mosaic plots. Instead use paneled histograms.
Ditch the legend if you can (you almost always can).
The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.
The science of graphical perception
When most people think about visualization, they think first of Edward Tufte. Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Tufte introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.
I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong in a museum. But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.
Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates. In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data. These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.
He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:
The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline). Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.
A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience. This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field. If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers. It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.
Dotplots versus pie charts and stacked barplots
Now let’s put this to the test. Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.
Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won. But once we start looking at who came in second, third, and so on, differences emerge. It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.
Second, which conveys estimates most accurately? The dot plot is the clear winner here. We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?
One argument I used to hear […]
R
from google
In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments. Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts) in most situations. In later posts, I’ll demonstrate another innovation whose widespread use I’ll credit to Cleveland and Tufte: the use of multiple panels (aka small multiples, trellis graphics, facets, generalized draftsman’s displays, multivar charts) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.
A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.
Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity). A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:
Never represent something in 2 or worse yet 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
Avoid mosaic plots. Instead use paneled histograms.
Ditch the legend if you can (you almost always can).
The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.
The science of graphical perception
When most people think about visualization, they think first of Edward Tufte. Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). We share an affinity for multipanel plotting approaches, which he calls “small multiples,” (thanks to Rebecca Weiss for pointing this out) though I think people give Tufte too much credit for their invention—both juiceanalytics and infovis-wiki write that Tufte introduced the concept/principle. However, both Cleveland and Tufte published books in 1983 discussing the use of multipanel displays; David Smith over at Revolutions writes that “the “small-multiples” principle of data visualization [was] pioneered by Cleveland and popularized in Tufte’s first book”; and the earliest reference to a work containing multipanel displays I could find was published *long* before Tufte’s 1983 work–Seder, Leonard (1950), “Diagnosis with Diagrams—Part I”, Industrial Quality Control (New York, New York: American Society for Quality Control) 7 (1): 11–19.
I’m less sure about Tufte’s advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong in a museum. But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.
Cleveland (with Robert McGill) develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates. In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data. These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.
He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:
The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline). Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.
A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience. This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field. If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers. It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.
Dotplots versus pie charts and stacked barplots
Now let’s put this to the test. Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.
Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won. But once we start looking at who came in second, third, and so on, differences emerge. It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.
Second, which conveys estimates most accurately? The dot plot is the clear winner here. We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?
One argument I used to hear […]
12 weeks ago by rahuldave
Visualization series: Insight from Cleveland and Tufte on plotting numeric data by groups
12 weeks ago by rahuldave
(This article was first published on Solomon Messing » R, and kindly contributed to R-bloggers)
After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools. He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors. That’s not many people.
In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments. Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts). In later posts, I’ll demonstrate another innovation that I’ll credit to Cleveland: the use of multiple panels (aka trellis graphics and/or facets) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.
A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.
Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity). A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:
Never represent something in 2 or (god forbid) 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
Avoid mosaic plots. Instead use paneled histograms.
Ditch the legend if you can (you almost always can).
The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.
The science of graphical perception
When most people think about visualization, they think first of Edward Tufte. Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). But I’m not so sure about his advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong in a museum. But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.
Cleveland develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates. In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data. These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.
He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:
The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline). Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.
A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience. This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field. If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers. It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.
Dotplots versus pie charts and stacked barplots
Now let’s put this to the test. Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.
Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won. But once we start looking at who came in second, third, and so on, differences emerge. It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.
Second, which conveys estimates most accurately? The dot plot is the clear winner here. We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?
One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.” Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4. What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20. But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents. And if I really care about proportion, I suppose I could extend the axis from 0-100.
Suppose I want to plot results for the top 15 candidates, not just the top 6? Here’s what happens:
No contest, the pie chart fails completely. We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go. And even after adding the legend, you wouldn’t be able to distinguish the lower perf[…]
R_bloggers
R
from google
After my post on making dotplots with concise code using plyr and ggplot, I got an email from my dad who practices immigration law and runs a website with a variety of immigration resources and tools. He pointed out that the post was written for folks who already know that they want to make dot plots, and who already know about bootstrapped standard errors. That’s not many people.
In an attempt to appeal to a broader audience, I’m starting a series in which I’ll outline the key principles I use when developing a visualization. In this post, I’ll articulate these principles, which combine some of Tuft’s aesthetic guidelines with Cleveland’s scientific approach to visualization, which is based on the psychological processes involved in making sense of visualizations, and has been rigorously tested via randomized controlled experiments. Based on these principles, I’ll argue that dotplots and scatterplots are better than other types of plots (especially pie charts). In later posts, I’ll demonstrate another innovation that I’ll credit to Cleveland: the use of multiple panels (aka trellis graphics and/or facets) to clearly convey the same information embedded in more complex and difficult to read visualizations, including multiple line plots and mosaic plots. In future posts I’ll also emphasize why it is important to provide some indication of the noise present in the underlying data using error bars or bands. Along the way, I’ll put you to the test–I’ll present some visualizations of the same data using different visualization techniques and ask you to try to get as much information as you can in 2 seconds from each type of visualization.
A good visualization conveys key information to those who may have trouble interpreting numbers and/or statistics, which can make your findings accessible to a wider audience (more on this below). Visualizations also give your audience a break from lexical processing, which is especially useful when you are presenting your findings–people can listen to you and process the findings from a well-designed visual at the same time, but most people have trouble listening while reading your PowerPoint bullet points. Visualizations also convey key information embedded in massive amounts of data, which can aid your own exploratory analysis of data, no matter how massive.
Yet most visualizations are flawed, drawn using elements that make it unnecessarily difficult for the human visual system to make sense of things. I see a lot of these visualizations attending research presentations, screening incoming draft manuscripts as the assistant editor for Political Communication, and as a consumer of media info-graphics (CNN is especially bad, have a look at this monstrosity). A big part of the problem is that Microsoft makes it easy to draw flashy but ultimately confusing visualizations in Excel. If you are too busy to read this post in full, follow this short list of guidelines and you’ll be on your way to producing elegant visualizations that impose a minimal cognitive burden on your audience:
Never represent something in 2 or (god forbid) 3 dimensions if it can be represented in one—NEVER use pie charts, 3-D pie charts, stacked bar charts, or 3-D bar charts.
Remove as much chart junk as possible–unnecessary gridlines, shading, borders, etc.
Give your audience a sense of the noise present in your data–draw error bars or confidence bands if you are plotting estimates.
If you want to plot multiple types of groups on a single outcome (the visual analog of cross-tabulations/marginals), use multi-paneled plots. These can also help if overploting looks too cluttered.
Avoid mosaic plots. Instead use paneled histograms.
Ditch the legend if you can (you almost always can).
The rest of the content in this series emphasizes why it makes sense to follow these guidelines. In this post I’ll look at the first point in detail and touch on the sixth. These two guidelines are most relevant when you want to look at a quantitative variable (e.g., earnings, vote-share, temperature, etc.) across different qualitative groupings (e.g., industry segment, candidate, party, racial group, season, etc.). This is one of the most common visualization tasks in business, media, and social science, and for this task people often use pie charts and/or bar charts, and occasionally dot plots.
The science of graphical perception
When most people think about visualization, they think first of Edward Tufte. Tufte emphasizes integrity to the data, showing relationships between phenomena, and above all else aesthetic minimalism. I appreciate his ruthless crusade against chart junk and pie charts (nice quote from Data without Borders). But I’m not so sure about his advice to always show axes starting at zero, which can make comparison between two groups difficult, and to “show causality,” which can end up misleading your readers. Of course, the visualizations on display in the glossy pages of Tufte’s books are beautiful–they belong in a museum. But while his books are full of general advice that we should all keep in mind when creating plots, he does not put forth a theory of what works and what doesn’t when trying to visualize data.
Cleveland develops such a theory and subjects it to rigorous scientific testing. In my last post I linked to one of Cleveland’s studies showing that dots (or bars) aligned on the same scale are indeed the best visualization to convey a series of numerical estimates. In this work, Cleveland examined how accurately our visual system can process visual elements or “perceptual units” representing underlying data. These elements include markers aligned on the same scale (e.g., dot plots, scatterplots, ordinary bar charts), the length of lines that are not aligned on the same scale (e.g., stacked bar plots), area (pie charts and mosaic plots), angles (also pie charts), shading/color, volume, curvature, and direction.
He runs two experiments: the first compares judgements about relative position (grouped bar charts) to judgements based only on length (stacked bar charts); the second compares judgements about relative position (ordinary bar charts) to judgements about angles/area (pie charts). Here are the materials he uses, courtesy of the Stanford Computer Graphics Lab:
The results are resoundingly clear—judgements about position relative to a baseline are dramatically more accurate than judgements about angles, area, or length (with no baseline). Hence, he suggests that we replace pie charts with bar charts or dot plots and that we substitute stacked bar charts for grouped bar charts.
A striking and often overlooked finding in this work is the fact that the group of participants without technical training, “mostly ordinary housewives” as Cleveland describes them, performed just as well as the group of mostly men with substantial technical training and experience. This finding provides evidence for something that I’ve long suspected: that visualizations make it easier for people lacking quantitative experience to understand your results, serving to level the playing field. If you want your findings to be broadly accessible, it’s probably better to present a visualization rather than a bunch of numbers. It also suggests that if someone is having trouble interpreting your visualizations, it’s probably your fault.
Dotplots versus pie charts and stacked barplots
Now let’s put this to the test. Take a look at each visualization below for two seconds, looking for the percent of the vote that Mitt Romney, Ron Paul, and Jon Huntsman got.
Which is easiest to read? Which conveys information most accurately? Let’s first take a look at the most critical information–the order in which the candidates placed. In all plots, the candidates are arrayed in order from highest to least vote share, and it’s easy to see that Mitt won. But once we start looking at who came in second, third, and so on, differences emerge. It’s slightly harder to process order in the pie chart because your eye has to go around the plot rather than up and down in a straight line. In the stacked bar chart, we need to look up which color corresponds to which candidate’s in the legend (as Tufte told us not to use), adding a layer of cognitive processing.
Second, which conveys estimates most accurately? The dot plot is the clear winner here. We can quickly see that Romney got about 37%, Paul got about 24%, and Huntsman got about 16%, just by looking at dots relative to the axis. When we look at the pie chart, it’s really tough to estimate the exact percent each candidate got. Same with the stacked bar chart. We could add numbers to the pie and bar charts, which would even things out to some extent, but then why not just display a table with exact percents?
One argument I used to hear all the time when I worked in industry is that pie charts “convey a sense of proportion.” Well, sure, I guess I can kind of guestimate that Ron Paul’s vote share is about 1/4. What about Jon Huntsman? Hmm, it looks like about 15 percent, which is 3/20. But wait, why do I want to convert things into fractions anyway? I don’t think in terms of fractions, I think in terms of percents. And if I really care about proportion, I suppose I could extend the axis from 0-100.
Suppose I want to plot results for the top 15 candidates, not just the top 6? Here’s what happens:
No contest, the pie chart fails completely. We’d need to add a legend with colors for each candidate, which adds another layer of cognitive processing–we’d need to look up each color in the lengend as we go. And even after adding the legend, you wouldn’t be able to distinguish the lower perf[…]
12 weeks ago by rahuldave
Spurious Regression illustrated
12 weeks ago by rahuldave
(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)
Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.
Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.
The results from the regression are:
Estimate
Std. Error
t.value
P.value
(Intercept)
-46.4234
2.1827
-21.27
0.0000
beta.hat
0.8534
0.0178
47.90
0.0000
R^2 = 0.76
The graph looks fine, and the results make sense, but utterly wrong!
Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!
?View Code RSPLUSy = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.
x = cumsum(rnorm(250*10,0.05))
lm2 = lm(y~x) ; summary(lm2)
plot(y, ty = "l", main = "Fitted (in blue) over Actual --
Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)
Estimate
Std.Error
t.value
P.value
(Intercept)
7.0474
0.4651
15.15
0.0000
x
0.5862
0.0062
94.29
0.0000
R^2 = 0.78
Note the resemblance with the previous figure and table.
So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?
In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction.
Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.
Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.
We can transform the data from levels to returns and re-execute the regression as follows:
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
tckr = c('MS', 'SPY')
end <- "2007-01-01"
start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns
ret2 = (dat2[,4] - dat2[,1])/dat2[,1]
lmret = lm(ret1~ret2)
summary(lmret)
plot(as.numeric(ret1)~as.numeric(lmret$fit))
abline(lmret, col = 2, lwd = 2.5)
Regression using returns
Now we can see that even after analyzing using returns, not levels, we still get a good fit.
You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.
?View Code RSPLUSadf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root
adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root
As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!
Residuals from Regression on levels
See how the residuals from the regression fluctuate around zero.
Comments
1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.
Thanks for reading.
References
Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics)
Financial Econometrics: From Basics to Advanced Modeling Techniques (Frank J. Fabozzi Series)
A Companion to Theoretical Econometrics (Blackwell Companions to Contemporary Economics)
Financial Econometrics: Problems, Models, and Methods.
Time Series Analysis
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
blog
R
from google
Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.
Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.
The results from the regression are:
Estimate
Std. Error
t.value
P.value
(Intercept)
-46.4234
2.1827
-21.27
0.0000
beta.hat
0.8534
0.0178
47.90
0.0000
R^2 = 0.76
The graph looks fine, and the results make sense, but utterly wrong!
Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!
?View Code RSPLUSy = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.
x = cumsum(rnorm(250*10,0.05))
lm2 = lm(y~x) ; summary(lm2)
plot(y, ty = "l", main = "Fitted (in blue) over Actual --
Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)
Estimate
Std.Error
t.value
P.value
(Intercept)
7.0474
0.4651
15.15
0.0000
x
0.5862
0.0062
94.29
0.0000
R^2 = 0.78
Note the resemblance with the previous figure and table.
So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?
In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction.
Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.
Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.
We can transform the data from levels to returns and re-execute the regression as follows:
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
tckr = c('MS', 'SPY')
end <- "2007-01-01"
start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns
ret2 = (dat2[,4] - dat2[,1])/dat2[,1]
lmret = lm(ret1~ret2)
summary(lmret)
plot(as.numeric(ret1)~as.numeric(lmret$fit))
abline(lmret, col = 2, lwd = 2.5)
Regression using returns
Now we can see that even after analyzing using returns, not levels, we still get a good fit.
You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.
?View Code RSPLUSadf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root
adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root
As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!
Residuals from Regression on levels
See how the residuals from the regression fluctuate around zero.
Comments
1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.
Thanks for reading.
References
Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics)
Financial Econometrics: From Basics to Advanced Modeling Techniques (Frank J. Fabozzi Series)
A Companion to Theoretical Econometrics (Blackwell Companions to Contemporary Economics)
Financial Econometrics: Problems, Models, and Methods.
Time Series Analysis
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
ABC in Roma [R lab #2]
march 2012 by rahuldave
(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)
Here are the R codes of the second R lab organised by Serena Arima in supplement of my lectures (now completed!). This morning I covered ABC model choice and the following example is the benchmark used in the course (and in the paper) about the impact of summary statistics. (Warning! It takes a while to run…)
n.iter=10000
n=c(10,100,1000)
n.sims=100
prob.m1=matrix(0,nrow=n.sims,ncol=length(n))
prob.m2=prob.m1
### Simulation from Normal model
for(sims in 1:n.sims){
## True data generation from the Normal distribution and summary statistics
for(i in 1:length(n)){
y.true=rnorm(n[i],0,1)
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m1[sims,i]=sum(dist.m1<epsilon)/(2*n.iter*0.01)
}}
### Simulation from the Laplace model
for(sims in 1:n.sims){
## True data generation from the Laplace distribution and summary statistics
for(i in 1:length(n)){
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}
# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}
# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}}
# Visualize the results
boxplot(data.frame(prob.m1[,1],prob.m2[,1]),
names=c("M1","M2"),main="N=10")
boxplot(data.frame(prob.m1[,2],prob.m2[,2]),
names=c("M1","M2"),main="N=100")
boxplot(data.frame(prob.m1[,3],prob.m2[,3]),
names=c("M1","M2"),main="N=1000")
Once again, I had a terrific time teaching this “ABC in Roma” course, and not only for the immense side benefit of enjoy Roma in a particularly pleasant weather (for late February).
Filed under: R, Statistics, University life Tagged: ABC, La Sapienza, PhD course, R, Roma
To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
ABC
La_Sapienza
PhD_course
R
Roma
statistics
University_life
from google
Here are the R codes of the second R lab organised by Serena Arima in supplement of my lectures (now completed!). This morning I covered ABC model choice and the following example is the benchmark used in the course (and in the paper) about the impact of summary statistics. (Warning! It takes a while to run…)
n.iter=10000
n=c(10,100,1000)
n.sims=100
prob.m1=matrix(0,nrow=n.sims,ncol=length(n))
prob.m2=prob.m1
### Simulation from Normal model
for(sims in 1:n.sims){
## True data generation from the Normal distribution and summary statistics
for(i in 1:length(n)){
y.true=rnorm(n[i],0,1)
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m1[sims,i]=sum(dist.m1<epsilon)/(2*n.iter*0.01)
}}
### Simulation from the Laplace model
for(sims in 1:n.sims){
## True data generation from the Laplace distribution and summary statistics
for(i in 1:length(n)){
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}
# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}
# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))
## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)
for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}
# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)
# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}}
# Visualize the results
boxplot(data.frame(prob.m1[,1],prob.m2[,1]),
names=c("M1","M2"),main="N=10")
boxplot(data.frame(prob.m1[,2],prob.m2[,2]),
names=c("M1","M2"),main="N=100")
boxplot(data.frame(prob.m1[,3],prob.m2[,3]),
names=c("M1","M2"),main="N=1000")
Once again, I had a terrific time teaching this “ABC in Roma” course, and not only for the immense side benefit of enjoy Roma in a particularly pleasant weather (for late February).
Filed under: R, Statistics, University life Tagged: ABC, La Sapienza, PhD course, R, Roma
To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
march 2012 by rahuldave
Modeling Trick: the Signed Pseudo Logarithm
march 2012 by rahuldave
(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)
Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).Log-transforming data is essential when analyzing systems that operate in relative terms or are “scale invariant” (such as financial returns). For example Geometric Brownian motion is a stochastic process central to a number of financial models (such as option pricing). Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian). In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative). However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.
A signed value we have often been asked to characterize or predict is: profit and loss (often called P&L). One natural model for P&L would be as the difference between a revenue and an expense. If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform). However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions). And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution. In some situations P&L’s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.
For signed data we suggest the following transformation (code in R):
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
asinh() is a somewhat ugly function that is the inverse of sinh(). sinh() is defined as:
sinh(x) = (e^x - e^(-x))/2
The important point is for x such that |x| is large 2*sinh(x) rapidly approaches sign(x)*e^(|x|). Thus we should expect asinh(x/2) to look a lot like sign(x)*log(|x|) (which is why we call it a signed pseudo logarithm). For pseudoLog10() we take the previous function divided by log(10) to ensure that we are in log-10 like units (i.e. pseudoLog10(100) is nearly 2, pseudoLog10(1000) is nearly 3 and so on). Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.
So for large positive numbers pseudoLog10() pretty much behaves like log10() (itself a standard transform). In fact pseudoLog10() has the following nice properties:
pseudoLog10(x) is defined for all real x.
pseudoLog10(0) = 0.
pseudoLog10(-x) = -pseudoLog10(x).
pseudoLog10(x) is monotone in x.
For x such that |x| is large: pseudoLog10(x) is very near sign(x)*log10(|x|).
We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.
However, we can not recommend the transformation for presentation. Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see My Favorite Graphs for description of a density plot):
library(ggplot2)
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
d <- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))
The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away). However, the plot of the pseudoLog10 transformed data is not what we would hope:
d$pseudoLog10x <- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))
The data density (falsely) appears bimodal! This is because the pseudoLog10() transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin). So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls. This phenomena (which is just a “change of variables” for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10]. The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.
We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money). The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: “are total losses driven by many accounts or by few accounts” and so on). We can not, however, recommend the pseudoLog10() transform for presentation. The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data). You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.
We do still recommend trying the pseudoLog10() transform when building a linear or logistic model with wide ranged data. The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values. Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably. Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably. The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into “typical positive example” and “typical negative example” while still allowing magnitudes to enter the model in some form.
Used with care the pseudoLog10() or arcsinh() transform can be an important data preparation step for signed data with large range. Many financial summaries (such as P&L) meet these conditions and often profit from the transform.
Related posts:
Your Data is Never the Right Shape
To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
arcsinh
R
singed_pseudo_logarithm
stabilizing_transform
statistics
from google
Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).Log-transforming data is essential when analyzing systems that operate in relative terms or are “scale invariant” (such as financial returns). For example Geometric Brownian motion is a stochastic process central to a number of financial models (such as option pricing). Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian). In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative). However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.
A signed value we have often been asked to characterize or predict is: profit and loss (often called P&L). One natural model for P&L would be as the difference between a revenue and an expense. If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform). However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions). And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution. In some situations P&L’s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.
For signed data we suggest the following transformation (code in R):
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
asinh() is a somewhat ugly function that is the inverse of sinh(). sinh() is defined as:
sinh(x) = (e^x - e^(-x))/2
The important point is for x such that |x| is large 2*sinh(x) rapidly approaches sign(x)*e^(|x|). Thus we should expect asinh(x/2) to look a lot like sign(x)*log(|x|) (which is why we call it a signed pseudo logarithm). For pseudoLog10() we take the previous function divided by log(10) to ensure that we are in log-10 like units (i.e. pseudoLog10(100) is nearly 2, pseudoLog10(1000) is nearly 3 and so on). Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.
So for large positive numbers pseudoLog10() pretty much behaves like log10() (itself a standard transform). In fact pseudoLog10() has the following nice properties:
pseudoLog10(x) is defined for all real x.
pseudoLog10(0) = 0.
pseudoLog10(-x) = -pseudoLog10(x).
pseudoLog10(x) is monotone in x.
For x such that |x| is large: pseudoLog10(x) is very near sign(x)*log10(|x|).
We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.
However, we can not recommend the transformation for presentation. Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see My Favorite Graphs for description of a density plot):
library(ggplot2)
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
d <- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))
The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away). However, the plot of the pseudoLog10 transformed data is not what we would hope:
d$pseudoLog10x <- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))
The data density (falsely) appears bimodal! This is because the pseudoLog10() transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin). So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls. This phenomena (which is just a “change of variables” for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10]. The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.
We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money). The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: “are total losses driven by many accounts or by few accounts” and so on). We can not, however, recommend the pseudoLog10() transform for presentation. The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data). You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.
We do still recommend trying the pseudoLog10() transform when building a linear or logistic model with wide ranged data. The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values. Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably. Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably. The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into “typical positive example” and “typical negative example” while still allowing magnitudes to enter the model in some form.
Used with care the pseudoLog10() or arcsinh() transform can be an important data preparation step for signed data with large range. Many financial summaries (such as P&L) meet these conditions and often profit from the transform.
Related posts:
Your Data is Never the Right Shape
To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
march 2012 by rahuldave
Statistics project ideas for students
february 2012 by rahuldave
(This article was first published on Simply Statistics, and kindly contributed to R-bloggers)
Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.
Happy data crunching!
Data Collection/Synthesis
Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like Wordpress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)
Data Analyses
Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish)
Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).
Data visualization
Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish).
The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)
To leave a comment for the author, please follow the link and comment on his blog: Simply Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
DIY
education
Projects
R
from google
Here are a few ideas that might make for interesting student projects at all levels (from high-school to graduate school). I’d welcome ideas/suggestions/additions to the list as well. All of these ideas depend on free or scraped data, which means that anyone can work on them. I’ve given a ballpark difficulty for each project to give people some idea.
Happy data crunching!
Data Collection/Synthesis
Creating a webpage that explains conceptual statistical issues like randomization, margin of error, overfitting, cross-validation, concepts in data visualization, sampling. The webpage should not use any math at all and should explain the concepts so a general audience could understand. Bonus points if you make short 30 second animated youtube clips that explain the concepts. (Difficulty: Lowish; Effort: Highish)
Building an aggregator for statistics papers across disciplines that can be the central resource for statisticians. Journals ranging from PLoS Genetics to Neuroimage now routinely publish statistical papers. But there is no one central resource that aggregates all the statistics papers published across disciplines. Such a resource would be hugely useful to statisticians. You could build it using blogging software like Wordpress so articles could be tagged/you could put the resource in your RSS feeder. (Difficulty: Lowish; Effort: Mediumish)
Data Analyses
Scrape the LivingSocial/Groupon sites for the daily deals and develop a prediction of how successful the deal will be based on location/price/type of deal. You could use either the RCurl R package or the XML R package to scrape the data. (Difficulty: Mediumish; Effort: Mediumish)
You could use the data from your city (here are a few cities with open data) to: (a) identify the best and worst neighborhoods to live in based on different metrics like how many parks are within walking distance, crime statistics, etc. (b) identify concrete measures your city could take to improve different quality of life metrics like those described above - say where should the city put a park, or (c) see if you can predict when/where crimes will occur (like these guys did). (Difficulty: Mediumish; Effort: Highish)
Download data on state of the union speeches from here and use the tm package in R to analyze the patterns of word use over time (Difficulty: Lowish; Effort: Lowish)
Use this data set from Donors Choose to determine the characteristics that make the funding of projects more likely. You could send your results to the Donors Choose folks to help them improve the funding rate for their projects. (Difficulty: Mediumish; Effort: Mediumish)
Which basketball player would you want on your team? Here is a really simple analysis done by Rafa. But it doesn’t take into account things like defense. If you want to take on this project, you should take a look at this Denis Rodman analysis which is the gold standard. (Difficulty: Mediumish; Effort: Highish).
Data visualization
Creating an R package that wraps the svgAnnotation package. This package can be used to create dynamic graphics in R, but is still a bit too flexible for most people to use. Writing some wrapper functions that simplify the interface would be potentially high impact. Maybe something like svgPlot() to create simple, dynamic graphics with only a few options (Difficulty: Mediumish; Effort: Mediumish).
The same as project 1 but for D3.js. The impact could potentially be a bit higher, since the graphics are a bit more professional, but the level of difficulty and effort would also both be higher. (Difficulty: Highish; Effort: Highish)
To leave a comment for the author, please follow the link and comment on his blog: Simply Statistics.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
february 2012 by rahuldave
XYZ geographic data interpolation, part 2
february 2012 by rahuldave
(This article was first published on me nugget, and kindly contributed to R-bloggers)
Having recently received a comment on a post regarding geographic xyz data interpolation, I decided to return to my original "xyz.map" function and open it up for easier interpretation. This should make the method easier to adapt and follow.The above graph shows the distance to Mecca as interpolated from 1000 randomly generated lat/lon data using the "interp" function of the akima package. Several functions, found within this blog, are needed to reproduce the plot (pos2coord, earth.dist, new.lon.lat, color.palette, val2col, image.scale). One thing you will notice is the strip of uninterpolated area within the stereographic projection. This is a problem that I have yet to resolve and has to do with the fact that the interpolation is not considering the connection along the 180° longitude line. This will probably require some other type of interpolation based on geographic distances rather than Cartesian coordinates. R code to produce the above graph...Read more »
To leave a comment for the author, please follow the link and comment on his blog: me nugget.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
code
graphics
interpolation
map
R
r-project
spatial
from google
Having recently received a comment on a post regarding geographic xyz data interpolation, I decided to return to my original "xyz.map" function and open it up for easier interpretation. This should make the method easier to adapt and follow.The above graph shows the distance to Mecca as interpolated from 1000 randomly generated lat/lon data using the "interp" function of the akima package. Several functions, found within this blog, are needed to reproduce the plot (pos2coord, earth.dist, new.lon.lat, color.palette, val2col, image.scale). One thing you will notice is the strip of uninterpolated area within the stereographic projection. This is a problem that I have yet to resolve and has to do with the fact that the interpolation is not considering the connection along the 180° longitude line. This will probably require some other type of interpolation based on geographic distances rather than Cartesian coordinates. R code to produce the above graph...Read more »
To leave a comment for the author, please follow the link and comment on his blog: me nugget.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
february 2012 by rahuldave
People voice about Lynas Malaysia through Twitter Analysis with R CloudStat
february 2012 by rahuldave
(This article was first published on CloudStat, and kindly contributed to R-bloggers)
People voice about Lynas Malaysia through Twitter Analysis with R CloudStat: CloudStat Analysis: This is a twitter analysis report for “Lynas” from 21 till 28 February 2012, generated by CloudStat Twitter Application.
Lynas was a hot topic, especially Malaysian. This was due to Lynas Malaysia Sdn Bhd (Lynas) will start importing rare earth ore from its Mount Weld mine in Western Australia and process it at the Gebeng Industrial Estate in Pahang, Malaysia.
Rare earth minerals are often found in ores which contain small amounts of radioactive elements such as uranium and thorium. So extracting them from these ores raises a number of health and safety issues. Many members of the public – including residents, non-governmental committees and professional bodies – expressed concern that the Lynas project was not safe, and was a threat to public health and safety.
The determination of government’s to continue implementing this Lynas project caused the anger of Malaysians. And the largest anti-Lynas rally, “Himpunan Hijau 2.0” at Padang MPK 4, Kuantan aimed to pressure government into aborting Lynas project in 26 February 2012.
In this analysis, we will track the tweets from 21 till 28 February 2012. Below, you will get the insights of:
Lynas’s Total tweets, original tweets, retweets and contributors.
“Lynas” Tweet counts and trending.
Tweet behavior over time by users.
Who is tweeting about Lynas.
How a person retweets (RT) a message about Lynas.
Who is retweeting whom’s.
Top 50 of tweeters, retweet, tweets and created time.
Sentiment Analysis and Opinion Mining.
The overview of Lynas twitter analysis for past 7 days (21 - 28 Feb).
Total tweets: 8397
Original tweets: 6605
Retweets: 1792
Contributors: 4305
Average 2 tweets per contributor
More details tracked daily are the bottom of this analysis.
This chart shows how many tweets had been occurred over past 7 days, reaching the maximum in 25 February, a day before the Anti-Lynas rally.
This plot show how a person tweets and retweets about Lynas.
This is the Word Square of the tweets containing “lynas” as a keyword. The size is varied according to the frequency of the keyword.
This is the sentiment analysis of the tweets. Each tweets will be scanned and matched to our positive and negative word database. The score is calculated using the formula :
Score = # Positive words - # Negative words
The plot Scores over days is as below:
Next is the Top 50 for several categories.
View More…
To leave a comment for the author, please follow the link and comment on his blog: CloudStat.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
cloudstat
data_mining
Lynas
R
R_Language
sentiment_analysis
Social_Media
twitter
from google
People voice about Lynas Malaysia through Twitter Analysis with R CloudStat: CloudStat Analysis: This is a twitter analysis report for “Lynas” from 21 till 28 February 2012, generated by CloudStat Twitter Application.
Lynas was a hot topic, especially Malaysian. This was due to Lynas Malaysia Sdn Bhd (Lynas) will start importing rare earth ore from its Mount Weld mine in Western Australia and process it at the Gebeng Industrial Estate in Pahang, Malaysia.
Rare earth minerals are often found in ores which contain small amounts of radioactive elements such as uranium and thorium. So extracting them from these ores raises a number of health and safety issues. Many members of the public – including residents, non-governmental committees and professional bodies – expressed concern that the Lynas project was not safe, and was a threat to public health and safety.
The determination of government’s to continue implementing this Lynas project caused the anger of Malaysians. And the largest anti-Lynas rally, “Himpunan Hijau 2.0” at Padang MPK 4, Kuantan aimed to pressure government into aborting Lynas project in 26 February 2012.
In this analysis, we will track the tweets from 21 till 28 February 2012. Below, you will get the insights of:
Lynas’s Total tweets, original tweets, retweets and contributors.
“Lynas” Tweet counts and trending.
Tweet behavior over time by users.
Who is tweeting about Lynas.
How a person retweets (RT) a message about Lynas.
Who is retweeting whom’s.
Top 50 of tweeters, retweet, tweets and created time.
Sentiment Analysis and Opinion Mining.
The overview of Lynas twitter analysis for past 7 days (21 - 28 Feb).
Total tweets: 8397
Original tweets: 6605
Retweets: 1792
Contributors: 4305
Average 2 tweets per contributor
More details tracked daily are the bottom of this analysis.
This chart shows how many tweets had been occurred over past 7 days, reaching the maximum in 25 February, a day before the Anti-Lynas rally.
This plot show how a person tweets and retweets about Lynas.
This is the Word Square of the tweets containing “lynas” as a keyword. The size is varied according to the frequency of the keyword.
This is the sentiment analysis of the tweets. Each tweets will be scanned and matched to our positive and negative word database. The score is calculated using the formula :
Score = # Positive words - # Negative words
The plot Scores over days is as below:
Next is the Top 50 for several categories.
View More…
To leave a comment for the author, please follow the link and comment on his blog: CloudStat.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
february 2012 by rahuldave
R integrated throughout the enterprise analytics stack
february 2012 by rahuldave
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
The past couple of years have seen a dramatic growth in the use of the R language in the enterprise. R has always been pervasive in academia for research and teaching in statistics and data science, and as new graduates trained in R have migrated to the workplace the demand for R in corporations has become more and more intense.
Database vendor Oracle estimates that "R has attracted over two million users since its introduction". James Kobielus, noted Forrester analyst and predictive analytics expert, recently said in PCWorld that "R has become a real ubiquitous force in advanced analytics. It's everywhere. Enterprise adoption of it has been growing steadily. When we ask our customers what they're using for statistical modeling they'll say SAS or [IBM's] SPSS, but they increasingly say R in the same breath."
With rapid growth of the use of R in the enterprise comes a corresponding increase in demand for enterprise support for R from its users, and demand for integration of R into corporate systems from IT (both areas in which Revolution Analytics provides software and expertise). Most large organizations have a sophisticated infrastructure devoted to data analysis, with an "analytics stack" of software to provide data warehousing and query, predictive analytics, reporting, presentation, and Business Intelligence (BI). As a result software vendors at every layer of this stack have added functionality to integrate R, accommodating demand from both users and IT, and to serve the needs of data-driven business decision makers. Let's take a look at some of the applications within the analytics stack which now provide integration with R.
The Data Layer
The data layer is where the lifeblood of the analysis — the data — is stored and prepared. Especially for high-performance and big-data applications, analytics based in R can benefit from the infrastructure that the data layer provides. The IBM blog has a great post with an in-depth discussion about integrating R with the data layer.
IBM Netezza, the high-performance data-warehousing appliance, is integrated with R (in partnership with Revolution Analytics). R users can use Revolution R Enterprise to run massively-parallel computations in R within the IBM Netezza appliance and implement high-performance, big-data analytics (with high-frequency financial data, for example). [Update: a free webinar on February 29 will describe the integration between IBM Netezza and Revolution R Enterprise in detail.]
Oracle announced last year a forthcoming connection between R and Oracle, which was made available in February 2012. Oracle R Enterprise is aimed at statisticians who are "don't know SQL" and are "not familiar with DBA tasks". It is available as part of the Oracle Advanced Analytics option (priced at around $23,000 per core), and provides a "transparency layer" for with functions to connect to Oracle and use R functionality in the Oracle database. Oracle also maintains the open-source ROracle package which provides similar functionality for open-source R.
Cloudera's Distribution Including Apache Hadoop provides support for R in partnership with Revolution Analytics. This connection makes it possible to manipulate Hadoop data stores in R directly from HDFS and HBASE, and give R programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
IBM BigInsights, the Hadoop platform from IBM, is also integrated with R and Revolution R Enterprise. BigInsight queries can make use of the Map-Reduce construct while running R computations in parallel.
Teradata's Enterprise Data Warehousing platform provides in-database analytics using R via the free teradataR package. This package allows R users to connect to Teradata, create data frames linked to Teradata and to call in-database analytic functions. The Teradata Aster MapReduce Platform also provides integration with R.
Sybase RAP, the edition of the Sybase database for financial data, provides integration with R. Providing the R language alongside Sybase RAP allows for faster algorithm development and extensive backward testing on historical data. Sybase also regularly highlights R integration in its financial newsletters and webinars.
The Analytics Layer
The Analytics layer is where the magic happens: statistical modeling, predictive analytics and custom data visualization. Fed with (usually) structured data sourced from the Data Layer, R is widely used here to categorize, predict, and generally provide insight into corporate data stores. In many organizations, older data analysis tools remain in use, and so interfaces to R have been added provide support for analysts and data scientists who prefer to use R and to fill in the gaps of these legacy tools with modern, high-performance analytics.
Revolution Analytics is the leading commercial organization focused on software and support for R. Its Revolution R Enterprise software extends open-source R with productivity interfaces, high-performance statistical computing, big-data analytics, and enterprise integration of R.
SAS has been a statistical analysis workhorse since the early 70's. Now, with so many graduates in statistics trained in R instead of SAS, SAS has introduced the ability to call R from SAS/IML. (It's also possible to call R directly from base SAS thanks to a free package developed at Roche Pharmaceuticals.) SAS JMP, the point-and-click data analysis package, now also provides support for R.
IBM SPSS Statistics, the popular desktop data analysis software known simply as SPSS before being acquired by IBM in 2010, provides integration to R via the Statistics Programmability Extension module.
RStudio, an open-source software company, provides an integrated development environment for developing code in the R language.
Matlab, a numerical computing language used by engineers, also offers the ability to call R from Matlab on Windows.
Zementis software allows models created with R to be scored on massive data sets using the ADAPA Decision Engine and Revolution R Enterprise.
The Presentation Layer
Data analysis makes the most impact in the enterprise when it can be readily acted upon by decision makers: often, business executives not steeped in the arcana of data warehousing or statistical analysis. As a result, many reporting and business intelligence tools now make it possible to make it possible to incorporate the resuts of analyses generated in R in the presentation layer, in a format tuned to the needs of a business audiecne.
Jaspersoft's business intelligence software makes it possible to incorporate the results of R-based analytics into BI dashboards and reports, via integration with Revolution R Enterprise.
TIBCO Spotfire's interactive business intelligence dashboards make it possible to share results and models from R.
You might not think of Microsoft Excel as more of a spreadsheet than a presentation tool, but it is very widely used on the desktop as a "container" for static and interactive reports based on statistical analysis. While Excel does not have out-of-the-box integration with R, is is possible to integrate R-based computation into Excel spreadsheets via RExcel and Revolution Analytics' RevoDeployR web services API.
R: Integrated throughout the analytics stack
As you can see, for organizations who need to create advanced analytics applications, R is integrated throughout the analytics stack: for data access, for presentation of results, and of course for the statistical analysis process itself. This degree of integration by so many companies is indicative of the level of demand for R throughout the enterprise. As the leading provider of commercial software and support for R, Revolution Analytics supports R users througout the organization, helps IT departments integrate Revolution R Enterprise throughout the analytics stack, for high-performance and big data applications based on the R language.
Revolution R Enterprise: production-grade analytics software built upon the powerful open source R statistics language.
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
Big_Data
other_industry
R
REvolution
Rmedia
from google
The past couple of years have seen a dramatic growth in the use of the R language in the enterprise. R has always been pervasive in academia for research and teaching in statistics and data science, and as new graduates trained in R have migrated to the workplace the demand for R in corporations has become more and more intense.
Database vendor Oracle estimates that "R has attracted over two million users since its introduction". James Kobielus, noted Forrester analyst and predictive analytics expert, recently said in PCWorld that "R has become a real ubiquitous force in advanced analytics. It's everywhere. Enterprise adoption of it has been growing steadily. When we ask our customers what they're using for statistical modeling they'll say SAS or [IBM's] SPSS, but they increasingly say R in the same breath."
With rapid growth of the use of R in the enterprise comes a corresponding increase in demand for enterprise support for R from its users, and demand for integration of R into corporate systems from IT (both areas in which Revolution Analytics provides software and expertise). Most large organizations have a sophisticated infrastructure devoted to data analysis, with an "analytics stack" of software to provide data warehousing and query, predictive analytics, reporting, presentation, and Business Intelligence (BI). As a result software vendors at every layer of this stack have added functionality to integrate R, accommodating demand from both users and IT, and to serve the needs of data-driven business decision makers. Let's take a look at some of the applications within the analytics stack which now provide integration with R.
The Data Layer
The data layer is where the lifeblood of the analysis — the data — is stored and prepared. Especially for high-performance and big-data applications, analytics based in R can benefit from the infrastructure that the data layer provides. The IBM blog has a great post with an in-depth discussion about integrating R with the data layer.
IBM Netezza, the high-performance data-warehousing appliance, is integrated with R (in partnership with Revolution Analytics). R users can use Revolution R Enterprise to run massively-parallel computations in R within the IBM Netezza appliance and implement high-performance, big-data analytics (with high-frequency financial data, for example). [Update: a free webinar on February 29 will describe the integration between IBM Netezza and Revolution R Enterprise in detail.]
Oracle announced last year a forthcoming connection between R and Oracle, which was made available in February 2012. Oracle R Enterprise is aimed at statisticians who are "don't know SQL" and are "not familiar with DBA tasks". It is available as part of the Oracle Advanced Analytics option (priced at around $23,000 per core), and provides a "transparency layer" for with functions to connect to Oracle and use R functionality in the Oracle database. Oracle also maintains the open-source ROracle package which provides similar functionality for open-source R.
Cloudera's Distribution Including Apache Hadoop provides support for R in partnership with Revolution Analytics. This connection makes it possible to manipulate Hadoop data stores in R directly from HDFS and HBASE, and give R programmers the ability to write MapReduce jobs in R using Hadoop Streaming.
IBM BigInsights, the Hadoop platform from IBM, is also integrated with R and Revolution R Enterprise. BigInsight queries can make use of the Map-Reduce construct while running R computations in parallel.
Teradata's Enterprise Data Warehousing platform provides in-database analytics using R via the free teradataR package. This package allows R users to connect to Teradata, create data frames linked to Teradata and to call in-database analytic functions. The Teradata Aster MapReduce Platform also provides integration with R.
Sybase RAP, the edition of the Sybase database for financial data, provides integration with R. Providing the R language alongside Sybase RAP allows for faster algorithm development and extensive backward testing on historical data. Sybase also regularly highlights R integration in its financial newsletters and webinars.
The Analytics Layer
The Analytics layer is where the magic happens: statistical modeling, predictive analytics and custom data visualization. Fed with (usually) structured data sourced from the Data Layer, R is widely used here to categorize, predict, and generally provide insight into corporate data stores. In many organizations, older data analysis tools remain in use, and so interfaces to R have been added provide support for analysts and data scientists who prefer to use R and to fill in the gaps of these legacy tools with modern, high-performance analytics.
Revolution Analytics is the leading commercial organization focused on software and support for R. Its Revolution R Enterprise software extends open-source R with productivity interfaces, high-performance statistical computing, big-data analytics, and enterprise integration of R.
SAS has been a statistical analysis workhorse since the early 70's. Now, with so many graduates in statistics trained in R instead of SAS, SAS has introduced the ability to call R from SAS/IML. (It's also possible to call R directly from base SAS thanks to a free package developed at Roche Pharmaceuticals.) SAS JMP, the point-and-click data analysis package, now also provides support for R.
IBM SPSS Statistics, the popular desktop data analysis software known simply as SPSS before being acquired by IBM in 2010, provides integration to R via the Statistics Programmability Extension module.
RStudio, an open-source software company, provides an integrated development environment for developing code in the R language.
Matlab, a numerical computing language used by engineers, also offers the ability to call R from Matlab on Windows.
Zementis software allows models created with R to be scored on massive data sets using the ADAPA Decision Engine and Revolution R Enterprise.
The Presentation Layer
Data analysis makes the most impact in the enterprise when it can be readily acted upon by decision makers: often, business executives not steeped in the arcana of data warehousing or statistical analysis. As a result, many reporting and business intelligence tools now make it possible to make it possible to incorporate the resuts of analyses generated in R in the presentation layer, in a format tuned to the needs of a business audiecne.
Jaspersoft's business intelligence software makes it possible to incorporate the results of R-based analytics into BI dashboards and reports, via integration with Revolution R Enterprise.
TIBCO Spotfire's interactive business intelligence dashboards make it possible to share results and models from R.
You might not think of Microsoft Excel as more of a spreadsheet than a presentation tool, but it is very widely used on the desktop as a "container" for static and interactive reports based on statistical analysis. While Excel does not have out-of-the-box integration with R, is is possible to integrate R-based computation into Excel spreadsheets via RExcel and Revolution Analytics' RevoDeployR web services API.
R: Integrated throughout the analytics stack
As you can see, for organizations who need to create advanced analytics applications, R is integrated throughout the analytics stack: for data access, for presentation of results, and of course for the statistical analysis process itself. This degree of integration by so many companies is indicative of the level of demand for R throughout the enterprise. As the leading provider of commercial software and support for R, Revolution Analytics supports R users througout the organization, helps IT departments integrate Revolution R Enterprise throughout the analytics stack, for high-performance and big data applications based on the R language.
Revolution R Enterprise: production-grade analytics software built upon the powerful open source R statistics language.
To leave a comment for the author, please follow the link and comment on his blog: Revolutions.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
february 2012 by rahuldave
Code for Machine Learning for Hackers
february 2012 by rahuldave
With the release of the eBook version of Machine Learning for Hackers this week, many people have been asking for the code. With good reason—as it turns out—because O’Reilly still (at the time of this writing) has not updated the book page to include a link to the code.
For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:
https://github.com/johnmyleswhite/ML_for_Hackers
Please feel free to clone, fork, and hack the repository as much as you like. As we mention in the README, some of the code will not appear exactly as it does in the text. This happens for two reasons; first, because some minor formatting changes had to be made to fit the code into the book; and second, some of the code has been updated or edited to remove typos and minor errors.
We hope you find the code a useful supplement to the text!
Learning
machinelearning
R
from google
For those interested, my co-author John Myles White is hosting the code at his Github, which can be accessed at:
https://github.com/johnmyleswhite/ML_for_Hackers
Please feel free to clone, fork, and hack the repository as much as you like. As we mention in the README, some of the code will not appear exactly as it does in the text. This happens for two reasons; first, because some minor formatting changes had to be made to fit the code into the book; and second, some of the code has been updated or edited to remove typos and minor errors.
We hope you find the code a useful supplement to the text!
february 2012 by rahuldave
An essential vocabulary for the R language
may 2011 by rahuldave
The Oxford English Dictionary includes more than 600,000 words, yet most of us get by in our day-to-day lives with a vocabulary of just a few thousand. In a similar vein, the R language includes thousands of functions: when you start up R 2.13, you have 2832 functions at your disposal:
> length(apropos(".", mode="function"))
[1] 2382
This includes only the base packages attached by default: add in the recommended packages and all the packages on CRAN and you're easily into five figures.
But like English, you can make do with a much smaller vocabulary of R functions and still express yourself quite effectively. Hadley Wickham has published his essential vocabulary of R functions, listing just those functions you need for basic object operations, statistical analysis, working with the R system, data and text I/O, and dealing with specialized data types like dates and strings. There are about 350 functions on the list. This is a great idea: like many open-source projects, R has amassed many new functions (and indeed packages) that supersede (but not deprecate) older functionality, and it helps new users especially just to focus on those functions that are best for the job. Developing a small vocabulary of essential R functions is an important step forward.
github (Hadley Wickham): Vocabulary
beginner_tips
R
from google
> length(apropos(".", mode="function"))
[1] 2382
This includes only the base packages attached by default: add in the recommended packages and all the packages on CRAN and you're easily into five figures.
But like English, you can make do with a much smaller vocabulary of R functions and still express yourself quite effectively. Hadley Wickham has published his essential vocabulary of R functions, listing just those functions you need for basic object operations, statistical analysis, working with the R system, data and text I/O, and dealing with specialized data types like dates and strings. There are about 350 functions on the list. This is a great idea: like many open-source projects, R has amassed many new functions (and indeed packages) that supersede (but not deprecate) older functionality, and it helps new users especially just to focus on those functions that are best for the job. Developing a small vocabulary of essential R functions is an important step forward.
github (Hadley Wickham): Vocabulary
may 2011 by rahuldave
Revolution R Enterprise now free to academics
may 2010 by rahuldave
Unlike Revolution R Community which is 100% free to everyone, our commercial-grade Revolution R Enterprise distribution bundles R with proprietary components from our development team, which are normally available only to paying subscribers. (Those subscriptions are the way we get income to keep the company going.) Those components include our full ParallelR libraries for parallel programming, enhanced for 64-bit Windows, and our R Productivity Environment with full code editing and visual debugging on Windows ... and there's more to come soon in our roadmap.
Today, we're making the full Revolution R Enterprise distribution available to anyone in the academic community, free of charge. You can read all the details in this page about the free academic download program. All you need to do is fill out a form attesting that you're a professor or a student at an accredited degree-granting institution, and you can download the Windows 32-bit, Windows 64-bit or RHEL 5 version, free of charge.
The only restrictions are that it's for use on a single-user workstation, and you won't have access to live technical support. (For universities/schools that full support on a server or cluster, subscriptions are still available.)
Since there's so much overlap between the R community and the academic community, this is just one more way for us to give back to the community. We want to make sure everyone in academia gets access to both R and all the extensions we've made to it, and will make in the future.
Revolution Analytics: Free Single-User Subscription to Revolution R Enterprise for the Academic Community
academia
announcements
R
REvolution
from google
Today, we're making the full Revolution R Enterprise distribution available to anyone in the academic community, free of charge. You can read all the details in this page about the free academic download program. All you need to do is fill out a form attesting that you're a professor or a student at an accredited degree-granting institution, and you can download the Windows 32-bit, Windows 64-bit or RHEL 5 version, free of charge.
The only restrictions are that it's for use on a single-user workstation, and you won't have access to live technical support. (For universities/schools that full support on a server or cluster, subscriptions are still available.)
Since there's so much overlap between the R community and the academic community, this is just one more way for us to give back to the community. We want to make sure everyone in academia gets access to both R and all the extensions we've made to it, and will make in the future.
Revolution Analytics: Free Single-User Subscription to Revolution R Enterprise for the Academic Community
may 2010 by rahuldave
Is R an ‘epic fail’?
april 2010 by rahuldave
Is R an ‘epic fail’?
Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.
For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language. If you don’t want to do any programming, don’t muck about with R! There are lots of visualisation tools and statistical tools that are much easier to use.
Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.
You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it. Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).
Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?
Work
programming
R
statistics
from google
Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.
For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language. If you don’t want to do any programming, don’t muck about with R! There are lots of visualisation tools and statistical tools that are much easier to use.
Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.
You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it. Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).
Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?
april 2010 by rahuldave
Because the professionals are using it!
april 2010 by rahuldave
Why use R? Because that is what the professionals use!
The moral of the story is not that one should blindly use professionals’ tools; that can be just as bad as ignoring them. You don’t need to use Google or Facebook’s infrastructure to run your personal blog, just as developers shouldn’t employ the GPL simply because it’s the same license Linux or MySQL uses. But if you want a no-bullshit take on which technologies actually deliver, you could do a lot worse than watching what the professionals use.
Work
R
from google
The moral of the story is not that one should blindly use professionals’ tools; that can be just as bad as ignoring them. You don’t need to use Google or Facebook’s infrastructure to run your personal blog, just as developers shouldn’t employ the GPL simply because it’s the same license Linux or MySQL uses. But if you want a no-bullshit take on which technologies actually deliver, you could do a lot worse than watching what the professionals use.
april 2010 by rahuldave
Data I/O performance tips
april 2010 by rahuldave
The R tag on StackOverflow recently topped 1000 questions, and continues to be a great community resource for practical tips on using the R language for data analysis and visualization. To take one example, "Efficiency of operations on R data structures" has been answered with some great tips on efficiently getting data in and out of the R system. Here's three quick tips excerpted from user "doug"'s response:
For reading in flat files, the performance of read.table can be improved 5x (or more) just by opting out of a few of read.table's default arguments
With only a little more hassle, you can make reading flat files even faster by using 'scan' instead of 'read.table', and
Paying attention to data types can often give you a performance boost and reduce your memory footprint.
See the post linked below for the complete details on how to implement these tips and power up the process of reading data into R.
StackOverflow.com: Efficiency of operations on R data structures
advanced_tips
R
from google
For reading in flat files, the performance of read.table can be improved 5x (or more) just by opting out of a few of read.table's default arguments
With only a little more hassle, you can make reading flat files even faster by using 'scan' instead of 'read.table', and
Paying attention to data types can often give you a performance boost and reduce your memory footprint.
See the post linked below for the complete details on how to implement these tips and power up the process of reading data into R.
StackOverflow.com: Efficiency of operations on R data structures
april 2010 by rahuldave
Video: Hadley Wickham gives a short course on graphics with R
march 2010 by rahuldave
Hadley Wickham (the creator of the popular ggplot2 graphics package for R) has posted video of a 2-hour short course on Visualisation in R at his blip.tv channel. The video is split into four thirty-minute segments:
Basic Graphics
Displaying Large Data
Data manipulation and transformations
Polishing your plots for publication
The course is peppered with self-guided exercises, for which data, code and the supporting slides are all available. So if you want to be able to create beautiful, informative visualizations of data sets large and small, like this:
then it's well worth two hours of your time to watch these videos. (And don't forget Hadley's book, ggplot2: Elegant Graphics for Data Analysis, for a complete reference.)
Hadley Wickham: data vis mini course
beginner_tips
graphics
R
from google
Basic Graphics
Displaying Large Data
Data manipulation and transformations
Polishing your plots for publication
The course is peppered with self-guided exercises, for which data, code and the supporting slides are all available. So if you want to be able to create beautiful, informative visualizations of data sets large and small, like this:
then it's well worth two hours of your time to watch these videos. (And don't forget Hadley's book, ggplot2: Elegant Graphics for Data Analysis, for a complete reference.)
Hadley Wickham: data vis mini course
march 2010 by rahuldave
related tags
ABC ⊕ academia ⊕ advanced_tips ⊕ announcements ⊕ arcsinh ⊕ Bayesian ⊕ beginner_tips ⊕ Big_Data ⊕ Bioinformatics ⊕ blog ⊕ business_analytics ⊕ business_intelligence ⊕ charts ⊕ cloud ⊕ cloudstat ⊕ code ⊕ D3 ⊕ D3.js ⊕ data_mining ⊕ deducer ⊕ DIY ⊕ education ⊕ factors ⊕ forecasting ⊕ fPortfolio ⊕ French ⊕ genenet ⊕ Google ⊕ google_docs ⊕ google_maps ⊕ Google_Spreadsheets ⊕ graphics ⊕ HANA ⊕ HTML5 ⊕ igraph ⊕ interpolation ⊕ JAGS ⊕ Julia ⊕ knitr ⊕ LaTex ⊕ lattice ⊕ La_Sapienza ⊕ Learning ⊕ Lynas ⊕ machinelearning ⊕ map ⊕ modelling ⊕ mvbutils ⊕ Open_Notebook_Thoughts ⊕ other_industry ⊕ packages ⊕ PhD_course ⊕ prediction ⊕ predictive ⊕ programming ⊕ Projects ⊕ R ⊕ r-project ⊕ RApache ⊕ rblogs ⊕ reporting ⊕ Research_tips ⊕ REvolution ⊕ Rmedia ⊕ Roma ⊕ rstats ⊕ R_bloggers ⊕ R_code ⊕ R_Language ⊕ R_package ⊕ SAP ⊕ sentiment_analysis ⊕ SFO ⊕ simone ⊕ Simulation ⊕ singed_pseudo_logarithm ⊕ SJC ⊕ Social_Media ⊕ social_network_analysis ⊕ spatial ⊕ stabilizing_transform ⊕ statistics ⊕ stats ⊕ stocks ⊕ Stratified_Sampling ⊕ systematic_investor ⊕ TechEd ⊕ twitter ⊕ Uncategorized ⊕ University_life ⊕ Work ⊕Copy this bookmark: