Bootstrap example
8 weeks ago by rahuldave
(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)
Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your \( \widehat{\beta} \). You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is \( \widehat{\beta} \pm 1.96 \times sd( \widehat{\beta}) \), the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of \( \beta \) which came from the sample, is suppose to estimate the “real” \( \beta \), the population \( \beta \), which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your \( \beta \) according to this (bootstrap)sample, now this new estimate is an estimate for your original \( \widehat{\beta} \), the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This shuffling, or permuted sample is what we call the bootstrap sample. Now we estimate the same statistic (in our case \( \beta \)) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
?View Code RSPLUSlibrary(boot) #load the package
# Now we need the function we would like to estimate
# In our case the beta:
betfun = function(data,b,formula){
# b is the random indexes for the bootstrap sample
d = data[b,]
return(lm(d[,1]~d[,2], data = d)$coef[2])
# thats for the beta coefficient
}
# now you can bootstrap:
bootbet = boot(data="your data here", statistic=betfun, R=5000)
# R is how many bootstrap samples
names(bootbet)
plot(bootbet)
hist(bootbet$t, breaks = 100)
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
?View Code RSPLUSzoombot = boot.array(bootbet, indices = T)
dim(zoombot)
hist(zoombot[1,], breaks = 100)
# this is the frequency of each index, [1,] for the first
#bootstrap run
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
?View Code RSPLUSnb = 5000 ; bet = NULL ; n = NROW("your data here")
ptm <- proc.time() # have a look at the time it takes
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T) # pick random indices
bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2]
}
proc.time() - ptm # about 80 seconds on my end
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley \(\widehat{\beta}\) with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless ). I created a fake normal return distribution using same center and standard deviation as reported, and made the exact same analysis:
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
Bootstrap Methods and their Application (Cambridge Series in Statistical and Probabilistic Mathematics)
An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability)
Resampling Methods: A Practical Guide to Data Analysis
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
library(ggplot2) ; library(fields) ; library(grDevices)
tckr = c('MS', 'SPY')
end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd
start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] )
ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1])
plot(ret1)
length(as.numeric(ret1)) ; length(as.numeric(ret1))
plot(as.numeric(ret1)~as.numeric(ret2))
t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2)))
names(t1) <- tckr
lm1 = lm(MS~SPY, data = t1) ; summary(lm1)
nb = 5000 ; bet = NULL ; n = NROW(t1)
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2]
}
proc.time() - ptm
den = density(bet)
par(mfrow = c(1,1))
bethat <- bquote(Histogram~of ~bold(widehat(beta)))
h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = bethat )
lwd1 = 2.5
xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1)
xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(bet),max(bet),length=length(bet))
yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet))
yfit <- yfit*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2])
yfit2 <- yfit2*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit2, col=5, lwd=2)
###################################################
### Now I generate actually from the normal distribution
###################################################
eps = rnorm(n, mean(ret1), sd(ret1))
fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps
# hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200)
lm2 = lm(fakey~ret1)
fakebet = NULL
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2]
}
proc.time() - ptm
fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta)))
h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = fakebethat )
xline(mean(fakebet), col = 4, lwd = lwd1)
xline(lm2$coef[2], col = 6, lwd = lwd1)
xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet))
yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet))
yfit <- yfit*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2])
yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit2, col=5, lwd=2)
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on top[…]
R_bloggers
blog
R
from google
Bootstrap your way into robust inference. Wow, that was fun to write..
Introduction
Say you made a simple regression, now you have your \( \widehat{\beta} \). You wish to know if it is significantly different from (say) zero. In general, people look at the statistic or p.value reported by their software of choice, (heRe). Thing is, this p.value calculation relies on the distribution of your dependent variable. Your software assumes normal distribution if not told differently, how so? for example, the (95%) confidence interval is \( \widehat{\beta} \pm 1.96 \times sd( \widehat{\beta}) \), the 1.96 comes from the normal distribution.
It is advisable not to do that, the beauty in bootstrapping* is that it is distribution untroubled, it’s valid for dependent which is Gaussian, Cauchy, or whatever. You can defend yourself against misspecification, and\or use the tool for inference when the underlying distribution is unknown.
I was not here 40 years ago but they say calculations were slow back then, not any more, take advantage of your Dell or your “meant for you” MacBook. We will see that it is better to add the non in the nonparametric, specifically in financial context, but that holds in general. You can still keep your distribution assumption but at least have a look at what happens when you relax it. The way to do that is by using bootstrap, the idea is intuitive and simple.
The idea
John Fox wrote: “The population is to the sample as the sample is to the bootstrap samples”.. Tadaaam.. but what does this mean? your estimate of \( \beta \) which came from the sample, is suppose to estimate the “real” \( \beta \), the population \( \beta \), which is unknown. Now take a sample from the sample, we call that sample a bootstrap sample, estimate your \( \beta \) according to this (bootstrap)sample, now this new estimate is an estimate for your original \( \widehat{\beta} \), the one coming from the original data. For clarity, say you have 3 observations, first is {x = 0.7,y = 0.6}, second is {whatever}, third is {whatever}, now, an example of sample from the sample would be the shuffled ordering: first is {whatever}, second is {x = 0.7,y = 0.6}, third is {whatever}. This shuffling, or permuted sample is what we call the bootstrap sample. Now we estimate the same statistic (in our case \( \beta \)) again.
Repeat this sample and estimate many times, you have many bootstrapped estimates, now you can check how do these guys behave. One thing you can do with it is to bootstrap confidence intervals (CI) for your estimate, without the need for underlying distributional assumption.
In R
In R, “boot” package will do the trick:
?View Code RSPLUSlibrary(boot) #load the package
# Now we need the function we would like to estimate
# In our case the beta:
betfun = function(data,b,formula){
# b is the random indexes for the bootstrap sample
d = data[b,]
return(lm(d[,1]~d[,2], data = d)$coef[2])
# thats for the beta coefficient
}
# now you can bootstrap:
bootbet = boot(data="your data here", statistic=betfun, R=5000)
# R is how many bootstrap samples
names(bootbet)
plot(bootbet)
hist(bootbet$t, breaks = 100)
You can zoom in on which indices were picked at each bootstrap take, what was the exact permutation, you do that using the function boot.array:
?View Code RSPLUSzoombot = boot.array(bootbet, indices = T)
dim(zoombot)
hist(zoombot[1,], breaks = 100)
# this is the frequency of each index, [1,] for the first
#bootstrap run
More transparent
You can better understand it if you can code it yourself, for a simple problem like bootstrapping confidence intervals it is even simpler and faster:
?View Code RSPLUSnb = 5000 ; bet = NULL ; n = NROW("your data here")
ptm <- proc.time() # have a look at the time it takes
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T) # pick random indices
bet[i] = lm(data[unifnum,1]~data[unifnum,2])$coef[2]
}
proc.time() - ptm # about 80 seconds on my end
How bad are your current confidence interval?
Does it really matter? maybe it’s not worth the trouble. As an illustration I use stock returns which are known to have fat tails, meaning more observations far from the center. Take a look at the following figure:
That is the Morgan Stanley \(\widehat{\beta}\) with the market. The estimate is centered at 1.87. The green vertical lines are the (95%) confidence interval reported by the the “lm” function, the red vertical lines are the equivalent nonparametric confidence intervals, the light blue curve is the normal density.
Notice how different is the interval from the nonparametric bootstrap, which is more accurate in this case. For example, it may be that the parameter is actually 2, you can see the software output rejects this possibility since it assumes normality, yet the bootstrap confidence interval indeed covers the value 2. So, an investor who thinks that “in my portfolio all beta’s are smaller than 2 with a 95% CI” is mistaken to hold Morgan Stanley as such. This check takes about 80 seconds, so I would think twice before plugging it in a double “for” loop. However, if you are social scientist or similar, beef up your standard (normal) output with this robust analysis.
Wrap up
Here you can see that when you use the bootstrapped confidence interval, when the normal distribution assumption IS valid, you are not that worse off (though you will need to wait these 80 seconds nontheless ). I created a fake normal return distribution using same center and standard deviation as reported, and made the exact same analysis:
Again, I simulated from a normal distribution using same center and standard deviation as reported by the “lm”, you can see that the intervals are close to each other which concludes this post, using the parametric confidence interval, from the assumed normal distribution is in some sense suboptimal, since you don’t lose much even if it IS normal.
You can see the remaining code and some references below. Thanks for reading.
Comments
* Naturally we only cover the idea here, there is lots of bootstrapping literature, we only tasted nonparametric bootstrap.
Related literature
Bootstrap Methods and their Application (Cambridge Series in Statistical and Probabilistic Mathematics)
An Introduction to the Bootstrap (Chapman & Hall/CRC Monographs on Statistics & Applied Probability)
Resampling Methods: A Practical Guide to Data Analysis
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
library(ggplot2) ; library(fields) ; library(grDevices)
tckr = c('MS', 'SPY')
end<- format(Sys.Date(),"%Y-%m-%d") # yyyy-mm-dd
start<-format(Sys.Date() - 365*10,"%Y-%m-%d") # 10 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = as.numeric((dat1[,4] - dat1[,1])/dat1[,1] )
ret2 = as.numeric((dat2[,4] - dat2[,1])/dat2[,1])
plot(ret1)
length(as.numeric(ret1)) ; length(as.numeric(ret1))
plot(as.numeric(ret1)~as.numeric(ret2))
t1 = as.data.frame(cbind(as.numeric(ret1),as.numeric(ret2)))
names(t1) <- tckr
lm1 = lm(MS~SPY, data = t1) ; summary(lm1)
nb = 5000 ; bet = NULL ; n = NROW(t1)
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
bet[i] = lm(t1[unifnum,1]~t1[unifnum,2])$coef[2]
}
proc.time() - ptm
den = density(bet)
par(mfrow = c(1,1))
bethat <- bquote(Histogram~of ~bold(widehat(beta)))
h = hist(bet, breaks = 100, freq = NULL, probability = F, ylim = c(0,280), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = bethat )
lwd1 = 2.5
xline(mean(bet), col = 4, lwd = lwd1) ; xline(lm1$coef[2], col = 6, lwd = lwd1)
xline(mean(bet)+1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(bet)-1.96*summary(lm1)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(bet,.025), col = 2, lwd = lwd1) ; xline(quantile(bet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(bet),max(bet),length=length(bet))
yfit<-dnorm(xfit,mean=mean(bet),sd=sd(bet))
yfit <- yfit*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm1$coef[2],sd=summary(lm1)$coef[2,2])
yfit2 <- yfit2*diff(h$mids[1:2])*length(bet)
lines(xfit, yfit2, col=5, lwd=2)
###################################################
### Now I generate actually from the normal distribution
###################################################
eps = rnorm(n, mean(ret1), sd(ret1))
fakey = summary(lm1)$coef[1,1]+summary(lm1)$coef[2,1]*ret1 +eps
# hist(fakey, offset=0.16, width=0.13, add=TRUE, col = "green", breaks = 200)
lm2 = lm(fakey~ret1)
fakebet = NULL
ptm <- proc.time()
for (i in 1:nb){
unifnum = sample(c(1:n),n,replace = T)
fakebet[i] = lm(fakey[unifnum]~ret1[unifnum])$coef[2]
}
proc.time() - ptm
fakebethat <- bquote(Histogram~of~fake~bold(widehat(beta)))
h2 = hist(fakebet, breaks = 100, freq = NULL, probability = F, ylim = c(0,470), xlab = expression(bold(widehat(beta))),
, cex.lab = 1.6, main = fakebethat )
xline(mean(fakebet), col = 4, lwd = lwd1)
xline(lm2$coef[2], col = 6, lwd = lwd1)
xline(mean(fakebet)+1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(mean(fakebet)-1.96*summary(lm2)$coef[2,2], col = 3, lwd = lwd1)
xline(quantile(fakebet,.025), col = 2, lwd = lwd1) ; xline(quantile(fakebet,.975), col = 2, lwd = lwd1)
xfit<-seq(min(fakebet),max(fakebet),length=length(fakebet))
yfit<-dnorm(xfit,mean=mean(fakebet),sd=sd(fakebet))
yfit <- yfit*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit, col="blue", lwd=2)
yfit2<-dnorm(xfit,mean=lm2$coef[2],sd=summary(lm2)$coef[2,2])
yfit2 <- yfit2*diff(h2$mids[1:2])*length(fakebet)
lines(xfit, yfit2, col=5, lwd=2)
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on top[…]
8 weeks ago by rahuldave
Spurious Regression illustrated
12 weeks ago by rahuldave
(This article was first published on Eran Raviv » R, and kindly contributed to R-bloggers)
Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.
Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.
The results from the regression are:
Estimate
Std. Error
t.value
P.value
(Intercept)
-46.4234
2.1827
-21.27
0.0000
beta.hat
0.8534
0.0178
47.90
0.0000
R^2 = 0.76
The graph looks fine, and the results make sense, but utterly wrong!
Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!
?View Code RSPLUSy = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.
x = cumsum(rnorm(250*10,0.05))
lm2 = lm(y~x) ; summary(lm2)
plot(y, ty = "l", main = "Fitted (in blue) over Actual --
Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)
Estimate
Std.Error
t.value
P.value
(Intercept)
7.0474
0.4651
15.15
0.0000
x
0.5862
0.0062
94.29
0.0000
R^2 = 0.78
Note the resemblance with the previous figure and table.
So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?
In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction.
Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.
Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.
We can transform the data from levels to returns and re-execute the regression as follows:
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
tckr = c('MS', 'SPY')
end <- "2007-01-01"
start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns
ret2 = (dat2[,4] - dat2[,1])/dat2[,1]
lmret = lm(ret1~ret2)
summary(lmret)
plot(as.numeric(ret1)~as.numeric(lmret$fit))
abline(lmret, col = 2, lwd = 2.5)
Regression using returns
Now we can see that even after analyzing using returns, not levels, we still get a good fit.
You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.
?View Code RSPLUSadf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root
adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root
As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!
Residuals from Regression on levels
See how the residuals from the regression fluctuate around zero.
Comments
1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.
Thanks for reading.
References
Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics)
Financial Econometrics: From Basics to Advanced Modeling Techniques (Frank J. Fabozzi Series)
A Companion to Theoretical Econometrics (Blackwell Companions to Contemporary Economics)
Financial Econometrics: Problems, Models, and Methods.
Time Series Analysis
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers
blog
R
from google
Spurious Regression problem dates back to Yule (1926): “Why Do We Sometimes Get Nonsense Correlations between Time-series?”. Lets see what is the problem, and how can we fix it. I am using Morgan Stanley (MS) symbol for illustration, pre-crisis time span. Take a look at the following figure, generated from the regression of MS on the S&P, actual prices of the stock, actual prices of the S&P, when we use actual prices we term it regression in levels, as in price levels, as oppose to log transformed or returns.
Regression in levels, Morgan Stanley price level and fitted values from the regression MS~SPY.
The results from the regression are:
Estimate
Std. Error
t.value
P.value
(Intercept)
-46.4234
2.1827
-21.27
0.0000
beta.hat
0.8534
0.0178
47.90
0.0000
R^2 = 0.76
The graph looks fine, and the results make sense, but utterly wrong!
Thing is, the two series are upward drifting, so.. they drift together, it seems as if they are related. As a matter of fact, they are related, but what we just did is the wrong way to check it. Here is similar results from x and y random walks!!
?View Code RSPLUSy = cumsum(rnorm(250*10,0.05)) # random normal, with small (0.05) drift.
x = cumsum(rnorm(250*10,0.05))
lm2 = lm(y~x) ; summary(lm2)
plot(y, ty = "l", main = "Fitted (in blue) over Actual --
Random WALK this time", xlab = "x") ; lines(lm2$fit, col = 4)
Estimate
Std.Error
t.value
P.value
(Intercept)
7.0474
0.4651
15.15
0.0000
x
0.5862
0.0062
94.29
0.0000
R^2 = 0.78
Note the resemblance with the previous figure and table.
So.., analysis of two Random Walks which are clearly independent from each other by construction, and the analysis of two time series in levels can have same qualitative result, as if there is a significant positive correlation, that can’t be good right?
In real life, how would I know if what I see is an actual relation or the result of two unrelated series that, just so happen, are drifting in the same direction.
Here we step into the domain of the highly important yet amazingly boring of Unit Roots. This post is not about unit roots, and I want to keep it short not to lose the remaining 5% out of the 100% who started reading. Being abusive, it is suffice to say we need to remove the drift in the series, check here and reference therein for more information.
Once the drift is removed, we can verify that indeed there is a real relation, meaning Morgan Stanley stock movement is actually affected by the market movement. Removing the drift is easy, use returns or first differences. Feel important by telling your classmates that the series are not stationary, hence the transformation.
We can transform the data from levels to returns and re-execute the regression as follows:
?View Code RSPLUSlibrary(quantmod) ; library(xtable) ; library(tseries)
tckr = c('MS', 'SPY')
end <- "2007-01-01"
start<-format(Sys.Date() - 365*8,"%Y-%m-%d") # 8 years of data
dat1 = (getSymbols(tckr[1], src="yahoo", from=start, to=end, auto.assign = FALSE))
dat2 = (getSymbols(tckr[2], src="yahoo", from=start, to=end, auto.assign = FALSE))
ret1 = (dat1[,4] - dat1[,1])/dat1[,1] # Convert to returns
ret2 = (dat2[,4] - dat2[,1])/dat2[,1]
lmret = lm(ret1~ret2)
summary(lmret)
plot(as.numeric(ret1)~as.numeric(lmret$fit))
abline(lmret, col = 2, lwd = 2.5)
Regression using returns
Now we can see that even after analyzing using returns, not levels, we still get a good fit.
You can use the “adf.test” function in package “tseries” to check if your series drift (stationary*) or not.
?View Code RSPLUSadf.test(as.numeric(dat1[,1])) # --> P.value is 0.6481 --> has Unit Root
adf.test(as.numeric(ret1)) # --> P.value < 0.01 --> no Unit Root
As a final note, fact that we cannot make any inference using price levels does not render the regression completely useless. Both “MS” and “S&P” series are NOT stationary, but together they ARE co-integrated, which is the main justification behind pairs trading. Co-integrated means that y-series may drift, x-series may drift, but the residual from the regression will not!
Residuals from Regression on levels
See how the residuals from the regression fluctuate around zero.
Comments
1. * — stationary process does not only mean “no drift”, we have weak definition and strong definition, see here for more information.
2. according to the graph it seems that it was a good time to short MS and hedge with the market at the end end of the time span I used, which is start of 2007. I leave it to the reader to check what would have been the loss on such a trade.
Thanks for reading.
References
Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics)
Financial Econometrics: From Basics to Advanced Modeling Techniques (Frank J. Fabozzi Series)
A Companion to Theoretical Econometrics (Blackwell Companions to Contemporary Economics)
Financial Econometrics: Problems, Models, and Methods.
Time Series Analysis
To leave a comment for the author, please follow the link and comment on his blog: Eran Raviv » R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
12 weeks ago by rahuldave
Scriptogr.am Turns Your Dropbox Into a Blog [Web Apps]
january 2012 by rahuldave
Want to start a free blog using nothing but text files and a service you already know and love? Scriptogr.am can turn a folder in your Dropbox account into a simple blog. You just sign up, write some posts in Markdown format, and you're done. More »
Web_apps
Blog
dropbox
Web_Site
Writing
from google
january 2012 by rahuldave
How will the elmcity service scale? Like the web!
december 2010 by rahuldave
During a recent talk at Harvard's Berkman Center, Scott MacLeod asked (via the IRC backchannel): "How does the elmcity service scale?" He wondered, in particular, whether the service could support an online university like the World University and School that might produce an unlimited number of class schedules.
My short answer was that the elmcity service scales like the web. But what does that really mean? I promised Scott that I'd spell it out here. We'll start with an analogy. As I mentioned in The power of informal contracts, the elmcity project envisions a web of calendar feeds that's analogous to the blogosphere's web of RSS and Atom feeds. We take for granted that the blogosphere scales like the web. A blog feed is just a special kind of web page. Anybody can create a blog and publish its feed at some URL. Why not calendars too? We haven't thought about them in the same way, but the ICS (iCalendar) files that our calendar programs export are the moral equivalents of the RSS and Atom feeds that our blog publishing tools export. Anybody can create a calendar and publish its feed at some URL.
These webs -- of HTML pages, of blog feeds, of calendar feeds -- are notionally webs of peers. We can all publish, and we can all read, without relying on a central authority or privileged hub. There are, to be sure, powerful centralized services. My blog, for example, is one of millions hosted at wordpress.com, aggregated by Bloglines and Google Reader, and indexed by Google and Bing. But these services, while convenient, are optional. So long as we can publish our blogs somewhere online, advertise their URLs, and get the DNS to resolve their domain names, we can have a working blogosphere. The necessary and sufficient condition is that we can all publish resources (e.g., pages and feeds), and that we can all access those resources.
For the calendarsphere that I envision, a service like elmcity is likewise optional. Let's suppose that the World University and School succeeds wildly. At any given moment there are tens of thousands of courses on offer, each with its own course page and also with its own calendar. Instructors publish course pages using any web publishing tool, and also publish calendars using any calendar publishing tool -- Google Calendar, or Outlook, or Apple iCal, or another calendar program. Students pick schedules of courses, bookmark the course pages, and load the course calendars into any of these same calendar programs. The calendar software merges the separate course calendars and combines them with the students' personal calendars. These calendar programs are thus aggregators of calendar feeds in the same way that feedreaders like NetNewsWire or Google Reader are aggregators of blog feeds.
Given a baseline web of peers, it's useful to be able to merge our individual views of them into pooled spaces. NetNewsWire is a personal feedreader, but Google Reader is social. In the pool created by Google Reader, data finds data and people find people. The elmcity service aims to create that same kind of effect in the realm of public calendar events. When we pool our separate calendars, we publicize the events that we are promoting, we discover events that others are promoting, and we see all our public events on common timelines.
What constrains our ability to scale out pools of calendars? Let's continue the analogy to the blogosphere. Google Reader constitutes one pooled space for blog feeds, Bloglines another. Because the data aggregated by these services conforms to open standards (i.e., RSS and Atom), other services can create blog pools too. Likewise in the calendarsphere, Google Calendar is one way to pool calendars, the elmcity service is another, Calagator is a third. Others can play too.
How can we scale these providers of calendar pools? Along one axis, each provider needs to be able to grow its computing power. Google Calendar scales on this axis by using Google's cloud platform. The elmcity service uses Azure, the Microsoft cloud platform. Note that elmcity, unlike Google Calendar, is an open source service. That means you could run your own instance of it, using your own Azure account, but you'd still be relying on the Azure compute fabric.
Calagator, based on Ruby on Rails, could be deployed either to a conventional hosting environment or to a cloud platform. It would thus scale, along the compute axis, as either environment allows. The elmcity service could be used in this way too. The service is written for Azure, but the core aggregation engine is independent of Azure and could be deployed to a conventional hosting environment.
For feed aggregators, another axis of scale is the number of feeds that can be processed. When that number grows, the time required to connect to many feeds and ingest their contents becomes a constraint. The elmcity service currently supports 50 calendar hubs. Thrice daily, each hub pulls data from Eventful, Upcoming, Eventbrite, Facebook, and a list of iCalendar feeds. So far a single Azure worker role can easily do all this work. I'll dial up the number of workers if needed, but first I want to squeeze as much parallelism as I can out of each worker. To that end, I recently upgraded to the 4.0 version of the .NET Framework in order to exploit its dramatically simplified parallel processing. In this week's companion article I show how the elmcity service uses that new capability to optimize the time required to gather feeds from many sources.
Pub/sub networks can also scale by coalescing feeds. Consider a calendar hub operated, for some city, by the online arm of that city's newspaper. One model is flat. The newspaper runs a hub whose registry lists all the calendar feeds in town. But another model is hierarchical. In that model, there's a hub for arts and culture, a hub for sports and recreation, a hub for city government, and so on. Each hub gathers events from many feeds, and publishes the merged result on its own website for its own constituency. If the newspaper wants to include all those feeds, it can list them individually in its own registry. But why aggregate arts, sports, or recreation feeds more than once? The newspaper's uber-hub can, instead, reuse the arts, sports, and recreation feeds curated by those respective hubs, adding their merged outputs to its own set of curated feeds. Such reuse can cut down the computational time and effort required to propagate feeds throughout the network.
None of these mechanisms will matter, though, until a vibrant ecosystem of calendar feeds requires them. That's the ultimate constraint. Scaling the calendarsphere isn't a problem yet, but it would be a good problem to have. First, though, we've got to light up a whole bunch of feeds.
Related:
The iCalendar chicken-and-egg conundrum
Developing intuitions about data
Personal data stores and pub/sub networks
The principle of indirection
See all Radar elmcity stories
See all Answers elmcity stories
Programming
blog
calendar
elmcity
feed
syndication
from google
My short answer was that the elmcity service scales like the web. But what does that really mean? I promised Scott that I'd spell it out here. We'll start with an analogy. As I mentioned in The power of informal contracts, the elmcity project envisions a web of calendar feeds that's analogous to the blogosphere's web of RSS and Atom feeds. We take for granted that the blogosphere scales like the web. A blog feed is just a special kind of web page. Anybody can create a blog and publish its feed at some URL. Why not calendars too? We haven't thought about them in the same way, but the ICS (iCalendar) files that our calendar programs export are the moral equivalents of the RSS and Atom feeds that our blog publishing tools export. Anybody can create a calendar and publish its feed at some URL.
These webs -- of HTML pages, of blog feeds, of calendar feeds -- are notionally webs of peers. We can all publish, and we can all read, without relying on a central authority or privileged hub. There are, to be sure, powerful centralized services. My blog, for example, is one of millions hosted at wordpress.com, aggregated by Bloglines and Google Reader, and indexed by Google and Bing. But these services, while convenient, are optional. So long as we can publish our blogs somewhere online, advertise their URLs, and get the DNS to resolve their domain names, we can have a working blogosphere. The necessary and sufficient condition is that we can all publish resources (e.g., pages and feeds), and that we can all access those resources.
For the calendarsphere that I envision, a service like elmcity is likewise optional. Let's suppose that the World University and School succeeds wildly. At any given moment there are tens of thousands of courses on offer, each with its own course page and also with its own calendar. Instructors publish course pages using any web publishing tool, and also publish calendars using any calendar publishing tool -- Google Calendar, or Outlook, or Apple iCal, or another calendar program. Students pick schedules of courses, bookmark the course pages, and load the course calendars into any of these same calendar programs. The calendar software merges the separate course calendars and combines them with the students' personal calendars. These calendar programs are thus aggregators of calendar feeds in the same way that feedreaders like NetNewsWire or Google Reader are aggregators of blog feeds.
Given a baseline web of peers, it's useful to be able to merge our individual views of them into pooled spaces. NetNewsWire is a personal feedreader, but Google Reader is social. In the pool created by Google Reader, data finds data and people find people. The elmcity service aims to create that same kind of effect in the realm of public calendar events. When we pool our separate calendars, we publicize the events that we are promoting, we discover events that others are promoting, and we see all our public events on common timelines.
What constrains our ability to scale out pools of calendars? Let's continue the analogy to the blogosphere. Google Reader constitutes one pooled space for blog feeds, Bloglines another. Because the data aggregated by these services conforms to open standards (i.e., RSS and Atom), other services can create blog pools too. Likewise in the calendarsphere, Google Calendar is one way to pool calendars, the elmcity service is another, Calagator is a third. Others can play too.
How can we scale these providers of calendar pools? Along one axis, each provider needs to be able to grow its computing power. Google Calendar scales on this axis by using Google's cloud platform. The elmcity service uses Azure, the Microsoft cloud platform. Note that elmcity, unlike Google Calendar, is an open source service. That means you could run your own instance of it, using your own Azure account, but you'd still be relying on the Azure compute fabric.
Calagator, based on Ruby on Rails, could be deployed either to a conventional hosting environment or to a cloud platform. It would thus scale, along the compute axis, as either environment allows. The elmcity service could be used in this way too. The service is written for Azure, but the core aggregation engine is independent of Azure and could be deployed to a conventional hosting environment.
For feed aggregators, another axis of scale is the number of feeds that can be processed. When that number grows, the time required to connect to many feeds and ingest their contents becomes a constraint. The elmcity service currently supports 50 calendar hubs. Thrice daily, each hub pulls data from Eventful, Upcoming, Eventbrite, Facebook, and a list of iCalendar feeds. So far a single Azure worker role can easily do all this work. I'll dial up the number of workers if needed, but first I want to squeeze as much parallelism as I can out of each worker. To that end, I recently upgraded to the 4.0 version of the .NET Framework in order to exploit its dramatically simplified parallel processing. In this week's companion article I show how the elmcity service uses that new capability to optimize the time required to gather feeds from many sources.
Pub/sub networks can also scale by coalescing feeds. Consider a calendar hub operated, for some city, by the online arm of that city's newspaper. One model is flat. The newspaper runs a hub whose registry lists all the calendar feeds in town. But another model is hierarchical. In that model, there's a hub for arts and culture, a hub for sports and recreation, a hub for city government, and so on. Each hub gathers events from many feeds, and publishes the merged result on its own website for its own constituency. If the newspaper wants to include all those feeds, it can list them individually in its own registry. But why aggregate arts, sports, or recreation feeds more than once? The newspaper's uber-hub can, instead, reuse the arts, sports, and recreation feeds curated by those respective hubs, adding their merged outputs to its own set of curated feeds. Such reuse can cut down the computational time and effort required to propagate feeds throughout the network.
None of these mechanisms will matter, though, until a vibrant ecosystem of calendar feeds requires them. That's the ultimate constraint. Scaling the calendarsphere isn't a problem yet, but it would be a good problem to have. First, though, we've got to light up a whole bunch of feeds.
Related:
The iCalendar chicken-and-egg conundrum
Developing intuitions about data
Personal data stores and pub/sub networks
The principle of indirection
See all Radar elmcity stories
See all Answers elmcity stories
december 2010 by rahuldave
Mac blog editor MarsEdit 3 finally gains rich text editor
may 2010 by rahuldave
Fans of Red Sweater Software's blog publishing tool MarsEdit got a surprise Tuesday morning with the release of MarsEdit 3. The most significant update to the software is the addition of a rich text editor, though those who fiddle with the HTML for their blog posts got an updated syntax highlighter. A new media manager rounds out this solid update, one that the company hopes will attract new users and get old ones writing again.
According to Red Sweater founder and developer Daniel Jalkut, some of the features in MarsEdit 3 have been in the works for roughly 2.5 years—basically since MarsEdit 2 was released. Many of the enhancements in the new version respond to long-standing requests from users, Jalkut told Ars, particularly rich text editing. "Most of the [blog] Web interfaces and desktop competitors have a rich mode but, until now, MarsEdit has focused exclusively on HTML/markdown source," Jalkut said.
Read the comments on this post
News
News
News
Apple
Software
blog
blogging
developers
macosx
marsedit
redsweatersoftware
from google
According to Red Sweater founder and developer Daniel Jalkut, some of the features in MarsEdit 3 have been in the works for roughly 2.5 years—basically since MarsEdit 2 was released. Many of the enhancements in the new version respond to long-standing requests from users, Jalkut told Ars, particularly rich text editing. "Most of the [blog] Web interfaces and desktop competitors have a rich mode but, until now, MarsEdit has focused exclusively on HTML/markdown source," Jalkut said.
Read the comments on this post
may 2010 by rahuldave
Interview about TechStars on Mixergy
march 2010 by rahuldave
Here’s an interview with me about TechStars that was just published on Mixergy. In it, Andrew asks some great questions about how we actually help companies, the differences between TechStars and other programs, a little bit about our results so far, and what’s special about Boulder, Seattle, and Boston.
Thanks Mixergy. Good timing, since applications to the Boulder program close tonight at midnight mountain time. Don’t worry though, because applications to our Seattle program open Tuesday.
blog
from google
Thanks Mixergy. Good timing, since applications to the Boulder program close tonight at midnight mountain time. Don’t worry though, because applications to our Seattle program open Tuesday.
march 2010 by rahuldave
Our Full Historical Results
march 2010 by rahuldave
You might have heard us say that TechStars is “open-source” in the past. Last year we shared our Series AA Financing docs and now we’re now publishing our actual historical company results in full detail. This includes information on funding, acquisitions, failures, number of employees, etc.
Applying to and participating in a Seed Accelerator program is a big decision for nascent companies. We wanted to arm them with as much information as we could about what the program is like, as well as what the actual historical results have been.
I’m incredibly grateful to all the past companies who help us collect and allowed us to publish this data in one place for the first time. We hope that you find it useful.
If anyone has suggestions on how to improve the format (this one was inspired by Awesome Zombie), we’re all ears!
UPDATE: ReadWriteWeb has some analysis and thoughts on this data.
Featured
blog
from google
Applying to and participating in a Seed Accelerator program is a big decision for nascent companies. We wanted to arm them with as much information as we could about what the program is like, as well as what the actual historical results have been.
I’m incredibly grateful to all the past companies who help us collect and allowed us to publish this data in one place for the first time. We hope that you find it useful.
If anyone has suggestions on how to improve the format (this one was inspired by Awesome Zombie), we’re all ears!
UPDATE: ReadWriteWeb has some analysis and thoughts on this data.
march 2010 by rahuldave
Book By Its Cover
march 2010 by rahuldave
Book By Its Cover is a glorious new blog devoted to the beauty of books.
Design
Publications
Publishing
links
cover
glorious
book
devoted
beauty
books
blog
from google
march 2010 by rahuldave
related tags
Apple ⊕ beauty ⊕ blog ⊖ blogging ⊕ book ⊕ books ⊕ calendar ⊕ cover ⊕ data ⊕ Design ⊕ developers ⊕ devoted ⊕ dropbox ⊕ elmcity ⊕ Featured ⊕ feed ⊕ glorious ⊕ links ⊕ macosx ⊕ marsedit ⊕ News ⊕ Programming ⊕ Publications ⊕ Publishing ⊕ R ⊕ redsweatersoftware ⊕ R_bloggers ⊕ Software ⊕ syndication ⊕ Web_apps ⊕ Web_Site ⊕ Writing ⊕Copy this bookmark: