rahuldave + statistics   27

cumplyr: Extending the plyr Package to Handle Cross-Dependencies
Introduction
For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr
The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means
Suppose that you have the following data set:

Time
Value

1
1

2
3

3
5

To compute a forward-running mean, you need to split this data into three subsets:

Time
Value

1
1

Time
Value

1
1

2
3

Time
Value

1
1

2
3

3
5

In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:

Time
Value

1
1

2
2

3
3

This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means
Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:

Time
Value

3
5

Time
Value

2
3

3
5

Time
Value

1
1

2
3

3
5

And the final result is:

Time
Value

1
3

2
4

3
5

Computing Local Means (AKA Kernelized Means)
Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:

Time
Value

1
1

2
3

Time
Value

1
1

2
3

3
5

Time
Value

2
3

3
5

Within these non-disjoint subsets, means are computed and the result is:

Time
Value

1
2

2
3

3
4

A Strategy for Handling Non-Disjoint Subsets
How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds
Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds
Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls
Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation
To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library('cumplyr')
 
data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))
 
iddply(data,
equality.variables = c('Time'),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c('Time'),
upper.bound.variables = c(),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c('Time'),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 1),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 2),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 5),
func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr
In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!
Programming  Statistics  from google
29 days ago by rahuldave
Measuring time series characteristics
(This article was first published on Research tips » R, and kindly contributed to R-bloggers)

A few years ago, I was working on a project where we measured various characteristics of a time series and used the information to determine what forecasting method to apply or how to cluster the time series into meaningful groups. The two main papers to come out of that project were:

Wang, Smith and Hyndman (2006) Characteristic-​​based clustering for time series data. Data Mining and Knowledge Discovery, 13(3), 335–364.
Wang, Smith-Miles and Hyndman (2009) “Rule induction for forecasting method selection: meta-​​learning the characteristics of univariate time series”, Neurocomputing, 72, 2581–2594.

I’ve since had a lot of requests for the code which one of my coauthors has been helpfully emailing to anyone who asked. But to make it easier, we thought it might be helpful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in several ways (so it will give different results). If you just want the code, skip to the bottom of the post.

Finding the period of the data
Usually in time series work, we know the period of the data (if the observations are monthly, the period is 12, for example). But in this project, some of our data was of unknown period and we wanted a method to automatically determine the appropriate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a better approach (prompted on crossvalidated.com) using an estimate of the spectral density:

find.freq <- function(x)
{
n <- length(x)
spec <- spec.ar(c(na.contiguous(x)),plot=FALSE)
if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
{
period <- round(1/spec$freq[which.max(spec$spec)])
if(period==Inf) # Find next local maximum
{
j <- which(diff(spec$spec)>0)
if(length(j)>0)
{
nextmax <- j[1] + which.max(spec$spec[j[1]:500])
if(nextmax <= length(spec$freq))
period <- round(1/spec$freq[nextmax])
else
period <- 1
}
else
period <- 1
}
}
else
period <- 1
 
return(period)
}

 
The function is called find.freq because time series people often call the period of seasonality the “frequency” (which is of course highly confusing).

Decomposing the data into trend and seasonal components
We needed a measure of the strength of trend and the strength of seasonality, and to do this we decomposed the data into trend, seasonal and error terms.

Because not all data could be decomposed additively, we first needed to apply an automated Box-Cox transformation. We tried a range of Box-Cox parameters on a grid, and selected the one which gave the most normal errors. That worked ok, but I’ve since found some papers that provide quite good automated Box-Cox algorithms that I’ve implemented in the forecast package. So this code uses Guerrero’s (1993) method instead.

For seasonal time series, we decomposed the transformed data using an stl decomposition with periodic seasonality.

For non-seasonal time series, we estimated the trend of the transformed data using penalized regression splines via the mgcv package.

decomp <- function(x,transform=TRUE)
{
require(forecast)
# Transform series
if(transform & min(x,na.rm=TRUE) >= 0)
{
lambda <- BoxCox.lambda(na.contiguous(x))
x <- BoxCox(x,lambda)
}
else
{
lambda <- NULL
transform <- FALSE
}
# Seasonal data
if(frequency(x)>1)
{
x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
trend <- x.stl$time.series[,2]
season <- x.stl$time.series[,1]
remainder <- x - trend - season
}
else #Nonseasonal data
{
require(mgcv)
tt <- 1:length(x)
trend <- rep(NA,length(x))
trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
season <- NULL
remainder <- x - trend
}
return(list(x=x,trend=trend,season=season,remainder=remainder,
transform=transform,lambda=lambda))
}

 

Putting everything on a [0,1] scale
We wanted to measure a range of characteristics such as strength of seasonality, strength of trend, level of nonlinearity, skewness, kurtosis, serial correlatedness, self-similarity, level of chaoticity (is that a word?) and the periodicity of the data. But we wanted all these on the same scale which meant mapping the natural range of each measure onto [0,1]. The following two functions were used to do this.

# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
{
eax <- exp(a*x)
if (eax == Inf)
f1eax <- 1
else
f1eax <- (eax-1)/(eax+b)
return(f1eax)
}
 
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
{
eax <- exp(a*x)
ea <- exp(a)
return((eax-1)/(eax+b)*(ea+b)/(ea-1))
}

 
The values of and in each function were chosen so the measure had a 90th percentile of 0.10 when the data were iid standard normal, and a value of 0.9 using a well-known benchmark time series.

Calculating the measures
Now we are ready to calculate the measures on the original data, as well as on the adjusted data (after removing trend and seasonality).

measures <- function(x)
{
require(forecast)
 
N <- length(x)
freq <- find.freq(x)
fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
x <- ts(x,f=freq)
 
# Decomposition
decomp.x <- decomp(x)
 
# Adjust data
if(freq > 1)
fits <- decomp.x$trend + decomp.x$season
else # Nonseasonal data
fits <- decomp.x$trend
adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
 
# Backtransformation of adjusted data
if(decomp.x$transform)
tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
else
tadj.x <- adj.x
 
# Trend and seasonal measures
v.adj <- var(adj.x, na.rm=TRUE)
if(freq > 1)
{
detrend <- decomp.x$x - decomp.x$trend
deseason <- decomp.x$x - decomp.x$season
trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(deseason,na.rm=TRUE))))
season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(detrend,na.rm=TRUE))))
}
else #Nonseasonal data
{
trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
max(0,min(1,1-v.adj/var(decomp.x$x,na.rm=TRUE))))
season <- 0
}
 
m <- c(fx,trend,season)
 
# Measures on original data
xbar <- mean(x,na.rm=TRUE)
s <- sd(x,na.rm=TRUE)
 
# Serial correlation
Q <- Box.test(x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
 
# Nonlinearity
p <- terasvirta.test(na.contiguous(x))$statistic
fp <- f1(p,0.069,2.304)
 
# Skewness
s <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
 
# Kurtosis
k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
 
# Hurst=d+0.5 where d is fractional difference.
H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
 
# Lyapunov Exponent
if(freq > N-10)
stop("Insufficient data")
Ly <- numeric(N-freq)
for(i in 1:(N-freq))
{
idx <- order(abs(x[i] - x))
idx <- idx[idx < (N-freq)]
j <- idx[2]
Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
if(is.na(Ly[i]) | Ly[i]==Inf | Ly[i]==-Inf)
Ly[i] <- NA
}
Lyap <- mean(Ly,na.rm=TRUE)
fLyap <- exp(Lyap)/(1+exp(Lyap))
 
m <- c(m,fQ,fp,fs,fk,H,fLyap)
 
# Measures on adjusted data
xbar <- mean(tadj.x, na.rm=TRUE)
s <- sd(tadj.x, na.rm=TRUE)
 
# Serial
Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
fQ <- f2(Q,7.53,0.103)
 
# Nonlinearity
p <- terasvirta.test(na.contiguous(adj.x))$statistic
fp <- f1(p,0.069,2.304)
 
# Skewness
s <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
fs <- f1(s,1.510,5.993)
 
# Kurtosis
k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
fk <- f1(k,2.273,11567)
 
m <- c(m,fQ,fp,fs,fk)
names(m) <- c("frequency", "trend","seasonal",
"autocorrelation","non-linear","skewness","kurtosis",
"Hurst","Lyapunov",
"dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")
 
return(m)
}

 

Here is a quick example applied to Australian monthly gas production:

library(forecast)
measures(gas)
frequency trend seasonal autocorrelation
0.1096 0.9989 0.9337 0.9985
non-linear skewness kurtosis Hurst
0.4947 0.1282 1.0000 0.9996
Lyapunov dc autocorrelation dc non-linear dc skewness
0.5662 0.1140 0.0538 0.1743
dc kurtosis
1.0000

 
The function is far from perfect, and it is not hard to find examples where it fails. For example, it doesn’t work with multiple seasonality — try measure(taylor) and check the seasonality. Also, I’m not convinced the kurtosis provides anything useful here, or that the skewness measure is done in the best way possible. But it was really a proof of concept, so we will leave it to others to revise and improve the code.

In our papers, we took the measures obtained using R, and produced self-organizing maps using Viscovery. There is now a som package in R for that, so it might be possible to integrate that step into R as well. The dendogram was generated in matlab, although that could now also be done in R using the ggdendro package for example.

Download the code in a single file.

To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers  forecasting  R  Research_tips  statistics  from google
4 weeks ago by rahuldave
Big data is easy
Big data is easy; big models are hard.

If you just wanted to use simple models with tons of data, that would be easy. You could resample the data, throwing some of it away until you had a quantity of data you could comfortably manage.

But when you have tons of data, you want to take advantage of it and ask questions that simple models cannot answer. (”Big” data is often indirect data.) So the problem isn’t that you have a lot of data, it’s that you’re using models that require a lot of data. And that can be very hard.

I am not saying people should just use simple models. No, people are right to want to take advantage of their data, and often that does require complex models. (See Brad Efron’s explanation why.) But the primary challenge is intellectual, not physical.

Related post: Big data and humility
Statistics  Probability_and_Statistics  from google
4 weeks ago by rahuldave
Comparing Julia and R’s Vocabularies
(This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers)

While exploring the Julia manual recently, I realized that it might be helpful to put the basic vocabularies of Julia and R side-by-side for easy comparison. So I took Hadley Wickham’s R Vocabulary section from the book he’s putting together on the devtools wiki, put all of the functions Hadley listed into a CSV file, and proceeded to fill in entries where I knew of an obvious Julia equivalent to an R function.

The results are on GitHub and, as they stand today, are shown below:

R
Julia
Category
Subcategory

https://

github.com/

hadley/devtools/

wiki/vocabulary
http://

julialang.org/

manual/

standard-

library-reference/
Resources
Vocabulary

?
help
Basics
First Functions

str

Basics
First Functions

%in%

Basics
Operators

match

Basics
Operators

=
=
Basics
Operators

<-
=
Basics
Operators

<<-

Basics
Operators

assign

Basics
Operators

$
[]
Basics
Operators

[]
[]
Basics
Operators

[[]]
[]
Basics
Operators

replace

Basics
Operators

head

Basics
Operators

tail

Basics
Operators

subset

Basics
Operators

with

Basics
Operators

within

Basics
Operators

all.equal

Basics
Comparison

identical

Basics
Comparison

!=
!=
Basics
Comparison

==
==
Basics
Comparison

>
>
Basics
Comparison

>=
>=
Basics
Comparison

<
<
Basics
Comparison

<=
<=
Basics
Comparison

is.na

Basics
Comparison

is.nan

Basics
Comparison

is.finite

Basics
Comparison

complete.cases

Basics
Comparison

*
*
Basics
Basic Math

+
+
Basics
Basic Math

-
-
Basics
Basic Math

/
/
Basics
Basic Math

^
^
Basics
Basic Math

%%
mod (%%)
Basics
Basic Math

%/%
div
Basics
Basic Math

abs
abs
Basics
Basic Math

sign
sign
Basics
Basic Math

acos
acos
Basics
Basic Math

acosh
acosh
Basics
Basic Math

asin
asin
Basics
Basic Math

asinh
asinh
Basics
Basic Math

atan
atan
Basics
Basic Math

atan2
atan2
Basics
Basic Math

atanh
atanh
Basics
Basic Math

sin
sin
Basics
Basic Math

sinh
sinh
Basics
Basic Math

cos
cos
Basics
Basic Math

cosh
cosh
Basics
Basic Math

tan
tan
Basics
Basic Math

tanh
tanh
Basics
Basic Math

ceiling
ceil
Basics
Basic Math

floor
floor
Basics
Basic Math

round
round
Basics
Basic Math

trunc
trunc
Basics
Basic Math

signif

Basics
Basic Math

exp
exp
Basics
Basic Math

log
log
Basics
Basic Math

log10
log10
Basics
Basic Math

log1p
log1p
Basics
Basic Math

log2
log2
Basics
Basic Math

logb

Basics
Basic Math

sqrt
sqrt
Basics
Basic Math

cummax

Basics
Basic Math

cummin

Basics
Basic Math

cumprod
cumprod
Basics
Basic Math

cumsum
cumsum
Basics
Basic Math

diff
diff
Basics
Basic Math

max
max
Basics
Basic Math

min
min
Basics
Basic Math

prod
prod
Basics
Basic Math

sum
sum
Basics
Basic Math

range

Basics
Basic Math

mean
mean
Basics
Basic Math

median
median
Basics
Basic Math

cor
cor_pearson
Basics
Basic Math

cov
cov_pearson
Basics
Basic Math

sd
std
Basics
Basic Math

var
var
Basics
Basic Math

pmax

Basics
Basic Math

pmin

Basics
Basic Math

rle

Basics
Basic Math

function
function
Basics
Functions

missing

Basics
Functions

on.exit

Basics
Functions

return
return
Basics
Functions

invisible

Basics
Functions

&
&
Basics
Logical & Set Operations

|
|
Basics
Logical & Set Operations

!
!
Basics
Logical & Set Operations

xor

Basics
Logical & Set Operations

all
all
Basics
Logical & Set Operations

any
any
Basics
Logical & Set Operations

intersect
intersect
Basics
Logical & Set Operations

union
union
Basics
Logical & Set Operations

setdiff

Basics
Logical & Set Operations

setequal

Basics
Logical & Set Operations

which
find
Basics
Logical & Set Operations

c
[] ({})
Basics
Vectors and Matrices

matrix
[] ({})
Basics
Vectors and Matrices

length
size (length)
Basics
Vectors and Matrices

dim
size
Basics
Vectors and Matrices

ncol
size(x, 1)
Basics
Vectors and Matrices

nrow
size(x, 2)
Basics
Vectors and Matrices

cbind
hcat
Basics
Vectors and Matrices

rbind
vcat
Basics
Vectors and Matrices

names

Basics
Vectors and Matrices

colnames

Basics
Vectors and Matrices

rownames

Basics
Vectors and Matrices

t

Basics
Vectors and Matrices

diag
eye
Basics
Vectors and Matrices

sweep

Basics
Vectors and Matrices

as.matrix

Basics
Vectors and Matrices

data.matrix

Basics
Vectors and Matrices

c
[] ({})
Basics
Making Vectors

rep

Basics
Making Vectors

seq
[from:by:to]
Basics
Making Vectors

seq_along

Basics
Making Vectors

seq_len
[1:len]
Basics
Making Vectors

rev
reverse
Basics
Making Vectors

sample

Basics
Making Vectors

choose
factorial
Basics
Making Vectors

factorial
factorial
Basics
Making Vectors

combn

Basics
Making Vectors

(is/as).(character/numeric/logical)

Basics
Making Vectors

list
HashTable ([])
Basics
Lists & Data Frames

unlist

Basics
Lists & Data Frames

data.frame

Basics
Lists & Data Frames

as.data.frame

Basics
Lists & Data Frames

split

Basics
Lists & Data Frames

expand.grid

Basics
Lists & Data Frames

if
if
Basics
Control Flow

&&
&&
Basics
Control Flow

||
||
Basics
Control Flow

for
for
Basics
Control Flow

while
while
Basics
Control Flow

next
continue
Basics
Control Flow

break
break
Basics
Control Flow

switch

Basics
Control Flow

ifelse

Basics
Control Flow

fitted

Statistics
Linear Models

predict

Statistics
Linear Models

resid

Statistics
Linear Models

rstandard

Statistics
Linear Models

lm

Statistics
Linear Models

glm

Statistics
Linear Models

hat

Statistics
Linear Models

influence.measures

Statistics
Linear Models

logLik

Statistics
Linear Models

df

Statistics
Linear Models

deviance

Statistics
Linear Models

formula

Statistics
Linear Models

~

Statistics
Linear Models

I

Statistics
Linear Models

anova

Statistics
Linear Models

coef

Statistics
Linear Models

confint

Statistics
Linear Models

vcov

Statistics
Linear Models

contrasts

Statistics
Linear Models

apropos(‘\\.test$’)

Statistics
Miscellaneous Statistical Tests

beta
beta
Statistics
Random Numbers

binom
binom
Statistics
Random Numbers

cauchy
cauchy
Statistics
Random Numbers

chisq
chisq
Statistics
Random Numbers

exp
exp
Statistics
Random Numbers

f
f
Statistics
Random Numbers

gamma
gamma
Statistics
Random Numbers

geom
geom
Statistics
Random Numbers

hyper
hyper
Statistics
Random Numbers

lnorm
lnorm
Statistics
Random Numbers

logis
logis
Statistics
Random Numbers

multinom
multinom
Statistics
Random Numbers

nbinom
nbinom
Statistics
Random Numbers

norm
norm
Statistics
Random Numbers

pois
pois
Statistics
Random Numbers

signrank
signrank
Statistics
Random Numbers

t
t
Statistics
Random Numbers

unif
unif (rand)
Statistics
Random Numbers

weibull
weibull
Statistics
Random Numbers

wilcox
wilcox
Statistics
Random Numbers

birthday
birthday
Statistics
Random Numbers

tukey
tukey
Statistics
Random Numbers

crossprod
*
Statistics
Matrix Algebra

tcrossprod
*
Statistics
Matrix Algebra

eigen
eig
Statistics
Matrix Algebra

qr
qr
Statistics
Matrix Algebra

svd
svd
Statistics
Matrix Algebra

%*%
*
Statistics
Matrix Algebra

%o%

Statistics
Matrix Algebra

outer

Statistics
Matrix Algebra

rcond

Statistics
Matrix Algebra

solve
\
Statistics
Matrix Algebra

duplicated

Statistics
Ordering and Tabulating

unique

Statistics
Ordering and Tabulating

merge

Statistics
Ordering and Tabulating

order

Statistics
Ordering and Tabulating

rank

Statistics
Ordering and Tabulating

quantile
quantile
Statistics
Ordering and Tabulating

sort
sort
Statistics
Ordering and Tabulating

table

Statistics
Ordering and Tabulating

ftable

Statistics
Ordering and Tabulating

ls
whos
Working with R
Workspace

exists

Working with R
Workspace

get

Working with R
Workspace

rm

Working with R
Workspace

getwd
getcwd
Working with R
Workspace

setwd
setcwd
Working with R
Workspace

q
Ctrl-D
Working with R
Workspace

source
load
Working with R
Workspace

install.packages

Working with R
Workspace

library

Working with R
Workspace

require

Working with R
Workspace

help
help
Working with R
Help

?
help
Working with R
Help

help.search

Working with R
Help

apropos

Working with R
Help

RSiteSearch

Working with R
Help

citation

Working with R
Help

demo

Working with R
Help

example

Working with R
Help

vignette

Working with R
Help

traceback

Working with R
Debugging

browser

Working with R
Debugging

recover

Working with R
Debugging

options(error =)

Working with R
Debugging

stop

Working with R
Debugging

warning

Working with R
Debugging

message

Working with R
Debugging

tryCatch
try/catch
Working with R
Debugging

try
try
Working with R
Debugging

print
print (println)
I/O
Output

cat

I/O
Output

message

I/O
Output

warning

I/O
Output

dput

I/O
Output

format

I/O
Output

sink

I/O
Output

data

I/O
Reading and Writing Data

count.fields

I/O
Reading and Writing Data

read.csv
csvread
I/O
Reading and Writing Data

read.delim
dlmread
I/O
Reading and Writing Data

read.fwf

I/O
Reading and Writing Data

read.table

I/O
Reading and Writing Data

library(foreign)

I/O
Reading and Writing Data

write.table
dlmwrite
I/O
Reading and Writing Data

readLines
readlines
I/O
Reading and Writing Data

writeLines

I/O
Reading and Writing Data

load

I/O
Reading and Writing Data

save

I/O
Reading and Writing Data

readRDS

I/O
Reading and Writing Data

saveRDS

I/O
Reading and Writing Data

dir

I/O
Files and Directories

basename

I/O
Files and Directories

dirname

I/O
Files and Directories

file.path

I/O
Files and Directories

path.expand

I/O
Files and Directories

file.choose

I/O
Files and Directories

file.copy

I/O
Files and Directories

file.create

I/O
Files and Directories

file.remove

I/O
Files and Directories

path.rename

I/O
Files and Directories

dir.create

I/O
Files and Directories

file.exists

I/O
F[…]
R_bloggers  programming  statistics  from google
7 weeks ago by rahuldave
Simulated Annealing in Julia
Building Optimization Functions for Julia
In hopes of adding enough statistical functionality to Julia to make it usable for my day-to-day modeling projects, I’ve written a very basic implementation of the simulated annealing (SA) algorithm, which I’ve placed in the same JuliaVsR GitHub repository that I used for the code for my previous post about Julia. For easier reading, my current code for SA is shown below:

The Simulated Annealing Algorithm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# simulated_annealing()
# Arguments:
# * cost: Function from states to the real numbers. Often called an energy function, but this algorithm works for both positive and negative costs.
# * s0: The initial state of the system.
# * neighbor: Function from states to states. Produces what the Metropolis algorithm would call a proposal.
# * temperature: Function specifying the temperature at time i.
# * iterations: How many iterations of the algorithm should be run? This is the only termination condition.
# * keep_best: Do we return the best state visited or the last state visisted? (Should default to true.)
# * trace: Do we show a trace of the system's evolution?
 
function simulated_annealing(cost,
s0,
neighbor,
temperature,
iterations,
keep_best,
trace)
 
# Set our current state to the specified intial state.
s = s0
 
# Set the best state we've seen to the intial state.
best_s = s0
 
# We always perform a fixed number of iterations.
for i = 1:iterations
t = temperature(i)
s_n = neighbor(s)
if trace println("$i: s = $s") end
if trace println("$i: s_n = $s_n") end
y = cost(s)
y_n = cost(s_n)
if trace println("$i: y = $y") end
if trace println("$i: y_n = $y_n") end
if y_n <= y
s = s_n
if trace println("Accepted") end
else
p = exp(- ((y_n - y) / t))
if trace println("$i: p = $p") end
if rand() <= p
s = s_n
if trace println("Accepted") end
else
s = s
if trace println("Rejected") end
end
end
if trace println() end
if cost(s) < cost(best_s)
best_s = s
end
end
 
if keep_best
best_s
else
s
end
end

I’ve tried to implement the algorithm as it was presented by Bertsimas and Tsitsiklis in their 1993 review paper in Statistical Science. The major differences between my implementation and their description of the algorithm is that (1) I’ve made it possible to keep the best solution found during the search rather than always use the last solution found and (2) I’ve made no effort to select a value for their d parameter carefully: I’ve simply set it to 1 for all of my examples, though my code will allow you to specify another rule for setting the temperature of the annealer at time t other than the 1 / log(t) cooling scheme I’ve been using. (In fact, the code currently forces you to select a cooling scheme, since there are no default arguments in Julia yet.)

I chose simulated annealing as my first optimization algorithm to implement in Julia because it’s a natural relative of the Metropolis algorithm that I used in the previous post. Indeed, coding up an implementation of SA made me appreciate the fact that the Metropolis algorithm as used in Bayesian statistics can be considered a special case of the SA algorithm in which the temperature is always 1 and in which the cost function for a state with posterior probability p is -log(p).

Coding up the SA algorithm for myself also me made realize why it’s important that it uses an additive comparison of cost functions rather than a ratio (as in the Metropolis algorithm): the ratio goes haywire when the cost function can take on both positive and negative values (which, of course, doesn’t matter for Bayesian methods because probabilities are strictly non-negative). I discovered this when I initially tried to code up SA from my inaccurate memory without first consulting the literature and discovered that I couldn’t get a ratio-based algorithm to work no matter how many times I tried changing the cooling schedule.

To test out the SA implementation I’ve written, I’ve written two tests cases that attempt to minimize the Rosenbrock and Himmelbrau functions, which I found listed as examples of hard-to-minimize functions in the Wikipedia description of the Nelder-Mead method. You can see those two examples below this paragraph. In addition, I’ve used R to generate plots showing how the SA algorithm works under repeated application on the same optimization problem; in these plots, I’ve used a heatmap to show the cost functions value at each (x, y) position, colored crosshairs to indicate the position of a true minimum of the function in question, and red dots to indicate the purported solutions found by my implementation of SA.

Finding the Minimum of the Rosenbrock Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Rosenbrock function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function rosenbrock(x, y)
(1 - x)^2 + 100(y - x^2)^2
end
 
function neighbors(z)
[rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> rosenbrock(z[1], z[2]),
[0, 0],
neighbors,
log_temperature,
10000,
true,
false)

Finding the Minima of the Himmelbrau Function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Find a solution of the Himmelbrau function using SA.
load("simulated_annealing.jl")
load("../rng.jl")
 
function himmelbrau(x, y)
(x^2 + y - 11)^2 + (x + y^2 - 7)^2
end
 
function neighbors(z)
[rand_uniform(z[1] - 1, z[1] + 1), rand_uniform(z[2] - 1, z[2] + 1)]
end
 
srand(1)
 
solution = simulated_annealing(z -> himmelbrau(z[1], z[2]),
[0, 0],
neighbors,
log_temperature,
10000,
true,
false)

Moving Forward
Now that I’ve got a form of SA working, I’m interested in coding up a suite of optimization functions for Julia so that I can start to do maximum likelihood estimation in pure Julia. Once that’s possible, I can use Julia to do real science, e.g. when I need to fit simple models for which finding the MLE is appropriate. (I will leave the development of cleaner statistical functions for special cases of maximum likelihood estimation to more capable people, like Douglas Bates, who has already produced some GLM code.)

At present my code is meant simply to demonstrate how one could write an implementation of simulated annealing in Julia. I’m sure that the code can be more efficient and I suspect that I’ve violated some of the idioms of the language. In addition, I’d much prefer that this function use default values for many of the arguments, as there is no reason that an end-user needs to be concerned with finding the best cooling schedule if SA seems to work out of the box on their problem with the cooling schedule I’ve been using.

With those disclaimers about my code in place, I’d like to think that I haven’t made any mathematical errors and that this function can be used by others. So, I’d ask that those interested please tear apart my code so that I can make it usable as a general purpose function for optimization in Julia. Alternatively, please tell me that there’s no need for a pure Julia implementation of SA because, for example, there are nice C libraries that would be much easier to link to than to re-implement.

With an implementation of SA in place, I’ll probably start working on implementing L-BFGS-S soon, which is the other optimization algorithm I use often in R. (To be honest, I use L-BFGS-S almost exclusively, but SA was much easier to implement.)

Incidentally, this code base demonstrates how I view the relationship between R and Julia: Julia is a beautiful new language that is still missing many important pieces. We can all work together to build the best pieces of R that are missing from Julia. While we’re working on improving Julia, we’ll need to keep using R to handle things like visualization of our results. For this post, I turned back to ggplot2 for all of the graphics generation.
Statistics  from google
8 weeks ago by rahuldave
Monkeying with Bayes’ theorem
In Peter Norvig’s talk The Unreasonable Effectiveness of Data, starting at 37:42, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then at 38:16 he says something curious.

So this is all nice and theoretical and pure, but as well as being mathematically inclined, we are also realists. So we experimented some, and we found out that when you raise that first factor [in Bayes' theorem] to the 1.5 power, you get a better result.

In other words, if we change Bayes’ theorem (!) the algorithm works better. He goes on to explain

Now should we dig up Bayes and notify him that he was wrong? No, I don’t think that’s it. …

I imagine most statisticians would respond that this cannot possibly be right. While it appears to work, there must be some underlying reason why and we should find that reason before using an algorithm based on an ad hoc tweak.

While such a reaction is understandable, it’s also a little hypocritical. Statisticians are constantly drawing inference from empirical data without understanding the underlying mechanisms that generate the data. When analyzing someone else’s data, a statistician will say that of course we’d rather understand the underlying mechanism than fit statistical models, that just not always possible. Reality is too complicated and we’ve got to do the best we can.

I agree, but that same reasoning applied at a higher level of abstraction could be used to accept Norvig’s translation algorithm. Here’s this model (derived from spurious math, but we’ll ignore that). Let’s see empirically how well it works.
Statistics  Bayesian  Probability_and_Statistics  from google
11 weeks ago by rahuldave
Data visualization
(This article was first published on Research tips » R, and kindly contributed to R-bloggers)

For those who have not read the seminal works of Tufte and Cleveland, please hang your heads in shame. To salvage some sense of self-worth, you can then head over to Solomon Messing’s blog where he is starting a series on data visualization based on the principles developed by Tufte and Cleveland (with R examples).

The classics are also worth reading, and remain relevant despite the 20 or 30 years that have elapsed since they appeared.

To leave a comment for the author, please follow the link and comment on his blog: Research tips » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers  graphics  R  statistics  Uncategorized  from google
12 weeks ago by rahuldave
ABC in Roma [R lab #2]
(This article was first published on Xi'an's Og » R, and kindly contributed to R-bloggers)

Here are the R codes of the second R lab organised by Serena Arima in supplement of my lectures (now completed!). This morning I covered ABC model choice and the following example is the benchmark used in the course (and in the paper) about the impact of summary statistics. (Warning! It takes a while to run…)

n.iter=10000
n=c(10,100,1000)
n.sims=100
prob.m1=matrix(0,nrow=n.sims,ncol=length(n))
prob.m2=prob.m1

### Simulation from Normal model
for(sims in 1:n.sims){

## True data generation from the Normal distribution and summary statistics
for(i in 1:length(n)){
y.true=rnorm(n[i],0,1)
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m1[sims,i]=sum(dist.m1<epsilon)/(2*n.iter*0.01)
}}

### Simulation from the Laplace model
for(sims in 1:n.sims){

## True data generation from the Laplace distribution and summary statistics
for(i in 1:length(n)){
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}

# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}
}

# Visualize the results
y.true=sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.true=c(mean(y.true),median(y.true),var(y.true))

## ABC algorithm
# Sample from the prior
mu=rnorm(n.iter,0,2)
dist.m1=rep(NA,n.iter)
dist.m2=rep(NA,n.iter)

for(j in 1:n.iter){
# Data generation under both models
# Normal model
y.m1=rnorm(n[i],mu[j])
stat.m1=c(mean(y.m1),median(y.m1),var(y.m1))
# computing the distance
dist.m1[j]=sum((stat.m1-stat.true)^2)
# Laplace model
y.m2=mu[j]+sample(c(-1,1),n[i],rep=TRUE)*rexp(n[i],rate=sqrt(2))
stat.m2=c(mean(y.m2),median(y.m2),var(y.m2))
# computing the distance
dist.m2[j]=sum((stat.m2-stat.true)^2)
}

# select epsilon as 1% distance quantile
epsilon=quantile(c(dist.m1,dist.m2),prob=0.01)

# compute the posterior probability that data come from
# the model
prob.m2[sims,i]=sum(dist.m2<epsilon)/(2*n.iter*0.01)
}}

# Visualize the results
boxplot(data.frame(prob.m1[,1],prob.m2[,1]),
names=c("M1","M2"),main="N=10")
boxplot(data.frame(prob.m1[,2],prob.m2[,2]),
names=c("M1","M2"),main="N=100")
boxplot(data.frame(prob.m1[,3],prob.m2[,3]),
names=c("M1","M2"),main="N=1000")

Once again, I had a terrific time teaching this “ABC in Roma” course, and not only for the immense side benefit of enjoy Roma in a particularly pleasant weather (for late February).

Filed under: R, Statistics, University life Tagged: ABC, La Sapienza, PhD course, R, Roma

To leave a comment for the author, please follow the link and comment on his blog: Xi'an's Og » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers  ABC  La_Sapienza  PhD_course  R  Roma  statistics  University_life  from google
march 2012 by rahuldave
Modeling Trick: the Signed Pseudo Logarithm
(This article was first published on Win-Vector Blog » R, and kindly contributed to R-bloggers)

Much of the data that the analyst uses exhibits extraordinary range. For example: incomes, company sizes, popularity of books and any “winner takes all process”; (see: Living in A Lognormal World). Tukey recommended the logarithm as an important “stabilizing transform” (a transform that brings data into a more usable form prior to generating exploratory statistics, analysis or modeling). One benefit of such transforms is: data that is normal (or Gaussian) meets more of the stated expectations of common modeling methods like least squares linear regression. So data from distributions like the lognormal is well served by a log() transformation (that transforms the data closer to Gaussian) prior to analysis. However, not all data is appropriate for a log-transform (such as data with zero or negative values). We discuss a simple transform that we call a signed pseudo logarithm that is particularly appropriate to signed wide-range data (such as profit and loss).Log-transforming data is essential when analyzing systems that operate in relative terms or are “scale invariant” (such as financial returns). For example Geometric Brownian motion is a stochastic process central to a number of financial models (such as option pricing). Geometric Brownian motion is actually the exponential of a standard linear Brownian motion (where increments are normal or Gaussian). In this case a logarithmic transformation actually moves from the observed data to a more natural frame of reference (where increments are additive instead of being multiplicative). However, a major shortcoming of the log transform is its inability to deal with zero values and negative values.

A signed value we have often been asked to characterize or predict is: profit and loss (often called P&L). One natural model for P&L would be as the difference between a revenue and an expense. If the revenue and expense were both normally distributed then their difference would also be normal (and we would not need any stabilizing transform). However, if the revenue and expense were both log-normally distributed (say both proportional to task size or some other log-normal parameter) then the difference is not normal (it retains the propensity for extreme values or heavy tails of the original distributions). And for many financial size measures (company size, contract size and so on) the log-normal distribution is a much more realistic model than the normal distribution. In some situations P&L’s are formed from completely observed revenues and expenses (so we can model everything without sign problems), in other situations the signed P&L from an unobserved (or unrecorded) underling process and we are forced to deal with signed quantities.

For signed data we suggest the following transformation (code in R):

pseudoLog10 <- function(x) { asinh(x/2)/log(10) }

asinh() is a somewhat ugly function that is the inverse of sinh(). sinh() is defined as:

sinh(x) = (e^x - e^(-x))/2

The important point is for x such that |x| is large 2*sinh(x) rapidly approaches sign(x)*e^(|x|). Thus we should expect asinh(x/2) to look a lot like sign(x)*log(|x|) (which is why we call it a signed pseudo logarithm). For pseudoLog10() we take the previous function divided by log(10) to ensure that we are in log-10 like units (i.e. pseudoLog10(100) is nearly 2, pseudoLog10(1000) is nearly 3 and so on). Business audiences tend to have an easier time with log-10 (or dB) units (which can be explained as counting the number of decimal digits) than natural log or log-e units.

So for large positive numbers pseudoLog10() pretty much behaves like log10() (itself a standard transform). In fact pseudoLog10() has the following nice properties:

pseudoLog10(x) is defined for all real x.
pseudoLog10(0) = 0.
pseudoLog10(-x) = -pseudoLog10(x).
pseudoLog10(x) is monotone in x.
For x such that |x| is large: pseudoLog10(x) is very near sign(x)*log10(|x|).

We strongly recomend trying this transformation before feeding heavy tail data into a linear or logistic model.

However, we can not recommend the transformation for presentation. Consider the simple case of plotting the distribution or density of normal data with mean zero and standard-deviation 10 (see My Favorite Graphs for description of a density plot):

library(ggplot2)
pseudoLog10 <- function(x) { asinh(x/2)/log(10) }
d <- data.frame(x=rnorm(n=1000,sd=100))
ggplot(d) + geom_density(aes(x=x))

The density plot shows what we would expect- a near normal distribution (most points towards the center and mass falling off quickly as we move away). However, the plot of the pseudoLog10 transformed data is not what we would hope:

d$pseudoLog10x <- pseudoLog10(d$x)
ggplot(d) + geom_density(aes(x=pseudoLog10x))

The data density (falsely) appears bimodal! This is because the pseudoLog10() transform is compressing ranges more and more violently as we move away from the origin (and not compressing near the origin). So as we move away from origin: the product of the real data density times the degree of range compression climbs, achieves a maximum and then falls. This phenomena (which is just a “change of variables” for densities) gives us the bimodal appearance for unimodal distributions that have significant mass outside of the range [-10,10]. The bimodal appearance is mostly a fact about the transform not really a feature of the underlying data.

We see value in examining at the relative sizes and centers of these two modes for asymmetric distributions (such as the profit and loss statement for a set of accounts that are mostly losing money). The position and relative sizes of the modes gives us an initial hint what to look for (helps with questions like: “are total losses driven by many accounts or by few accounts” and so on). We can not, however, recommend the pseudoLog10() transform for presentation. The most striking feature of the graph is almost always the bimodal appearance of the data; and the bimodal appearance is almost always an artifact of the transform (not a real feature of the data). You can not in good conscious push a presentation where the most prominent and exciting observation is not in fact in the data.

We do still recommend trying the pseudoLog10() transform when building a linear or logistic model with wide ranged data. The transformation usefully compresses range which allows the modeled coefficients to be a function of most of the data and not a function of a few extreme values. Models that depend on most of their data (or on central estimates from their data) tend to be safer, achieve higher statistical significance and cross-validate more reliably. Models that are dominated by a few extreme values tend to be unsafe, not achieve statistical significance and not cross-validate reliably. The bimodal artifact can work in the favor of modeling as it tends to compress a transformed variable into “typical positive example” and “typical negative example” while still allowing magnitudes to enter the model in some form.

Used with care the pseudoLog10() or arcsinh() transform can be an important data preparation step for signed data with large range. Many financial summaries (such as P&L) meet these conditions and often profit from the transform.

Related posts:
Your Data is Never the Right Shape

To leave a comment for the author, please follow the link and comment on his blog: Win-Vector Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...
R_bloggers  arcsinh  R  singed_pseudo_logarithm  stabilizing_transform  statistics  from google
march 2012 by rahuldave
Comparing R to smoking
Francois Pinard comparing the R programming language to smoking:

Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run, it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.

I’ve never gotten to the point that I would call using R pleasurable.

Quote via Stats in the Wild

Related posts:

Reviews of R in Action and The Art of R Programming
R quirks
Statistics  Rstats  from google
february 2012 by rahuldave
Dear statisticians: Please start using your powers for good not evil
And here I waste all my time predicting outbreaks of violence, when I could be doing this:
As Pole’s computers crawled through the data, he was able to identify about 25 products that, when analyzed together, allowed him to assign each shopper a “pregnancy prediction” score. More important, he could also estimate her due date to within a small window, so Target could send coupons timed to very specific stages of her pregnancy.
…About a year after Pole created his pregnancy-prediction model, a man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry…
“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.
On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
Actually, sir, I don’t think you do.
Full article. h/t @justinwolfers and Forbes
statistics  forecasting  prediction  from google
february 2012 by rahuldave
Linking correlation to causation with power laws and scale free systems
An essential part of science involves finding correlations between two sets of measurements and seeking explanations for those correlations. However, relationships can be suggested by data even when they don't actually exist, and correlations may occur due to random fluctuations rather than a deep underlying principle (as the infamous "correlation does not equal causation" cliché suggests). These errors are easy to make, and the scientific literature is full of them.

So how can researchers establish if a correlation is both real and meaningful? In a Perspective in the February 10 issue
of Science, Michael P.H. Stumpf and Mason A. Porter
examine the type of correlation known as a power law, where one set of measurements is related to a second via an exponent. They argue that two things must be in place for a power law
to be valid as a predictive model: it must hold over a wide range of data to eliminate chance associations, and it must have a plausible mechanism to explain why the correlation showed up in the data.






Read the comments on this post
News  News  Science  correlations  powerlaw  statistics  from google
february 2012 by rahuldave
The universal solvent of statistics
Andrew Gelman just posted an interesting article on the philosophy of Bayesian statistics. Here’s my favorite passage.

This reminds me of a standard question that Don Rubin … asks in virtually any situation: “What would you do if you had all the data?” For me, that “what would you do” question is one of the universal solvents of statistics.

Emphasis added.

I had not heard Don Rubin’s question before, but I think I’ll be asking it often. It reminds me of Alice’s famous dialog with the Cheshire Cat:

“Would you tell me, please, which way I ought to go from here?”

“That depends a good deal on where you want to get to,” said the Cat.

“I don’t much care where–” said Alice.

“Then it doesn’t matter which way you go,” said the Cat.

Related post: Irrelevant uncertainty
Statistics  Uncategorized  Bayesian  Probability_and_Statistics  from google
february 2012 by rahuldave
R in Action
No Starch Press sent me a copy of The Art of R Programming last Fall and I wrote a review of it here. Then a couple weeks ago, Manning sent me a copy of R in Action. Here I’ll give a quick comparison of the two books, then focus specifically on R in Action.

Comparing R books

Norman Matloff, author of The Art of R Programming, is a statistician-turned-computer scientist. As the title may imply, Matloff’s book has more of a programmer’s perspective on R as a language.

Robert Kabacoff, author of R in Action, is a psychology professor-turned-statistical consultant. And as its title may imply, Kabacoff’s book is more about using R to analyze data. That is, the book is organized by analytical task rather than by language feature.

Many R books are organized like a statistical text. In fact, many are statistics texts, organized according to the progression of statistical theory with R code sprinkled in. R in Action is organized roughly in the order of steps one would take to analyze data, starting with importing data and ending with producing reports.

In short, The Art of R Programming is for programmers, R in Action is for data analysts, and most other R books I’ve seen are for statisticians. Of course a typical R user is to some extent a programmer, an analyst, and a statistician. But this comparison gives you some idea which book you might want to reach for depending on which hat you’re wearing at the moment. For example, I’d pick up The Art of R Programming if I had a question about interfacing R and C, but I’d pick up R in Action if I wanted to read about importing SAS data or using the ggplot2 graphics package.

R in Action

Kabacoff begins his book off with two appropriate quotes.

What is the use of a book, without pictures or conversations? — Alice, Alice in Wonderland

It’s wonderous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. — Q, “Q Who?” Star Trek: The Next Generation

R in Action is filled with pictures and conversations. It is also a treasure chest of practical information.

The first third of the book concerns basic data management and graphics. This much of the book would be accessible to someone with no background in statistics. The middle third of the book is devoted to basic statistics: correlation, linear regression, etc. The final third of the book contains more advanced statistics and graphics. (I was pleased to see the book has an appendix on using Sweave and odfWeave to produce reports.)

R in Action includes practical details that I have not seen in other books on R. Perhaps this is because the book is focused on analyzing and graphing data rather than exploring the dark corners of R or rounding out statistical theory.

Kabacoff says that he wrote the book that he wishes he’d had years ago. I also wish I’d had his book years ago.

Related links:

R programming for those coming from other languages (referenced in R in Action)

Calling C++ from R

Better R console fonts
Statistics  Books  Probability_and_Statistics  Rstats  from google
january 2012 by rahuldave
Teaching Bayesian stats backward
Most presentations of Bayesian statistics I’ve seen start with elementary examples of Bayes’ Theorem. And most of these use the canonical example of testing for rare diseases. But the connection between these examples and Bayesian statistics is not obvious at first. Maybe this isn’t the best approach.

What if we begin with the end in mind? Bayesian calculations produce posterior probability distributions on parameters. An effective way to teach Bayesian statistics might be to start there. Suppose we had probability distributions on our parameters. Never mind where they came from. Never mind classical objections that say you can’t do this. What if you could? If you had such distributions, what could you do with them?

For starters, point estimation and interval estimation become trivial. You could, for example, use the distribution mean as a point estimate and the area between two quantiles as an interval estimate. The distributions tell you far more than  point estimates or interval estimates could; these estimates are simply summaries of the information contained in the distributions.

It makes logical sense to start with Bayes’ Theorem since that’s the tool used to construct posterior distributions. But I think it makes pedagogical sense to start with the posterior distribution and work backward to how one would come up with such a thing.

Bayesian statistics is so named because Bayes’ Theorem is essential to its calculations. But that’s a little like classical statistics Central Limitist statistics because it relies heavily on the Central Limit Theorem.

The key idea of Bayesian statistics is to represent all uncertainty by probability distributions. That idea can be obscured by an early emphasis on calculations.

Related posts:

Interview with David Spiegelhalter
Occam’s razor and Bayes’ theorem
Four reasons to use Bayesian inference
Statistics  Bayesian  Education  from google
april 2011 by rahuldave
Significance testing and Congress
The US Supreme Court’s criticism of significance testing has been in the news lately. Here’s a criticism of significance testing involving the US Congress. Consider the following syllogism.

If a person is an American, he is not a member of Congress.
This person is a member of Congress.
Therefore he is not American.

The initial premise is false, but the reasoning is correct if we assume the initial premise is true.

The premise that Americans are never members of Congress is clearly false. But it’s almost true! The probability of an American being a member of Congress is quite small, about 535/309,000,000. So what happens if we try to salvage the syllogism above by inserting “probably” in the initial premise and conclusion?

If a person is an American, he is probably not a member of Congress.
This person is a member of Congress.
Therefore he is probably not American.

What went wrong? The probability is backward. We want to know the probability that someone is American given he is a member of Congress, not the probability he is a member of Congress given he is American.

Science continually uses flawed reasoning analogous to the example above. We start with a “null hypothesis,” a hypothesis we seek to disprove. If our data are highly unlikely assuming this hypothesis, we reject that hypothesis.

If the null hypothesis is correct, then these data are highly unlikely.
These data have occurred.
Therefore, the null hypothesis is highly unlikely.

Again the probability is backward. We want to know the probability of the hypothesis given the data, not the probability of the data given the hypothesis.

We can’t reject a null hypothesis just because we’ve seen data that are rare under this hypothesis. Maybe our data are even more rare under the alternative. It is rare for an American to be in Congress, but it is even more rare for someone who is not American to be in the US Congress!

I found this illustration in The Earth is Round (p < 0.05) by Jacob Cohen (1994). Cohen in turn credits Pollard and Richardson (1987) in his references.

Related posts:

How insignificant is significance testing?
Five criticisms of significance testing
Most published research results are false
Classical statistics in a nutshell
Statistics  Bayesian  Probability_and_Statistics  Science  from google
april 2011 by rahuldave
How insignificant is statistical significance?
Luis Pericchi sent me a brief note commenting on the recent US Supreme Court decision involving statistical significance and medical reporting. Here is his paper, about a page and a half.

How insignificant is statistical significance? (PDF)

Related post: Significance testing and Congress
Statistics  Probability_and_Statistics  from google
april 2011 by rahuldave
Simple approximation to normal distribution
Here’s a simple approximation to the normal distribution I just ran across. The density function is

f(x) = (1 + cos(x))/2π

over the interval (-π, π). The plot below graphs this density with a solid blue line. For comparison, the density of a normal distribution with the same variance is plotted with a dashed orange line.

The approximation is good enough to use for teaching. Students may benefit from doing an exercise twice, once with this approximation and then again with the normal distribution. Having an approximation they can integrate in closed form may help take some of the mystery out of the normal distribution.

The approximation may have practical uses. The agreement between the PDFs isn’t great. However, the agreement between the CDFs (which is more important) is surprisingly good. The maximum difference between the two CDFs is only 0.018. (The differences between the PDFs oscillate, and so their integrals, the CDFs, are closer together.)

I ran across this approximation here. It goes back to the 1961 paper “A cosine approximation to the normal distribution” by D. H. Raab and E. H. Green, Psychometrika, Volume 26, pages 447-450.

Update 1: See the paper referenced in the first comment. It gives a much more accurate approximation using a logistic function. The cosine approximation is a little simpler and may be better for teaching. However, the logistic approximation has infinite support. That could be an advantage since students might be distracted by the finite support of the cosine approximation.

The logistic approximation for the standard normal CDF is

F(x) = 1/(1 + exp(-0.07056 x3 – 1.5976 x))

and has a maximum error of 0.00014 at x = ± 3.16.

Update 2:

How might you use this approximation the other way around, approximating a cosine by a normal density? See Always invert.

Related posts:

Rolling dice for normal samples
Sums of uniform random variables
Math  Statistics  Probability_and_Statistics  from google
april 2010 by rahuldave
Is R an ‘epic fail’?
Is R an ‘epic fail’?

Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.

For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language.  If you don’t want to do any programming, don’t muck about with R!  There are lots of visualisation tools and statistical tools that are much easier to use.

Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.

You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it.  Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).

Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?
Work  programming  R  statistics  from google
april 2010 by rahuldave
Estimating the chances of something that hasn’t happened yet
Suppose you’re proofreading a book. If you’ve read 20 pages and found 7 typos, you might reasonably estimate that the chances of a page having a typo are 7/20. But what if you’ve read 20 pages and found no typos. Are you willing to conclude that the chances of a page having a typo are 0/20, i.e. the book has absolutely no typos?

To take another example, suppose you are testing children for perfect pitch. You’ve tested 100 children so far and haven’t found any with perfect pitch. Do you conclude that children don’t have perfect pitch? You know that some do because you’ve heard of instances before. But your data suggest perfect pitch in children is at least rare. But how rare?

The rule of three gives a quick and dirty way to estimate these kinds of probabilities. It says that if you’ve tested N cases and haven’t found what you’re looking for, a reasonable estimate is that the probability is less than 3/N. So in our proofreading example, if you haven’t found any typos in 20 pages, you could estimate that the probability of a page having a typo is less than 15%. In the perfect pitch example, you could conclude that fewer than 3% of children have perfect pitch.

Note that the rule of three says that your probability estimate goes down in proportion to the number of cases you’ve studied. If you’d read 200 pages without finding a typo, your estimate would drop from 15% to 1.5%. But it doesn’t suddenly drop to zero. I imagine most people would harbor a suspicion that that there may be typos even though they haven’t seen any in the first few pages. But at some point they might say “I’ve read so many pages without finding any errors, there must not be any.” The situation is a little different with the perfect pitch example, however, because you may know before you start that the probability cannot be zero.

If the sight of math makes you squeamish, you might want to stop reading now. Just remember that if you haven’t seen something happen in N observations, a good estimate is that the chances of it happening are less than 3/N.

What makes the rule of three work? Suppose the probability of what you’re looking for is p. If we want a 95% confidence interval, we want to find the largest p so that the probability of no successes out of n trials is 0.05, i.e. we want to solve (1-p)n = 0.05 for p. Taking logs of both sides, n log(1-p) = log(0.05) ≈ -3. Since log(1-p) is approximately -p for small values of p, we have p ≈ 3/n.

The derivation above gives the frequentist perspective. I’ll now give the Bayesian derivation of the same result. Then you can say “p is probably less than 3/N” in clear conscience since Bayesians are allowed to make such statements.

Suppose you start with a uniform prior on p. The posterior distribution on p after having seen 0 successes and N failures has a beta(1, N+1) distribution. If you calculate the posterior probability of p being less than 3/N you get an expression that approaches 1 – exp(-3) as N gets large, and 1 – exp(-3) ≈ 0.95.

Related posts:

What is a confidence interval?
Probability mistake can give a good approximation
Four reasons to use Bayesian inference
Statistics  Bayesian  Probability_and_Statistics  from google
march 2010 by rahuldave

Copy this bookmark:



description:


tags: