rahuldave + programming   31

cumplyr: Extending the plyr Package to Handle Cross-Dependencies
Introduction
For me, Hadley Wickham‘s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.

The Split-Apply-Combine Strategy is Broader than plyr
The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let’s consider three very simple data analysis problems.

Computing Forward-Running Means
Suppose that you have the following data set:

Time
Value

1
1

2
3

3
5

To compute a forward-running mean, you need to split this data into three subsets:

Time
Value

1
1

Time
Value

1
1

2
3

Time
Value

1
1

2
3

3
5

In each of these clearly non-disjoint subsets, you would then compute the mean of Value and combine the results to give:

Time
Value

1
1

2
2

3
3

This sort of computation occurs often enough in a simpler form that R provides tools like cumsum and cumprod to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.

Computing Backward-Running Means
Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:

Time
Value

3
5

Time
Value

2
3

3
5

Time
Value

1
1

2
3

3
5

And the final result is:

Time
Value

1
3

2
4

3
5

Computing Local Means (AKA Kernelized Means)
Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:

Time
Value

1
1

2
3

Time
Value

1
1

2
3

3
5

Time
Value

2
3

3
5

Within these non-disjoint subsets, means are computed and the result is:

Time
Value

1
2

2
3

3
4

A Strategy for Handling Non-Disjoint Subsets
How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which Variable 1 = Value 1 AND Variable 2 = Value 2, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.

Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I’ll describe just three instantiations of this broader strategy.

Using Upper Bounds
Here, we divide data into groups in which Variable 1 <= Value 1 AND Variable 2 <= Value 2, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint Variable = Value 1 AND Variable 2 <= Value 2. If the upper bound is the Time variable, these contraints will allow us to compute the forward-moving mean we described earlier.

Using Lower Bounds
Instead of using upper bounds, we can use lower bounds to divide data into groups in which Variable >= Value 1 AND Variable 2 >= Value 2, etc. This allows us to implement the backward-moving mean described earlier.

Using Norm Balls
Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like Variable 1 = Value 1 and then loosen it so that norm(Variable 1 - Value 1) <= r. When r = 0, this system gives the original equality constraint. But when r > 0, we produce a "ball" of data around the constraint whose tightness is r. This lets us estimate the local means from our third example.

Implementation
To demo these ideas in a usable fashion, I've created a draft package for R called cumplyr. Here is an extended example of its usage in solving simple variants of the problems described in this post:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
library('cumplyr')
 
data <- data.frame(Time = 1:5, Value = seq(1, 9, by = 2))
 
iddply(data,
equality.variables = c('Time'),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c('Time'),
upper.bound.variables = c(),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c('Time'),
norm.ball.variables = list(),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 1),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 2),
func = function (df) {with(df, mean(Value))})
 
iddply(data,
equality.variables = c(),
lower.bound.variables = c(),
upper.bound.variables = c(),
norm.ball.variables = list('Time' = 5),
func = function (df) {with(df, mean(Value))})

You can download this package from GitHub and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.

Comparing plyr with cumplyr
In the long run, I'm hoping to make the functions in cumplyr robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.

For the time being, I would advise any users of cumplyr to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.

But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.

Happy hacking!
Programming  Statistics  from google
29 days ago by rahuldave
Editorial Radar: Functional languages
Functional Languages are driving a broader set of choices for programmers. O'Reilly editors Mike Loukides and Mike Hendrickson sat down recently to talk about the advantages of functional programming languages and how functional language techniques can be deployed with almost any language. (The full conversation is embedded below.)

Andy Hunt and Dave Thomas have long recommend learning a new language each year, especially those languages that teach new concepts [discussed at the 02:02 mark]. Functional languages have made that easier. They behave in a different way than the languages many of us grew up on — procedural like C or languages derived from C. Plus, the polyglot programming movement has driven the interest in functional languages as one of the languages you might want to learn.

Programmers need to understanding the advantages of using a functional language, such as productivity, power of expressiveness, reliability, stateful objects, concurrency, natural concurrency, modularity, and composability [05:37]. Though a search still exists for a magic bullet [06:29] to make it easier for programers to better solve the problem of concurrency. CPU speeds have been stuck at roughly the same level for the last four to five years. Programmers have been given is more transistors on a chip, hence more CPUs and more cores to work with making concurrency one of the most difficult issues facing computer scientists today. Enter functional programming with improved debugging and the ability to write more reliable code in a concurrent environment.

Additional highlights from this conversation include:

Print book sales of functional languages are growing, especially books on R programming. And while Loukides doesn't consider R to be a functional language, some debate exists about its classification. Though it's clear the data science movement has driven the use of R because it's well designed for statistics and dealing with data. [Discussed at the 00:29 mark]

We'll see F# grow in the Microsoft development environment while Scala and Clojure are dominating the open source space. Erlang will also be around for a long time for building highly reliable concurrent systems. [Discussed at the 03:01 mark]

Since the publication of Doug Crockford's JavaScript: The Good Parts, coders have discovered the functional language abilities of JavaScript and Java. Google's release of Maps and Gmail revolutionized how JavaScript is used. Some of today's best examples include Node for high-performance websites and D3 for creating exotic and beautiful data visualizations. [Discussed at the 08:15 mark]

While JavaScript isn't a functional language, it's designed loosely, so it's easy to use as a functional language. You might also be interested in how functional programming techniques can be used in C++ — a blog post written by John Carmack. [Discussed at the 10:36 mark]

Java isn't intended as a functional language. Though Dean Wampler's Functional Programming for Java Developers provides an approachable introduction to functional programming for anyone using an object-oriented language. [Discussed at the 11:41 mark]

The use of a functional language or functional language techniques can make your code more robust and easier to debug. [Discussed at the 12:09 mark]

You can view the entire conversation in the following video:

Tune in next month for a discussion of NoSQL and web databases.

Fluent Conference: JavaScript & Beyond — Explore the changing worlds of JavaScript & HTML5 at the O'Reilly Fluent Conference (May 29 - 31 in San Francisco, Calif.).

Save 20% on registration with the code RADAR20

Related:

Subscribe to the free Code podcast through iTunes
See more Code podcasts
Editorial Radar: Machine learning, 3D printing, devices and JavaScript
Clojure: Lisp meets Java, with a side of Erlang
A rough guide to JVM languages
Programming  clojure  codepodcast  concurrency  d3  f  functionalprogramming  java  javascript  node  rprogramming  scala  from google
29 days ago by rahuldave
Comparing Julia and R’s Vocabularies
(This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers)

While exploring the Julia manual recently, I realized that it might be helpful to put the basic vocabularies of Julia and R side-by-side for easy comparison. So I took Hadley Wickham’s R Vocabulary section from the book he’s putting together on the devtools wiki, put all of the functions Hadley listed into a CSV file, and proceeded to fill in entries where I knew of an obvious Julia equivalent to an R function.

The results are on GitHub and, as they stand today, are shown below:

R
Julia
Category
Subcategory

https://

github.com/

hadley/devtools/

wiki/vocabulary
http://

julialang.org/

manual/

standard-

library-reference/
Resources
Vocabulary

?
help
Basics
First Functions

str

Basics
First Functions

%in%

Basics
Operators

match

Basics
Operators

=
=
Basics
Operators

<-
=
Basics
Operators

<<-

Basics
Operators

assign

Basics
Operators

$
[]
Basics
Operators

[]
[]
Basics
Operators

[[]]
[]
Basics
Operators

replace

Basics
Operators

head

Basics
Operators

tail

Basics
Operators

subset

Basics
Operators

with

Basics
Operators

within

Basics
Operators

all.equal

Basics
Comparison

identical

Basics
Comparison

!=
!=
Basics
Comparison

==
==
Basics
Comparison

>
>
Basics
Comparison

>=
>=
Basics
Comparison

<
<
Basics
Comparison

<=
<=
Basics
Comparison

is.na

Basics
Comparison

is.nan

Basics
Comparison

is.finite

Basics
Comparison

complete.cases

Basics
Comparison

*
*
Basics
Basic Math

+
+
Basics
Basic Math

-
-
Basics
Basic Math

/
/
Basics
Basic Math

^
^
Basics
Basic Math

%%
mod (%%)
Basics
Basic Math

%/%
div
Basics
Basic Math

abs
abs
Basics
Basic Math

sign
sign
Basics
Basic Math

acos
acos
Basics
Basic Math

acosh
acosh
Basics
Basic Math

asin
asin
Basics
Basic Math

asinh
asinh
Basics
Basic Math

atan
atan
Basics
Basic Math

atan2
atan2
Basics
Basic Math

atanh
atanh
Basics
Basic Math

sin
sin
Basics
Basic Math

sinh
sinh
Basics
Basic Math

cos
cos
Basics
Basic Math

cosh
cosh
Basics
Basic Math

tan
tan
Basics
Basic Math

tanh
tanh
Basics
Basic Math

ceiling
ceil
Basics
Basic Math

floor
floor
Basics
Basic Math

round
round
Basics
Basic Math

trunc
trunc
Basics
Basic Math

signif

Basics
Basic Math

exp
exp
Basics
Basic Math

log
log
Basics
Basic Math

log10
log10
Basics
Basic Math

log1p
log1p
Basics
Basic Math

log2
log2
Basics
Basic Math

logb

Basics
Basic Math

sqrt
sqrt
Basics
Basic Math

cummax

Basics
Basic Math

cummin

Basics
Basic Math

cumprod
cumprod
Basics
Basic Math

cumsum
cumsum
Basics
Basic Math

diff
diff
Basics
Basic Math

max
max
Basics
Basic Math

min
min
Basics
Basic Math

prod
prod
Basics
Basic Math

sum
sum
Basics
Basic Math

range

Basics
Basic Math

mean
mean
Basics
Basic Math

median
median
Basics
Basic Math

cor
cor_pearson
Basics
Basic Math

cov
cov_pearson
Basics
Basic Math

sd
std
Basics
Basic Math

var
var
Basics
Basic Math

pmax

Basics
Basic Math

pmin

Basics
Basic Math

rle

Basics
Basic Math

function
function
Basics
Functions

missing

Basics
Functions

on.exit

Basics
Functions

return
return
Basics
Functions

invisible

Basics
Functions

&
&
Basics
Logical & Set Operations

|
|
Basics
Logical & Set Operations

!
!
Basics
Logical & Set Operations

xor

Basics
Logical & Set Operations

all
all
Basics
Logical & Set Operations

any
any
Basics
Logical & Set Operations

intersect
intersect
Basics
Logical & Set Operations

union
union
Basics
Logical & Set Operations

setdiff

Basics
Logical & Set Operations

setequal

Basics
Logical & Set Operations

which
find
Basics
Logical & Set Operations

c
[] ({})
Basics
Vectors and Matrices

matrix
[] ({})
Basics
Vectors and Matrices

length
size (length)
Basics
Vectors and Matrices

dim
size
Basics
Vectors and Matrices

ncol
size(x, 1)
Basics
Vectors and Matrices

nrow
size(x, 2)
Basics
Vectors and Matrices

cbind
hcat
Basics
Vectors and Matrices

rbind
vcat
Basics
Vectors and Matrices

names

Basics
Vectors and Matrices

colnames

Basics
Vectors and Matrices

rownames

Basics
Vectors and Matrices

t

Basics
Vectors and Matrices

diag
eye
Basics
Vectors and Matrices

sweep

Basics
Vectors and Matrices

as.matrix

Basics
Vectors and Matrices

data.matrix

Basics
Vectors and Matrices

c
[] ({})
Basics
Making Vectors

rep

Basics
Making Vectors

seq
[from:by:to]
Basics
Making Vectors

seq_along

Basics
Making Vectors

seq_len
[1:len]
Basics
Making Vectors

rev
reverse
Basics
Making Vectors

sample

Basics
Making Vectors

choose
factorial
Basics
Making Vectors

factorial
factorial
Basics
Making Vectors

combn

Basics
Making Vectors

(is/as).(character/numeric/logical)

Basics
Making Vectors

list
HashTable ([])
Basics
Lists & Data Frames

unlist

Basics
Lists & Data Frames

data.frame

Basics
Lists & Data Frames

as.data.frame

Basics
Lists & Data Frames

split

Basics
Lists & Data Frames

expand.grid

Basics
Lists & Data Frames

if
if
Basics
Control Flow

&&
&&
Basics
Control Flow

||
||
Basics
Control Flow

for
for
Basics
Control Flow

while
while
Basics
Control Flow

next
continue
Basics
Control Flow

break
break
Basics
Control Flow

switch

Basics
Control Flow

ifelse

Basics
Control Flow

fitted

Statistics
Linear Models

predict

Statistics
Linear Models

resid

Statistics
Linear Models

rstandard

Statistics
Linear Models

lm

Statistics
Linear Models

glm

Statistics
Linear Models

hat

Statistics
Linear Models

influence.measures

Statistics
Linear Models

logLik

Statistics
Linear Models

df

Statistics
Linear Models

deviance

Statistics
Linear Models

formula

Statistics
Linear Models

~

Statistics
Linear Models

I

Statistics
Linear Models

anova

Statistics
Linear Models

coef

Statistics
Linear Models

confint

Statistics
Linear Models

vcov

Statistics
Linear Models

contrasts

Statistics
Linear Models

apropos(‘\\.test$’)

Statistics
Miscellaneous Statistical Tests

beta
beta
Statistics
Random Numbers

binom
binom
Statistics
Random Numbers

cauchy
cauchy
Statistics
Random Numbers

chisq
chisq
Statistics
Random Numbers

exp
exp
Statistics
Random Numbers

f
f
Statistics
Random Numbers

gamma
gamma
Statistics
Random Numbers

geom
geom
Statistics
Random Numbers

hyper
hyper
Statistics
Random Numbers

lnorm
lnorm
Statistics
Random Numbers

logis
logis
Statistics
Random Numbers

multinom
multinom
Statistics
Random Numbers

nbinom
nbinom
Statistics
Random Numbers

norm
norm
Statistics
Random Numbers

pois
pois
Statistics
Random Numbers

signrank
signrank
Statistics
Random Numbers

t
t
Statistics
Random Numbers

unif
unif (rand)
Statistics
Random Numbers

weibull
weibull
Statistics
Random Numbers

wilcox
wilcox
Statistics
Random Numbers

birthday
birthday
Statistics
Random Numbers

tukey
tukey
Statistics
Random Numbers

crossprod
*
Statistics
Matrix Algebra

tcrossprod
*
Statistics
Matrix Algebra

eigen
eig
Statistics
Matrix Algebra

qr
qr
Statistics
Matrix Algebra

svd
svd
Statistics
Matrix Algebra

%*%
*
Statistics
Matrix Algebra

%o%

Statistics
Matrix Algebra

outer

Statistics
Matrix Algebra

rcond

Statistics
Matrix Algebra

solve
\
Statistics
Matrix Algebra

duplicated

Statistics
Ordering and Tabulating

unique

Statistics
Ordering and Tabulating

merge

Statistics
Ordering and Tabulating

order

Statistics
Ordering and Tabulating

rank

Statistics
Ordering and Tabulating

quantile
quantile
Statistics
Ordering and Tabulating

sort
sort
Statistics
Ordering and Tabulating

table

Statistics
Ordering and Tabulating

ftable

Statistics
Ordering and Tabulating

ls
whos
Working with R
Workspace

exists

Working with R
Workspace

get

Working with R
Workspace

rm

Working with R
Workspace

getwd
getcwd
Working with R
Workspace

setwd
setcwd
Working with R
Workspace

q
Ctrl-D
Working with R
Workspace

source
load
Working with R
Workspace

install.packages

Working with R
Workspace

library

Working with R
Workspace

require

Working with R
Workspace

help
help
Working with R
Help

?
help
Working with R
Help

help.search

Working with R
Help

apropos

Working with R
Help

RSiteSearch

Working with R
Help

citation

Working with R
Help

demo

Working with R
Help

example

Working with R
Help

vignette

Working with R
Help

traceback

Working with R
Debugging

browser

Working with R
Debugging

recover

Working with R
Debugging

options(error =)

Working with R
Debugging

stop

Working with R
Debugging

warning

Working with R
Debugging

message

Working with R
Debugging

tryCatch
try/catch
Working with R
Debugging

try
try
Working with R
Debugging

print
print (println)
I/O
Output

cat

I/O
Output

message

I/O
Output

warning

I/O
Output

dput

I/O
Output

format

I/O
Output

sink

I/O
Output

data

I/O
Reading and Writing Data

count.fields

I/O
Reading and Writing Data

read.csv
csvread
I/O
Reading and Writing Data

read.delim
dlmread
I/O
Reading and Writing Data

read.fwf

I/O
Reading and Writing Data

read.table

I/O
Reading and Writing Data

library(foreign)

I/O
Reading and Writing Data

write.table
dlmwrite
I/O
Reading and Writing Data

readLines
readlines
I/O
Reading and Writing Data

writeLines

I/O
Reading and Writing Data

load

I/O
Reading and Writing Data

save

I/O
Reading and Writing Data

readRDS

I/O
Reading and Writing Data

saveRDS

I/O
Reading and Writing Data

dir

I/O
Files and Directories

basename

I/O
Files and Directories

dirname

I/O
Files and Directories

file.path

I/O
Files and Directories

path.expand

I/O
Files and Directories

file.choose

I/O
Files and Directories

file.copy

I/O
Files and Directories

file.create

I/O
Files and Directories

file.remove

I/O
Files and Directories

path.rename

I/O
Files and Directories

dir.create

I/O
Files and Directories

file.exists

I/O
F[…]
R_bloggers  programming  statistics  from google
7 weeks ago by rahuldave
Profile of the Data Journalist: The Human Algorithm
Around the globe, the bond between data and journalism is growing stronger. In an age of big data, the growing importance of data journalism lies in the ability of its practitioners to provide context, clarity and, perhaps most important, find truth in the expanding amount of digital content in the world. In that context, data journalism has profound importance for society.

To learn more about the people who are doing this work and, in some cases, building the newsroom stack for the 21st century, I conducted a series of email interviews during the 2012 NICAR Conference.

Ben Welsh (@palewire) is an Web developer and journalist based in Los Angeles. Our interview follows.


Where do you work now? What is a day in your life like?

I work for the Los Angeles Times, a daily
newspaper and 24-hour Web site based in Southern California. I'm a member
of the Data Desk, a team of reporters and
Web developers that specializes in maps, databases, analysis and
visualization. We both build Web applications and conduct analysis for
reporting projects.

I like to compare The Times to a factory, a factory that makes information.
Metaphorically speaking, it has all sorts of different assembly lines. Just
to list a few, one makes beautifully rendered narratives, another makes battleship-like investigative projects.

A typical day involves juggling work on difference projects, mentally
moving from one assembly line to the other. Today I patched an embryonic open-source release, discussed our next move on a pending public records request, guided the real-time publication of results from the GOP primaries in Michigan and Arizona, and did some preparation for how we'll present a larger dump of results on Super Tuesday.

How did you get started in data journalism? Did you get any special
degrees or certificates?

I'm thrilled to see new-found interest in "data journalism" online. It's
drawing young, bright people into the field and involving people from
different domains. But it should be said that the idea isn't new.

I was initiated into the field as a graduate student at the Missouri School
of Journalism. There I worked at the National Institute for Computer-Assisted Reporting , also known as NICAR. Decades before anyone called it "data journalism," a disparate group of misfit reporters discovered that the data analysis made possible by computers enabled them to do more powerful investigative reporting. In 1989, they founded NICAR, which has, for decades, been training data skills
to journalists and nurtured a tribe of journalism geeks. In the time since, computerized data analysis has become a dominant force in investigative reporting, responsible for a large share of the field's best work.

To underscore my point, here's a 1986 Time magazine article about how
"newsmen are enlisting the machine."

Did you have any mentors? Who? What were the most important resources they
shared with you?

My first journalism job was in Chicago. I got a gig working for two great people there, Carol Marin and Don Moseley, who have spent most of their careers as television journalists. I worked as their assistant. Carol and Don are warm people who are good teachers, but they are also excellent at what they do. There was a moment when I realized, "Hey, I can do this!" It wasn't just something I heard about in class, but I could actually see myself doing.

At Missouri, I had a great classmate named Brian
Hamman, who is now at the New York Times. I remember seeing how invested Brian was in the Web, totally committed to Web development as a career path. When an opportunity opened up to be a graduate assistant at NICAR, Brian encouraged me to pursue it. I learned enough SQL to help do farmed-out investigative work for TV stations. And, more importantly, I learned that if you had technical skills you could get the job to work on a cool story.

After that I got a job doing data analysis at the Center for Public Integrity in Washington DC. I had the opportunity to work on investigative projects, but also the chance to learn a lot of computer programming along the way. I had the guidance of my talented coworkers, Daniel Lathrop, Agustin Armendariz, John Perry, Richard Mullins and Helena Bengtsson. I learned that computer programming wasn't impossible. They taught me that if you have a manageable task, a few friends to help you out and a door you can close, you can figure out a lot.

What does your personal data journalism "stack" look like? What tools
could you not live without?

I do my daily development in gedit text editor, Byobu's slick implementation of the screen terminal and the Chromium browser. And, this part may be hard to believe, but I love Ubuntu
Unity. I don't understand what everybody is complaining about.

I do almost all of my data management in the Python Web development
framework Django and
PostgreSQL's database, even if
the work is an exploratory reporting project that will never be published. I find that the structure of the framework can be useful for organizing just about any data-driven project.

I use GitHub for both version-control and
project management. Without it, I'd be lost.

What data journalism project are you the most proud of working on or
creating?

As we all know, there's a lot of data out there. And, as anyone who works
with it knows, most of it is crap. The projects I'm most proud of have
taken large, ugly data sets and refined them into something worth knowing:
a nut graf in an investigative story, or a
data-driven app that gives the reader some new
insight into the world around them. It's impossible to pick one. I like to
think the best is still, as they say in the newspaper business,
TK.

Where do you turn to keep your skills updated or learn new things?

Twitter is a great way to keep up with what is getting other programmers excited. I know a lot of people find social media overwhelming or distracting, but I feel plugged in and inspired by what I find there. I wouldn't want to live without it.

GitHub is another great source. I've learned so much just exploring other
people's code. It's invaluable.

Why are data journalism and "news apps" important, in the context of the
contemporary digital environment for information?

Computers offer us an opportunity to better master information, better
understand each other and better watchdog those who would govern us. I
tried to talk about some of the ways simply thinking about the process of
journalism as an algorithm can point the way at last week's NICAR
conference in a talk called "Human-Assisted Reporting." In my opinion, we should aspire to write code that embodies the idealistic principles and investigative methods of the previous generation. There's all this data out there now, and journalistic algorithms, "robot
reporters," can help us ask it tougher questions.
Data  Gov_2.0  Publishing  dataconference  datajournalism  dataproduct  datascience  nicarinterview  opensource  programming  from google
march 2012 by rahuldave
Julia random number generation
Julia is a new programming language for scientific computing. From the Julia site:

Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. …

I just started playing around with it. I didn’t see functions for non-uniform random number generation so I wrote some as a way to get started.

[Update: there are non-uniform random number generators in Julia, but they have not been added to the documentation yet. See details in this comment.]

Here’s a random number generator for normal (Gaussian) random values:

## return a random sample from a normal (Gaussian) distribution
function rand_normal(mean, stdev)
if stdev <= 0.0
error("standard deviation must be positive")
end
u1 = rand()
u2 = rand()
r = sqrt( -2.0*log(u1) )
theta = 2.0*pi*u2
mean + stdev*r*sin(theta)
end
From this you can see Julia is a low-ceremony language: Python-like syntax, you can call common mathematical functions without having to do anything special, etc. You can have explicit return statements, but the preferred style seems to be to let the last line of the function be the implicit return statement.

My most common mistake so far has been forgetting to close code blocks with end; Julia’s syntax is similar enough to Python that I suppose I think indentation should be sufficient.

I’ve written random number generators for the following probability distributions:

Beta
Cauchy
Chi square
Exponential
Inverse gamma
Laplace (double exponential)
Normal
Student t
Uniform
Weibull

You can find the code here: Non-uniform random number generation in Julia.
Software_development  Julia  Programming  from google
february 2012 by rahuldave
What I Learned After 3 Weeks of Writing Mobile Apps
Towards the end of last year, I realized I was about to bump up against the ”use
it or lose it” vacation policy at work which basically means I either had to take
about two weeks of paid vacation or forfeit the vacation. Since I hadn’t planned the
time off I immediately became worried about what to do with all that idle time especially
since if left to my own devices I’d play 80 straight hours of Modern
Warfare 3 without pause.

To make sure the time was productively used I decided to write a mobile app as a learning
exercise about the world of mobile development since I’ve read so much about it and
part of my day job is building
APIs for developers of mobile apps. I ended up enjoying the experience so much
I added an extra week of vacation and wrote two apps for Windows Phone. I’d originally
planned to write one app for Windows Phone then port it to iOS or Android but gave
up on that due to time constraints after some investigation of both.

I learned a bunch about mobile development from this exercise and a few friends have
asked me to share of my thoughts on mobile development in general and building for
Windows Phone using Microsoft platforms in particular. If you are already a mobile
developer then some of this is old hat to you but I did find a bunch of what I learned
to be counterintuitive and fairly eye opening so you might too.

Thoughts on Building Mobile Apps on Any Platform

This section is filled with items I believe are generally applicable if building iOS,
Android or Windows Phone apps. These are mostly things I discovered as part of my
original plan to write one app for all three platforms.

A consistent hardware ecosystem is a force multiplier

After realizing the only options for doing iPhone development on Windows was the Dragon
Fire SDK which only supports games, I focused on learning as much as I could about Android
development options. The Xamarin guys have MonoTouch which sounded very appealing
to me as a way to leverage C# skills across Android and Windows Phone until I saw
the $400 price tag. :)

One of the things I noticed upon downloading the Android SDK as compared to installing
the Windows Phone SDK is that the Android one came with a bunch of emulators and SDKs
for various specific devices. As I started development on my apps, there were many
times I was thankful for the consistent
set of hardware specifications for Windows Phone. Knowing that the resolution
was always going to be WVGA and so if something looked good in the emulator then it
would look good on my device and those of my beta testers not only gave piece of mind
but made UX development a breeze.

Comparing this to an ecosystem like Android where the diversity of hardware devices
with varying screen resolutions have
made developers effectively throw up their hands as in this article quoted by
Jeffrey Zeldman

If … you have built your mobile site using fixed widths (believing
that you’ve designed to suit the most ‘popular’ screen size), or are planning to serve
specific sites to specific devices based on detection of screen size, Android’s settings
should serve to reconfirm how counterproductive a practice this can be. Designing
to fixed screen sizes is in fact never a good idea…there is just too much variation,
even amongst ‘popular’ devices. Alternatively, attempting to track, calculate, and
adjust layout dimensions dynamically to suit user-configured settings or serendipitous
conditions is just asking for trouble.

Basically, you’re just screwed if you think you can build a UI that will work on all
Android devices. This is clearly not the case if you target Windows Phone or iOS development.
This information combined with my experiences building for Windows Phone convinced
me that it is more likely I’ll buy a Mac and start iOS development than it is that
I’d ever do Android development.

No-name Web Hosting vs. name brands like Amazon Web Services and Windows Azure

One of my apps had a web service requirement and I initially spent some time investigating
both Windows Azure and Amazon Web Services. Since this was a vacation side project
I didn’t want expenses to get out of hand so I was fairly price sensitive. Once I
discovered AWS charged less for Linux servers I spent a day or two getting my Linux
chops up to speed given I hadn’t used it much since my the early 2000s. This is where
I found out about yum and
discovered the interesting paradox that discovering and installing software on modern
Linux distros is simultaneously much easier and much harder than doing so on Windows
7. Anyway, that’s a discussion for another day.

I soon realized I had been penny wise and pound foolish when focusing on the cost
of Linux hosting when it turns out what breaks the bank is database hosting. Amazon
charges about $0.11 an hour ($80 a month)
for RDS hosting at the low end. Windows Azure seemed to charge around the same
ballpark when I looked two months ago but it seems they’ve revamped
their pricing site since I did my investigation.

Once I realized database hosting would be the big deciding factor in cost. It made
it easier for me to stick with the familiar and go with instead of as a

LAMP

server stack. If I had stuck with

LAMP

, I could have gone with a provider like Blue Host to
get the entire web platform + database stack for less than $10 with perks like free
credits for Google ads thrown in. With the

WISC

stack, hosters like Discount ASP and Webhost
4 Life charge in the ballpark of $15 which is about $10 if you swap out SQL Server
for MySQL.

These prices were more my speed. I was quite surprised that even though all the blogs
talk about AWS and Azure, it made the most sense for my bootstrapped apps to start
with a vanilla web host and pay up to ten times less for service than using one of
the name brand cloud computing services. Paying almost ~$100 a month for services
with elastic scaling properties may make sense if my apps stick around and become
super successful but not at the start.

Another nice side effect of going with a web hosting provider is the reduced complexity
from going with a cloud services provider. Anyone who's gone through the AWS
getting started guides after coming from vanilla web hosting knows what I mean.

Facebook advertising beats search ads for multiple app categories

As mentioned above, one of the perks of some of the vanilla hosting providers is that
they throw in free credits for ads on Google AdSense/Adwords and Facebook ads as part
of the bundle. I got to experiment with buying ads on both platforms and I came away
very impressed with what Facebook has built as an advertising platform.

I remember reading a few years ago that MySpace
had taught us social networks are bad for advertisers. Things are very different
in today’s world. With search ads, I can choose to show ads alongside results when
people search for a term that is relevant to my app. With Facebook ads, I get to narrowly
target demographics based on detailed profile attributes such as Georgia Tech alumni
living in New York who have expressed an interest in DC or Marvel comics. The latter
seems absurd at first until you think about an app like Instagram.

No one is searching for "best photo sharing app for the iphone" on Google and even
if you are one of the few people who has, there aren’t a lot of you. On the other
hand, at launch the creators of Instagram could go to Facebook and say we'd like to
show ads to people who have liked or use an and who also have shown an affiliation
for photo sharing apps or sites like Flickr, Camera+, etc then craft specific pitches
for those demographics. I don’t know about you but I know which sounds like it would
be more effective and relevant.

This also reminded me that I'd actually clicked on more ads on Facebook than I've
ever clicked on search ads.

Lot's of unfilled niches still exist

I remember being in college back in the day, flipping through my copy of Yahoo!
Internet Life and thinking that we were oversaturated with websites and all the
good ideas were already taken. This was before YouTube, Flickr, SkyDrive, Facebook
or Twitter. Silly me.

The same can be said about mobile apps today. I hear a lot about there being 500,000
apps in the Apple app store and the
same number being in Android Market. To some this may seem overwhelming but there
are clearly still niches that are massively underserved on those platforms and especially
on Windows Phone which just
hit 50,000 apps.

There are a lot of big and small problems in people's lives that can be addressed
by bringing the power of the web to the devices in their pockets in a tailored way.
The one thing I was most surprised by is how many apps haven't been written that you'd
expect to exist just from extrapolating what we have on the Web and the offline world
today. I don't just mean geeky things like a
non-propeller head way to share bookmarks from my desktop to my phone and vice versa
without emailing myself but instead applications that would enrich the lives of
millions of regular people out there that they'd be gladly willing to pay $1 for (less
than the price of most brands of bubble gum these days).

If you are a developer, don't be intimidated by the size of the market nor be attracted
to the stories of the folks who've won the lottery by gambling on being in the right
place at the right time with the right gimmick (fart
apps, sex position guides and yet
another photo sharing app). There are a lot of problems that can be solved or
pleasant ways to pass the time on a mobile device that haven’t yet been built. Look
around at your own life and talk to your non-technical friends about their days. There
is lots of inspiration out there if you just look for it.

Look for Platforms that Favor User Experience over Developer Experience

One of the topics I’ve wanted to write about in this blog is how my ge[…]
Programming  Web_Development  from google
january 2012 by rahuldave
Four short links: 28 December 2011
Terrier IR -- open source (Mozilla) text search engine, now with Hadoop support.
s3ql -- open source (GPLv3) Linux filesystem which stores its data on Google Storage, Amazon S3, or OpenStack. (via Adam Shand)
Esprima -- open source (BSD) fast Javascript parser in Javascript. (via Javascript Weekly)
Hogan.js -- open source (Apache) Javascript templating engine from Twitter. If it proves anywhere near as good as Bootstrap, it'll be heavily used.
cloud  javascript  opensource  programming  search  storage  textanalysis  web  from google
december 2011 by rahuldave
Four short links: 28 December 2011
Terrier IR -- open source (Mozilla) text search engine, now with Hadoop support.
s3ql -- open source (GPLv3) Linux filesystem which stores its data on Google Storage, Amazon S3, or OpenStack. (via Adam Shand)
Esprima -- open source (BSD) fast Javascript parser in Javascript. (via Javascript Weekly)
Hogan.js -- open source (Apache) Javascript templating engine from Twitter. If it proves anywhere near as good as Bootstrap, it'll be heavily used.
cloud  javascript  opensource  programming  search  storage  textanalysis  web  from google
december 2011 by rahuldave
Why Was Hypercard Killed?
theodp writes "Steve Jobs took the secret to his grave, but Stanislav Datskovskiy offers some interesting and illustrated speculation on why HyperCard had to die. 'Jobs was almost certainly familiar with HyperCard and its capabilities,' writes Datskovskiy. 'And he killed it anyway. Wouldn't you love to know why? Here's a clue: Apple never again brought to market anything resembling HyperCard. Despite frequent calls to do so. Despite a more-or-less guaranteed and lively market. And I will cautiously predict that it never will again. The reason for this is that HyperCard is an echo of a different world. One where the distinction between the "use" and "programming" of a computer has been weakened and awaits near-total erasure. A world where the personal computer is a mind-amplifier, and not merely an expensive video telephone. A world in which Apple's walled garden aesthetic has no place.' Slashdotters have bemoaned the loss of HyperCard over the past decade, but Datskovskiy ends his post on a keep-hope-alive note, saying: 'Contemplate the fact that what has been built once could probably be built again.' Where have you gone, Bill Atkinson, a nation of potential programmers turns its lonely eyes to you."


Read more of this story at Slashdot.
programming  from google
november 2011 by rahuldave
Fundamental theorem of code readability
In The Art of Readable Code, the authors call the following the “Fundamental Theorem of Readability”:

Code should be written to minimize the time it would take for someone else to understand it.

They go on to explain

And when we say “understand,” we have a very high bar … they should be able to make changes to it, spot bugs, and understand how it interacts with the rest of your code.
Software_development  Books  Programming  from google
november 2011 by rahuldave
Separating presentation from content
In the late ’90s I went to a fair number of Microsoft presentations. One presentation would say “The problem with Technology X is that it mixes presentation and content. We’ve introduced Technology Y to make your code cleaner, separating presentation and content.” A few months later I’d be at another presentation that would announce “The problem with Technology Y is that it mixes presentation and content. We’ve introduced Technology Z …” (Does this remind anyone else of The Cat in the Hat Comes Back?)

When I first learned LaTeX, I was told that one of its strengths is that it separates presentation and content. Then a few years later I hear complaints that the problem with LaTeX is that it mingles presentation and content, unlike XHTML. A few years later, guess what? XHTML mixes presentation and content, so we need something else.

I shut down when I hear someone announce that everything before their product was bad because it mixed presentation and content, and now with their solution, presentation and content will be completely separate.

Sometimes one technology really does make a cleaner separation of presentation and content. But at best the separation is relative. LaTeX separates presentation and content more than Word, though not as much as well-written HTML and CSS, maybe. But presentation and content cannot be entirely separated. Nor is their unanimous agreement on what exactly the dividing line is between the two.

Many people don’t want to separate their presentation and content. They don’t understand why this would be desirable, and they’ll fight against anything designed to encourage separation. Maybe they need to learn the advantages, or maybe they’re just doing the best they can to get their job done and they can’t be bothered with long term advantages that may not materialize.

The principle of separating presentation and content is admirable. It really does have advantages, but it’s easier said than done.
Software_development  LaTeX  Programming  from google
november 2011 by rahuldave
Microsoft Roslyn: Reinventing the Compiler As We Know It
snydeq writes "Fatal Exception's Neil McAllister sees Microsoft's Project Roslyn potentially reinventing how we view compilers and compiled languages. 'Roslyn is a complete reengineering of Microsoft's .NET compiler toolchain in a new way, such that each phase of the code compilation process is exposed as a service that can be consumed by other applications,' McAllister writes. 'The most obvious advantage of this kind of "deconstructed" compiler is that it allows the entire compile-execute process to be invoked from within .NET applications. With the Roslyn technology, C# may still be a compiled language, but it effectively gains all the flexibility and expressiveness that dynamic languages such as Python and Ruby have to offer.'"


Read more of this story at Slashdot.
programming  from google
october 2011 by rahuldave
Sed one-liners
A few weeks ago I reviewed Peteris Krumins’ book Awk One-Liners Explained. This post looks at his sequel, Sed One-Liners Explained.

The format of both books is the same: one-line scripts followed by detailed commentary. However, the sed book takes more effort to read because the content is more subtle. The awk book covers the most basic features of awk, but the sed book goes into the more advanced features of sed.

Sed One-Liners Explained provides clear explanations of features I found hard to understand from reading the sed documentation. If you want to learn sed in depth, this is a great book. But you may not want to learn sed in depth; the oldest and simplest parts of sed offer the greatest return on time invested. Since the book is organized by task — line numbering, selective printing, etc — rather than by language feature, the advanced and basic features are mingled.

On the other hand, there are two appendices  organized by language feature. Depending on your learning style, you may want to read the appendices first or jump into the examples and refer to the appendices only as needed.

For a sample of the book, see the table of contents, preface, and first chapter here.

Related links:

Learn one sed command
Daily tips on sed and awk
Software_development  Books  Programming  Sed  from google
september 2011 by rahuldave
Client-side Web REPL For 15+ Languages
In his first accepted submission, MaxShaw writes "repl.it is an online REPL that supports running code in 15+ languages, from Ruby to Scheme to QBasic, in the browser. It is intended as a tool for learning new languages and experimenting with code on the go. All the code is open sourced under the MIT license and available from GitHub."

A few of the languages are supported by reusing existing "Foolang in Javascript" interpreters, but a number of them are built using Emscripten (previously used to build Doom for the browser). All evaluation occurs client side, but saved sessions are stored on their server.


Read more of this story at Slashdot.
programming  from google
september 2011 by rahuldave
Learn one sed command
You may have seen sed programs even if you didn’t know that’s what they were. In online discussions it’s common to hear someone say

s/foo/bar/
as a shorthand to mean “replace foo with bar.” The line s/foo/bar/ is a complete sed program to do such a replacement.

sed comes with every Unix-like operating system and is available for Windows here. It has a range of features for editing files, but sed is worth using even if you only know how to do one thing with it:

sed "s/pattern1/pattern2/g" file.txt > newfile.txt
This will replace every instance of pattern1 with pattern2 in the file file.txt and will write the result to newfile.txt. The original file file.txt is unchanged.

I used to think there was no reason to use sed when other languages like Python will do everything sed does and much more. Suppose you agree with that. Now suppose you find you often have to make global search-and-replace operations and so you write a script to do this, say a Python script. You’ve got to call your script something, remember what you called it, and put it in your path. How about calling it sed? Or better, don’t write your script, but pretend that you did. If you’re on Linux, it’s already in your path. One advantage of the real sed over your script named sed is that the former can do a lot more, should you ever need it to.

Now for a few details regarding the sed command above. The “s” on the front stands for “substitute” and the “g” on the end stands for “global.” Without the “g” on the end, sed would only replace the first instance of the pattern on each line. If that’s what you want, then remove the “g.”

The patterns inside a sed command are regular expressions, so it’s best to get in the habit of always quoting sed commands. This isn’t necessary for simple string substitutions, but regular expressions often contain characters that you’ll need to prevent the shell from interpreting.

You may find the default regular expression support in sed odd or restrictive. If you’re used to regular expressions in Perl, Python, JavaScript, etc. and you’re using a Gnu implementation of sed, you can add the -r option for more familiar regular expression syntax.

I got the idea for this post from Greg Grouthaus’ post Why you should learn just a little Awk. He makes a good case that you can benefit from learning just a few commands of a language like Awk with no intention to learn more of the language.

Related posts:

Good old regular expressions
Tips for learning regular expressions
A little awk
Software_development  Programming  Regular_expressions  from google
april 2011 by rahuldave
Kod is a Free Text Editor Designed for Programmers [Downloads]
Mac OS X: Kod is a simple and free OS X text editor geared toward programmers that offers easy navigation and terminal integration. More »
Downloads  Mac_OS_X  Mac_OS_X_Featured_Download  Programming  Text_Editors  from google
january 2011 by rahuldave
How will the elmcity service scale? Like the web!
During a recent talk at Harvard's Berkman Center, Scott MacLeod asked (via the IRC backchannel): "How does the elmcity service scale?" He wondered, in particular, whether the service could support an online university like the World University and School that might produce an unlimited number of class schedules.

My short answer was that the elmcity service scales like the web. But what does that really mean? I promised Scott that I'd spell it out here. We'll start with an analogy. As I mentioned in The power of informal contracts, the elmcity project envisions a web of calendar feeds that's analogous to the blogosphere's web of RSS and Atom feeds. We take for granted that the blogosphere scales like the web. A blog feed is just a special kind of web page. Anybody can create a blog and publish its feed at some URL. Why not calendars too? We haven't thought about them in the same way, but the ICS (iCalendar) files that our calendar programs export are the moral equivalents of the RSS and Atom feeds that our blog publishing tools export. Anybody can create a calendar and publish its feed at some URL.

These webs -- of HTML pages, of blog feeds, of calendar feeds -- are notionally webs of peers. We can all publish, and we can all read, without relying on a central authority or privileged hub. There are, to be sure, powerful centralized services. My blog, for example, is one of millions hosted at wordpress.com, aggregated by Bloglines and Google Reader, and indexed by Google and Bing. But these services, while convenient, are optional. So long as we can publish our blogs somewhere online, advertise their URLs, and get the DNS to resolve their domain names, we can have a working blogosphere. The necessary and sufficient condition is that we can all publish resources (e.g., pages and feeds), and that we can all access those resources.

For the calendarsphere that I envision, a service like elmcity is likewise optional. Let's suppose that the World University and School succeeds wildly. At any given moment there are tens of thousands of courses on offer, each with its own course page and also with its own calendar. Instructors publish course pages using any web publishing tool, and also publish calendars using any calendar publishing tool -- Google Calendar, or Outlook, or Apple iCal, or another calendar program. Students pick schedules of courses, bookmark the course pages, and load the course calendars into any of these same calendar programs. The calendar software merges the separate course calendars and combines them with the students' personal calendars. These calendar programs are thus aggregators of calendar feeds in the same way that feedreaders like NetNewsWire or Google Reader are aggregators of blog feeds.

Given a baseline web of peers, it's useful to be able to merge our individual views of them into pooled spaces. NetNewsWire is a personal feedreader, but Google Reader is social. In the pool created by Google Reader, data finds data and people find people. The elmcity service aims to create that same kind of effect in the realm of public calendar events. When we pool our separate calendars, we publicize the events that we are promoting, we discover events that others are promoting, and we see all our public events on common timelines.

What constrains our ability to scale out pools of calendars? Let's continue the analogy to the blogosphere. Google Reader constitutes one pooled space for blog feeds, Bloglines another. Because the data aggregated by these services conforms to open standards (i.e., RSS and Atom), other services can create blog pools too. Likewise in the calendarsphere, Google Calendar is one way to pool calendars, the elmcity service is another, Calagator is a third. Others can play too.

How can we scale these providers of calendar pools? Along one axis, each provider needs to be able to grow its computing power. Google Calendar scales on this axis by using Google's cloud platform. The elmcity service uses Azure, the Microsoft cloud platform. Note that elmcity, unlike Google Calendar, is an open source service. That means you could run your own instance of it, using your own Azure account, but you'd still be relying on the Azure compute fabric.

Calagator, based on Ruby on Rails, could be deployed either to a conventional hosting environment or to a cloud platform. It would thus scale, along the compute axis, as either environment allows. The elmcity service could be used in this way too. The service is written for Azure, but the core aggregation engine is independent of Azure and could be deployed to a conventional hosting environment.

For feed aggregators, another axis of scale is the number of feeds that can be processed. When that number grows, the time required to connect to many feeds and ingest their contents becomes a constraint. The elmcity service currently supports 50 calendar hubs. Thrice daily, each hub pulls data from Eventful, Upcoming, Eventbrite, Facebook, and a list of iCalendar feeds. So far a single Azure worker role can easily do all this work. I'll dial up the number of workers if needed, but first I want to squeeze as much parallelism as I can out of each worker. To that end, I recently upgraded to the 4.0 version of the .NET Framework in order to exploit its dramatically simplified parallel processing. In this week's companion article I show how the elmcity service uses that new capability to optimize the time required to gather feeds from many sources.

Pub/sub networks can also scale by coalescing feeds. Consider a calendar hub operated, for some city, by the online arm of that city's newspaper. One model is flat. The newspaper runs a hub whose registry lists all the calendar feeds in town. But another model is hierarchical. In that model, there's a hub for arts and culture, a hub for sports and recreation, a hub for city government, and so on. Each hub gathers events from many feeds, and publishes the merged result on its own website for its own constituency. If the newspaper wants to include all those feeds, it can list them individually in its own registry. But why aggregate arts, sports, or recreation feeds more than once? The newspaper's uber-hub can, instead, reuse the arts, sports, and recreation feeds curated by those respective hubs, adding their merged outputs to its own set of curated feeds. Such reuse can cut down the computational time and effort required to propagate feeds throughout the network.

None of these mechanisms will matter, though, until a vibrant ecosystem of calendar feeds requires them. That's the ultimate constraint. Scaling the calendarsphere isn't a problem yet, but it would be a good problem to have. First, though, we've got to light up a whole bunch of feeds.

Related:

The iCalendar chicken-and-egg conundrum
Developing intuitions about data
Personal data stores and pub/sub networks
The principle of indirection
See all Radar elmcity stories
See all Answers elmcity stories
Programming  blog  calendar  elmcity  feed  syndication  from google
december 2010 by rahuldave
What Every Programmer Should Know About Floating-Point Arithmetic
-brazil- writes "Every programmer forum gets a steady stream of novice questions about numbers not 'adding up.' Apart from repetitive explanations, SOP is to link to a paper by David Goldberg which, while very thorough, is not very accessible for novices. To alleviate this, I wrote The Floating-Point Guide, as a floating-point equivalent to Joel Spolsky's excellent introduction to Unicode. In doing so, I learned quite a few things about the intricacies of the IEEE 754 standard, and just how difficult it is to compare floating-point numbers using an epsilon. If you find any errors or omissions, you can suggest corrections."


Read more of this story at Slashdot.
programming  from google
may 2010 by rahuldave
Is R an ‘epic fail’?
Is R an ‘epic fail’?

Something as popular and widespread as R can hardly be called a ‘failure’ in any meaningful sense, so of course the question is really in which aspects R is inferior to alternatives.

For most users who need a bit of data analysis, it is probably a poor first choice. R is a programming language with a lot of statistical and data visualisation support, but it is a programming language.  If you don’t want to do any programming, don’t muck about with R!  There are lots of visualisation tools and statistical tools that are much easier to use.

Of course, without a bit of programming, you are limited to what those tools can do, so if you need analysis that is not provided, you need to either find a programmer or learn how to program, and for the latter, R isn’t a bad choice.

You can get pretty far with very little effort in R, once you have learned how to program. Now learning how to program does require quite a bit of effort, but if you need to there really isn’t any way around it.  Just like there isn’t any Royal Road to mathematics (as Euclid is supposed to have said).

Sure, as a programming language R has its idiosyncrasies, but which programming languages do not?
Work  programming  R  statistics  from google
april 2010 by rahuldave
feature: Tutorial: consuming Twitter's real-time stream API in Python
Twitter is preparing to launch several impressive new features, including a new streaming API that will give desktop client applications real-time access to the user's message timeline. The new streaming API was announced last week at Twitter's Chirp conference, where it was made available to conference attendees on-site for some preliminary experimentation. Twitter opened it up to the broader third-party developer community on Monday so that programmers can begin testing it to offer informed feedback.

This tutorial will show you how to consume and process data from Twitter's new streaming API. The code examples, which are written in the Python programming language, demonstrate how to establish a long-lived HTTP connection with PyCurl, buffer the incoming data, and process it to perform the basic message display functions of a Twitter client application. We will also take a close look at how the new streaming API differs from the existing polling-based REST API.





Read the comments on this post
Features  Guides  Guides  Guides  Open-source  Web  programming  python  tutorial  twitter  from google
april 2010 by rahuldave
On code and comments…
I’ve never been a big fan of comments in code.  Mainly because I too often have seen comments explaining the trivial and ignoring the complex…

In most cases, clear code eliminates the need for comments, as discussed here.

I used to think commenting my code was the responsible thing to do. I used to think that I should have a comment for just about every line of code that I wrote. After my first read of Code Complete, my views changed pretty drastically.

I began to value good names over comments. As my experience has increased, I have realized more and more that comments are actually bad.

Actually, Code Complete has a more nuanced discussion on commenting code, but still…

Comments are often not needed, because they just rephrase what you can already read in the code. If at all possible, make the code easier to read rather than explain it in code.

When comments are needed, they explain design decisions that are not obvious from the code. Then there is too often the risk that the design has changed since the comment was written and that is really worse than no comment.

Still, it is when it comes to design decisions that I often miss documentation. Especially when it comes to complex class hierarchies and object interactions where there is clearly some underlying design decisions about how the objects are suppose to interact and how new classes should be added to the hierarchy to extend the code.

I rarely find that stuff documented, though.  At best I am told that for function add(a,b), “a and b are input” and “add(a,b) returns a+b” or something obvious like that…  or that the class “AbstractVisitor” is an abstract visitor class.  Duh!

I would love it if people would stop commenting the obvious but start explaining their design decisions…
Rants  Work  programming  from google
april 2010 by rahuldave
85% functional language purity
James Hague offers this assessment of functional programming:

My real position is this: 100% pure functional programing doesn’t work. Even 98% pure functional programming doesn’t work. But if the slider between functional purity and 1980s BASIC-style imperative messiness is kicked down a few notches — say to 85% — then it really does work. You get all the advantages of functional programming, but without the extreme mental effort and unmaintainability that increases as you get closer and closer to perfectly pure.

I found James Hague’s blog via a link from Greg Wilson. I’ve gone back through several posts on Hague’s blog Programming in the 21st Century and look forward to reading more.

Related posts:

Functional in the small, OO in the large
F# may succeed where others have failed
Why 90% solutions may beat 100% solutions
Reasoning about code
Why functional programming hasn’t taken off
Software_development  Functional_programming  Programming  from google
april 2010 by rahuldave
Four short links: 5 April 2010
Wrong about the iPad (Tim Bray) -- I am actively ignoring the iPad drivel, but this line caught my eye: Intelligence is a text-based application.
Fertile Medium -- online community consultancy, from the first and former Flickr community coordinator. One to watch: Heather and Derek really know their community. Again I say it: understanding of how open source and other collaborative communities can function is rare and valuable. (via waxy)
pigz -- parallel gzip implementation. Voom voom, so fast! (via kellan on Delicious
Prefab: What If We Could Modify Any Interface? -- screen-scraping for GUIs to bolt on new functionality to user interfaces. This is incredible. Watch the demo, it's impressive!
brains  community  hacks  opensource  programming  ui  from google
april 2010 by rahuldave
Chris Howie: git-svn in the workplace
At work, we use Subversion for source control. This is quite the popular VCS, but I’ve grown accustomed to (and much prefer) Git. Don’t get me wrong, SVN has its advantages, but since using Git my workflow has changed quite radically, and it’s difficult to revert to the rather inflexible and tedious SVN [...]
Git  Programming  from google
april 2010 by rahuldave
The Next Ten One-Liners from CommandLineFu Explained
Here are the next ten top one-liners from the commandlinefu website. The first post about the topic became massively popular and received over 100,000 views in the first two days.

Before I dive into the next ten one-liners, I want to take the chance and promote the other three article series on one-liners that I have written:

Awk One-Liners Explained (4 part article).
Sed One-Liners Explained (3 part article).
Perl One-Liners Explained (9 part article, work in progress).

Alright, so here are today’s one-liners:

#11. Edit the command you typed in your favorite editor
$ command <CTRL-x CTRL-e>
This one-liner opens the so-far typed command in your favorite text editor for further editing. This is handy if you are typing a lengthier shell command. After you have done editing the command, quit from your editor successfully to execute it. To cancel execution, just erase it. If you quit unsuccessfully, the command you had typed before diving into the editor will be executed.

Actually, I have to educate you, it’s not a feature of the shell per se but a feature of the readline library that most shells use for command line processing. This particular binding CTRL-x CTRL-e only works in readline emacs editing mode. The other mode is readline vi editing mode, in which the same can be accomplished by pressing ESC and then v.

The emacs editing mode is the default in all the shells that use the readline library. The usual command to change between the modes is set -o vi to change to vi editing mode and set -o emacs to change back to emacs editing mode.

To change the editor, export the $EDITOR shell variable to your preference. For example, to set the default editor to pico, type export EDITOR=pico.

Another way to edit commands in a text editor is to use fc shell builtin (at least bash has this builtin). The fc command opens the previous edited command in your favorite text editor. It’s easy to remember the fc command because it stands for “fix command.”

Remember the ^foo^bar^ command from the first top ten one-liners? You can emulate this behavior by typing fc -s foo=bar. It will replace foo with bar in the previous command and execute it.

#12. Empty a file or create a new file
$ > file.txt
This one-liner either wipes the file called file.txt empty or creates a new file called file.txt.

The shell first checks if the file file.txt exists. If it does, the shell opens it and wipes it clean. If it doesn’t exist, the shell creates the file and opens it. Next the shell proceeds to redirecting standard output to the opened file descriptor. Since there is nothing on the standard output, the command succeeds, closes the file descriptor, leaving the file empty.

Creating a new empty file is also called touching and can be done by $ touch file.txt command. The touch command can also be used for changing timestamps of the commands. Touch, however, won’t wipe the file clean, it will only change the access and modification timestamps to the current time.

#13. Create a tunnel from localhost:2001 to somemachine:80
$ ssh -N -L2001:localhost:80 somemachine
This one-liner creates a tunnel from your computer’s port 2001 to somemachine’s port 80. Each time you connect to port 2001 on your machine, your connection gets tunneled to somemachine:80.

The -L option can be summarized as -L port:host:hostport. Whenever a connection is made to localhost:port, the connection is forwarded over the secure channel, and a connection is made to host:hostport from the remote machine.

The -N option makes sure you don’t run shell as you connect to somemachine.

To make things more concrete, here is another example:

$ ssh -f -N -L2001:www.google.com:80 somemachine
This one-liner creates a tunnel from your computer’s port 2001 to www.google.com:80 via somemachine. Each time you connect to localhost:2001, ssh tunnels your request via somemachine, where it tries to open a connection to www.google.com.

Notice the additional -f flag - it makes ssh daemonize (go into background) so it didn’t consume a terminal.

#14. Reset terminal
$ reset
This command resets the terminal. You know, when you have accidentally output binary data to the console, it becomes messed up. The reset command usually cleans it up. It does that by sending a bunch of special byte sequences to the terminal. The terminal interprets them as special commands and executes them.

Here is what BusyBox’s reset command does:

printf("\033c\033(K\033[J\033[0m\033[?25h");
It sends a bunch of escape codes and a bunch of CSI commands. Here is what they mean:

\033c: “ESC c” - sends reset to the terminal.
\033(K: “ESC ( K” - reloads the screen output mapping table.
\033[J: “ESC [ J” - erases display.
\033[0m: “ESC [ 0 m” - resets all display attributes to their defaults.
\033[?25h: “ESC [ ? 25 h” - makes cursor visible.

#15. Tweet from the shell
$ curl -u user:pass -d status='Tweeting from the shell' http://twitter.com/statuses/update.xml
This one-liner tweets your message from the terminal. It uses the curl program to HTTP POST your tweet via Twitter’s API.

The -u user:pass argument sets the login and password to use for authentication. If you don’t wish your password to be saved in the shell history, omit the :pass part and curl will prompt you for the password as it tries to authenticate. Oh, and while we are at shell history, another way to omit password from being saved in the history is to start the command with a space! For example, <space>curl ... won’t save the curl command to the shell history.

The -d status='...' instructs curl to use the HTTP POST method for the request and send status=... as POST data.

Finally, http://twitter.com/statuses/update.xml is the API URL to POST the data to.

Talking about Twitter, I’d love if you followed me on Twitter! :)

#16. Execute a command at midnight
$ echo cmd | at midnight
This one-liner sends the shell command cmd to the at-daemon (atd) for execution at midnight.

The at command is light on the execution-time argument, you may write things like 4pm tomorrow to execute it at 4pm tomorrow, 9pm next year to run it on the same date at 9pm the next year, 6pm + 10 days to run it at 6pm after 10 days, or now +1minute to run it after a minute.

Use atq command to list all the jobs that are scheduled for execution and atrm to remove a job from the queue.

Compared to the universally known cron, at is suitable for one-time jobs. For example, you’d use cron to execute a job every day at midnight but you would use at to execute a job only today at midnight.

Also be aware that if the load is greater than some number (for one processor systems the default is 0.8), then atd will not execute the command! That can be fixed by specifying a greater max load to atd via -l argument.

#17. Output your microphone to other computer’s speaker
$ dd if=/dev/dsp | ssh username@host dd of=/dev/dsp
The default sound device on Linux is /dev/dsp. It can be both written to and read from. If it’s read from then the audio subsystem will read the data from the microphone. If it’s written to, it will send audio to your speaker.

This one-liner reads audio from your microphone via the dd if=/dev/dsp command (if stands for input file) and pipes it as standard input to ssh. Ssh, in turn, opens a connection to a computer at host and runs the dd of=/dev/dsp (of stands for output file) on it. Dd of=/dev/dsp receives the standard input that ssh received from dd if=/dev/dsp. The result is that your microphone gets output on host computer’s speaker.

Want to scare your colleague? Dump /dev/urandom to his speaker by dd if=/dev/urandom.

#18. Create and mount a temporary RAM partition
# mount -t tmpfs -o size=1024m tmpfs /mnt
This command creates a temporary RAM filesystem of 1GB (1024m) and mounts it at /mnt. The -t flag to mount specifies the filesystem type and the -o size=1024m passes the size sets the filesystem size.

If it doesn’t work, make sure your kernel was compiled to support the tmpfs. If tmpfs was compiled as a module, make sure to load it via modprobe tmpfs. If it still doesn’t work, you’ll have to recompile your kernel.

To unmount the ram disk, use the umount /mnt command (as root). But remember that mounting at /mnt is not the best practice. Better mount your drive to /mnt/tmpfs or a similar path.

If you wish your filesystem to grow dynamically, use ramfs filesystem type instead of tmpfs. Another note: tmpfs may use swap, while ramfs won’t.

#19. Compare a remote file with a local file
$ ssh user@host cat /path/to/remotefile | diff /path/to/localfile -
This one-liner diffs the file /path/to/localfile on local machine with a file /path/to/remotefile on host machine.

It first opens a connection via ssh to host and executes the cat /path/to/remotefile command there. The shell then takes the output and pipes it to diff /path/to/localfile - command. The second argument - to diff tells it to diff the file /path/to/localfile against standard input. That’s it.

#20. Find out which programs listen on which TCP ports
# netstat -tlnp
This is an easy one. Netstat is the standard utility for listing information about Linux networking subsystem. In this particular one-liner it’s called with -tlnp arguments:

-t causes netstat to only list information about TCP sockets.
-l causes netstat to only list information about listening sockets.
-n causes netstat not to do reverse lookups on the IPs.
-p causes netstat to print the PID and name of the program to which the socket belongs (requires root).

To find more detailed info about open sockets on your computer, use the lsof utility. See my article “A Unix Utility You Should Know About: lsof” for more information.

That’s it for today.
Tune in the next time for “Another Ten One-Liners from CommandLineFu Explained”. There are many more nifty commands to write about. But for now, have fun and see ya!

PS. Follow me on twitter for updates!
Programming  at  atd  atq  atrm  audio  commandlinefu  cron  csi_command  curl  daemon  dd  diff  dsp  echo  editor  emacs  escape_code  fc  http  if  localhost  microphone  mount  netstat  of  pico  post  ram  ramfs  readline  redirect  reset  shell  ssh  standard_output  tcp  terminal  tmpfs  tunnel  tweet  twitter  vi  from google
march 2010 by rahuldave
It’s Ada Lovelace Day
This is my contribution to Ada Lovelace Day

Ada Lovelace is, perhaps, the world’s first programmer of an actual computer. Others wrote about algorithms much earlier—think Euclid and the famous GCD algorithm—but she wrote a program for a specific computing machine. The machine was Charles Babbage’s Analytic Engine, and her notes look like a program to most.

Today I plan on joining over a million other bloggers in discussing women in science, and more specifically computing. The event is named after Ada Lovelace, and is happening all over the web.

Okay, I exaggerated about the number of bloggers, it is closer to 100,000 than a million—actually it is closer to 1,700. The number is not important; what is important is: we need more women administrators, educators, and researchers in all areas of computing. Further, more women who already have done great work in computing need to be recognized and given the awards and accolades they deserve. This has not always happened.

I am honored to be a tiny part of this special day, and I hope I can help in some way to make the event a success.

What To Do?

I am honestly unsure what I should do. For starters I am not a woman, and cannot really understand their issues. But, I have been in the computing field for over thirty years and perhaps I can add some small insights. I will try.

When I was first at Princeton we worked very hard to hire Andrea LaPaugh from MIT, where she got her Ph.D. We were successful, and for quite a while she was the only woman in all of engineering at Princeton. Later, she been the first tenured woman in engineering at Princeton. The balance is not perfect now, but I am happy to say today she is not the only tenured professor in engineering.

One day I was talking with a colleague from another engineering department. He asked me, “How many women did we have in Computer Science?” I immediately answered one—Andrea. Then, I asked the obvious question, “How many do you have in your department?” My colleague thought a long time—I guessed he must be adding up women faculty. Finally he said, “None.”

I am telling this story to show how subtle the issues can be concerning women in science. I gave him a very hard time: I said, you can take a long time to add up the cardinality of a big set, but you cannot take any time to figure out the cardinality of the empty-set. What was he thinking?

The Two Rule

One rule is the two rule. I learned this rule from my wife, Judith Norback, who is a Ph.D in psychology from Princeton. Often in an attempt to create balance—especially in academia—one woman will be placed on each committee. A woman. One. It is good to have women on committees, but putting one on does not usually work well.

The difficulty is a lone person on any committee is hard pressed to speak out and really make a difference. A lone person of any minority—the principle is the same for other minorities—is not in general the right choice. There are exceptions to this rule, but studies show one person, from a minority, is not nearly as effective as two. This is the rule of two. If possible always place two women on a group or a committee. They will be immensely more effective, if there are two.

Of course in order to make this happen the academic organization needs to have at least two women—another argument for more diversity. I do not claim to completely understand the reason the “two rule” works, but it does. Try it.

The Out Rule

Another rule is the out rule. This I learned from long experience watching fellow computer scientists operate—especially in academia. In the old days when wagon trains were attacked, they were taught to “circle the wagons.” In computer science we still do this, as do most other areas of science and academia.

However, in computer science the joke—unfortunately all too true—is we shoot the wrong way. We shoot in, not out. Hence, the rule of out: when attacked remember to shoot out, not in toward each other.

With all due respect, I have long noticed women on various committees often ignore this simple rule. They shoot in toward their fellow women. I have been on many committees of all kinds—award, hiring, program, and other types—and have noticed the women on the committee are often the hardest on women candidates. I have often argued for a women candidate for something, and noticed the other faculty were generally supportive. However, the women faculty in the room would frequently agree with me on the big points, yet attack the candidate on some minor points. Do not shoot in, shoot out.

I am not arguing for a decrease in standards. Never. I am arguing for both male and female faculty to be sure they are as objective as possible. I certainly am far from perfect, but I do think more attention should be paid to being aware of the out rule.

The Zero Rule

I am trying to be constructive and not writing a “moral with a tale,” but one last rule is critical in my mind. The zero rule is just this: there must be zero—no—tolerance for any jokes, comments, stories, of any kind that put down women. I have heard many of them over the years, and have always immediately complained about them. I believe such statements cause many women to go into other areas of science. We must be intolerant of any comments of this kind.

Ada As The First Programmer

It seems to me clear Lady Lovelace was more than the first programmer: she had great insight into what a computing device could or could not do. Here is a direct quote from her—it could have been written the other day. It would be interesting to see what she would think about computing today—she wrote this in 1842.

It is desirable to guard against the possibility of exaggerated ideas that might arise as to the powers of the Analytical Engine. In considering any new subject, there is frequently a tendency, first, to overrate what we find to be already interesting or remarkable; and, secondly, by a sort of natural reaction, to undervalue the true state of the case, when we do discover that our notions have surpassed those that were really tenable.

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. It can follow analysis; but it has no power of anticipating any analytical relations or truths. Its province is to assist us in making available what we are already acquainted with. This it is calculated to effect primarily and chiefly of course, through its executive faculties; but it is likely to exert an indirect and reciprocal influence on science itself in another manner. For, in so distributing and combining the truths and the formula of analysis, that they may become most easily and rapidly amenable to the mechanical combinations of the engine, the relations and the nature of many subjects in that science are necessarily thrown into new lights, and more profoundly investigated. This is a decidedly indirect, and a somewhat speculative, consequence of such an invention. It is however pretty evident, on general principles, that in devising for mathematical truths a new form in which to record and throw themselves out for actual use, views are likely to be induced, which should again react on the more theoretical phase of the subject. There are in all extensions of human power, or additions to human knowledge, various collateral influences, besides the main and primary object attained.

To really appreciate her brilliant mind, read all her comments here. This is the front piece to the document:

Open Problems

The main open problem is continue to try and increase the number of women in all aspects of science, especially computing. I think there are already many good ideas on how to do this—perhaps what we need is to execute the best of these ideas. In any event have a happy Ada Lovelace Day. It would have been a great privilege to have met her.
History  Ada_Lovelace  computing  programming  women  from google
march 2010 by rahuldave
Top Ten One-Liners from CommandLineFu Explained
I love working in the shell. Mastery of shell lets you get things done in seconds, rather than minutes or hours, if you chose to write a program instead.

In this article I’d like to explain the top one-liners from the commandlinefu.com. It’s a user-driven website where people get to choose the best and most useful shell one-liners.

But before I do that, I want to take the opportunity and link to a few of my articles that I wrote some time ago on working efficiently in the command line:

Working Efficiently in Bash (Part I).
Working Efficiently in Bash (Part II).
The Definitive Guide to Bash Command Line History.
A fun article on Set Operations in the Shell.
Another fun article on Solving Google Treasure Hunt in the Shell.

And now the explanation of top one-liners from commandlinefu.

Update: Russian translation available.

#1. Run the last command as root
$ sudo !!
We all know what the sudo command does - it runs the command as another user, in this case, it runs the command as superuser because no other user was specified. But what’s really interesting is the bang-bang !! part of the command. It’s called the event designator. An event designator references a command in shell’s history. In this case the event designator references the previous command. Writing !! is the same as writing !-1. The -1 refers to the last command. You can generalize it, and write !-n to refer to the n-th previous command. To view all your previous commands, type history.

This one-liner is actually really bash-specific, as event designators are a feature of bash.

I wrote about event designators in much more detail in my article “The Definitive Guide to Bash Command Line History.” The article also comes with a printable cheat sheet for working with the history.

#2. Serve the current directory at http://localhost:8000/
$ python -m SimpleHTTPServer
This one-liner starts a web server on port 8000 with the contents of current directory on all the interfaces (address 0.0.0.0), not just localhost. If you have “index.html” or “index.htm” files, it will serve those, otherwise it will list the contents of the currently working directory.

It works because python comes with a standard module called SimpleHTTPServer. The -m argument makes python to search for a module named SimpleHTTPServer.py in all the possible system locations (listed in sys.path and $PYTHONPATH shell variable). Once found, it executes it as a script. If you look at the source code of this module, you’ll find that this module tests if it’s run as a script if __name__ == '__main__', and if it is, it runs the test() method that makes it run a web server in the current directory.

To use a different port, specify it as the next argument:

$ python -m SimpleHTTPServer 8080
This command runs a HTTP server on all local interfaces on port 8080.

#3. Save a file you edited in vim without the needed permissions
:w !sudo tee %
This happens to me way too often. I open a system config file in vim and edit it just to find out that I don’t have permissions to save it. This one-liner saves the day. Instead of writing the while to a temporary file :w /tmp/foobar and then moving the temporary file to the right destination mv /tmp/foobar /etc/service.conf, you now just type the one-liner above in vim and it will save the file.

Here is how it works, if you look at the vim documentation (by typing :he :w in vim), you’ll find the reference to the command :w !{cmd} that says that vim runs {cmd} and passes it the contents of the file as standard input. In this one-liner the {cmd} part is the sudo tee % command. It runs tee % as superuser. But wait, what is %? Well, it’s a read-only register in vim that contains the filename of the current file! Therefore the command that vim executes becomes tee current_filename, with the current directory being whatever the current_file is in. Now what does tee do? The tee command takes standard input and write it to a file! Rephrasing, it takes the contents of the file edited in vim, and writes it to the file (while being root)! All done!

#4. Change to the previous working directory
$ cd -
Everyone knows this, right? The dash “-” is short for “previous working directory.” The previous working directory is defined by $OLDPWD shell variable. After you use the cd command, it sets the $OLDPWD environment variable, and then, if you type the short version cd -, it effectively becomes cd $OLDPWD and changes to the previous directory.

To change to a directory named “-“, you have to either cd to the parent directory and then do cd ./- or do cd /full/path/to/-.

#5. Run the previous shell command but replace string “foo” with “bar”
$ ^foo^bar^
This is another event designator. This one is for quick substitution. It replaces foo with bar and repeats the last command. It’s actually a shortcut for !!:s/foo/bar/. This one-liner applies the s modifier to the !! event designator. As we learned from one-liner #1, the !! event designator stands for the previous command. Now the s modifier stands for substitute (greetings to sed) and it substitutes the first word with the second word.

Note that this one-liner replaces just the first word in the previous command. To replace all words, add the g modifer (g for global):

$ !!:gs/foo/bar
This one-liner is also bash-specific, as event designators are a feature of bash.

Again, see my article “The Definitive Guide to Bash Command Line History.” I explain all this stuff in great detail.

#6. Quickly backup or copy a file
$ cp filename{,.bak}
This one-liner copies the file named filename to a file named filename.bak. Here is how it works. It uses brace expansion to construct a list of arguments for the cp command. Brace expansion is a mechanism by which arbitrary strings may be generated. In this one-liner filename{,.bak} gets brace expanded to filename filename.bak and puts in place of the brace expression. The command becomes cp filename filename.bak and file gets copied.

Talking more about brace expansion, you can do all kinds of combinatorics with it. Here is a fun application:

$ echo {a,b,c}{a,b,c}{a,b,c}
It generates all the possible strings 3-letter from the set {a, b, c}:

aaa aab aac aba abb abc aca acb acc
baa bab bac bba bbb bbc bca bcb bcc
caa cab cac cba cbb cbc cca ccb ccc

And here is how to generate all the possible 2-letter strings from the set of {a, b, c}:

$ echo {a,b,c}{a,b,c}

It produces:

aa ab ac ba bb bc ca cb cc

If you liked this, you may also like my article where I defined a bunch of set operations (such as intersection, union, symmetry, powerset, etc) by using just shell commands. The article is called “Set Operations in the Unix Shell.” (And since I have sets in the shell, I will soon write articles on on “Combinatorics in the Shell” and “Algebra in the Shell“. Fun topics to explore. Perhaps even “Topology in the Shell” :))

#7. mtr - traceroute and ping combined
$ mtr google.com
MTR, bettern known as “Matt’s Traceroute” combines both traceroute and ping command. After each successful hop, it sends a ping request to the found machine, this way it produces output of both traceroute and ping to better understand the quality of link. If it finds out a packet took an alternative route, it displays it, and by default it keeps updating the statistics so you knew what was going on in real time.

#8. Find the last command that begins with “whatever,” but avoid running it
$ !whatever:p
Another use of event designators. The !whatever designator searches the shell history for the most recently executed command that starts with whatever. But instead of executing it, it prints it. The :p modifier makes it print instead of executing.

This one-liner is bash-specific, as event designators are a feature of bash.

Once again, see my article “The Definitive Guide to Bash Command Line History.” I explain all this stuff in great detail.

#9. Copy your public-key to remote-machine for public-key authentication
$ ssh-copy-id remote-machine
This one-liner copies your public-key, that you generated with ssh-keygen (either SSHv1 file identity.pub or SSHv2 file id_rsa.pub) to the remote-machine and places it in ~/.ssh/authorized_keys file. This ensures that the next time you try to log into that machine, public-key authentication (commonly referred to as “passwordless authentication.”) will be used instead of the regular password authentication.

If you wished to do it yourself, you’d have to take the following steps:

your-machine$ scp ~/.ssh/identity.pub remote-machine:
your-machine$ ssh remote-machine
remote-machine$ cat identity.pub >> ~/.ssh/authorized_keys

This one-liner saves a great deal of typing. Actually I just found out that there was a shorter way to do it:

your-machine$ ssh remote-machine 'cat >> .ssh/authorized_keys' < .ssh/identity.pub

#10. Capture video of a linux desktop
$ ffmpeg -f x11grab -s wxga -r 25 -i :0.0 -sameq /tmp/out.mpg
A pure coincidence, I have done so much video processing with ffmpeg that I know what most of this command does without looking much in the manual.

The ffmpeg generally can be descibed as a command that takes a bunch of options and the last option is the output file. In this case the options are -f x11grab -s wxga -r 25 -i :0.0 -sameq and the output file is /tmp/out.mpg.

Here is what the options mean:

-f x11grab makes ffmpeg to set the input video format as x11grab. The X11 framebuffer has a specific format it presents data in and it makes ffmpeg to decode it correctly.
-s wxga makes ffmpeg to set the size of the video to wxga which is shortcut for 1366×768. This is a strange resolution to use, I’d just write -s 800x600.
-r 25 sets the framerate of the video to 25fps.
-i :0.0 sets the video input file to X11 display 0.0 at localhost.
-sameq preserves the quality of input stream. It’s best to preserve the quality and post-process it later.

You can also specify ffmp[…]
Programming  authorized_keys  bash  cd  combinatorics  commandlinefu  cp  desktop  display  event_designators  ffmpeg  history  identity.pub  id_rsa.pub  linux  mtr  oldpwd  one_liners  passwordless_authentication  ping  public_key_authentication  python  pythonpath  root  sets  shell  simplehttpserver  ssh  ssh_copy_id  ssh_keygen  sshv1  sshv2  sudo  tee  traceroute  vim  x11  from google
march 2010 by rahuldave
Simpler "Hello World" Demonstrated In C
An anonymous reader writes "Wondering where all that bloat comes from, causing even the classic 'Hello world' to weigh in at 11 KB? An MIT programmer decided to make a Linux C program so simple, she could explain every byte of the assembly. She found that gcc was including libc even when you don't ask for it. The blog shows how to compile a much simpler 'Hello world,' using no libraries at all. This takes me back to the days of programming bare-metal on DOS!"


Read more of this story at Slashdot.
programming  from google
march 2010 by rahuldave

related tags

Ada_Lovelace  at  atd  atq  atrm  audio  authorized_keys  bash  blog  Books  brains  calendar  cd  clojure  cloud  codepodcast  combinatorics  commandlinefu  community  computing  concurrency  cp  cron  csi_command  curl  d3  daemon  Data  dataconference  datajournalism  dataproduct  datascience  dd  desktop  diff  display  Downloads  dsp  echo  editor  elmcity  emacs  escape_code  event_designators  f  fc  Features  feed  ffmpeg  functionalprogramming  Functional_programming  Git  google  Gov_2.0  Guides  hacks  history  http  identity.pub  id_rsa.pub  if  java  javascript  Julia  LaTeX  linux  localhost  Mac_OS_X  Mac_OS_X_Featured_Download  microphone  mount  mtr  netstat  nicarinterview  node  of  oldpwd  one_liners  Open-source  opensource  passwordless_authentication  pico  ping  post  programming  public_key_authentication  Publishing  python  pythonpath  R  ram  ramfs  Rants  readline  redirect  Regular_expressions  reset  root  rprogramming  R_bloggers  scala  search  Sed  sets  shell  simplehttpserver  Software_development  ssh  sshv1  sshv2  ssh_copy_id  ssh_keygen  standard_output  statistics  storage  sudo  syndication  tcp  tee  terminal  textanalysis  Text_Editors  tmpfs  traceroute  tunnel  tutorial  tweet  twitter  ui  vi  vim  web  Web_Development  women  Work  x11 

Copy this bookmark:



description:


tags: