Vaguery + open-science 32
Review of 2011 Data Scientist Summit | (R news & tutorials)
may 2011 by Vaguery
This was the first annual Data Scientist Summit, and I will no doubt be back. With that said, discussion of technical topics had a bit of an introductory flavor to them, which made the discussion of the technology seem dated. For example, “Vanilla” Hadoop was introduced as a tool for processing vast amounts of data. I would expect that most Data Scientists have worked with Hadoop, or at least know what it is. Hadoop is somewhat old news in terms of “cutting-edge technology.” Tools like Pig, Cascalog, HBase, Hive, Cascading, etc. would have been a better discussion topic. I was also disappointed with how little coverage of tools (except for Hadoop, NoSQL, and enterpise databases) there was. It seemed as if R had gone M.I.A. and I was surprised that there was such little discussion of visualization tools like Tableau, Processing, Gephi, D3, Polymaps, etc.
data-science
conference
academic-culture
cultural-assumptions
corporatism
open-science
may 2011 by Vaguery
Walking Randomly » Natural Scientists: their very big output files – and a tale of diffs
april 2011 by Vaguery
"A few years back, when a user at the University of Manchester asked for help with the ‘diff – files too big/ out of memory’ problem, I wrote a modern version that I called idiffh (for Ian’s diffh). My ground rules were:<br />
Work on any text files on any operating system with a C compilerHave no limits on, e.g., line lengths or file sizeNever ‘give up’ if the going gets tough (i.e. when the files are very different)"
diff
text-mining
dataset
open-science
tools
from delicious
Work on any text files on any operating system with a C compilerHave no limits on, e.g., line lengths or file sizeNever ‘give up’ if the going gets tough (i.e. when the files are very different)"
april 2011 by Vaguery
Beekeeper Who Leaked EPA Documents: "I Don't Think We Can Survive This Winter" | Fast Company
december 2010 by Vaguery
""They told me that EPA scientists had reviewed the originally lifecycle study and determined it wasn't scientifically sound, and I asked if it had been documented, if there was a hard copy," he says, "The [employee] said yes, and I asked if I could get a copy." And just like that, he had the proof he needed that the EPA had overlooked something that could be killing America's bees."
astroturf
corporatism
pesticides
ecology
science
open-science
lawsuit
december 2010 by Vaguery
» Open Data citation advantage Circle of Complexity
august 2010 by Vaguery
"Because sharing data resulted in a citation, I wonder how long will it take for Open Data advocates to start using this “open data citation advantage” as an argument for sharing data?"
citation-etiquette
economics
open-access
open-science
open-data
social-engineering
academic-culture
august 2010 by Vaguery
Getting Started Guide - Google Prediction API - Google Code
may 2010 by Vaguery
"The Prediction API allows you to get more from your data and makes its patterns more accessible. Specifically, the Prediction API leverages Google's machine learning infrastructure to give you the tools to better analyze your data and reveal patterns that are often difficult to manually discover. The API also enables you to use those patterns to predict new outcomes, which facilitates the development of all types of software, from textual analysis systems to recommendation systems. Because the Prediction API is a RESTful HTTP service, you can easily access it from Google App Engine, Apps Script, and other Internet-connected desktop applications."
nudge
machine-learning
models
google
prediction
clustering
learning-from-data
AI
API
open-science
may 2010 by Vaguery
An article attacking R gets responses from the R blogosphere – some reflections | (Articles about R)
april 2010 by Vaguery
"But Dr. De Mars post is (very) important for a different reason. Not because her claims are true or false, but because her writing angered people who love and care for R (whether legitimately or not, it doesn’t matter). Anger, being a very powerful emotion, can reveal interesting things. In our case, it just showed that R bloggers are connected to each other."
R
community
open-science
statistics
criticism-is-the-best-medicine
april 2010 by Vaguery
Deluge of scientific data needs to be curated for long-term use
february 2010 by Vaguery
"Most organizations have serious problems with data management because it's expensive to do systematic curation, which includes documenting the context in which data were generated or derived, including the instruments involved, the protocols and such," Palmer said. "But that also requires caring for the data and making them available to other scientists. It takes serious commitment and investment."
curation
data
data-warehousing
openness
open-science
challenges
february 2010 by Vaguery
Keeping computers from ending science's reproducibility
january 2010 by Vaguery
"The idea is that the researchers that rely on computational techniques as part of their day-to-day activities need an entire "reproducible research system" that will make it easier for them to document the sources of their data and the analyses performed on it. The system they've designed shares features with rapid application development environments, as it graphically represents modular computational tools, which can be ordered to create an analysis pipeline, and the individual settings for each can be tweaked. Once complete, the user can trigger the analysis to run; the system documents all of the relevant settings and software information."
agility
open-science
reproducibility
academic-culture
academics-shouldn't-design-interfaces
arguments-against-interns
january 2010 by Vaguery
[0911.0454] The Financial Bubble Experiment: advanced diagnostics and forecasts of bubble terminations
december 2009 by Vaguery
"We continue this protocol until the future date (1 May 2010) at which time we upload our final version of the master document. For this final version, we include the URL of a web site where the .pdf documents of all of our past forecasts can be downloaded and independently checked for consistent MD5 and SHA-2 hashes. For convenience, we will include a summary of all of our forecasts in this final document."
prediction
economics
financial-crisis
finance
science
open-science
competition
public-policy
december 2009 by Vaguery
About the Open Cloud Consortium
october 2009 by Vaguery
"The Open Cloud Consortium (OCC) is a member driven organization that:
Supports the development of standards for cloud computing and frameworks for interoperating between clouds;
develops benchmarks for cloud computing;
supports reference implementations for cloud computing, preferably open source reference implementations;
manages a testbed for cloud computing called the Open Cloud Testbed;
sponsors workshops and other events related to cloud computing."
cloud-computing
nudge
standards
openness
open-science
grid-computing
Supports the development of standards for cloud computing and frameworks for interoperating between clouds;
develops benchmarks for cloud computing;
supports reference implementations for cloud computing, preferably open source reference implementations;
manages a testbed for cloud computing called the Open Cloud Testbed;
sponsors workshops and other events related to cloud computing."
october 2009 by Vaguery
"Essentials of Metaheuristics"
august 2009 by Vaguery
"About the Book: This is an open set of lecture notes on metaheuristics algorithms, intended for undergraduate students, practitioners, programmers, and other non-experts. It was developed as a series of lecture notes for an undergraduate course I taught at GMU. The chapters are designed to be printable separately if necessary. As it's lecture notes, the topics are short and light on examples and theory. It's best when complementing other texts. With time, I might remedy this."
metaheuristics
genetic-programming
book
open-source
open-science
creative-commons
computer-science
search
optimization
genetic-algorithm
stochastic
august 2009 by Vaguery
Infochimps.org: Free Redistributable Data Sets of Every Kind
april 2009 by Vaguery
"There are many sources to find out something about everything. Until now, there’s been no good place for you to find out everything about something.
The infochimps.org community is assembling and interconnecting the world's best repository for raw data -- a sort of giant free allmanac, with tables on everything you can put in a table. Built by data nerds, used by data nerds, it's a central source for the information you need to power the projects the world needs. (learn more: help|faq)"
data
data-analysis
openness
open-science
public-domain
information
visualization
archive
database
free
raw-data-now
The infochimps.org community is assembling and interconnecting the world's best repository for raw data -- a sort of giant free allmanac, with tables on everything you can put in a table. Built by data nerds, used by data nerds, it's a central source for the information you need to power the projects the world needs. (learn more: help|faq)"
april 2009 by Vaguery
myGrid » What is a workflow?
january 2009 by Vaguery
"In a scientific context what does this mean? The overall project referred to is your analysis. The activities are simple operations within your analysis. All these operations have a certain number of inputs and outputs. In the case of fetching a DNA sequence, an input may be an identifier of the sequence, whilst the output is a string representing the nucleotide sequence represented by this identifier.
The triggering of activities by other activities are where an operation feeds data into a subsequent operation. For example, the ‘fetch sequence’ operation may feed its output (the string containing sequence ‘ACTG’) into a ‘transcribe’ operation. This would subsequently change the DNA sequence into an RNA sequence. We would then have a simple workflow with one operation, and a link, which looks something like the following:..."
open-science
science
collaboration
modeling
work
communication
formalization
The triggering of activities by other activities are where an operation feeds data into a subsequent operation. For example, the ‘fetch sequence’ operation may feed its output (the string containing sequence ‘ACTG’) into a ‘transcribe’ operation. This would subsequently change the DNA sequence into an RNA sequence. We would then have a simple workflow with one operation, and a link, which looks something like the following:..."
january 2009 by Vaguery
The Back Page
december 2008 by Vaguery
"Wikipedia is a second example where scientists have missed an opportunity to innovate online. Wikipedia has a vision statement to warm a scientist’s heart: “Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.” You might guess Wikipedia was started by scientists eager to collect all of human knowledge into a single source. In fact, Wikipedia’s founder, Jimmy Wales, had a background in finance and as a web developer. In the early days few established scientists were involved. To contribute would arouse suspicion from colleagues that you were wasting time that could be spent writing papers and grants."
openness
open-science
publishing
cultural-norms
collaboration
transparency
wikinomics
december 2008 by Vaguery
Opinion - My View: What's so wasteful about funding discovery? - sacbee.com
october 2008 by Vaguery
"Not all science needs to have a purpose. The nature of humans is that, sometimes, they simply want to know. Everything else is just a bonus.
Srinivasa Ramanujan and Albert Einstein, the two scientific geniuses of the 20th century, made their earliest discoveries while working as clerks, not as professors working on taxpayer-funded projects; but why risk, in the 21st century, that some diamond might remain forever unearthed for want of a government grant?"
science
politics
academia
basic-science
funding
government
grants
anti-intellectualism
open-science
cultural-norms
Srinivasa Ramanujan and Albert Einstein, the two scientific geniuses of the 20th century, made their earliest discoveries while working as clerks, not as professors working on taxpayer-funded projects; but why risk, in the 21st century, that some diamond might remain forever unearthed for want of a government grant?"
october 2008 by Vaguery
Open Reading Frame
july 2007 by Vaguery
Discovery is the addiction that drives research -- it's the crackpipe hit, the rush, the thrill, that keeps us going through the down times and the plodding; but one of the best ways to alleviate the boredom and despondency that sets in between fixes is t
collaboration
science
open-access
open-science
academia
cultural-norms
learning-by-doing
blogs
community
july 2007 by Vaguery
Open Reading Frame
july 2007 by Vaguery
"The real killer is ego: what if someone else gets there first?"
open-access
open-science
commentary
academia
cultural-norms
fear-uncertainty-doubt
FUD
blogging
competitiveness
july 2007 by Vaguery
Open Reading Frame
july 2007 by Vaguery
Catching up on old posts of new-discoverd blog: Open-access peer reviewers' comments. Good idea.
openness
open-science
collaboration
peer-review
academia
publishing
authority
comments
july 2007 by Vaguery
UsefulChem » Alicia Holsey
july 2007 by Vaguery
Wiki-editing a Masters Thesis, live.
transparency
science
communication
publishing
personal-brand
openness
open-access
open-science
wiki
writing
academia
july 2007 by Vaguery
Synthesis - That’s what I do, I synthesize. » Open Science
july 2007 by Vaguery
Another live thesis editing experiment.
science
openness
open-access
transparency
blogging
writing
academia
cultural-norms
communication
open-science
july 2007 by Vaguery
related tags
academia ⊕ academic-culture ⊕ academics-shouldn't-design-interfaces ⊕ agility ⊕ AI ⊕ algorithms ⊕ analytics ⊕ anti-intellectualism ⊕ API ⊕ archive ⊕ arguments-against-interns ⊕ astroturf ⊕ authority ⊕ basic-science ⊕ blogging ⊕ blogs ⊕ book ⊕ challenges ⊕ chemistry ⊕ citation-etiquette ⊕ cloud-computing ⊕ clustering ⊕ collaboration ⊕ commentary ⊕ comments ⊕ communication ⊕ community ⊕ competition ⊕ competitiveness ⊕ computer-science ⊕ conference ⊕ contagion-of-ideas ⊕ copyright ⊕ corporatism ⊕ creative-commons ⊕ criticism-is-the-best-medicine ⊕ crowdsourcing ⊕ cultural-assumptions ⊕ cultural-norms ⊕ curation ⊕ data ⊕ data-analysis ⊕ data-science ⊕ data-warehousing ⊕ database ⊕ dataset ⊕ diff ⊕ ecology ⊕ economics ⊕ experiment ⊕ fear-uncertainty-doubt ⊕ finance ⊕ financial-crisis ⊕ formalization ⊕ free ⊕ free-access ⊕ FUD ⊕ funding ⊕ genetic-algorithm ⊕ genetic-programming ⊕ GitHub ⊕ google ⊕ government ⊕ grants ⊕ grid-computing ⊕ html5 ⊕ information ⊕ institutional-design ⊕ lawsuit ⊕ learning-by-doing ⊕ learning-from-data ⊕ library ⊕ machine-learning ⊕ marketing ⊕ metaheuristics ⊕ modeling ⊕ models ⊕ nudge ⊕ open-access ⊕ open-data ⊕ open-science ⊖ open-source ⊕ openness ⊕ optimization ⊕ peer-review ⊕ personal-brand ⊕ pesticides ⊕ plagiarism ⊕ politics ⊕ prediction ⊕ public-domain ⊕ public-policy ⊕ publish-or-perish ⊕ publishing ⊕ R ⊕ raw-data-now ⊕ reproducibility ⊕ research ⊕ revolution ⊕ scholarship ⊕ science ⊕ search ⊕ social-engineering ⊕ social-norms ⊕ standards ⊕ statistics ⊕ stochastic ⊕ text-mining ⊕ time-series ⊕ timeseries ⊕ tools ⊕ transparency ⊕ visualization ⊕ wiki ⊕ wikinomics ⊕ work ⊕ writing ⊕Copy this bookmark: