November 2009 Archives

Help me to get PaperTrail out the door

I've spent the better part of my recreational programming time in the last two years working on an enhanced bibliographic project. The purpose was to build a citation manager that would also track references, so that as you build a literature review you can keep track of common sources, "important" papers, etc. There were also a few bigger goals to the project itself that I hoped would solve some problems that academics have in general.

I named this project PaperTrail, and I've been trying to get it ready for other users to test out. The only problem is that this is my first effort at real application building since high school, and even that one didn't work out too well. I built it using the gtkmm interface in C++ on my Ubuntu Linux machine, which means that in theory, it should be cross-platform so that all my Windows and OS X using friends can use it as well. Putting this together in practice -- auto-installation, etc -- is much trickier.

Here are the goals that PaperTrail is meant to help meet:

  1. To standardize citation scripting language in a way that would incorporate author identity. Those of us with the William H. Macy problem know that establishing such a system would be wonderful; I figure we're about 90% of the way there by noting that authors reference themselves in their work, so that a database that includes citations for each paper would ease the need for a scientific author database.
  2. To better process downloaded PDFs. Academics know the pain of sorting their physical paper collection, let alone their digital volumes; I've currently got it rigged so that downloaded PDFs will open in PaperTrail, so that the relevant bibliographic information can be collected (or imported from the document itself); this info is then saved to the file and archived to its own directory like iTunes can do with music files.
  3. To quickly grab citation information from a paper. I've got a regular-expressions set-up to grab the citations from a paper's reference list (complete with index numbers) and turn them into PaperTrail entries. It clearly needs some work but it's at least built to be expanded upon.
  4. To have better tracking of multiple versions of papers that might be cited -- drafts, conference proceedings, final versions -- within a single entry.
  5. To process and nest comments and rejoinders to journal discussion papers.
  6. To export a data file to bibtex for use in LaTeX documents. I wouldn't mind adding EndNote compatibility if anyone wanted to use it.
  7. More ideas that I'm forgetting to mention.
What I need is help building the installation procedures for Windows, Mac and *nix respectively; I've almost got it for the last one, except for locating supporting files and directories. Because the audience for this is small enough (poor academics, mainly!) I have no interest in trying to find profitability in this idea, only in making a product that people would want to use and share.

The source code is posted here; please contact me if you're interested in helping out, have friends who know this stuff, or have suggestions on features that should be included.
Over what I'm sure will be some howls of objection, I maintain that Breaking Bad is the best show on AMC, better than that other one that everyone else talks about. The main reason would be that there doesn't seem to be a greater dramatic actor with comedy instincts than Bryan Cranston (splitting hairs a bit, as I think Hugh Laurie is the best comedic actor with dramatic instincts), but there are at least three issues it raises with high quality:

  • The law of unintended consequences is ultimately what runs the show. Almost every action taken by a character has a later reaction, predictable or otherwise.
  • Bankruptcy from health-related causes is a serious problem, and it's the lack of a strong insurance system that keeps people from picking their own (quality) doctors.
  • Drug addicts are people too -- any kind of approach to dealing with the problems of addiction must take it into account.
Letting alone the fact that Walter White is apparently as screwed-up a man as any of us, having made more than his fair share of bad life decisions, the precarious position that Walter White is in could have been mitigated by an insurance plan that didn't burden him with an expensive treatment.

The worst part of this is that we can't likely get back to the real meat of the discussion: what are the consequences, intended and otherwise, of each proposed change in the healthcare system in America, since the debate is buried on verifiably false scare claims.

In short this is another example of what I think of as the regression-to-the-mean of policy effects: consequences that appear large are most likely overblown, and those that appear small are likely bigger.

P.S. If you don't believe me about Bryan Cranston's dramatic chops, see him as Buzz Aldrin first.

A Short Note on Breast Cancer Screening: Really Less Effective?

There is plenty in the news on recommendations for breast cancer screening, but one detail jumped out at me -- namely, the suggestion that more women aged 40-49 (1904) would need to be screened regularly to prevent one death due to breast cancer, than women aged 50-59 (1339). This gets prominence in news reports because it's an easy way of summarizing effectiveness, even though it's a completely misleading interpretation of the recommendation. From the source material:

Total number to screen to prevent one fatality from cancer:
Ages 39-49: Mean 1904, CI (929, 6378)
Ages 50-59: Mean 1339, CI (322, 7455)


A less-than-compelling difference of effectiveness if one confidence interval lies completely within the other.

The intended point of the recommendation was that screens and operations have risks -- false positive results leading to unnecessary biopsies and unintended consequences -- though on first inspection I couldn't find any data on the mortality risk from overtreatment to compare.

P.S. There's clearly a lot more to say about the implications of this analysis, for the health care debate in the U.S. at least, but I'm in a position at least to dispute one misinterpretation.

Cochran at 100

I spent this past Saturday at a symposium for the centennial of William G. Cochran, one of my erstwhile department's co-founders, and I wasn't disappointed in the least. The impact he's had, both on the discipline and the world, appears to be vast; the suggestion that his work on the effects of smoking has saved millions of lives is an idea I'll eventually follow up on in detail.

After all I've learned about the diversity of his background, he also appears to be a highly positive case of Doctor No.

The Harvard Gazette has a nice write-up of the event, To those who want a long look at what he did, I strongly recommend The Planning of Observational Studies of Human Populations, a paper I'm ready to classify as being timeless.

As Long As I'm Talking About Life Ambitions...

I'd very much like to be this guy. I'm fairly sure that Xiao-Li Meng has a long head start on me for this in the world of applied statistics; good to have had him on my committee.

Role Models

In trying to pin down exactly what kind of professional I want to be, I'm finding it helpful to put two archetypes into play:

Putting it simply, my ideal scientist should be part Doc Brown, part Doctor No (though in the sense of "Senator No" Jesse Helms).

  doc-brown.jpg       doctor-no.jpg

Introducing Bayes and Computation in Eight Short Weeks

I'll be teaching Applied Bayesian and Computational Methods to the Masters-level students here at CMU in the spring, though as the course is scheduled for a "mini-term", I'll only have a short time to present many of these ideas to the group. My plan is to teach from Gelman and Hill ("ARM", or as AG wishes he'd called it, "Regression") since it's filled with useful tips and it's inexpensive.

At this point, my plan is to do what most practitioners do when teaching: show the students how to avoid the mistakes you've made. At a minimum, this involves

  • Think about the model before you code it.
  • MCMC may be flashy, and powerful, and the root of a lot of my work, but it's also really, really easy to screw up.
  • Posing a stochastic model is the beginning of scientific wisdom, not the end of it. Therefore, think about what other approaches may be equally valid.
One of my least favourite classroom experiences was from a lecturer whose approach to teaching problem solving was to say "well, you could try this", as if teaching was a laundry list. So making sure to put these methods in perspective is at the top of my agenda.

For coding, my plan is to go with R and WinBUGS using the R2WinBUGS package; as much as I'd like to go pure R, I'm sure I'd spend too much time worrying about the little details and missing the point of the modelling, which in an eight-week course I can't afford.

For those students who come into the course with really good skills in R already, I'd consider introducing C or C++ coding of routines if it were a longer course or one purely focused on the details of stochastic model programming, but right now I hesitate to bring it in.

Ubuntu 9.10 and REvolution-R

I stopped using Windows as my primary operating system about two years ago, making the switch to Ubuntu on all my work machines. My coding environment of choice for R is GTK Emacs with ESS, which both install almost by themselves through the Ubuntu setup.

Because I've been off Windows systems, I haven't given much thought toward using REvolution's version of R, which had as its biggest selling point the use of fancier compilers to speed up run time (not an issue when it builds from source on a Linux machine). However, the newest release of Ubuntu ("Karmic Koala") has revolution-r as an option, complete with their mechanism for parallelizing R code. So it's finally worth a go from my end, especially on my four-core-eight-thread work machine. Next project I code, I try it in revolution-r.