My favorite resources

An ever-expanding list of resources I’ve liked, still like, and recommend.

Software & Systems

Favorites:

  • R: Steep learning curve, frequently weird, but its myriad of packages and powerful graphics make it essential. It runs on everything. Plus, being Open Source code, you can run it on every node of a cluster for free.
  • SQL: Frequently essential partner to data mining. If you’re data are structured, this is the place to put it. I prefer PostgreSQL, but MySQL is as widespread as sin and can be used, if treated with respect.
  • Python (NumPy, SciPy, sklearn, Pandas, MatPlotLib): What R should be in terms of speed and rigor.
  • Perl: It’s on its way out, but it remains incredibly flexible and frequently essential glue between components.

Second base:

  • Scala: Nice language for doing data analysis in a Java-like fashion without its heaviness.
  • Java: So many things run on Java, you have to learn it: Spark, Weka, ChemAxon, OpenNLP, you name it. Though it is true that there is an awful lot of “IDE vomit”, to quote “hilarious Indian” guy in this Youtube Scala tutorial (highly recommended!).
  • Spark, and particularly the Spark Machine Learning Library (MLlib): When you need power and your problem fits the no-communication-between-slave-nodes model, this is the first place to turn to.
  • Julia: the next iteration in data mining languages? Sure looks like it
  • Weka: I rarely use it these days, but it’s a great platform for getting exposed to algorithms without the tears of R, and there are some great algorithms I’ve only found in that system, e.g., logistic regression trees.

Books

If you’re like me, you love figuring things out, and one thing I learned a long time ago is that documentation makes a big difference. Unfortunately, we are now at the point where there is frequently too much documentation out there, ergo this list of my favorite books.

All books are linked to Amazon unless indicated. Personally, I typically access them via Kindle (book prices and convenience are hard to beat).


Level: introductory

A great introduction to R:

The R Book.

Awesome comprehensive tour of data mining that combines theory and application in just the right proportions, centered on the Weka system: 

Data Mining: Practical Machine Learning Tools and Techniques

Excellent introduction to generating graphs in R: 

R Graph Coobook


LEVEL: intermediary

One of my all-time favorites for digging into meaty analyses with a superb balance of theory and application: 

Handbook of statistical analyses in R

My favorite go-to for understanding frequently murky statistical terminology:

Cambridge Dictionary of Statistics

Good source for understanding more sophisticated R graphs: 

R Graphics Cookbook

A superb survey of meta-analysis, with just the right blend of theory and application. 

An excellent source of advice on how to make this frequently under-appreciated tool sing for you. Lots of practical info and examples on how to make reports and dashboards that talk directly to data sources (including databases), and do it in style.  

Excel Dashboards and Reports 2nd Edition

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s