My favorite resources

An ever-expanding list of resources I’ve liked, still like, and recommend. Images are hyperlinked to Amazon, btw.

Software & Systems


  • R: Steep learning curve, frequently weird, but its myriad of packages and powerful graphics make it essential. It runs on everything. Plus, being Open Source code, you can run it on every node of a cluster for free.
  • SQL: Frequently essential partner to data mining. If you’re data are structured, this is the place to put it. I prefer PostgreSQL, but MySQL is as widespread as sin and can be used, if treated with respect. And of course, SQLite is integral to several key libraries in R and Python. Bottom line: learning SQL is mucho helpful.
  • Python (NumPy, SciPy, sklearn, Pandas, MatPlotLib, seaborn): What R should be in terms of speed and rigor. Having said this, R conserves the top rank when it comes to plotting, IMO.
  • RStudio. Don’t code R without it. Just don’t.
  • Jupyter Notebook, an excellent way to run Python or R exploratory data analyses (less so for just writing code). It took some doing for me to like it, but I have seen the light, and the light is good.
  • Rodeo from yhat (thanks for doing this, yhat! It’s appreciated!). Python’s answer to R Studio. Don’t like Jupyter Notebooks? Rodeo may be your answer.
  • EMS’ SQL Manager for MySQL (also Postgres). It’s been my favorite SQL query browser and DB manager for years. Very powerful, reasonably easy to use, and great support. Well worth the purchase. And they have free versions that have decent capabilities.

Second base:

  • Scala: Nice language for doing data analysis in a Java-like fashion without its heaviness.
  • Java: So many things run on Java, you have to learn it: Spark, Weka, ChemAxon, OpenNLP, you name it. Though it is true that there is an awful lot of “IDE vomit”, to quote “hilarious Indian” guy in this Youtube Scala tutorial (highly recommended!).
  • Spark, and particularly the Spark Machine Learning Library (MLlib): When you need power and your problem fits the no-communication-between-slave-nodes model, this is the first place to turn to.
  • Julia: the next iteration in data mining languages? Sure looks like it
  • Weka: I rarely use it these days, but it’s a great platform for getting exposed to algorithms without the tears of R, and there are some great algorithms I’ve only found in that system, e.g., logistic regression trees.


If you’re like me, you love figuring things out, and one thing I learned a long time ago is that documentation makes a big difference. Unfortunately, we are now at the point where there is frequently too much documentation out there, ergo this list of my favorite books.

All books are linked to Amazon unless indicated. Personally, I typically access them via Kindle (book prices and convenience are hard to beat).

Level: introductory

A great introduction to R:

The R Book.

Awesome comprehensive tour of data mining that combines theory and application in just the right proportions, centered on the Weka system: 

Data Mining: Practical Machine Learning Tools and Techniques

Excellent introduction to generating graphs in R: 

R Graph Coobook

LEVEL: intermediary

One of my all-time favorites for digging into meaty analyses with a superb balance of theory and application: 

Handbook of statistical analyses in R

My favorite go-to for understanding frequently murky statistical terminology:

Cambridge Dictionary of Statistics

Good source for understanding more sophisticated R graphs: 

R Graphics Cookbook

A superb survey of meta-analysis, with just the right blend of theory and application. 

An excellent source of advice on how to make this frequently under-appreciated tool sing for you. Lots of practical info and examples on how to make reports and dashboards that talk directly to data sources (including databases), and do it in style.  

Excel Dashboards and Reports 2nd Edition

LEVEL: advanced

Probably the best coverage to the theory of Deep Learning possible, complete with a powerful review of machine learning to provide the background necessary to understand this powerful technology. One of very few textbooks that bothers to explain why things are named the way they are, especially when those names are counter-intuitive or sometimes frankly wrong! Only people who really care to convey a good understanding of their field take such pains, and it shows in the quality of this book.

Deep Learning

A superb, highly practical introduction to actually running Deep Learning networks. A great complement to Goodfellow’s Deep Learning book.

Deep Learning with Python


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s