An ever-expanding list of resources I’ve liked, still like, and recommend.
Software & Systems
- R: Steep learning curve, frequently weird, but its myriad of packages and powerful graphics make it essential. It runs on everything. Plus, being Open Source code, you can run it on every node of a cluster for free.
- SQL: Frequently essential partner to data mining. If you’re data are structured, this is the place to put it. I prefer PostgreSQL, but MySQL is as widespread as sin and can be used, if treated with respect.
- Python (NumPy, SciPy, sklearn, Pandas, MatPlotLib): What R should be in terms of speed and rigor.
- Perl: It’s on its way out, but it remains incredibly flexible and frequently essential glue between components.
- Scala: Nice language for doing data analysis in a Java-like fashion without its heaviness.
- Java: So many things run on Java, you have to learn it: Spark, Weka, ChemAxon, OpenNLP, you name it. Though it is true that there is an awful lot of “IDE vomit”, to quote “hilarious Indian” guy in this Youtube Scala tutorial (highly recommended!).
- Spark, and particularly the Spark Machine Learning Library (MLlib): When you need power and your problem fits the no-communication-between-slave-nodes model, this is the first place to turn to.
- Julia: the next iteration in data mining languages? Sure looks like it
- Weka: I rarely use it these days, but it’s a great platform for getting exposed to algorithms without the tears of R, and there are some great algorithms I’ve only found in that system, e.g., logistic regression trees.
If you’re like me, you love figuring things out, and one thing I learned a long time ago is that documentation makes a big difference. Unfortunately, we are now at the point where there is frequently too much documentation out there, ergo this list of my favorite books.
All books are linked to Amazon unless indicated. Personally, I typically access them via Kindle (book prices and convenience are hard to beat).
A great introduction to R:
Awesome comprehensive tour of data mining that combines theory and application in just the right proportions, centered on the Weka system:
Excellent introduction to generating graphs in R:
One of my all-time favorites for digging into meaty analyses with a superb balance of theory and application:
My favorite go-to for understanding frequently murky statistical terminology:
Good source for understanding more sophisticated R graphs:
A superb survey of meta-analysis, with just the right blend of theory and application.
An excellent source of advice on how to make this frequently under-appreciated tool sing for you. Lots of practical info and examples on how to make reports and dashboards that talk directly to data sources (including databases), and do it in style.