An ever-expanding list of resources I’ve liked, still like, and recommend. Images are hyperlinked to Amazon, btw.
Software & Systems
- R: Steep learning curve, frequently weird, but its myriad of packages and powerful graphics make it essential. It runs on everything. Plus, being Open Source code, you can run it on every node of a cluster for free.
- SQL: Frequently essential partner to data mining. If you’re data are structured, this is the place to put it. I prefer PostgreSQL, but MySQL is as widespread as sin and can be used, if treated with respect. And of course, SQLite is integral to several key libraries in R and Python. Bottom line: learning SQL is mucho helpful.
- Python (NumPy, SciPy, sklearn, Pandas, MatPlotLib, seaborn): What R should be in terms of speed and rigor. Having said this, R conserves the top rank when it comes to plotting, IMO.
- RStudio. Don’t code R without it. Just don’t.
- Jupyter Notebook, an excellent way to run Python or R exploratory data analyses (less so for just writing code). It took some doing for me to like it, but I have seen the light, and the light is good.
- Rodeo from yhat (thanks for doing this, yhat! It’s appreciated!). Python’s answer to R Studio. Don’t like Jupyter Notebooks? Rodeo may be your answer.
- EMS’ SQL Manager for MySQL (also Postgres). It’s been my favorite SQL query browser and DB manager for years. Very powerful, reasonably easy to use, and great support. Well worth the purchase. And they have free versions that have decent capabilities.
- Scala: Nice language for doing data analysis in a Java-like fashion without its heaviness.
- Java: So many things run on Java, you have to learn it: Spark, Weka, ChemAxon, OpenNLP, you name it. Though it is true that there is an awful lot of “IDE vomit”, to quote “hilarious Indian” guy in this Youtube Scala tutorial (highly recommended!).
- Spark, and particularly the Spark Machine Learning Library (MLlib): When you need power and your problem fits the no-communication-between-slave-nodes model, this is the first place to turn to.
- Julia: the next iteration in data mining languages? Sure looks like it
- Weka: I rarely use it these days, but it’s a great platform for getting exposed to algorithms without the tears of R, and there are some great algorithms I’ve only found in that system, e.g., logistic regression trees.
If you’re like me, you love figuring things out, and one thing I learned a long time ago is that documentation makes a big difference. Unfortunately, we are now at the point where there is frequently too much documentation out there, ergo this list of my favorite books.
All books are linked to Amazon unless indicated. Personally, I typically access them via Kindle (book prices and convenience are hard to beat).
A great introduction to R:
Awesome comprehensive tour of data mining that combines theory and application in just the right proportions, centered on the Weka system:
Excellent introduction to generating graphs in R:
One of my all-time favorites for digging into meaty analyses with a superb balance of theory and application:
My favorite go-to for understanding frequently murky statistical terminology:
Good source for understanding more sophisticated R graphs:
A superb survey of meta-analysis, with just the right blend of theory and application.
An excellent source of advice on how to make this frequently under-appreciated tool sing for you. Lots of practical info and examples on how to make reports and dashboards that talk directly to data sources (including databases), and do it in style.
Probably the best coverage to the theory of Deep Learning possible, complete with a powerful review of machine learning to provide the background necessary to understand this powerful technology. One of very few textbooks that bothers to explain why things are named the way they are, especially when those names are counter-intuitive or sometimes frankly wrong! Only people who really care to convey a good understanding of their field take such pains, and it shows in the quality of this book.