This is going to be an ongoing article series about various aspects of Machine Learning. In the first post of the series I’m going to explain why I decided to learn and use R, and why it is probably the best statistical software for Machine Learning at this time.
R vs. popular programming languages like Java
Implementing Machine Learning algorithms is not an easy task because it requires a deep understanding of the inner workings of the algorithm. Furthermore, it can make a big difference how the function of a Machine Learning algorithm is implemented in detail because the application of advanced mathematical tricks can enhance the performance of the algorithm substantially. R already provides sophisticated implementations of various Machine Learning algorithms and therefore relieves you from the tedious and error-prone task of implementing your own algorithms. In addition, R generally allows for a much faster development since it is not a general-purpose programming language like Java, which allows it to be very compact. 1 line of code in R can sometimes do more work than 10 lines of code in Java. Notice that a couple of excellent libraries for Machine Learning algorithms are available online, e.g. libsvm, but they are bound to a particular programming language and are usually also accessible with R, where you can use them in a less verbose way.
R vs. Weka
I personally used to work a lot with Weka (Waikato Environment for Knowledge Analysis) since it is free and very easy to use. Within minutes even beginners should be able to apply the desired Machine Learning algorithm to a data set. But as it is often the case, there is a tradeoff between usability and other software characteristics. Firstly, R is way more powerful and flexible than Weka. This means that you can do numerous things with R which are not possible with Weka. An example would be the very customizable way of plotting data in R. Secondly, once you feel comfortable with R you can save plenty of time. Since the shell for Weka is primarily GUI-based, you end up clicking and adjusting many times in case you want to use powerful functions. So, even if a rather advanced procedure, like plotting the learning curve of a model, is available in Weka, applying it can be quite cumbersome. On the contrary, R’s shell is terminal-like and its compact language allows for very short and powerful instructions.
R vs. SAS
So far I haven’t been using SAS extensively, but to my understanding SAS and R both provide a similar amount of functionality for Machine Learning purposes, with maybe even slight advantages for R. The main difference in any case is that while R is free and open-source, SAS is commercial and can be expensive. Thus, some companies, especially startups, might not be able to provide its employees with this software. For a detailed and well written comparison of R vs. SAS, I recommend this blog post.
R vs. Matlab
First of all, Matlab also isn’t available for free, nor is it open-source. On top of that, the software package is mostly geared towards mathematical applications, like solving equations. Yes, you can perform statistical computations as well with Matlab, but R was especially developed for that niche and pretty much excels in it. The tremendous amount of statistical packages and better graphics tend to make R superior to Matlab (look here for more information). There is a free open-source software called Octave which is mostly compatible with Matlab. It is very similar to Matlab but inferior to it in some areas, e.g. usability. More information on the differences between Matlab and Octave can be found here.
Other reasons for R
A great advantage of R is that scientists adopted it as their de facto standard. As a consequence, the latest cutting-edge techniques are first available in R. Maybe that’s the reason why most Kaggle competition winners used the software as their main statistical tool, which by all means tells us that even the most experienced Machine Learning practitioners seem to prefer R. If those people work with it, then it shouldn’t be a bad idea to do the same, right? Finally, R is rapidly becoming the standard for developing statistical software. So learning R definitely is a good investment in the future.
Getting started with R
It’s not a secret that R lacks user-friendliness and cannot be described as easy-to-learn. I actually made the same experience. But if you already are familiar with programming languages and take a detailed look at the following references, then learning R is not an issue at all but turns out to be fun.
R Cookbook: great for beginners, very practical due to its format, useful as a quick reference guide
R in a Nutshell: great introduction to R which goes more into detail than the R Cookbook
The Art of R Programming: takes a look at R as a programming language, good for people who want to get serious about R, starts with the basics but it helps if you already have some experience with R
Online video resources: in case you want to learn R through online videos, this blog post provides an overview
Finally, I should mention that I don’t have a lot of experience with SAS and Matlab (or Octave). The paragraphs comparing those software packages to R are mostly based on online research and discussions with friends and colleagues. If you know more about that topic, I would love to hear your opinion.