This is going to be an ongoing article series about various aspects of Machine Learning. In the first post of the series I’m going to explain why I decided to learn and use R, and why it is probably the best statistical software for Machine Learning at this time.
R vs. popular programming languages like Java
Implementing Machine Learning algorithms is not an easy task because it requires a deep understanding of the inner workings of the algorithm. Furthermore, it can make a big difference how the function of a Machine Learning algorithm is implemented in detail because the application of advanced mathematical tricks can enhance the performance of the algorithm substantially. R already provides sophisticated implementations of various Machine Learning algorithms and therefore relieves you from the tedious and error-prone task of implementing your own algorithms. In addition, R generally allows for a much faster development since it is not a general-purpose programming language like Java, which allows it to be very compact. 1 line of code in R can sometimes do more work than 10 lines of code in Java. Notice that a couple of excellent libraries for Machine Learning algorithms are available online, e.g. libsvm, but they are bound to a particular programming language and are usually also accessible with R, where you can use them in a less verbose way.
R vs. Weka
I personally used to work a lot with Weka (Waikato Environment for Knowledge Analysis) since it is free and very easy to use. Within minutes even beginners should be able to apply the desired Machine Learning algorithm to a data set. But as it is often the case, there is a tradeoff between usability and other software characteristics. Firstly, R is way more powerful and flexible than Weka. This means that you can do numerous things with R which are not possible with Weka. An example would be the very customizable way of plotting data in R. Secondly, once you feel comfortable with R you can save plenty of time. Since the shell for Weka is primarily GUI-based, you end up clicking and adjusting many times in case you want to use powerful functions. So, even if a rather advanced procedure, like plotting the learning curve of a model, is available in Weka, applying it can be quite cumbersome. On the contrary, R’s shell is terminal-like and its compact language allows for very short and powerful instructions.
R vs. SAS
So far I haven’t been using SAS extensively, but to my understanding SAS and R both provide a similar amount of functionality for Machine Learning purposes, with maybe even slight advantages for R. The main difference in any case is that while R is free and open-source, SAS is commercial and can be expensive. Thus, some companies, especially startups, might not be able to provide its employees with this software. For a detailed and well written comparison of R vs. SAS, I recommend this blog post.
R vs. Matlab
First of all, Matlab also isn’t available for free, nor is it open-source. On top of that, the software package is mostly geared towards mathematical applications, like solving equations. Yes, you can perform statistical computations as well with Matlab, but R was especially developed for that niche and pretty much excels in it. The tremendous amount of statistical packages and better graphics tend to make R superior to Matlab (look here for more information). There is a free open-source software called Octave which is mostly compatible with Matlab. It is very similar to Matlab but inferior to it in some areas, e.g. usability. More information on the differences between Matlab and Octave can be found here.
Other reasons for R
A great advantage of R is that scientists adopted it as their de facto standard. As a consequence, the latest cutting-edge techniques are first available in R. Maybe that’s the reason why most Kaggle competition winners used the software as their main statistical tool, which by all means tells us that even the most experienced Machine Learning practitioners seem to prefer R. If those people work with it, then it shouldn’t be a bad idea to do the same, right? Finally, R is rapidly becoming the standard for developing statistical software. So learning R definitely is a good investment in the future.
Getting started with R
It’s not a secret that R lacks user-friendliness and cannot be described as easy-to-learn. I actually made the same experience. But if you already are familiar with programming languages and take a detailed look at the following references, then learning R is not an issue at all but turns out to be fun.
R Cookbook: great for beginners, very practical due to its format, useful as a quick reference guide
R in a Nutshell: great introduction to R which goes more into detail than the R Cookbook
The Art of R Programming: takes a look at R as a programming language, good for people who want to get serious about R, starts with the basics but it helps if you already have some experience with R
Online video resources: in case you want to learn R through online videos, this blog post provides an overview
Finally, I should mention that I don’t have a lot of experience with SAS and Matlab (or Octave). The paragraphs comparing those software packages to R are mostly based on online research and discussions with friends and colleagues. If you know more about that topic, I would love to hear your opinion.
Nice overview! But what about Python? There are packages like scikits-learn (http://scikit-learn.sourceforge.net/dev/index.html), PyML, PyMC , a DataFrame (pandas.pydata.org) and if something should be missing there still is rpy2.
Thank you Arthur for the additional information. I read about Python on a couple of Machine Learning blogs, so it really seems to be a reasonable alternative to R.
First,Thank you for this good article.
what’s your idea about Octave?good or …?
Hi Soroush, Octave is definitely a good tool for Machine Learning purposes and an alternative to R. For example, the Stanford Professor Andrew Ng uses it in his online ML-course (https://www.coursera.org/course/ml). However, as described in the paragraph “R vs. Matlab”, Octave is slightly inferior to Matlab, which is why I recommend R. But overall, Octave and Matlab are, as well as R, great pieces of software and being proficient in one of them can only be useful for you.
Interesting post. I too would like to ask Arthur’s question: “Why not Python”?
Also, I know that scripting in R or Python goes a lot faster then writing code in C/C++, but wouldn’t the advantage in running speed of the final product that would result from using the latter couple of languages make up for that?
Looking forward for your replay,
relating to your first question I can say that many Machine Learning practitioners seem to use Python and its rich libraries, especially in the USA. Rapid prototyping and trying out ideas would be common use cases for it. I don’t have much experience with Python personally, but e.g. Hilary Mason uses Python a lot. So it obviously can’t be a bad choice!
Addressing your second question: In Machine Learning it is important to be able to rapidly try out new ideas and to test your hypotheses. Using R or Python helps you there. Also, most packages of R itself are written in C/C++ which means that the performance gains would be very small. If you want to implement the whole algorithm yourself instead of using premade libraries, then good luck. Those libraries are optimized and written by experts and thus there is a good chance that your own code will be vastly slower. Additionally, in most Machine Learning applications not the running speed is the important factor, but rather the accuracy or quality of the final model. And this is where you should invest a lot of time.
Hope that helps, Florian
[…] Thoughts on Machine Learning – the statistical software R | Florian Hartl machine learning in Python — scikit-learn 0.12.1 documentation […]
I’m on my first steeps in Data Mining (My thesis is about to predict student desertion in a HighSchool)… Now searching info about Weka and R. But actually I’m not sure how to read and interpret the knowledge.
First of all Thank you for providing detailed review on different mining platforms. I have experiemce in VB .net however currently iny major I have opted for data mining as I found it interesting. My university is using weka as a tool for learning purpose. But after reading this article Im in bit dilemma as whether to invest time on weka or its betterbto learn R as first mining language which will be long term beneficial.
Hi Sourabh, I think for your first steps in the data mining world Weka should be your tool of choice. It is very user friendly, has visualizations automatically built in and gives you a lot of functionality out of the box. Once you are more familiar with the concepts of the data mining field and have the desire to go “one level deeper” when it comes down to manipulating data and applying the newest and fanciest research findings, then move on to R.
As a person who has worked with R, Python, and Matlab/Octave extensively both in research and the industry, I have to state that I see Octave as the poor man’s Matlab and R as the poor man’s Octave. The one and only advantage of R over Matlab is that it’s free. The reasons R has gained popularity are that (a) it’s free, (b) it’s easy to install packages, (c) it has a good amount of packages for a free product, (d) it comes with a decent GUI on windows (R-Studio). Those might be good reasons depending on the tasks you want to achieve but it certainly lacks behind when it comes to more demanding tasks, and overall I find that R lacks quality in terms of performance, intuitiveness, and code readability. Python (with anaconda) and Octave have a lot of quality but lack the fancy interface and one-click package installation. Other than that they can do anything R does only more efficiently (faster and with better quality of code). Personally I never understood the claim that R is “good for statistics” simply because I don’t see what more it can offer. Python is a real programming language with object-oriented capabilities and with numpy and Spyder it becomes a lot like matlab where vectorised code is possible and efficient. The problem with Python is that it has some sort of tedious module installation. Octave is basically a free interpreter of the Matlab language and has the beauty of mathematical vectorized high-level code plus GNU plot and many libraries. Its downside is that it has a bad GUI on windows and a very decent one on Unix. However, when it comes to the actual programming language and its capabilities, I find that they both outperform R in every way.
I’m not comparing R to Matlab because for me they’re not even in the same league. Matlab is FAST, and by that I mean really fast. It uses optimized libraries and is only comparable to C and Fortran code. It has a massive number of free toolboxes that do most things that Mathworks’ commercial toolboxes can do and of course what R’s packages can to as well. The language is beautiful, mathematically oriented with few data types and ideal for any kind of numerical calculation. Personally my Data analytical efficiency reaches its peak when I’m working on Matlab. It is also the tool of choice in Engineering/Bioinformatics/Biostatistics research as well as colossal companies such as NASA, Airbus, McLaren, Ferrari just to name a few. I guess that too must mean something.
PS: It should be noted that this was my honest subjective opinion and that I have no agenda with any of the mentioned products.