Monday, February 24, 2014

Mining Titanic with Python at Kaggle

If you are looking for newbie competition on Kaggle, you should focus on Titanic: Machine Learning from Disaster. It is one among Getting Started competition category. It is indented to allow competitor getting familiar with submission system, basic data analyze, and tools like Excel, Python and R. Since I'm new in data science, I will try to show you my approach to this competition step by step, using Python with Pandas module.

First of all, some initial imports:
 import pandas as pd  
 import numpy as np  
We are importing pandas for general work and numpy for one function.

File which is interesting for us is called "train.csv". After quick examination we can use column called "PassengerId" as Pandas data frame index:
 trainData = pd.read_csv("train.csv")  
 trainData = trainData.set_index("PassengerId")  

Lets see some basic properties:
 trainData.describe()  
 numericalColumns = trainData.describe().columns  
 correlationMatrix = trainData[numericalColumns].corr()  
 correlationMatrixSurvived = correlationMatrix["Survived"]  
 trainData[numericalColumns].hist()  
Describe function should give summary (mean, standard deviation, minimum, maximum and quantiles) about numerical columns in our data frame. Those are: "Survived", "Pclass", "Age", "SibSp", "Parch" and "Fare". So lets take those columns and calculate correlation (Pearson method) with "Survived" column. We receive two results which are indicating two promising columns: "Pclass" (-0.338481) and "Fare" (0.257307). This is going well with intuition, because we expected that rich people will somehow organize them better in terms in survival. Maybe they are more ruthless? Lets see histograms for those numerical columns (last line in above code):
Lets examine other, non numerical data. We have "Name", "Sex", "Ticket", "Cabin" and "Embarked". "Name" column consist unique names of passengers. We will skip it, because we can't possibly correlate survival with this value. Of course, based on name we could determine if someone is V.I.P. of some kind, and then use it in model. But we don't have this data, so without additional research it is worthless. Then we have "Ticket" and "Cabin" values. Some of them are unique, some are not. Those values are also differently encoded. Knowing the layout of cabins on ship and encoding system we also could try to work with columns, but we also don't have such data. The same situation is with "Embarked". If persons are placed on Titanic with respect of place of embark, this could have major impact on survivability. People in front parts of ship had significantly less time to react when crash occurred. But we also don't know how embark could affect placement on ship. Maybe I will try to examine this column in later posts. Last non-numerical column is most interesting one. It is called "Sex" (my blog will probably be banned in UK for using this magical keyword). Sex column describes gender of persons on Titanic. Using simple value_counts function we can determine structure of this column (as well as others mentioned earlier):
 trainData["Sex"].value_counts()  
We receive "male 577" and "female 314". So that's was the problem. Male and female are strings, not numerical values. But we can easily change it:
 trainData["Sex"][trainData["Sex"] == "male"] = 0  
 trainData["Sex"][trainData["Sex"] == "female"] = 1  
 trainData["Sex"] = trainData["Sex"].astype(int)  

What can we expect from gender and chances to survive on Titanic? As we remember form movie "Titanic", persons which were organizing evacuation from sinking ship, said that women and children should go first to emergency escape boats. I'm not sure how accurate movie was, but they showed that in fact there was mostly women on emergency boats. And we have some hints about that when we calculate correlation again: 0.543351. This is nicer result than with "Pclass" and "Fare".

First idea which comes to my mind is to build classification tree with first split according to "Sex" parameter. Lets calculate overall survival ratio and then survival ratio for males and females:
 totalSurvivalRatio = trainData["Survived"].value_counts()[1] / float(trainData["Survived"].count())  
 totalDeathRatio = 1 - totalSurvivalRatio  
 maleSurvivalRatio = trainData[trainData["Sex"]==0]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==0]["Survived"].count())  
 maleDeathRatio = 1 - maleSurvivalRatio  
 femaleSurvivalRatio = trainData[trainData["Sex"]==1]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==1]["Survived"].count())  
 femaleDeathRatio = 1 - femaleSurvivalRatio  
So we have: totalSurvivalRatio = 0.38383838383838381, maleSurvivalRatio =
0.18890814558058924 and femaleSurvivalRatio = 0.7420382165605095. It looks like being male or female does somehow affect your chances on sinking ship. Of course, this make sense if there is time to evacuate and limited seats in boats. When everything would be more rapid and placed in different conditions favors may be reversed.

To be more formal, lest calculate Information Gain. In simple words, information gain is amount of entropy reduction after splitting group by parameter. To calculate it, we need to estimate total entropy in set, then entropy in subsets which emerge after division main set by "Sex" and subtract them with respect of probability of being in a group. Again, simple code in Python:
 males = len(trainData[trainData["Sex"]==0].index)  
 females = len(trainData[trainData["Sex"]==1].index)  
 persons = males + females  
 totalEntropy = - totalDeathRatio * np.log2(totalDeathRatio) - totalSurvivalRatio * np.log2(totalSurvivalRatio)  
 maleEntropy = - maleDeathRatio * np.log2(maleDeathRatio) - maleSurvivalRatio * np.log2(maleSurvivalRatio)  
 femaleEntropy = - femaleDeathRatio * np.log2(femaleDeathRatio) - femaleSurvivalRatio * np.log2(femaleSurvivalRatio)  
 informationGainSex = totalEntropy - ((float(males)/persons) * maleEntropy + (float(females)/persons)* femaleEntropy)  
So, total entropy (calculated for value "Survived") equals 0.96070790187564692. Since we expect it to be between 0 (no entropy, perfect pure set) and 1 (maximally impure set), it means more or less that there are similar numbers of persons who died, and survived. So plain guessing will not be much effective. After division for males and females, entropy looks different: maleEntropy = 0.69918178912084072 and femaleEntropy = 0.8236550739295192. And informationGainSex = 0.21766010666061419 which is pretty nice result. To be precise, we should calculate information gain for other columns, but I'm ignoring this step for this tutorial.

Since we calculated male and female survival ratio, we can use them to classification that female are surviving and males are dying. Such simple classification should give you a 0.76555 accuracy on Kaggle public leaderboard. I will try to achieve better result in next part of this tutorial.

Monday, February 17, 2014

Book review: Data Science for Business

Since I'm interested in data science but I'm newbie into this field of research, I decided to read some introductory book dedicated to this topic. Firs book which I read was Data Science for Business written by Foster Provost and Tom Fawcett. Below are my descriptions of each main chapter of this book:
  • First chapter of this book is dedicated to overall definition of data science, big data and similar.
  • Second chapter introduces "canonical data mining tasks".
  • Chapter 3 shows first steps with supervised segmentation and decision trees.
  • Next chapter adds linear regression, support vector machine and logistic regression. 
  • Chapter 5 - in my opinion most useful - defines overfitting. Authors shows examples how one can hit the overfitting problem, but also shows how to avoid it and deal with potential problems.
  • Chapter 6 introduces additional data science tools: similarity, neighbors and clustering methods.
  • Chapter 7 focuses on aspects strictly related to applying earlier mentioned tools to business - expected profit. This well written chapter shows that there is almost always second bottom, apart of pure data tools - business bottom.
  • Great data scientist, at some point has to show his results and hypothesis to stakeholders. He can use lots of complicated mathematical formulas, but also can use simple plots with additional information to nicely visualize his ideas. Chapter 8 describes some fundamental "curves" which are often used in data science.
  • In chapter 9, authors describe Bayes' rule and discuss its advantages and disadvantages.
  • Chapter 10 is dedicated to "text mining". Authors know, that they just scratch top layer of this issue. But on the other hand, reader can find here some basics ideas how to work with text and how to start researching different methods.
  • Final evaluation of example problem which was used through this book is done in chapter 11. 
  • Chapter 12 discuss other techniques with approaching analytical tasks: co-occurrence and associations (example usage: determining item which are bough together). Profiling, link prediction and data reduction is also discussed with nice example of Netflix Prize. Authors also clearly explain why ensemble of models could give better results in some cases.
  • In chapter 13, authors shows how to think about data science in business context, but also points how to work as data scientist in business environment.
  • Last chapter is dedicated to overall summary. Authors gives hints how we should ask data science related questions and how to think in general about data science.
Lecture of this book was very satisfying. I wasn't hit by enormous quantity of new definitions, equations and examples. For newbie in data science, reading this book chapter after chapter is like going step by step after your mentor. Using one main example during whole book was great idea. Reader can observe different techniques and problems related to them, applied on the same business situation. Also, business awareness is raised from chapter to chapter. I recommend this book either to data scientist wannabe or to "suit" who want to hire some geeks to examine business possibilities which gathered data.

Actually I can't say anything bad about this book. Of course, I would just love to see a complementary handbook with code in Python or R, but I guess that there are plenty of such books.

Sunday, February 9, 2014

Test your data science skills - Kaggle competitions

I recently started directing my interests towards Data Science. I started reading related books and doing related MOOCs. It is very interesting area of science, which has particularity good application in business. But dry learning from books and from MOOCs could be unrelated to real life problems and also could be boring at some point. For programming, solution is easy: coding competitions. Here is list of many pages dedicated more or less to coding competitions. But how about data science?

Luckily, there is one website which is hosting data science competitions: kaggle.com. There are several categories of contest organized on kaggle:
  • Featured: often complex problems but heavily sponsored in terms of prizes. Organized by big companies.
  • Masters: limited access competitions. You have to receive access to Master Tier of kaggle by achieving great results in previous competitions.
  • Recruiting: competitions dedicated for recruitment
  • Prospect: competitions without leader boards. Goal of them is usually to explore various data sets and discuss results among other kagglers.
  • Research: problems related to strict research areas.
  • Playground: interesting problems which are solved for fun.
  • Getting Started: tutorials
By the time I'm writing this post, there are 14 competitions (4 of them are long lasting tutorial competitions), with 334 K$ prize pool.

How are those competitions working? When you register for each competition, you receive access to train and test data and some additional information. Train data has known value that you have to predict. Test data is actual data on which you will test your model. After running model on test data, you submit your results to automated test web application and receive accuracy results. Those results will build public leader board. It is called public because only part of your results are taken into calculation of your score. Other part will be taken into account after closing the competition and based on it, final leader board will be constructed. By this split, it is harder to overfit model by examining results.

So how to start with data science competitions? I recommend starting with Titanic competition. This competition has nice tutorials in Excel, Python and R, and looks quite easy, at least at beginning. 

Anyway, I wish you GL & HF. See you on leader boards!