TechnicalMumboJumbo: March 2017

Friday, March 31, 2017

Bike Sharing Demand #01

Bike sharing system is one of most cool feature of modern city. Such system allows citizens and tourists to easily rent bikes, and return them in different places that they were rented. In Wrocław, it is even free to ride for first 20 minutes, so if your route is so short, or you can spot automated bike stations in this time interval, you can use rented bikes practically for free. It is very tempting alternative to crowded public transportation or private cars on jammed roads.

On 28-th May 2014 Kaggle started knowledge competition, in which goal was to predict bike rental number in city bike rental system. Bike system is owned by Capital Bikeshare which describe themselves: "Capital Bikeshare puts over 3500 bicycles at your fingertips. You can choose any of the over 400 stations across Washington, D.C., Arlington, Alexandria and Fairfax, VA and Montgomery County, MD and return it to any station near your destination."

Problem with bike sharing system is that, it need to be filled with ready to borrow bikes. Owner of such system need to estimate demand for bikes and prepare appropriate supply for them. If there will be not enough bikes, system will generate more disappointment and won't be popular. If there will be to much unused bikes, it will generate unnecessary maintenance costs on top of initial investment cost. So it seems that finding good estimate for renting demand could improve customer satisfaction and reduce unnecessary spendings.

How to approach such problem? Usually, first step should be dedicated to get some initial and general knowledge about available data. It is called Exploratory Data Analysis. I will perform EDA on data available in this competition.

As we can see, in train data we have three additional columns: {'casual', 'count', 'registered'}. Our goal its to predict 'count' value for each hour for missing days. We know that 'casual' and 'registered' should sum nicely to total 'count'. We can also observe their relations on scatter plots. 'registered' values seems to be nicely correlated with 'count', but 'casual' are also somewhat related. This plot can give idea, that instead of calculating 'count' one may calculate 'registered' and 'casual' and based on this numbers submit total 'count'.

Every data point is indexed by round hour in datetime format, so after splitting it to date and time components we have sixteen columns. We can easily generate histogram for each of them. By visually examining this histogram we can point to some potentially interesting features: '(a)temp', 'humidity', 'weather', 'windspeed', and 'workingday'. Are they important? We don't know that now.

We can examine more those features and pick those which have rather discrete values which unique count is less or equal 10 (my arbitrary guess). I will sum them for each hour for each unique value. Sounds complicated? Maybe plots will bring some clarity ;). First plot shows aggregation with keeping 'season' information. We have clear information that at some hours there were twice times more bike borrowing for season '3' than for season '1'. It is not so surprising if we assume that season '3' is summer. It will automatically lead to '1' being winter, and it is not true according to data description: season - 1 = spring, 2 = summer, 3 = fall, 4 = winter. Fascinating.

Second feature which might be interesting in this analysis is 'weather'. This feature is described as: weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog. It seems that 'weather' value corresponds with not niceness of weather conditions. Higher value - worst weather. We can see that on histogram and on aggregated hourly plot.

Another interesting information might be extracted from calculated feature 'dayofweek'. For each date there was day of week calculation and its results were encoded as 0 = Monday up to 6 = Sunday. As we can see on related image, there are two different trends on bike sharing based on day of week. There are two days which build first trend. They are '5' and '6' which means 'Saturday' and 'Sunday'. In western countries those days are part of weekend and are considered as non working days for most of population. Rest of days, which are all building second trend, are considered as working days. We can easily spot peaks on working days, which are by my guess related to traveling to and from work place. In second "weekend" trend we can observe smooth curves which are probably reflecting general humans activity over weekend days.

OK, it seems to be good time to examine correlations in this data set. Lets start with numerical and categorical correlations against our target feature 'count'. It is not surprising that 'registered' and 'casual' features are nicely correlated, we saw that earlier. 'atemp' and 'temp' seems to also correlate at some level. Rest of features have rather low correlation, but it seems, that there is 'humidity' among them which might also be worth of consideration for further possible investigations.

What are overall every to every feature correlations? We can examine it visually on correlations heatmap. On the plot, we can spot correlations between 'atemp' and 'temp'. In fact 'atemp' is defined as "feels like" temperature in Celsius. So it is subjective, but it correlates nicely with objective temperature in Celsius. Other correlation is in "month" and "season" which is not surprising at all. Similar situation we can observe with 'workingday' and 'dayofweek'.

So how much not redundant information do we have in this data set? After removing target features, there are thirteen numerical or numerically encoded features. We can run Principal Component Analysis procedure on this dataset and calculate which resulting features cover how much of the data set. After applying cumulative sum on results we receive plot which tells us, that first 6 components after PCA explains over 99% of variance in this data set.

As you can see, despite Bike Sharing Demand dataset is rather simple, it allows to do some data exploration and checking some of "common sense" guess about human behavior. Will this data work well with machine learning context? I don't know yet, but maybe I will have time to check it. If you would like to look at my step by step analysis you can check my github repository with EDA Jupyter notebook. I hope you will enjoy it!

Sunday, March 26, 2017

Air Quality In Poland #07 - Is my data frame valid?

OK, so we have Jupyter Notebook which handles all data preprocessing. Since it got quite large I decided to save data frame from it on disk and leave it as is. I perform further analysis on separate notebook, so whole flow will be more clean now. Both notebooks are located in the same workspace so using them didn't change.

While working with one big data frame I encountered unwanted behavior of my process. Data frame which I created has 1 GB after saving to hard drive. Whole raw data I processed has 0,5 GB. So basically after my preprocessing I got additional 0,5 GB of allocated working memory. On my 4 GB ThinkPad X200s when I'm trying to read already read file, my laptop hangs. To avoid that I'm checking if maybe this variable is already created, and if it is present, I'm skipping its loading:

 if 'bigDataFrame' in globals():  
   print("Exist, do nothing!")  
 else:  
   print("Read data.")  
   bigDataFrame = pd.read_pickle("../output/bigDataFrame.pkl")

Since I have this data frame it should be easy to recreate result of searching for most polluted stations over year 2015. And in fact it is:

 bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].idxmax()

We also receiving byproduct of such search which is date and time of such extreme measurement:

It looks like we have the same values as in previous approach, so I'm assuming that I merged and concatenated data frames correctly. If we would like to know actual values we can swap idxmax() with max().

It looks like this was all I prepared for in this short post. Don't get overwhelmed by your data and keep doing science! Till next post!

Friday, March 24, 2017

Air Quality In Poland #06 - Big Data Frame part 2

Since we know how to restructure our data frames and how to concatenate them in proper way, we may start with building one big data frame. To select interesting pollutants and years of measurements we have to build two lists:

 pollutants = importantPollutants  
 years = sorted(list(dataFiles["year"].unique()))

and then we have to run two nested loops which will walk over relevant files and concatenate or merge generated data frames

1:  bigDataFrame = pd.DataFrame()  
2:  for dataYear in years:   
3:    print(dataYear)  
4:    yearDataFrame = pd.DataFrame()  
5:    for index, dataRow in tqdm(pollutantsYears[pollutantsYears["year"] == dataYear].iterrows(), total=len(pollutantsYears[pollutantsYears["year"] == dataYear].index)):  
6:      data = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])  
7:      data = data.rename(columns={"Kod stacji":"Hour"})  
8:    
9:      year = int(dataRow["year"])  
10:      rng = pd.date_range(start = str(year) + '-01-01 01:00:00', end = str(year+1) + '-01-01 00:00:00', freq='H')  
11:    
12:      # workaround for 2006_PM2.5_1g, 2012_PM10_1g, 2012_O3_1g  
13:      try:  
14:        data["Hour"] = rng  
15:      except ValueError:  
16:        print("File {} has some mess with timestamps".format(dataRow["filename"]))  
17:        continue  
18:    
19:      data = data.set_index("Hour")  
20:      data = data.stack()  
21:      data = pd.DataFrame(data, columns=[dataRow["pollutant"]])  
22:      data.index.set_names(['Hour', 'Station'], inplace=True)  
23:    
24:      yearDataFrame = pd.concat([yearDataFrame, data], axis=1)  
25:      
26:    bigDataFrame = bigDataFrame.append(yearDataFrame)

This code is more or less the same as in previous post, but you can see some differences. Line (10) is for generating time indexes for different year each time, so we don't have to care of leap year. Lines (12-17) on the other hand cover problems with 3 data files which are not starting on first hour of new year. Perhaps I will take care of them later, for now they are ignored.

Why do we need nested loops? I previous post I worked towards merging data frames containing different pollutants into one. So if we have such data frame we just need to append it to our target big data frame.

After creating data frame with all interesting data points we should save it to disk, so later we will only have to read it in order to start analysis.

  bigDataFrame.to_pickle("../output/bigDataFrame.pkl")

Thats all for today, thanks for reading!

Sunday, March 19, 2017

Air Quality In Poland #05 - Big Data Frame part 1

In last post I showed how to find example information - maximal values of each pollutant across year 2015. I believe that this example was good, but poorly executed. I basically iterated over relevant files and saved calculated values. I didn't saved content of files in memory, so if I would like to find minimal values I would need to execute this loop again and basically waste time.

Better approach would be to iterate over files once, and store their content in properly organized data frame. Actually, some people claim that such organizing and preprocessing of data takes up to 80% of their usual analytics process - and when you have nice and clean data frame you can start to feel like in home.

So my target now is to prepare one big data frame, which will contain all measurements for all measuring stations for all main pollutants across years 2000-2015. Since procedure to create such data frame will take some non trivial steps I decided to split it into two blog posts.

OK, so we have to start with reading data, and renaming column which is wrongly labeled:

 data1 = pd.read_excel("../input/2015_PM10_1g.xlsx", skiprows=[1,2])  
 data1 = data1.rename(columns={"Kod stacji":"Hour"})

After reading xlsx data into pandas data frame we can observe that there is some kind of anomaly in reading datetime fields. It seems that since row 3 there is constant addition of 0.005 s to previous row. It accumulates to 43.790 s over whole file.

It looks like Microsoft has used own timestamp method across xlsx files which is different from common Unix timestamp. There are probably methods for dealing with it, but I decided to recreate this index by hand:

 rng = pd.date_range(start = '2015-01-01 01:00:00', end = '2016-01-01 00:00:00', freq='H')  
 data1["Hour"] = rng  
 data1 = data1.set_index("Hour")

Now we have nice and clean data frame with measurements for one pollutant. How to merge it with other pollutants data frames? Answer for that question was probably most difficult answer for that problem so far. As you can see in raw xlsx files, each measurement is located in three dimensional space. First dimension is "pollutant" and we can get it from filename. Second dimension is "date and time" and we can get it from spreadsheet index. Third dimension is "measuring station" and it is located in spreadsheet columns. Since target data frame is two dimensional, I had to decide if I want multilevel columns or multilevel index, and where to put each dimension. "Date and time" is obvious pick for index, since it is natural way to analyze instances with such index. Next I had to pick one or more "features". I plan to work on "main pollutants" only, so it seems to be good pick for features/columns. "Measuring station" was what was left. Such stations are build and decommissioned at different times, so I decided to treat them as additional "spacial" level of index, so even if station was working for some weeks/months it will generate less "NaN" cells than when it would be treaded as column level. I hope that this make sense and would not make problems in future. What is most funny, to do that I just need to use stack() function, recreate data frame from series and add proper multiindex names:

 data1 = data1.stack()  
 data1 = pd.DataFrame(data1, columns=["PM10"])  
 data1.index.set_names(['Hour', 'Station'], inplace=True)

Only programmers know the pain of hours of thinking projected to few lines of code. Thanks god no one is paying me for produced lines of code. Not that anyone is paying me anything for this analysis ;). If we do the same transformations by hand for other pollutant and create second data frame, we can easily merge them (dataMerged = pd.concat([data1, data2], axis=1)) and receive (head of) foundations for target data frame:

Thats all for today. In next post I will try to wrap this code into iterators for years and pollutants, so hopefully after running them I will receive my target big data frame. Thanks.

Wednesday, March 15, 2017

Air Quality In Poland #04

Hey! Welcome in 4-th post dedicated to messing with air quality in Poland data. In last post we prepared data files for easy manipulation. It is time now, for actually analyze something.

For fist analysis I decided to look for most extreme values for each of most important pollutants across year 2015. So basically I will look for dead zones across Poland ;). And what are those important pollutants? In Poland, air quality is calculated as worst class of classes among "PM10", "PM25", "O3", "NO2", "SO2", "C6H6", "CO".

In order to find such places we need to locate proper files. To do that we need simple selection from data frame:

 importantPollutants = ["PM10", "PM25", "O3", "NO2", "SO2", "C6H6", "CO"]  
 pollutants2015 = dataFiles[(dataFiles["year"] == "2015") & (dataFiles["resolution"] == "1g") &   
               (dataFiles["pollutant"].isin(importantPollutants))]

which will give us much smaller data frame with list of interesting files:

Since we have relevant list of files, we can write simple (and probably not super efficient) loop over them, which will find maximum vale of each pollutant and corresponding measurement station. It will look like that:

 worstStation = {}  
 for index, dataRow in tqdm(pollutants2015.iterrows(), total=len(pollutants2015.index)):  
   dataFromFile = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])  
   dataFromFile = dataFromFile.rename(columns={"Kod stacji":"Godzina"})  
   dataFromFile = dataFromFile.set_index("Godzina")  
   worstStation[dataRow["pollutant"]] = dataFromFile.max().sort_values(ascending = False).index[0]

This loop is taking 2 minutes on my ThinkPad X200s and produces only dictionary with pollutants as keys and codenames of stations as values. We may easily count values occurrences and see worst "dead zone":

 Counter({u'LuZarySzyman': 1,  
      u'MzLegZegrzyn': 1,  
      u'MzPlocKroJad': 1,  
      u'OpKKozBSmial': 1,  
      u'PmStaGdaLubi': 1,  
      u'SlRybniBorki': 2})

Since "SlRybniBorki" doesn't says much, we must consult "Metadane_wer20160914.xlsx" file which allows us to decode this station as "Rybnik, ul. Borki 37 d". Whoever lives there, I feel sorry for you!

Thats all for today, thanks for reading! ;)

Sunday, March 12, 2017

Air Quality In Poland #03

We have now nice data frame which contains list of data files and descriptions of content in them. It looks like that (first 5 rows):

We can now easily check how much data for each year we have,

what pollutants were measured

and how much files is available for each resolution:

As we can see, this is the place where something isn't exactly as it supposed to be. 9 files have some mess within resolution column. To fix that, we need to find rows with invalid resolution and replace pollutant and resolution values by hand in them:

 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'pollutant'] = "SO42_(PM2.5)"  
 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'resolution'] = "24g"

After figuring proper name for all messed files (details in notebook on github) we can check overall status of files data frame by issuing dataFiles.describe():

As we can see cont value in resolution column doesn't sum to target 359 values. It is because there is one data file called "2015_depozycja" which has data about deposition experiments which are out of my scope for now. I decided to remove this row from data frame.

So now we have clean data frame with filenames and file contents description in separate columns. Thanks to this, we will be able to easily access needed data and use it for further analysis.

Tuesday, March 7, 2017

Air Quality In Poland #02

In order to do analysis, we have to obtain relevelant data. In Poland, best place for air data is on official website of Chief Inspectorate for Environmental Protection. This inspectorate is main and official Polish government agency responsible for measuring and analyzing changes in natural environment. This agency is providing packages witch archival measurement data. Currently those packages covers years range 2000-2015. They are not updated in real time, but it seems that this will be enough for my analysis.

Before reproducing research, everyone should check if available data is exactly the same as originally used data. For this purpose I will calculate md5sums of downloaded files and archives. I should calculate sum for every file inside archives, but I believe that decompressing them without errors should be enough to assume that we are working on the same files. Heres my list:

 $ md5sum *  
 b6f86aec3bee46d87db95f0e5e93ea70 2000.zip  
 a7c045e40179b297c282d745d9cbc053 2001.zip  
 ba05c06c7a2681f1aaa54c6e9dd88a34 2002.zip  
 3f4215d89a64a5a6e52b205eec848a83 2003.zip  
 4053dcc35f228bd8233eb308d67f2995 2004.zip  
 9e23571c25bf8bb6ad77fc006007a047 2005.zip  
 b37ff6a8f0d12539a8d026b882ecbb49 2006.zip  
 5fe5b74264d1d301190446ed13b5ffa0 2007.zip  
 d63f9e4fcc9672b1136eb54188e12d2f 2008.zip  
 b437a9d17e774671a334796789489d9f 2009.zip  
 3a3cd0db3d14501d07db5f882225d584 2010.zip  
 d0e0e19f7517ed0b1a67260e9840bd89 2011.zip  
 58ebcdd2c36c5ef0f7117a42e648822a 2012.zip  
 36eefbd5ae62651807fa108c64ac155e 2013.zip  
 47836093ac1d4aa1b71edc6964a53a3c 2014.zip  
 4030e4d5b1e5ba6c1c5876b89b7aaa55 2015.zip  
 71665e79bf0a6a2f3765b0fcbb424b70 Kopia Kody_stacji_pomiarowych.xlsx  
 b7ff94632d6c60842980ea882ae1b091 Metadane_wer20160914.xlsx  
 bfa2680d5fbb08f9067f467c8a864235 Statystyki_2000-2015_wer20160914.xlsx

After obtaining the same files we can start unzipping into input directory, which is located on the same level as workspace directory. I'm not including input data files because of following reasons: 1) I'm not sure if license allows redistributing those files. It might be that only valid way to obtain them is through mentioned website. 2) Those files have some significant size - they have 490 MB unpackaged. It would be much waste of transfer if anyone interested only in source code would have to download them. 3) Those files are xlsx, which are binary. It is not good practice to put binary files into source code version control system.

So what data do we have exactly? After unpacking all zip files, we should obtain 359 files with data and 3 additional files which were previously unpacked. Data files have following naming convention "xx_yy_zz.xlsx". xx means year. We have data from 2000-2015, so we expect number in this range in first filename section. yy part is responsible for pollutant, for example it might be "NO2". Last part (zz) tells us about measurement value averaging - "1g" means, that data was averaged over one hour for each hour, and "24g" means that data was averaged over 24 hours once for each day.

To read all filenames I run double for loop:

 filenames = [ os.path.splitext(wholeFilename)[0] for wholeFilename in   
        [ basename(wholePath) for wholePath in glob.glob("../input/2*.xlsx") ] ]

After creating list with data filenames I'm building data frame with filename and columns responsible for year, pollutant and resolution

 dataFiles = pd.DataFrame({"filename": filenames})  
 dataFiles["year"], dataFiles["pollutant"], dataFiles["resolution"] = dataFiles["filename"].str.split('_', 2).str

Since we created data frame with columns which are describing file contents, we can easily access data which will be interesting in future measurements.

But as usually when dealing with data, there is additional problem with this approach. I will describe it in next post. Stay tuned.

Saturday, March 4, 2017

Air Quality In Poland #01

Dear reader! Welcome to my first post tagged "Get Noticed 2017!". As you can see, this is not first post on this blog. But as you can also see, I'm not publishing regularly. This is my attempt to change it. By change it, I mean to publish at least two posts weekly. One post dedicated to my open source project and other post related to IT. By doing it, I will fulfill requirements of competition "Get Noticed!". So I will try to kill two birds with one stone. Or something like that. I was inspired to compete in "Get Noticed!" by Michał and encouraged by Wojtek which are also competing in it. We will see how will it roll.

It looks like there is difference
in air quality between two
central points ...

What open source project I will develop here? Since there were no restrictions about technology, programming language, topics and purposes, I decided to perform exploratory data analysis on air quality data of Poland. What exactly do I meant by that? I would like to build Jupyter Notebook in which I will do step by step research analysis. My goal will be to build usable analysis which will be worthy, scientifically correct and engineeringly valid. And it will be fully reproducible of course!

... but the yellow point show
only data for PM10 ...

My first step will be toward downloading and preparing GIOŚ air pollution data. I will try to estimate missing values, so every point on map will have various pollutant estimates. Then I will work on visualizing those estimates. Next step will be dedicated to gathering and discussing various facts related to performed analysis. When I will be able to complete mentioned steps I will try to build predictive model for predicting best time for physical outdoor activities, for various places in Poland. So those are my initial ideas for project development.

... and orange one shows also
PM 2.5. Is air really better in
yellow one?

Which technologies will I use? I'm big fan of Python, so I will use it exclusively. I'm thinking about using Pandas, NumPy, Matplotlib and Folium modules, but this list will probably change over time. My product will have form of Jupyter Notebook, but I may not restrict myself to only one Notebook. If my work will be effective I might consider building standalone Python scripts to perform some parts of analysis. Time will show what ideas will pop up.

What was my motivation to pick up such idea for project? Recently in Poland we had some discussions about poor air quality around couple of big cities known of heavy industrial profile. I found that some persons are getting wrong conclusions about data points. Worst case was when reporter was comparing two not so distant points and air quality index (AQI) in them. One point was "safe" and other "unsafe", and that was clearly highlighted by reported. But the problem with "safe" point was that measuring station weren't measuring all pollutants there, so system probably filled missing values with zeros. This case leads to situation when someone think that he is chilling in "safe" air quality zone, while actually he knows nothing about it. This problem pushed me to thinking about better way to estimate air quality in points where there are no direct measurements available. I don't have any experience with working with air quality and geospatial data but I think I will be able to perform at least some basic analysis and produce some useful pieces of code.

I also hope that participating in "Get Noticed 2017!" competition will give me much fun and possibly some feedback related to my work. Michal is very enthusiastic about it, so I need at least try to maintain development and writing momentum and see what I will build. Last but not least: source code for my project is located here.

Ideas to implement:

Fill missing values for measuring stations
Interpolate pollutants values over Poland

Random ideas:

Scrap current data from GIOS server

Ideas graveyard:

Build mobile application based on my work