Sunday, March 26, 2017

Air Quality In Poland #07 - Is my data frame valid?

OK, so we have Jupyter Notebook which handles all data preprocessing. Since it got quite large I decided to save data frame from it on disk and leave it as is. I perform further analysis on separate notebook, so whole flow will be more clean now. Both notebooks are located in the same workspace so using them didn't change.

While working with one big data frame I encountered unwanted behavior of my process. Data frame which I created has 1 GB after saving to hard drive. Whole raw data I processed has 0,5 GB. So basically after my preprocessing I got additional 0,5 GB of allocated working memory. On my 4 GB ThinkPad X200s when I'm trying to read already read file, my laptop hangs. To avoid that I'm checking if maybe this variable is already created, and if it is present, I'm skipping its loading:

 if 'bigDataFrame' in globals():  
   print("Exist, do nothing!")  
 else:  
   print("Read data.")  
   bigDataFrame = pd.read_pickle("../output/bigDataFrame.pkl")  

Since I have this data frame it should be easy to recreate result of searching for most polluted stations over year 2015. And in fact it is:

 bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].idxmax()  

We also receiving byproduct of such search which is date and time of such extreme measurement:


It looks like we have the same values as in previous approach, so I'm assuming that I merged and concatenated data frames correctly. If we would like to know actual values we can swap idxmax() with max().

It looks like this was all I prepared for in this short post. Don't get overwhelmed by your data and keep doing science! Till next post!

No comments:

Post a Comment