Thursday, April 27, 2017

Why should you participate in machine learning competitions

What are machine learning competitions?


Machine learning competitions are competitions, in which goal is to predict either class or value, based on data. Data could be tabular, time series, audio, image or something similar and is often related to real life problems, which are tried to be solved with it. Rules are formulated in such way that every participant can immediately compare his results with every other participant. Data used in competitions is usually pretty much clean and processed so it is quite easy to start working with own solution.

Why should I bother?


Machine learning competitions are prepared by machine learning specialists​ for machine learning specialists​. It means that problems which are used there represents current problems which are present in social, academic out business environments. Those problems also can be solved by machine learning methods, but sponsors are looking for some novel approach which may be proposed by competitors. This gives us first reason - this type of problems is currently present and solved with machine learning approach.

When you register to completion and accept its rules, you will be able to download train and test data. This data always comes from real life processes and has its every flaws. There are some unexpected values, data leaks, broken files and repetitions. But on the other hand, process of its obtaining is usually well described and data itself is also described. In machine learning research such real but usable data is very valuable. This is second reason - access to real word data.

As I mentioned, every participant receive train and test data. Train data, as usual is used to train prediction model. But test data is not used to test your model in usual way. You will receive this data but without target class or value. It's purpose is to be used for prediction. Those predicted classes or values can then be submitted into competition system and scored by defined function based on true and hidden real classes or values. You don't know target classes and values, so it is very hard to cheat. It's hard because every team or individual competitor is limited to result submissions per day. But since you know loss function, you can estimate effectiveness of your model before submitting results. After calculation, loss is placed on public leaderboard so every competitor can see how well is his model performing compared to others. After competition ends, leaderboard is recalculated with inclusion of another hidden data set, which wasn't used to calculate score on public leaderboard. This means that even if someone submitted just random results and luckily achieved great score, his results after recalculation will be closer to random than to high score.

After competition, top contestants are interviewed to share their approach towards tackling problems. All of this gives us third reason - you are often competing with and comparing your solutions to top solutions in industry/academia so it teaches you humility in thinking about your “brilliant” solutions.

Where can I compete?


I participated in some competitions and it always was entertaining and educating for me. I didn’t had much of leaderboard success. But I learned patience and careful from a to z thinking. I really recommend participating in them. Currently there are at least two companies organizing such competitions: Kaggle and DrivenData. First one is bigger and rather business driven, second one is definitely smaller but aims toward solving social/greater good problems, so it might be better suited for morally motivated competitors. Either way, both are using flow described above. Good luck!

No comments:

Post a Comment