Those not familiar with the contest can find the details here. After three weeks of work i managed to score 0.9422 rmse, which brings positions just above the 1000th place. how? Not too hard, but not too easy either.
I’ve put the k-nearest neighbours algorithm on work. The algorithm finds the nearest k samples to the given one (the one we’d like to predict), and then tries to assume its class/value somehow. So there are three tasks to make it work:
- define distance that will tell the algorithm which are the closest samples
- set k – neighbourhood size
- choose combination of the k values from the closest samples to obtain the final results
For a (user, movie) tuple I found the k closest users that watched the movie, and combined their rates to get a prediction.
Distance
There are many different distance measures to consider. One can try pearson’s correlation coefficient. Vectors’ sparsity brought cosine similarity to my mind. It works fine for clustering documents represented by sparse vectors. But none of those produced satisfactory results, so I tried another method that tries to capture some common sense.
So imagine a person walking into the room full of unknown people. Initially everybody has the same (some average) distance to you. Then you talk to them and the first one likes the same music band as you. So distance to this person diminished a little bit. Then another person says the same band is crap, so his/her distance increases. But there are more bands, movies, actors, moral issues, work related issues etc. you talk about. The more ‘slots’ you have in common and if you agree on each slot above some average agreement, the closer you are to the person.
So that’s exactly what I used to calculate the distance between users. First, i calculated the average distance between all the rates. Actually i took around one million randomly picked rates, and got an average rate-to-rate distance to be 1.53465. When comparing two users their distance was sum of the r1 – r2 – average_distance for the movies in common. And when the movie was not seen by one or both of them, average distance was assumed. That’s how the problem of non-overlapping movies was solved, when users had only a few movies in common.
K
With the distance set, initially i took 100 neighbours, and was a bit lucky, because later i tried also with 25, 50, 75, 125, 150, and 200 but none improved the score. Note also that for other distance the number is not necessarily the same.
Combination
I tried only two of them. one is averaging the 100 rates obtained from the closest samples, and the other is weighted, where weights were dropping linearly from 100 for the closest sample toward 1 for the farthest one. The later gave better results.
Implementation
Built two c++ classes, namely user and movie with all the necessary in their interfaces, such as user::get_closest_users(). Initially everything was written in c, but it was a messy solution. Actually one month later I couldn’t read the code anymore, I got lost in the arrays and pointers…
Running the qualifying dataset took me about five days. Some friends did some caching of the neighbourhood, but didn’t work well. Then one day i got lucky. At work they asked me if i have some software to do some heavy weight testing of the 7 brand new hp blades (each armed with eight cores) to be tested. That’s 56 cores, heaven. Using python script I split the dataset in small pieces and distributed them around the blades and later picked up the results. Using ssh i managed to run it remotely. It took 4 to 6 hours to finish.
Wishes
I had more time to play with it.






