I (David Slate) am a computer scientist with over 48 years of programming experience and more than 25 years doing machine learning and predictive analytics. Now that I am retired from full-time employment, I have endeavored to keep my skills sharp by participating in machine learning and data mining contests, usually with Peter Frey as team “Old Dogs With New Tricks”. Peter decided to sit this one out, so I went into it alone as “One Old Dog”.
For this contest I used essentially the same core forecasting technology that I’ve employed in other contests: a home-grown variant of “Ensemble Recursive Binary Partitioning”. This is a robust algorithm that can handle large numbers of records and large numbers
of feature (predictor) variables. Both outcome and feature variables can be boolean (2 classes), categoric (multiple classes), or numeric (real numbers plus a missing value). For the R contest the outcome was boolean, and the features provided in the training set were a mix of all three variable types.
To help tune modeling parameters and select feature variables, I relied both on a cross-validation procedure and also on feedback from the leaderboard. For most of the cross-validation runs I partitioned the training data into 5 subsets of roughly equal population, trained a model on the data in 4 of the 5 subsets, tested it on the 5th to produce an AUC score, and then repeated this process 4 more times, rotating the subsets so that each subset got to play the role of test set once. I then repeated this 5-fold procedure one more time, after scrambling the data to ensure a different partitioning, so as to produce 10 AUC scores altogether. These were averaged together into a composite score for the run. I also computed a standard deviation and standard error of the mean for the 10 scores to get some idea of their statistical variability. In the course of the competition I performed a total of 628 of these cross-validation runs. By the time of my first submission on Dec 11, I had already done 115 of them.
Although my tests involved a large number of feature variable selections and parameter settings, testing was not systematic enough to conclude that the winning model was in any way optimal. There were too many moving parts for that.
To produce my first submission I used only the feature variables provided in the training set, but I enhanced the results in two ways. One was to exploit the fact that some records occurred in both the training and test sets, so that their forecasts could simply be copied from the training labels. The other was to use the package dependency information in the depends.csv file from the supplementary archive johnmyleswhite-r_recommendation_system-36f8569.tar.gz, which, as suggested on the contest “Data” page, I downloaded from http://github.com/johnmyleswhite/r_recommendation_system. For each record whose Package depended on a Package known to be not Installed by this User, I produced the forecast 0, and for each record whose Package was depended on by a Package known to be Installed by this User, I produced the forecast 1.
Although this first submission received the lowest final score (0.983419) of all my 55 submissions, it turned out that unbeknownst to me this would have been just sufficient to win the contest.
In the course of the contest I produced and tested a variety of additional variables, many of them based on other files in the github archive, such as imports.csv, suggests.csv, and views.csv. I also made use of the one-line package descriptions on the “Available Packages” list at cran.r-project.org. Finally, I created variables from the text in the package index pages acquired by downloading all the pages http://cran.r project.org/web/packages/PKGNAME/index.html, where PKGNAME stands for each package name.
I failed to include in my final 5 selections the submission that received the highest final score, 0.988189. But I did include my 2nd best (0.988157), and I’ll describe that submission in some detail. Note that both of these submissions were made the day before the contest ended.
The winning submission model utilized 43 features. These included the 15 provided in the training file plus 28 synthesized feature
variables. Although my model-building algorithm will naturally give greater weight to highly-predictive variables, it is also possible to assign an “a priori” weight to each variable, and I tried various values of these. Here is a table of feature variable names, together with their types (B = boolean/binary, C = categoric/class, N = numeric), their assigned or default relative weights, and, in the case of each B or N variable, a crude indication of its utility in the form of its correlation coefficient with the outcome (Installed). The
final column contains a brief description of the variable.
Several of the synthesized features involve some crude text analysis. In the description of those features, a “word” refers to a contiguous sequence of alphanumeric characters, and a “name” is an upper case letter followed by a contiguous sequence of alphanumeric characters.
Various other feature variables were tried, but for whatever reasons did not make the final cut.
My computing platform consisted of two workstations powered by multi-core Intel Xeon processors and running the Linux OS. The core forecasting engine was written in C, but was controlled by a front-end program written in the scripting language Lua using LuaJIT (just-in- time Lua compiler version 2 Beta 5) for efficiency.
Originally published at blog.kaggle.com on February 16, 2011.