Causes Tech: A recommendation engine that could
Posted Jul 25, 2011 by Sydney Fleischer
This week in Causes Tech we hear from software engineer, Adam Derewecki. As his team bio suggests, Adam is an adventurer with tales of exploration and intrigue from across the globe. Today he regales us with the creation story of the new Causes recommendation feature, the Suggestomatic.
For better or worse, the personalized web is here and you probably interact with it every day without knowing it: “Other items you may like” on Amazon, “People Who Viewed This Also Viewed” on Yelp, “Who to follow” on Twitter, etc. Automated suggestions based on information about you or information mined from groups of users has become a staple in content discovery. So how hard is it to build a rudimentary recommendation engine from scratch? For our most recent Hackathon, this is what I set out to do: build a modest “You might like these Causes” feature. It’s worth mentioning that the Cause membership data set is 911,000,000 records, about 600,000 unique causes, 150,000,000 unique registered users and by FAR the largest data set that I’ve ever worked with — and also what made it an interesting problem.
There are two different types of collaborative filters that would accomplish what is needed: item based (purely a “this is how a bunch of other people are related to this item”) and user based (built on item based, but gives extra weight to users that act similarly to you). Because it was a Hackathon and I’d have to build the item based filter first, I scoped the project to just do that. The crux of an item based collaborative filter is a similarity matrix where each set member is compared against each other set member and a score is calculated. For Causes, the scoring function needed to show how related two Causes are, and the input data set is the list of all Cause memberships. “Members in common” seemed like a good enough scoring function, so I made a proof of concept Ruby script that queried the cause_membership database with two cause_ids and did a self join to calculate the member set intersection. While it produced good results (topic and language matched on first glance) it was really slow — on the order of two years slow.
So I got a bigger hammer — the C hammer (I was really hoping I could work an MC Hammer joke in here somehow). I knew that a lot of the overhead of the proof of concept was spent in the database doing the self joins and paging membership data in and out of memory. That became my starting point — write the fastest Cause intersection function that I could and build around that. The fastest way I could think of finding the intersection of two sets was to sort the sets into arrays of integers and compare as I walk down the array.*
My input was a CSV dump of membership data from the database in the format (user_id, cause_id) that was on the order of 9GB — more memory than I had. While I could have just spun up a big memory instance on EC2 to handle it, I thought it would be prudent to try and trim down the size: less memory meant not only less memory requirement, but that functions operating on the memory would be quicker. The data massaging could afford to be slower than C, so I did it in Python. The first step was pretty straightforward: read a line, parse the two integers, and write them to file as integers (8 bytes per record as opposed to the ASCII representation of the numbers). This got my total input size down to about 5.4 GB — much better, but still in (cause_id, user_id) tuples. To build membership arrays comprised of just user_ids, I needed another data massaging step. Because 5.4 GB was ALSO more memory than I had, I made a loop that iterated over segments of 25,000 cause_ids at a time and built membership arrays for just that segment. I was also able to throw away some data: about 15% of our Causes only have the creating member; I dropped those because they wouldn’t add much value but would significantly cut down on CPU cost (for those of you who haven’t done the mental math, 600,000 ^ 2 * (1/2) is 180,000,000,000 comparisons, and I’ll do anything I can do to cut that number down).
At the end of the data preparation step, we generate:
* data file — 2.54 GB, array of arrays of user ids
* index file — 2.36 MB, array where the key is the cause_id and the value is
the offset into the data file where that Cause’s user_id list starts
* ids file — 1.57 MB, array of cause_ids (our cause_ids aren’t contiguous)
The total memory requirement to hold this data is manageable on modern systems. Creating the C engine to use this data was fairly straightforward: load all the data files into memory with mmap(2) and start comparing every pair of Causes.
After an initial run, I noticed a few bottlenecks. First, at n^2 comparisons, if I stored every result, I’d need on the order of (4 bytes/int * 3 ints) * (1/2) * n^2 bytes. Way, way, too much! Completeness isn’t even desirable, because you only ever want the top 25 – 100 recommendations. So, the first deviation from completeness was to only write results if the score was at least 100 and to also stop doing comparisons once 100 “good” results have been found.
The second bottleneck was that a large `cause_a` (Cause in the outer loop) will bring the process to a near standstill. A 3,500,000 member cause will need to walk down a significant portion of that list against every other cause — 3.5m * 600k is yet another number that is way, way, way too big. The “save 100 with scores above 100″ remedied this problem pretty handily, large Causes also hit 100 scores pretty quickly.
At this point, I call it version 1. It’s able to do 30 Causes per minute, so a complete refresh of our data in about 9 days. It’s not perfect, but it’s a good start, and we’ll be able to fix problems as we find them and make other improvements as we go. You can test it out yourself by cloning the git repo (http://github.com/causes/suggestomatic), generate test data, and follow the instructions in the README. On our end, this engine is alive and serving “Causes you might like” and hopefully more to come (related Causes provides us with a lot of neat opportunities to improve user experience).
* n^2 gets big really quickly. If the number of unique Causes were an order of magnitude bigger, this process probably wouldn’t be feasible without massive parallelization. Which is completely possible, by the way — this is a very parallelizable problem. There’s already begin_at and end_at parameters, so it should be easy to spin up a few EC2 instances and run in parallel
* Python has pretty good binary support. array.array is very easy to work with; all I needed was .fromlist() and .tofile()
* C is really, really fast. We all run really fast computers, and our modern Pythons and Rubies still run pretty quickly — it’s easy to forget that C is really that much faster and you can squeeze more out of your CPU by using it. However…
* Only do the heavy crunching in C. I think I would have driven myself batty if I had done the entire thing in C instead of Python! Sure, I could have probably made the data massaging step 10x faster in C. It would have absolutely taken me 10x longer though, since C is not my day-to-day language. And cutting a 10 hour step down to a few hours isn’t really worth it, when the whole process takes on the order of 10 days. If the rest of the program gets a big speedup and the data massaging step is now the bottleneck, then I’ll look at making it run faster.
If any of this sounds interesting to you, we’re hiring!
* It turns out there are plenty of more clever algorithms that can intersect sets quicker.
We’ve got an open issue for investigating quicker alternatives