News Archive
PhpRiot Newsletter
Your Email Address:

More information

Twitter FollowingRank with Lapack

Note: This article was originally published at Planet PHP on 17 April 420.
Planet PHP

At the recent PHP UK Conference 2012 I had the opportunity to chat about machine learning and IR with a bunch of very smart people. One of the conversations included the always enlightening Rowan Merewood, and was around ranking Twitter friends. It's reasonably well known that Google used to use a variant of PageRank based on who-follows-who to rank it's Twitter search results (back when it had them). The question is, could the same kind of thing work over a much smaller set - say using it to rank the influence users I follow, in order, perhaps, to prioritise tweets?

In part the experiment is also an excuse to play with the PHP Lapack extension, which offers a few useful linear algebra algorithms, including calculating eigenvalues and eigenvectors. This is pertinent in this case as while PageRank can be (and usually is) calculated iteratively, mathematically it can be determined by taking the eigenvector of an appropriate matrix.

In PageRank, there is a grid that models the random walk of a user. It's a mapping between every page and every other page. For a given page (row), there is a value in each column if the row-page links to the column-page. The value of the entry is dependent on how many links there are on the page, representing the chance of a randomly clicking visitor leaving the row-page for the column-page.

In our case, we're using following as our link - on the basis that if a person we're following also follows another person we're following, we're doubly likely to see tweets by them as they may be retweeted or replied to. This is doing more than just counting up followers, as we want to factor in the importance of each follower - so if you are a followed by an important person, you are more likely to be important. Therefore, we'll work on the basis that the people who score highest when taking the PageRank are the most visible, and therefore important or influential.

So the first thing we need to do is to calculate the grid (or matrix) of all users - which is just an array of arrays. First of all we'll do a little caching function to avoid running constantly into the Twitter access limits (which are increased if you're using the authenticated API, which this isn't):

function getFromCache($id) {
A A $cacheFile = 'cache/' . $id . ".json";
A A if(!file_exists($cacheFile)) {
A A A A $access = is_numeric($id) ? 'user_id=' . $id : "screen_name=" . $id;
A A A A // Using cURL so we can get the error code
A A A A $ch = curl_init();
A A A A curl_setopt ($ch, CURLOPT_URL,
A A A A A A A A "" . $access);
A A A A curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
A A A A $file = curl_exec($ch

Truncated by Planet PHP, read more at the original (another 15156 bytes)