This site is being phased out.

PageRank's dependence on the damping factor

From Mathematics Is A Science
Jump to navigationJump to search

Briefly, it's bad math.

Recall how the decay/damping factor came to exist: "No PageRank would ever escape from the loop, and as incoming PageRank continued to flow into the loop, eventually the PageRank in that loop would reach infinity... so Larry and Sergey introduced a decay factor–you could think of it as 10-15% of the PageRank on any given page disappearing before the PageRank flows along the outlinks."

This number is a made-up parameter in the sense that there is no justification for choosing it over another value.

Of course, if the ordering that PageRank produces is independent from the choice of $r$ then we are OK.

A mathematician would try to prove or find such a theorem. Is there one?

First, if there was such a theorem, the only reason to choose a number over another would be the speed of convergence of the algorithm (.84 is better than .85, etc).

Second, the "naive" PageRank ($r=0$) could serve as a counterexample.

A quick literature search gives this paper (and others): Bressan and Peserico Choose the damping, choose the ranking? J. Discrete Algorithms 8 (2010), no. 2, 199–213. To quote:

"We prove that, at least on some graphs, the top $k$ nodes assume all possible $k!$ orderings as the damping factor varies, even if it varies within an arbitrarily small interval (e.g. [0.84999, 0.85001])."

A devastating result!

One of the responses to this criticism was that the graph used in the paper is "fairly contrived, and we're mostly interested in what the algorithm does on real-world data."

This seems like a fair point but to justify it one would need to conduct a study where a large number of websites are analyzed and it is shown that changing the damping factor DOES NOT significantly affect the ranking of these sites.

Further, such a study wouldn't resolve the bad math issue. What would? You’d need to

  1. define a class of graphs, let’s call them non-contrived;
  2. prove that the PageRank of a non-contrived graph is independent of the damping factor (or at least the top results aren't);
  3. provide evidence that the Internet, now and in the future, is non-contrived.

PageRank is often described as "elegant". However, pulling a parameter out of your ear is not only a very poor taste mathematically, it's bad science. It's similar to creating a mathematical theory of, say, planetary motion and deciding that the gravitational constant should be equal to, well, $.85$. And why not? You do get a valid mathematical theory!

Or, how about $\pi$ equal to $.85$? Any choice provides a valid exam of non-Euclidean geometry...

There are of course other problems with PageRank. For example, there is an example of such a graph that changing a single link completely reverses the ranking (Ronny Lempel, Shlomo Moran: Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs. Inf. Retr. 8(2): 245-264 (2005)).

Wikipedia: "the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank."

In particular, (semi-)closed communities represent a problem. Linkfarms is the main example. Also, groups of hotels have a lot links to each other, but very few links to the rest of the web. The issue is that once the random surfer gets into one of such communities he can't get out, adding more and more to the community's PageRank.

Finding such communities of web pages, if they try to avoid detection, is an NP-hard problem (read "hard"). The answer from Google and others is to employ secret heuristics that identify these communities and then penalize their PageRank.

The result is numerous false positives that hurt legitimate sites and false negatives that rank linkfarms higher in the search results. In addition, the secrecy and the constant tweaking make the risks unmanageable for web-based businesses. It's a mess.

To summarize, this is the sequence of events:

  • the initial model produces meaningless results;
  • they come up with an ad hoc fix;
  • the model is still mathematically bad;
  • the model becomes dominant nonetheless;
  • now we're all in trouble.

There will always be bad math out there. What makes this a real abomination is the magnitude of the problem, i.e., Google. It is disturbing that one piece of (bad) mathematics causes so many problems for so many people!