This site is being phased out.

PageRank as a flow

From Mathematics Is A Science
Jump to navigationJump to search

After more than 10 years, PageRank's paradigm is clear. The people who care about their PageRank – millions of web site owners – understand what it is about. It isn't about probability.

PageRank is about passing something valuable from a page to the next along its links. This something has been called popularity, significance, authority, “link juice”, etc.

So, PageRank shows who, after all this “redistribution of wealth”, ends up on top. This seems to make quite a lot of sense, doesn't it?

Unfortunately, mathematical analysis reveals various problems.

In particular, in the presence of circular sequences of links, the algorithm diverges to infinity. This has lead to the introduction of the damping factor, which doesn't prevent the gaming of the algorithm. Google's answers have been: surreptitious tweaking, attempts to evaluate the “quality” of sites, manual interventions, on and on.

A lot of people are very happy with PageRank. These are actual quotes from engineers and programmers:

  • “PR-based algorithms seem to work (very) well in the real world”;
  • “yield very good search results”;
  • “Google's search algorithm works amazingly well”;
  • “Google's algorithm works extremely well”;
  • “The page rank algorithm is actually extremely impressive, and obviously works well.”

This is what PageRank looks like to them:

Square wheels 1.png

This is its mathematical reality:

Square wheels 2.png

Let's look at the mathematics again.

Suppose we represent the web as a directed graph. Let $E$ be the set of edges of the graph and $N$ the total number of nodes in the graph. This is the formula for the PageRank: $$P_i=(1-\epsilon )\sum _{(i,j)\in E}\frac{P_j}{d_j}+\frac{\epsilon }{N},$$ where $P_i$ is the PageRank of the $i$-th node, $\epsilon$ is the damping coefficient, and $d_j$ is the degree of node $j$.

This formula can be interpreted as a PDE on a graph:

An example of diffusion is heat transfer.

We can think of advection as a fluid flow but with an additional feature -- the fluid carries some kind of substance, such as sand, around and gradually deposits it in different locations.

This is a very well known model. The problem is with how it is used. The accumulation of sand "at the end of the day" is used to rank the locations. The percentage of the amount of sand lost per unit of time is clearly an important characteristic of the process. Then it's not surprising that -- in the presence of whirls -- the ranking will depend on that number.

How should we approach the problem then?

The simplest mathematical interpretation of the web site linking is this. We have a discrete differential $1$-form $\phi$ on a graph and we want to see if it is produced by a global ranking $f$ of vertices. In other words: $$\phi = df,$$ where $d$ stands for the exterior derivative.

The question is then, is $\phi$ exact? We know that, in general, the answer is No. So, there is no solution!

We can't get around this problem.

Therefore, we have to do something about $\phi$ itself. There is only one way: we use the Hodge decomposition to find the "component" $\psi$ of $\phi$ that is exact.

PROBLEM SOLVED

If you've read the rest of the book and you know how to make your first billion!