Google It: Markov Chains in Modern Search Engines

As the Internet has grown, so has the need for a system that can easily navigate all the information it contains. However, the obvious logistical concerns makes creating a comprehensive system that can efficiently organize this information extremely difficult. The Internet is an unregulated organic entity, and on top of that, there are nearly 20 billion web pages.

Given the monumental task at hand, what Google has accomplished is all the more impressive. With near ubiquitous recognition across the Western world and over 83% of the US market share, Google represents the incredible power of linear algebra applied on an unprecedented scale.

Ultimately, what separates Google from the competition is that it ranks pages in a way that is not only efficient, but also ensures trustworthy and relevant sites are at or near the top of every search query.

Earlier search engine systems used an index that searched web pages and ranked them according to the frequency of the keyword queries. However, this proved to be an inefficient and inconsistent way to gauge the relative importance of resulting pages. While initially helpful, it was clear that this method was tedious and would only become more dysfunctional as the web grew.

It is for this reason that the innovative techniques pioneered by Google founders Sergey Brin and Lawrence Page are so integral. Rather than focusing on frequency, search results were ranked based on relevance; this meant no more sifting through volumes of hyperlinks. Instead, these hyperlinks were utilized as the means to determine relevant results. Behind this revolutionary idea: an algorithm called PageRank.

As we have explored eigenvectors and eigenvalues over the past few weeks, we have seen many long-term population modeling applications; however, we have not delved into many other real-life examples. As such, we will be exploring the model and algorithms that underlie Google’s monumental success. In addition, we will explore how Page and Brin modeled the web using modified stochastic matrices, Markov chains, and their intimate connection to eigenvectors.

To model how PageRank is used, we will scale the web down to a simpler system of 5 pages. Each page has a certain number of links to each of the other four, and each website passes on its “importance” to the pages to which it is linked. For example, if page 1 is linked to pages 2, 3, and 4, then it would pass ⅓ of its “importance” to each of those pages. This is essential in determining which pages are ranked higher, and in turn, show up higher on the results than others. Using this process, we can construct a stochastic matrix that models the links between the pages, called the “link matrix”. From here, it can be seen that if one start out with any vector “x” that represents the initial relative importances of the pages, multiplying it by the link matrix repeatedly will yield a final page matrix. That is, certain pages will become more important than others. In order to obtain the final page matrix, we will find the eigenvalues of the link matrix and determine the eigenvector for which the eigenvalue = 1, or more recognizably, where Pq=q.

In addition, we will examine how Google’s innovations invariably played a role in facilitating this model. Examining damping factors to account for irregularities in the columns of a stochastic matrix and the premise of hanging nodes, or disconnected subsystems, within such a network is essential to understanding the evolution of Google.

Simply put, the Internet is a complex, chaotic, and disconnected entity. Through our simplified model, we hope to give the class an idea of how Google’s application of linear algebra brought a sense of organization to the Internet.