Web and Social Network Analytics

Week 2: Search Engines & Web Crawler

Dr. Zexun Chen

Jan-2026

Table of Contents

Welcome Back!

Last week: We explored web analytics, tracking technologies, web fundamentals, and how websites measure user behaviour.

This week: We go behind the scenes of search engines - how does Google actually find and rank billions of web pages?

ShopSocial Question: How do customers find ShopSocial products through Google? How can we improve our search visibility?

Learning Objectives

By the end of this week, you will be able to:

  1. Explain how search engines work (crawl, index, rank)
  2. Describe web crawler processes and duplicate detection
  3. Calculate PageRank scores using the random surfer model
  4. Compare PageRank with the HITS algorithm (hubs vs authorities)
  5. Apply these concepts to improve ShopSocial’s search visibility

Assessment link: Understanding search algorithms is essential for your coursework on digital platform analysis.

Search Engines

Introduction

Code
import matplotlib.pyplot as plt

# UK data (Dec 2025)
uk_data = {
    'Google': 93.35,
    'Bing': 4.00,
    'Yahoo': 1.10,
    'DuckDuckGo': 0.69,
    'Other': 0.86
}
uk_sorted = dict(sorted(uk_data.items(), key=lambda item: item[1], reverse=True))

# Worldwide data (2025)
world_data = {
    'Google': 89.54,
    'Bing': 3.94,
    'Yandex': 2.45,
    'Yahoo': 1.31,
    'DuckDuckGo': 0.72,
    'Baidu': 0.64,
    'Other': 1.40
}
world_sorted = dict(sorted(world_data.items(), key=lambda item: item[1], reverse=True))

# Plot setup
fig, axes = plt.subplots(1, 2)

# UK bar chart
uk_bars = axes[0].bar(uk_sorted.keys(), uk_sorted.values(), color='#1f77b4')
axes[0].set_title("UK - Dec 2025")
axes[0].set_ylabel("Market Share (%)")
axes[0].tick_params(axis='x', rotation=45)
for bar in uk_bars:
    height = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2, height + 1, f'{height:.1f}%',
                 ha='center', va='bottom', fontsize=9)

# Worldwide bar chart
world_bars = axes[1].bar(world_sorted.keys(), world_sorted.values(), color='#2ca02c')
axes[1].set_title("Worldwide - Dec 2025")
axes[1].tick_params(axis='x', rotation=45)
for bar in world_bars:
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2, height + 1, f'{height:.1f}%',
                 ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()
Figure 1: Market Share of Search Engines in the UK and Worldwide, 2025

Google Search in 2025

Search Engines

ShopSocial Challenge

How do ShopSocial’s 50,000+ products get discovered by potential customers searching on Google?

Web Search Basics

Architecture

Key Components

  • Web Crawlers: Automated scripts that browse the internet to collect data.
  • Indexer: Processes and stores information gathered by crawlers.
  • Search Algorithm: Determines the relevance and ranking of results.

How Google Search Works

Principles

  • Finding: To “crawl” all of the links on the web.
  • Organizing: To index all of these links into their database.
  • Serving: To rank these links.

Comparing Search Engines

Methods

  • Fraction of web indexed
    • Pick random documents from one index
    • Check availability in the other
  • Overlap estimation
    • \(|E_i|\): Pages indexed by engine \(i\)
    • \(x\): Proportion of \(E_1\) found in \(E_2\)
    • \(y\): Proportion of \(E_2\) found in \(E_1\)
    • Verify: \(\frac{|E_1|}{|E_2|} \approx \frac{y}{x}\)

Limitations

  • Random search
    • Biased towards queried terms
  • Random IP address
    • Biased towards shared, grouped IPs
  • Random walks
    • Web is not strongly connected
    • Computationally expensive

Other challenges

Other challenges for search engine comparison:

  • Term spam: Adding unrelated words to manipulate rankings
  • Near-duplicates: Content that is similar but not identical

Why Does This Matter?

~40% of the web is duplicate content!

  • Copy-pasted articles
  • Slightly modified spam
  • Same product descriptions on multiple sites

How do search engines detect duplicates efficiently?

Shingling: Breaking Text into Overlapping Pieces

The Idea: Compare documents by breaking them into small overlapping pieces.

Why “Shingles”?

Think of roof shingles - they overlap! Similarly, text shingles overlap with each other.

Example: “The quick brown fox” with k=2 words:

Piece 1 Piece 2 Piece 3
“The quick” “quick brown” “brown fox”

Shingling

Comparing Documents: Jaccard Similarity

Document 1: “The quick brown fox”

  • Shingles: {The quick, quick brown, brown fox}

Document 2: “The quick red fox”

  • Shingles: {The quick, quick red, red fox}

How similar are they?

  • Shared shingles: {The quick} = 1
  • Total unique shingles: 5
  • Similarity = 1/5 = 20%

Jaccard Similarity Formula

\[J(S(d_1), S(d_2)) = \frac{|S(d_1)\cap S(d_2)|}{|S(d_1) \cup S(d_2)|}\]

In plain English: \[\frac{\text{Shared pieces}}{\text{All unique pieces}}\]

If similarity exceeds a threshold (e.g., 80%), pages are considered duplicates.

🔍 Hands-On: Calculate Jaccard Similarity

Activity (3 minutes)

Calculate the Jaccard similarity between these two documents using k=2 word shingles:

  • D1 = “I am Zexun”
  • D2 = “Zexun I am”

Steps: 1. List the k=2 shingles for each document 2. Find the intersection (shared shingles) 3. Find the union (all unique shingles) 4. Calculate: intersection / union

Solution

  • D1 shingles: {“I am”, “am Zexun”}
  • D2 shingles: {“Zexun I”, “I am”}
  • Intersection: {“I am”} = 1
  • Union: {“I am”, “am Zexun”, “Zexun I”} = 3
  • Jaccard similarity = 1/3 ≈ 33%

Web Crawlers

Basics

Web Crawler

It aims to quickly and efficiently gather as many useful web pages as possible. Also called spiders, bots, or web robots.

  1. Starting Point: Begin with a list of web addresses (URLs).
  2. Crawling: Visit each URL and read the content of the page.
  3. Extracting Links: Identify all hyperlinks on the page and add them to the list of URLs to visit.
  4. Repeating the Process: Continue crawling these new URLs.
  5. Respecting Rules: Adhere to robots.txt files and meta tags.

Web Crawler

Web Crawling vs Web Scraping

  • Both are techniques used for extracting data from the web.
  • Web Crawling: Focuses on navigating the web and indexing content.
  • Web Scraping: Involves extracting specific data from web pages.
Aspect Web Crawling Web Scraping
Purpose Indexing web content Extracting specific data
Process Navigating and mapping Parsing HTML for data
Usage Search engines Data analysis, research
Data Metadata, links Targeted data, text

🔍 Hands-On: Explore robots.txt

Activity (3 minutes)

Every website tells crawlers what they can and cannot access via a robots.txt file.

Try it yourself:

  1. Open your browser and go to: google.com/robots.txt
  2. Then try: amazon.co.uk/robots.txt
  3. Compare: What pages are blocked? Why might they block certain pages?

ShopSocial Consideration

What pages should ShopSocial block from crawlers? (Hint: checkout pages, user accounts, admin panels)

Indexing and Storing Data

Steps

Step 1: Indexing

  • Indexing is the process of organizing data in a way that makes it efficiently retrievable.
  • Key to the functioning of search engines and large databases.
  • Involves creating references to data for quick access.

Step 2: Storing

  • Data storage involves saving indexed data in a structured format.
  • Includes databases, data warehouses, and cloud storage.
  • Ensures data security, integrity, and availability.

Algorithms

Searching and Presentation

🤔 The Random Surfer Model

Imagine This…

You’re bored and clicking random links on Wikipedia:

  • You start on “Edinburgh”
  • You click a random link → “Scotland”
  • You click another random link → “Whisky”
  • And so on…

Question: If you did this for hours, which pages would you visit most often?

This is exactly how Google thinks about web importance!

  • Pages you’d visit often = important pages
  • This random clicking behaviour = Markov Chain

📊 Quick Poll: What Makes a Page Important?

Wooclap Question

Go to wooclap.com and enter code: QNRTKJ

What makes a web page “important” in your view?

A. Number of visitors B. Number of links pointing to it C. Quality of content D. Age of the page

Google’s Answer

Links pointing to it! But not just any links - links from other important pages matter more.

💡 PageRank: The Big Idea

The Core Insight

A page is important if important pages link to it.

It’s like academic citations:

  • A paper cited by Nature is more impressive than one cited by an unknown blog
  • Importance flows through links

Named after Larry Page (co-founder of Google) - though it’s also about ranking pages!

Markov Chains: The Simple Idea

The Rule: Where you go next depends ONLY on where you are now, not where you’ve been.

Technical term: This “memoryless” property is what makes it a Markov Chain.

Markov Chains: Formal Definition

A Markov chain is a stochastic model where the probability of each event depends only on the current state (not the history).

Key Components:

  • States: Each web page is a state (N total)
  • Transition Matrix P: An N×N table of probabilities
    • Each row adds up to 1 (must go somewhere!)
    • \(P_{ij}\) = probability of going from page i to page j

Markov Chain

Markov Chains: Surfer’s Position

  • At time \(t\), the probability vector is \(x_0 P\): \[ (x_0 P) P = x_0 P^2 \]
  • Converges to a steady-state distribution.
  • Each entry of \(P\) only depends on the current state.

Markov Chain Steady-State

Converting to Probabilities

If a random surfer is on page A with 2 outgoing links, what’s the chance they click each one?

Answer: 50% each! (1 out of 2 links)

The Recipe:

  1. Take each row in the adjacency matrix
  2. Divide by the number of links (1s) in that row
  3. Now each row shows the probability of clicking each link
From/To A B C D
A 0 0.5 0.5 0
B 0 0 0 1
D 0 0 1 0

This is the Transition Matrix - probabilities of where you’ll go next

⚠️ The Dead End Problem

We Have a Problem!

What happens when you reach a page with NO outgoing links?

Example: Page C has no links out. The random surfer gets stuck!

From/To A B C D
C 0 0 0 0

Google’s Solution: Teleportation!

When stuck (or randomly, with probability α):

  • Beam yourself to a completely random page
  • Like pressing Google’s “I’m Feeling Lucky”
  • Typically α = 0.15 (15% chance of teleporting)

Teleportation

Markov Chains: Adjacency Matrix

  • Transform \(A\) to \(P\):
    1. Divide each 1 by the number of ones in the row.
    2. If a row has no 1s, replace by \(1/N\).
    3. Multiply the matrix by \(1 - \alpha\).
    4. Add \(\alpha / N\) to every entry.

Markov Chains: Example 1

  • Converting this web graph from an adjacency matrix to a transition matrix for \(\alpha = 0.3\): \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0 \end{pmatrix} \]
  • Transition matrix accounts for teleportation probability.

Transition Matrix for α=0.3

Exercise: Go to Wooclap

Markov Chains: Example 2

  • Converting this web graph from an adjacency matrix to a transition matrix for \(\alpha = 0.7\): \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 1 & 1 & 1 \end{pmatrix} \]
  • Transition matrix includes teleportation adjustments with a higher \(\alpha\) value.

Transition Matrix for α=0.7

How PageRank Works (No Math Version)

  1. Start: Give every page equal importance (say, 1 point each)

  2. Share: Each page divides its points equally among pages it links to

    • If Page A (1 point) links to B and C → B and C each get 0.5 points from A
  3. Collect: Each page adds up all the points it received

  4. Repeat: Do steps 2-3 many times until scores stabilize

  5. Result: Pages that receive points from many important pages get high scores!

That’s It!

PageRank is essentially “repeated sharing and collecting” of importance scores.

PageRank: The Principles

  • At each time step, the random surfer makes a decision
  • In the long run, some pages are visited more often than others
  • The PageRank score \(\pi(\nu)\) combines:
    • Random Walk: Following the structure of the web graph
    • Teleportation: Jumping to random pages (controlled by α)
  • Key assumption: Important websites receive more links from other important websites

ShopSocial Application

To improve PageRank, ShopSocial should:

  • Get featured in trusted review sites and blogs
  • Create valuable content that others want to link to
  • Build partnerships with complementary brands

PageRank Computation

  • Left eigenvectors of \(P\) are vectors \(\pi\) of size \(N\):
    • \(\pi P = \lambda \pi\), where \(\lambda\) are the eigenvalues of \(P\).
    • If \(\pi\) is the steady-state distribution: \[ \pi P = 1 \pi \]
    • Compute left eigenvector of \(P\) for eigenvalue 1 to obtain PageRank values \(\pi\).
  • Calculated through power iteration:
    1. Start with an initial distribution \(x\).
    2. Find \(xP^t\) for large \(t\), as frequencies converge.
    3. Simulates a user’s surfing behavior.

PageRank Illustration

Exercise: Go to Wooclap

PageRank Computation: Exercise

Question

  • Find the “coolest page” if \(x_0 = (1 \, 0 \, 0)\) over 3 iterations: \[ P = \begin{pmatrix} 0.45 & 0.1 & 0.45\\ 0.33 & 0.33 & 0.33\\ 0.1 & 0.8 & 0.1 \end{pmatrix} \]

Solution

Solution Graph

🔍 Hands-On: Trace the Random Surfer

Activity (5 minutes)

Experience PageRank yourself!

Instructions:

  1. Draw a simple 4-node web graph on paper (A, B, C, D with some links)
  2. Start at node A
  3. Trace 10 random steps:
    • Roll a die: if you get 1, teleport to a random node
    • Otherwise, follow a random outgoing link
  4. Keep a tally: which node did you visit most often?

Discussion

Compare your results with your neighbours. Did you all end up visiting the same nodes most often? That’s PageRank in action!

Topic-Specific PageRank

Personalized PageRank

🔍 Beyond PageRank: Two Types of Important Pages

Think About It

Not all important pages are the same:

  • Wikipedia is important because everyone links TO it (many incoming links)
  • A “Best Restaurants Edinburgh” blog is important because it links to many good places (many outgoing links)

These are different kinds of importance!

📊 Quick Poll: Hub or Authority?

Wooclap Question

Which is Wikipedia: a hub or an authority?

  • A. Hub (links to many pages)
  • B. Authority (many pages link to it)
  • C. Both
  • D. Neither

The Answer: Both!

Wikipedia is an authority (everyone links to it) AND a hub (it links to sources). This shows why we need both concepts!

Hubs vs Authorities

HITS: The Mutual Reinforcement Idea

Key Insight: Hubs and authorities boost each other!

  • A good hub links to many good authorities
  • A good authority is linked to by many good hubs

The Algorithm (Simple Version):

  1. Start: Everyone gets hub score = 1, authority score = 1
  2. Update authority scores: Add up the hub scores of pages linking TO you
  3. Update hub scores: Add up the authority scores of pages you link TO
  4. Repeat until scores stabilize

HITS: The Math (Optional Detail)

  • For a web page/node \(\nu\):
    • \(h(\nu)\): Hub score, \(a(\nu)\): Authority score
    • Initially: \(h(\nu) = a(\nu) = 1\)
  • Iterative update rules:
    • Authority: \(a(\nu) \leftarrow \sum_{y \to \nu} h(y)\) (sum hub scores of pages pointing to you)
    • Hub: \(h(\nu) \leftarrow \sum_{\nu \to y} a(y)\) (sum authority scores of pages you point to)
  • In matrix form: \(\textbf{a} \leftarrow A^T \textbf{h}\), \(\textbf{h} \leftarrow A \textbf{a}\) (adjacency matrix \(A\)), \(\textbf{h} \leftarrow AA^T \textbf{h}\), \(\textbf{a} \leftarrow A^T A \textbf{a}\) with SVD.
    • \(\lambda_h \textbf{h} = AA^T \textbf{h}\), \(\lambda_a \textbf{a} = A^T A \textbf{a}\).
    • \(\lambda_h, \lambda_a\): Eigenvalues of \(AA^T\) and \(A^T A\) respectively.

Don’t worry about the matrix notation - the intuition is what matters!

HITS Algorithm

  • Hyperlink-Induced Topic Search (HITS) (Kleinberg 1999):
    1. Assemble the target subset of web pages and then Compute \(AA^T\) / \(A^T A\).
    2. Calculate eigenvectors to find \(\textbf{h}\) and \(\textbf{a}\), Output the top-scoring hubs and authorities.

Note: Links containing the query “jaguar” receive double weighting.

📝 Quick Quiz: Test Your Understanding

Question Your Answer
1. A page with no outlinks is called a _____
2. PageRank teleportation probability is typically denoted _____
3. In HITS, a page with high hub score points to many _____
4. Jaccard similarity measures _____ over _____
5. The “memoryless” property defines a _____

Answers:

  1. Dead end
  2. α (alpha)
  3. Authorities
  4. Intersection over Union
  5. Markov Chain

Summary and Next Steps

Key Takeaways

Search Engine Architecture:

  • Crawl → Index → Rank
  • Web crawlers respect robots.txt
  • Shingling detects duplicates (Jaccard similarity)

Markov Chains:

  • Random surfer model
  • Transition probability matrix
  • “Memoryless” property
  • Steady-state distribution

PageRank:

  • Links = votes of importance
  • Important pages link to important pages
  • Teleportation handles dead ends (α ≈ 0.15)
  • Power iteration for computation

HITS Algorithm:

  • Hubs point to authorities
  • Authorities are pointed to by hubs
  • Mutual reinforcement
  • Query-dependent ranking

Looking Ahead

Next Week: Social Network Fundamentals

  • Network structure and properties
  • Degree distribution and centrality measures
  • Community detection basics
  • Small world phenomenon

Assessment reminder:

  • Lab exercises build practical web scraping skills
  • PageRank calculations may appear in the assessment
  • ShopSocial analysis continues throughout the course

References

Kleinberg, Jon M. 1999. “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM (JACM) 46 (5): 604–32.

Questions?

Thank you!

Dr. Zexun Chen

📧 Zexun.Chen@ed.ac.uk

Office Hours: By appointment

Next Week: Social Network Fundamentals