Web and Social Network Analytics

Week 2: Search Engines & Web Crawler

Dr. Zexun Chen

Jan-2025

Table of Contents

Search Engines

Introduction

Search Engines: One of the most widely-used sources of information

Google Search in 2024

Search Engines

  • Make the web navigational to the World Wide Web
    • Examples: Google, Yahoo, Bing, Baidu, Yandex, DuckDuckGO
    • AI-Powered Search Enginies, e.g., perplexity
    • Searches the Deep Web/Dark Web as well:
      • Part of the web that is not indexed by well-known search engines
      • Typically hidden behind paywalls, scripts, etc.
      • Dark web: websites that can only be reached by encryption-enabled browsers, e.g., Tor (The Onion Router)
    • Privacy-oriented search engines, e.g., DuckDuckGo, TorSearch (indexes pages to help find content in websites located on the Tor network), etc.
    • Search Engine is NOT the same as Web Browser:
      • Search Engine: Google, DuckDuckGo, TorSearch, Bing
      • Web Browser: Chrome (default engine: Google), DuckDuckGo Browser (default engine: DuckDuckGo), Tor Browser (default engine: DuckDuckGo), Edge (default engine: Bing)

Web Search Basics

Architecture

The basic architecture of a search engine

Key Components

  • Web Crawlers: Automated scripts that browse the internet to collect data.
  • Indexer: Processes and stores information gathered by crawlers.
  • Search Algorithm: Determines the relevance and ranking of results.

How Google Search Works

Principles

  • Finding: To “crawl” all of the links on the web.
  • Organizing: To index all of these links into their database.
  • Serving: To rank these links.

Comparison

  • How to compare search engines technically?
    • Estimates of fractions of the web indexed:
      • Independently estimated by picking random documents from one index.
    • Randomly sample each other:
      • \(|E_i|\): Number of indexed pages for engine \(i \in [1,2]\).
      • Proportion of indices from \(E_1\) available \(E_2 \mapsto x\).
      • Proportion of indices from \(E_2\) available \(E_1 \mapsto y\).
      • Check whether \(\frac{|E_1|}{|E_2|} \approx \frac{y}{x}\).
    • Random tests:
      • Random search: Bias towards what is queried.
      • Random IP address: Bias towards shared, grouped IPs.
      • Random walks: Web is not strongly connected; this is expensive.
  • Other difficulties:
    • Term spam: Adding words that are unrelated to the site.
    • Duplicates: (~40% of the web).

Shingling for Duplicates

  • Compare and determine how similar.
  • Document comparison.
  • Plagiarism detection.

Shingling
  • Shingle: String of length \(k\) in a document \(d\).
    • Example: \(k = 5\) in “University of Edinburgh.”
    • Shingles: “Unive”, “niver”, …, “burgh.”
    • \(k\) typically not too small to avoid overlap in documents.
  • Degree of overlap between two documents:
    • \(S(d_j)\): The shingles in \(d_j\).
    • Jaccard index (intersection over union): \[ J(S(d_1), S(d_2)) = \frac{|S(d_1)\cap S(d_2)|}{|S(d_1) \cup S(d_2)|} \]
    • If threshold exceeded: regarded as similar.
      • Apply hashing to document’s shingles.
      • Information on shingles.
  • Challenges:
    • High computational resources for large documents or web pages.
    • Optimal size of shingles.

Shingling

  • A k-shingle is a consecutive set of \(k\) words. For example: \[ D_1 = \text{"I am Zexun"}, \quad D_2 = \text{"Zexun I am"}. \]
    • The (k=1)-shingles of \(D_1 \cup D_2\) = { “I”, “am”, “Zexun” }.
    • The (k=2)-shingles of \(D_1 \cup D_2\) = { “I am”, “am Zexun”, “Zexun I” }.
  • Let’s look at (k=2)-shingles for each \(D_1\) and \(D_2\): \[ D_1 : \text{"I am", "am Zexun"} \quad D_2 : \text{"Zexun I", "I am"} \]
  • Therefore, the Jaccard index: \[ JS(D_1, D_2) = \frac{1}{3} \]

Web Crawlers

Basics

Web Crawler

It aims to quickly and efficiently gather as many useful web pages as possible. Also called spiders, bots, or web robots.

  1. Starting Point: Begin with a list of web addresses (URLs).
  2. Crawling: Visit each URL and read the content of the page.
  3. Extracting Links: Identify all hyperlinks on the page and add them to the list of URLs to visit.
  4. Repeating the Process: Continue crawling these new URLs.
  5. Respecting Rules: Adhere to robots.txt files and meta tags.

Web Crawler

Web Crawling vs Web Scraping

  • Both are techniques used for extracting data from the web.
  • Web Crawling: Focuses on navigating the web and indexing content.
  • Web Scraping: Involves extracting specific data from web pages.
Aspect Web Crawling Web Scraping
Purpose Indexing web content Extracting specific data
Process Navigating and mapping Parsing HTML for data
Usage Search engines Data analysis, research
Data Metadata, links Targeted data, text

Indexing and Storing Data

Steps

Step 1: Indexing

  • Indexing is the process of organizing data in a way that makes it efficiently retrievable.
  • Key to the functioning of search engines and large databases.
  • Involves creating references to data for quick access.

Step 2: Storing

  • Data storage involves saving indexed data in a structured format.
  • Includes databases, data warehouses, and cloud storage.
  • Ensures data security, integrity, and availability.

Algorithms

Searching and Presentation

  • Key Role: To rank web pages based on relevance and authority.
    • Keyword Analysis: The primary step in understanding the query intent.
    • PageRank: Google’s initial algorithm based on Link Analysis.
    • Current Trends: Incorporation of AI and machine learning for context understanding.
  • Relevance in Search Results:
    • Defining relevance: Matching user queries with accurate and useful information.
    • Factors influencing relevance: Keywords, site authority, content quality, user engagement metrics.
    • The challenge of subjectivity in relevance.
  • Personalization:
    • Same search, but different results based on user context.

Markov Chains: Definition

A Markov chain (or process) is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Key Concepts

  • In each step, a random choice is made.
  • Consists of \(N\) states (e.g., each web page).
  • \(N \times N\) transition probability matrix \(P\):
    • Each row of \(P\) adds up to 1.
    • \(P_{ij}\) tells the probability that the next state is \(j\) when in state \(i\).

Markov Chain

Markov Chains: Surfer’s Position

  • At time \(t\), the probability vector is \(x_0 P\): \[ (x_0 P) P = x_0 P^2 \]
  • Converges to a steady-state distribution.
  • Each entry of \(P\) only depends on the current state.

Markov Chain Steady-State

Markov Chains: Adjacency Matrix

A random surfer’s behavior on the web graph can be modeled with an adjacency matrix: \[ A = (a_{ij}) = \text{Number of edges from vertex $i$ to $j$}. \]

Adjacency Matrix
  • Dead Ends:
    • No outlinks; rows in \(A\) sum to 0.
    • After a few iterations, entries in \(x\) become 0.
  • Solutions:
    • Drop dead ends (cascading effect).
    • Add “teleportation” links:
      • Randomly jump to another page with probability \(0 < \alpha < 1\).

Markov Chains: Adjacency Matrix

  • Transform \(A\) to \(P\):
    1. Divide each 1 by the number of ones in the row.
    2. If a row has no 1s, replace by \(1/N\).
    3. Multiply the matrix by \(1 - \alpha\).
    4. Add \(\alpha / N\) to every entry.

Markov Chains: Example 1

  • Converting this web graph from an adjacency matrix to a transition matrix for \(\alpha = 0.3\): \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0 \end{pmatrix} \]
  • Transition matrix accounts for teleportation probability.

Transition Matrix for α=0.3

Markov Chains: Example 2

  • Converting this web graph from an adjacency matrix to a transition matrix for \(\alpha = 0.7\): \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 1 & 1 & 1 \end{pmatrix} \]
  • Transition matrix includes teleportation adjustments with a higher \(\alpha\) value.

Transition Matrix for α=0.7

PageRank: Algorithm Overview

  • Basic algorithm underlying Google’s search engine.
  • Named after Larry Page (Co-founder of Google).
  • A general approach for scoring nodes in a network.

PageRank

PageRank: Principles

  • At each time step, the surfer makes a decision.
  • In the long run, some pages are visited more often than others.
  • Challenges:
    • What if there are no out-links?
      • Add teleportation modeled as \(1/N\).
      • Each node receives teleport probability \(\alpha\).
  • \(\pi(\nu)\) (PageRank score) combines:
    • Random Walk: Reflects the structure of the web graph.
    • Teleportation: Controlled by \(\alpha\).
  • Assumption: Important websites receive more links from other websites.

PageRank Computation

  • Left eigenvectors of \(P\) are vectors \(\pi\) of size \(N\):
    • \(\pi P = \lambda \pi\), where \(\lambda\) are the eigenvalues of \(P\).
    • If \(\pi\) is the steady-state distribution: \[ \pi P = 1 \pi \]
    • Compute left eigenvector of \(P\) for eigenvalue 1 to obtain PageRank values \(\pi\).
  • Calculated through power iteration:
    1. Start with an initial distribution \(x\).
    2. Find \(xP^t\) for large \(t\), as frequencies converge.
    3. Simulates a user’s surfing behavior.

PageRank Illustration

PageRank Computation: Exercise

Question

  • Find the “coolest page” if \(x_0 = (1 \, 0 \, 0)\) over 3 iterations: \[ P = \begin{pmatrix} 0.45 & 0.1 & 0.45\\ 0.33 & 0.33 & 0.33\\ 0.1 & 0.8 & 0.1 \end{pmatrix} \]

Question Graph

Solution

Solution Graph

Topic-Specific PageRank

  • Query-Independent Measure:
    • Query results are based on PageRank.
    • Relative ordering depends on the query and the PageRank score.
  • Topic-Specific PageRank:
    • Teleporting is not uniformly distributed.
    • Example: A sports fanatic’s browsing preferences: \[ S: \text{Sports-related pages, with sports PageRank } \pi_S. \]
      • Entries in \(\pi_S\) are zero for unrelated webpages.

Personalized PageRank

  • Linear combination of a small number of topic-specific distributions.
  • Example: \[ 0.06 \pi_S + 0.04 \pi_P \]
    • For sports and politics browsing.
    • User has 10% teleport probability:
      • 6% sports teleport.
      • 4% politics teleport.

Personalized PageRank Graph

Hubs and Authorities

  • PageRank: Simplistic view of a network (one score per node).
  • Hubs and Authorities: Richer view of a network with different node types:
    • Hub: A page that points to many others.
    • Authority: A page that many others refer to.
  • Reflecting Types of Websites:
    • Broad-topic searches often land on authority pages:
      • Example: Searching “leukaemia” leads to the National Cancer Institute’s page.
    • Hub pages point to authoritative pages:
      • Example: Searching “sell property” leads to Rightmove.

Hubs and Authorities

Hubs and Authorities: Calculations

  • For a web page/node \(\nu\):
    • \(h(\nu)\): Hub score.
    • \(a(\nu)\): Authority score.
    • Initially: \(h(\nu) = a(\nu) = 1\).
    • \(\nu \mapsto y\) if a link exists between \(\nu\) and \(y\).
  • Iterative Calculations:
    • \(h(\nu) \leftarrow \sum_{\nu \to y} a(y)\).
    • \(a(\nu) \leftarrow \sum_{y \to \nu} h(y)\).
    • \(\textbf{h} \leftarrow A \textbf{a}\), \(\textbf{a} \leftarrow A^T \textbf{h}\) (adjacency matrix \(A\)).
    • \(\textbf{h} \leftarrow AA^T \textbf{h}\), \(\textbf{a} \leftarrow A^T A \textbf{a}\) with SVD.
      • \(\lambda_h \textbf{h} = AA^T \textbf{h}\), \(\lambda_a \textbf{a} = A^T A \textbf{a}\).
      • \(\lambda_h, \lambda_a\): Eigenvalues of \(AA^T\) and \(A^T A\) respectively.

HITS Algorithm

  • Hyperlink-Induced Topic Search (HITS)(Kleinberg, 1997):
    1. Assemble the target subset of web pages and then Compute \(AA^T\) / \(A^T A\).
    2. Calculate eigenvectors to find \(\textbf{h}\) and \(\textbf{a}\), Output the top-scoring hubs and authorities.

Note: Links containing the query “jaguar” receive double weighting.

Choosing the Subset of the Web

  • Collection of Sites:
    • Use a text index to get all related pages – root set.
    • Build base set: root set + referring pages.
    • Use the base set to compute hub and authority scores.
  • Authority Sites May Not Contain Keywords:
    • Example: IBM for computer hardware.
    • Hubs are likely to include the terms.
  • HITS Algorithm:
    • 5 iterations on average are customary.
    • Downside: Inclusion of pages in a foreign language.
      • Due to base set construction during which the query is ignored.
  • Importance:
    • Enhances the quality of search results by identifying expert sources.
    • Used in combination with other algorithms like PageRank for comprehensive analysis.
    • Critical for academic and topic-specific searches.

Take Home Messages

  • Search Engines
  • Web Search Basics
  • Web Crawler
  • Indexing and Storing Data
  • Searching Algorithms:
    • Link Analysis
    • Markov Chains (***)
    • PageRank (**)
    • Hubs and Authorities (**)