Web and Social Network Analytics

Week 2: Search Engines & Web Crawler

Dr. Zexun Chen

Zexun.Chen@ed.ac.uk

Jan-2025

Search Engines

Introduction

Search Engines: One of the most widely-used sources of information

Search Engine Market Share in the UK and Worldwide

Google Search in 2024

Search Engines

Make the web navigational to the World Wide Web
- Examples: Google, Yahoo, Bing, Baidu, Yandex, DuckDuckGO
- AI-Powered Search Enginies, e.g., perplexity
- Searches the Deep Web/Dark Web as well:
  - Part of the web that is not indexed by well-known search engines
  - Typically hidden behind paywalls, scripts, etc.
  - Dark web: websites that can only be reached by encryption-enabled browsers, e.g., Tor (The Onion Router)
- Privacy-oriented search engines, e.g., DuckDuckGo, TorSearch (indexes pages to help find content in websites located on the Tor network), etc.
- Search Engine is NOT the same as Web Browser:
  - Search Engine: Google, DuckDuckGo, TorSearch, Bing
  - Web Browser: Chrome (default engine: Google), DuckDuckGo Browser (default engine: DuckDuckGo), Tor Browser (default engine: DuckDuckGo), Edge (default engine: Bing)

Web Search Basics

Architecture

Key Components

Web Crawlers: Automated scripts that browse the internet to collect data.
Indexer: Processes and stores information gathered by crawlers.
Search Algorithm: Determines the relevance and ranking of results.

How Google Search Works

Principles

Finding: To “crawl” all of the links on the web.
Organizing: To index all of these links into their database.
Serving: To rank these links.

Comparison

How to compare search engines technically?
- Estimates of fractions of the web indexed:
  - Independently estimated by picking random documents from one index.
- Randomly sample each other:
  - $|E_i|$: Number of indexed pages for engine $i \in [1,2]$.
  - Proportion of indices from $E_1$ available $E_2 \mapsto x$.
  - Proportion of indices from $E_2$ available $E_1 \mapsto y$.
  - Check whether $\frac{|E_1|}{|E_2|} \approx \frac{y}{x}$.
- Random tests:
  - Random search: Bias towards what is queried.
  - Random IP address: Bias towards shared, grouped IPs.
  - Random walks: Web is not strongly connected; this is expensive.
Other difficulties:
- Term spam: Adding words that are unrelated to the site.
- Duplicates: (~40% of the web).

Shingling for Duplicates

Compare and determine how similar.
Document comparison.
Plagiarism detection.

Shingle: String of length $k$ in a document $d$.
- Example: $k = 5$ in “University of Edinburgh.”
- Shingles: “Unive”, “niver”, …, “burgh.”
- $k$ typically not too small to avoid overlap in documents.
Degree of overlap between two documents:
- $S(d_j)$: The shingles in $d_j$.
- Jaccard index (intersection over union): \[ J(S(d_1), S(d_2)) = \frac{|S(d_1)\cap S(d_2)|}{|S(d_1) \cup S(d_2)|} \]
- If threshold exceeded: regarded as similar.
  - Apply hashing to document’s shingles.
  - Information on shingles.
Challenges:
- High computational resources for large documents or web pages.
- Optimal size of shingles.

Shingling

A k-shingle is a consecutive set of $k$ words. For example: \[ D_1 = \text{"I am Zexun"}, \quad D_2 = \text{"Zexun I am"}. \]
- The (k=1)-shingles of $D_1 \cup D_2$ = { “I”, “am”, “Zexun” }.
- The (k=2)-shingles of $D_1 \cup D_2$ = { “I am”, “am Zexun”, “Zexun I” }.
Let’s look at (k=2)-shingles for each $D_1$ and $D_2$: \[ D_1 : \text{"I am", "am Zexun"} \quad D_2 : \text{"Zexun I", "I am"} \]
Therefore, the Jaccard index: \[ JS(D_1, D_2) = \frac{1}{3} \]

Web Crawlers

Basics

Web Crawler

It aims to quickly and efficiently gather as many useful web pages as possible. Also called spiders, bots, or web robots.

Starting Point: Begin with a list of web addresses (URLs).
Crawling: Visit each URL and read the content of the page.
Extracting Links: Identify all hyperlinks on the page and add them to the list of URLs to visit.
Repeating the Process: Continue crawling these new URLs.
Respecting Rules: Adhere to robots.txt files and meta tags.

Web Crawling vs Web Scraping

Both are techniques used for extracting data from the web.
Web Crawling: Focuses on navigating the web and indexing content.
Web Scraping: Involves extracting specific data from web pages.

Aspect	Web Crawling	Web Scraping
Purpose	Indexing web content	Extracting specific data
Process	Navigating and mapping	Parsing HTML for data
Usage	Search engines	Data analysis, research
Data	Metadata, links	Targeted data, text

Indexing and Storing Data

Steps

Step 1: Indexing

Indexing is the process of organizing data in a way that makes it efficiently retrievable.
Key to the functioning of search engines and large databases.
Involves creating references to data for quick access.

Step 2: Storing

Data storage involves saving indexed data in a structured format.
Includes databases, data warehouses, and cloud storage.
Ensures data security, integrity, and availability.

Algorithms

Searching and Presentation

Key Role: To rank web pages based on relevance and authority.
- Keyword Analysis: The primary step in understanding the query intent.
- PageRank: Google’s initial algorithm based on Link Analysis.
- Current Trends: Incorporation of AI and machine learning for context understanding.
Relevance in Search Results:
- Defining relevance: Matching user queries with accurate and useful information.
- Factors influencing relevance: Keywords, site authority, content quality, user engagement metrics.
- The challenge of subjectivity in relevance.
Personalization:
- Same search, but different results based on user context.

Link Analysis

Based on the principle of referral:
- Relies on the relevance and quality of web pages.
- Quantifies the influence of a web page by the number of referrals.
- Acts as evidence for authority.
Potential Abuse:
- Link spam by setting up many referring websites.
Sets the web up as a graph:
- Nodes (vertices): Web pages.
- Edges (links): Hyperlinks between pages.

Markov Chains: Definition

A Markov chain (or process) is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Key Concepts

In each step, a random choice is made.
Consists of $N$ states (e.g., each web page).
$N \times N$ transition probability matrix $P$:
- Each row of $P$ adds up to 1.
- $P_{ij}$ tells the probability that the next state is $j$ when in state $i$.

Markov Chains: Surfer’s Position

At time $t$, the probability vector is $x_0 P$: \[ (x_0 P) P = x_0 P^2 \]
Converges to a steady-state distribution.
Each entry of $P$ only depends on the current state.

Markov Chains: Adjacency Matrix

A random surfer’s behavior on the web graph can be modeled with an adjacency matrix: \[ A = (a_{ij}) = \text{Number of edges from vertex $i$ to $j$}. \]

Dead Ends:
- No outlinks; rows in $A$ sum to 0.
- After a few iterations, entries in $x$ become 0.
Solutions:
- Drop dead ends (cascading effect).
- Add “teleportation” links:
  - Randomly jump to another page with probability $0 < \alpha < 1$.

Markov Chains: Adjacency Matrix

Transform $A$ to $P$:
1. Divide each 1 by the number of ones in the row.
2. If a row has no 1s, replace by $1/N$.
3. Multiply the matrix by $1 - \alpha$.
4. Add $\alpha / N$ to every entry.

Markov Chains: Example 1

Converting this web graph from an adjacency matrix to a transition matrix for $\alpha = 0.3$: \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 0 & 1 & 0 \end{pmatrix} \]
Transition matrix accounts for teleportation probability.

Markov Chains: Example 2

Converting this web graph from an adjacency matrix to a transition matrix for $\alpha = 0.7$: \[ A = \begin{pmatrix} 1 & 0 & 1\\ 0 & 0 & 0\\ 1 & 1 & 1 \end{pmatrix} \]
Transition matrix includes teleportation adjustments with a higher $\alpha$ value.

PageRank: Algorithm Overview

Basic algorithm underlying Google’s search engine.
Named after Larry Page (Co-founder of Google).
A general approach for scoring nodes in a network.

PageRank: Principles

At each time step, the surfer makes a decision.
In the long run, some pages are visited more often than others.
Challenges:
- What if there are no out-links?
  - Add teleportation modeled as $1/N$.
  - Each node receives teleport probability $\alpha$.
$\pi(\nu)$ (PageRank score) combines:
- Random Walk: Reflects the structure of the web graph.
- Teleportation: Controlled by $\alpha$.
Assumption: Important websites receive more links from other websites.

PageRank Computation

Left eigenvectors of $P$ are vectors $\pi$ of size $N$:
- $\pi P = \lambda \pi$, where $\lambda$ are the eigenvalues of $P$.
- If $\pi$ is the steady-state distribution: \[ \pi P = 1 \pi \]
- Compute left eigenvector of $P$ for eigenvalue 1 to obtain PageRank values $\pi$.
Calculated through power iteration:
1. Start with an initial distribution $x$.
2. Find $xP^t$ for large $t$, as frequencies converge.
3. Simulates a user’s surfing behavior.

PageRank Computation: Exercise

Question

Find the “coolest page” if $x_0 = (1 \, 0 \, 0)$ over 3 iterations: \[ P = \begin{pmatrix} 0.45 & 0.1 & 0.45\\ 0.33 & 0.33 & 0.33\\ 0.1 & 0.8 & 0.1 \end{pmatrix} \]

Solution

Topic-Specific PageRank

Query-Independent Measure:
- Query results are based on PageRank.
- Relative ordering depends on the query and the PageRank score.
Topic-Specific PageRank:
- Teleporting is not uniformly distributed.
- Example: A sports fanatic’s browsing preferences: \[ S: \text{Sports-related pages, with sports PageRank } \pi_S. \]
  - Entries in $\pi_S$ are zero for unrelated webpages.

Personalized PageRank

Linear combination of a small number of topic-specific distributions.
Example: \[ 0.06 \pi_S + 0.04 \pi_P \]
- For sports and politics browsing.
- User has 10% teleport probability:
  - 6% sports teleport.
  - 4% politics teleport.

Hubs and Authorities

PageRank: Simplistic view of a network (one score per node).
Hubs and Authorities: Richer view of a network with different node types:
- Hub: A page that points to many others.
- Authority: A page that many others refer to.
Reflecting Types of Websites:
- Broad-topic searches often land on authority pages:
  - Example: Searching “leukaemia” leads to the National Cancer Institute’s page.
- Hub pages point to authoritative pages:
  - Example: Searching “sell property” leads to Rightmove.

Hubs and Authorities: Calculations

For a web page/node $\nu$:
- $h(\nu)$: Hub score.
- $a(\nu)$: Authority score.
- Initially: $h(\nu) = a(\nu) = 1$.
- $\nu \mapsto y$ if a link exists between $\nu$ and $y$.
Iterative Calculations:
- $h(\nu) \leftarrow \sum_{\nu \to y} a(y)$.
- $a(\nu) \leftarrow \sum_{y \to \nu} h(y)$.
- $\textbf{h} \leftarrow A \textbf{a}$, $\textbf{a} \leftarrow A^T \textbf{h}$ (adjacency matrix $A$).
- $\textbf{h} \leftarrow AA^T \textbf{h}$, $\textbf{a} \leftarrow A^T A \textbf{a}$ with SVD.
  - $\lambda_h \textbf{h} = AA^T \textbf{h}$, $\lambda_a \textbf{a} = A^T A \textbf{a}$.
  - $\lambda_h, \lambda_a$: Eigenvalues of $AA^T$ and $A^T A$ respectively.

HITS Algorithm

Hyperlink-Induced Topic Search (HITS)(Kleinberg, 1997):
1. Assemble the target subset of web pages and then Compute $AA^T$ / $A^T A$.
2. Calculate eigenvectors to find $\textbf{h}$ and $\textbf{a}$, Output the top-scoring hubs and authorities.

Note: Links containing the query “jaguar” receive double weighting.

Choosing the Subset of the Web

Collection of Sites:
- Use a text index to get all related pages – root set.
- Build base set: root set + referring pages.
- Use the base set to compute hub and authority scores.
Authority Sites May Not Contain Keywords:
- Example: IBM for computer hardware.
- Hubs are likely to include the terms.
HITS Algorithm:
- 5 iterations on average are customary.
- Downside: Inclusion of pages in a foreign language.
  - Due to base set construction during which the query is ignored.
Importance:
- Enhances the quality of search results by identifying expert sources.
- Used in combination with other algorithms like PageRank for comprehensive analysis.
- Critical for academic and topic-specific searches.

Take Home Messages

Search Engines
Web Search Basics
Web Crawler
Indexing and Storing Data
Searching Algorithms:
- Link Analysis
- Markov Chains (***)
- PageRank (**)
- Hubs and Authorities (**)

Web and Social Network Analytics

Table of Contents

Search Engines

Introduction

Google Search in 2024

Search Engines

Web Search Basics

Architecture

Key Components

How Google Search Works

Principles

Comparison

Shingling for Duplicates

Shingling

Web Crawlers

Basics

Web Crawling vs Web Scraping

Indexing and Storing Data

Steps

Step 1: Indexing

Step 2: Storing

Algorithms

Searching and Presentation

Link Analysis

Markov Chains: Definition

Markov Chains: Surfer’s Position

Markov Chains: Adjacency Matrix

Markov Chains: Adjacency Matrix

Markov Chains: Example 1

Markov Chains: Example 2

PageRank: Algorithm Overview

PageRank: Principles

PageRank Computation

PageRank Computation: Exercise

Question

Solution

Topic-Specific PageRank

Personalized PageRank

Hubs and Authorities

Hubs and Authorities: Calculations

HITS Algorithm

Choosing the Subset of the Web

Take Home Messages