Web and Social Network Analytics

Week 3: Social Media & Graph Analytics

Dr. Zexun Chen

Zexun.Chen@ed.ac.uk

Jan-2025

Introduction

Last decade explosion of social media websites
- E.g., Twitter, LinkedIn, Facebook, Instagram
Creation of vast amount of content
Content is also linked
- Interpersonal/entity relations from graph
  - Network of interactions
- Data linked to people
  - Preferences, videos, images, sensor data

Two kinds of data:
- Linkage-based for structural analysis:
  - Links are explicit, e.g., Facebook friendship
  - Analyse linkage behaviour to determine important nodes, communities, links, …
- Content-based:
  - Not explicitly linked, but becomes more informative when combined with linkage data
  - Often very rich: e-mails, blogs, message boards, reviews
    - Linked through participants, customers, companies
Types of analyses:
- Static: Social network changes slowly over time, e.g., bibliographic network.
- Dynamic: Rapidly generated and changing content and user base, e.g., instant messaging.

Typical analyses:
- Community detection
- Analysis of interaction dynamics
- Link inference
- Transfer learning
- Adversary detection
- Ranking

Questions
What do Social Networks look like? How do they behave over time? What distribution of patterns do they maintain?

Analytics Tools

Preliminaries:

Graphs: \(G = (V, E, W)\)
- \(V\): vertices (nodes), \(E\): edges (links), \(W\): weights
- Edges (links):
  - Directed: phone call, follower on Twitter
  - Undirected: collaboration, friendship on Facebook
  - Weighted: edge \(e_{ij}\) has weight \(w_{ij}\)
  - \(E_{max} = \binom{N}{2} = \frac{N(N-1)}{2}, N = |V|\)
- Vertices (nodes):
  - Node degree \(k_i\): number of edges adjacent
  - In-degree \(k^{in}\)
  - Out-degree \(k^{out}\)

Example:

Graph Basics

Graphs: \(G = (V, E, W)\)
- Unipartite, multipartite (typically bipartite)
  - Homogeneous nodes vs nodes of different classes
- Adjacency matrix \(A\), \(|V| \times |V|\) matrix
  - Sparsity is preferable (for calculations)
  - \(|E| \ll E_{max}\) or \(\bar{k} \ll N-1\)

Preliminaries: Components

(Weakly) Connected Component: Set of nodes and edges where there exists a path between any two nodes in set.
- Strongly Connected (SCC): Directed path between any two nodes.
  - The elements from every pair of nodes in \(S\) can reach each other.
  - There is no larger set containing \(S\) with the same property.
Disconnected Graph: Consists of two or more connected components.
Bridge Edge: If deleted, the graph becomes disconnected.
Directed Acyclic Graph (DAG): Has no cycles.

Characteristics

Degree Distribution \(P(k)\):
- Probability that a node has degree \(k\).
Path:
- Sequence of nodes traversed (e.g., \(<N_1, N_2, N_3, N_2>\)).
Distance Between Nodes \(h_{ij}\):
- Length of the shortest path connecting them.
Diameter of Network:
- The shortest path between any pair of nodes.
Average Path Length:
- \(\bar{h} = \frac{1}{2E_{max}}\sum_{j \neq i} h_{ij}\)

Examples 1

Examples 2

Vertex/Node-Based

Centrality - Degree

A central actor is involved in many ties:
- Degree centrality/prestige centrality: Referrals
  - In-links
  - PageRank
- Degree centrality (undirected):
  - Number of immediate connections
  - \(C_{D}(i) = \frac{k_i}{|V| - 1}\)
- Prestige centrality (directed):
  - Everyone points to this node
  - \(C_{P}(i) = \frac{k_{i}^{in}}{|V| - 1}\)

Centrality: Betweenness

A central actor is indispensable:
- Lots of connections pass through it.
- Choke-point for information.
- How many pairs of individuals would have to go through you in order to reach one another in the minimum number of hops?
- If a node is on the path between two non-adjacent actors, it can influence the interaction.
Betweenness centrality:
- Calculation: Number of the shortest paths that go through \(V\) vs total number of the shortest paths that exist between all pairs of nodes. \[C_V = \frac{\sum_{s \neq v \neq t} \sigma_{st}(v)}{\sum_{s \neq v \neq t} \sigma_{st}}\]
- \(\sigma_{st}\): The number of shortest paths between \(s\) and \(t\).
- \(\sigma_{st}(v)\): The shortest paths that go through \(v\).

Centrality: Example

Centrality: Exercise

Graph

Calculate:
- Degree centrality of A-D:
  - A: \(1/3\)
  - B: \(3/3\)
  - C: \(2/3\)
  - D: \(2/3\)
- Betweenness centrality of B:
  - Number of Shortest Paths:
    - A-C: 1
    - A-D: 1
    - C-D: 1
  - Number of Shortest Paths through B:
    - A-C: 1
    - A-D: 1
    - C-D: 0
  - Centrality: \(2/3\)

Clustering Coefficient

Portion of connected neighbours of node \(i\):
- Number of connections vs maximum number of connections.
- Local clustering coefficient of node \(i\): \[ C_i^{Directed} = \frac{\sum_{j, k \in N_i} e_{jk}}{k_i (k_i - 1)}, \quad C_i^{Undirected} = \frac{\sum_{j, k \in N_i} e_{jk}}{k_i (k_i - 1)/2} \]
  - \(N_i = \{v_j \in V| e_{ij} \in E \vee e_{ji} \in E\}\): the neighbourhood of \(i\).
  - Denominator: Possible connections between adjacent nodes.
- Average clustering coefficient:
  - Average local density.
  - \(C = \frac{1}{|N|} \sum_i^{|N|} C_i\)

Clustering Coefficient: Exercise

What is the local clustering coefficient of node B?
- \(N = \{A, C, D\}\)
- \(k_i = 3\)
- \(C_B = \frac{1}{(3 \times 2) / 2}\)

Clustering Coefficient

Captures the tendency of networks to cluster together.
More nodes are connected than (uniform randomly) expected.
Signifies that nodes in a neighbourhood tend to stick together:
- E.g., groups of friends in a social network.
- Higher clustering coefficient:
  - Indicates less importance in a global network sense, as such nodes are embedded within tightly-knit clusters and may not act as bridges between clusters.
  - Indicates more importance in a local sense, as the node plays a central role within its immediate neighborhood.

Statistical Properties

Distribution

Distribution of Node Connectivity:
- Heavy tails (right skew): Strong nodes have many connections, weak nodes still have a few.
- Power law: \(P(k) = k^{-\alpha}\).
- Straight line on a log-log plot with slope \(-\alpha\).

Distribution of node connectivity

Power Law

Model of Preferential Attachment:
- Stems from citation analysis: New citations are proportionate to the number a paper already has.
- “Rich get richer” model.

Source: http://www.ladamic.com/

Small Worlds

Experiment

Typical shortest path between any two people?
- Small-world experiment (Milgram, 1967, Link):
  - 300 people from Omaha, Nebraska, and Kansas.
  - Asked to send a letter to a stockbroker in Boston by passing the letter through friends.
  - Recorded the number of steps taken.

Bacon number

Six Degrees of Kevin Bacon:
- Create a network of Hollywood actors.
- Connect actors if they appeared in the same movie.
- Bacon number: Number of steps to Kevin Bacon.
- https://www.sixdegrees.org/

Example of Small World

Smaller Number on Digital Plataform

Facebook:
- 99.6% of users are connected by degree 5 (6 hops).
- 92% are connected by only 4 degrees.

Reference Link

Causes for Small World

High Clustering: Social networks tend to have groups of friends who are more interconnected.
Short Path Lengths (Diameter): Despite clustering, there are typically short paths that connect different groups.
Random Connections: Random links between distant nodes significantly reduce the average path length.

Source: http://www.ladamic.com/

Cause (Continuous)

Cause of Small Worlds (Duncan Watts):
- A network with low clustering can still be a small world.
- Only a few random links are enough to create a small-world effect.

Source: http://www.ladamic.com/

Smaller Worlds on Digital Platforms

Dense Connectivity:
- Platforms like Facebook encourage adding friends from diverse parts of life, creating tightly-knit clusters.
Global Reach:
- Digital platforms bridge geographical barriers, connecting people across the globe.
Algorithmic Recommendations:
- Platforms suggest connections, increasing the chances of linking with mutual acquaintances.
Implications:
- Efficient spread of information or influence.
- Quick navigation across the network to reach anyone.
- Vulnerability to the rapid spread of misinformation or harmful content.

Link/Edge-Based

Link Properties

Triadic Closure:
- If two people in a SN have a friend in common, more likelihood they will become friends themselves.
- This principle explains the evolution of networks over time in many situations.

Strength of Ties:
- Number of interactions:
  - Reciprocal services (e.g., #calls).
- Time spent together.
- Strong Ties:
  - Represent dependable links, corresponding to friends who provide social or emotional support.
- Weak Ties:
  - Represent acquaintances or less close relations that do not involve close friendship.

Friend of a Friend is a Friend

Triadic Closure

Triadic closure is a concept in social network theory, first suggested by German sociologist Georg Simmel in his 1908 book Soziologie (Sociology: Investigations on the Forms of Sociation).

Strong Ties and Triadic Closure:
- Weak Claim: A new edge B-C is more likely when A-B and A-C are ties.
- Strong Claim: If there exists strong ties between A and B/C, there must be an edge B-C.

Triadic Closure: Clustering Coefficient

Consider Clustering Coefficient \(C_A\) of Node A:
- Probability that two random friends of A are friends themselves.
- Before New Edge: \(C_A = \frac{1}{4 \times 3 / 2}\)
- After New Edge: \(C_A = \frac{2}{4 \times 3 / 2}\)
- Higher clustering coefficient after the new edge.

Triadic Closure: Causes

Meet Friends Through Friends:
- Have opportunities to meet through common friends.
- Mutual trust in shared friends.
- Incentive to pair friends to avoid stress in relationships.

Bridge

Bridges:
- If removed, the components of the network become disconnected.

Local Bridges

Local Bridges (Equivalent Definitions):
- Any edge whose endpoints have no friends in common.
- Any edge with zero neighbour overlap is called a local bridge.
- Any edge whose deletion results in increasing the distance between the endpoints to a value strictly more than two.

Example:
- A-B is the only local bridge in the above graph.

Strong Triadic Closure

Strong Triadic Closure Property

A node \(A\) violates the Strong Triadic Closure Property if it has strong ties to two non-linked nodes \(B\) and \(C\).

Observations:
- No node in the left figure violates the Strong Triadic Closure Property.
- If we change A-F and A-B to strong ties, then it violates the Strong Triadic Closure Property due to the absence of links B-F, B-E, and A-G.

Local Bridges and Weak Ties

Proposition

If a node \(A\) in a network satisfies the Strong Triadic Closure Property and is involved in at least two strong ties, then any local bridge it is involved in must be a weak tie.

Illustration:

A connection between:
- Local bridge (a structural view).
- Strong and weak ties (an individual view).

Ties

Triadic Closure:
- Useful for analyzing people that can be addressed.
- E.g., in campaigns for sport shoes.
Visualization of Concepts

Weak Ties:
- Tend to carry different information compared to close contacts.
  - In experiments, acquaintanceship ties were more effective at passing information.
  - E.g., in recommending jobs –> untapped/undiscovered community.

Weak Ties: Example

Mobile Call Graph with Edge Strength

Graph Structure

Community Discovery

Community Discovery:
- Identifies groups or clusters within a network that are more densely connected internally.
- Networks are strongly modular.
- Cliques exist, e.g., groups of friends, user bases, etc.

Basic Idea

Partition a graph to minimize connections between different groups while maximizing connections within the same group.

Normalised Cut (Ncut)

Ncut:
- A criterion for measuring the quality of a network division into clusters or communities.
Normalised Cut of a group of nodes \(S, \bar{S} \subset V\): \[ Ncut(S) = \frac{\sum_{i \in S, j \in \bar{S}} A(i, j)}{\sum_{i \in S} k_i} + \frac{\sum_{i \in S, j \in \bar{S}} A(i, j)}{\sum_{j \in \bar{S}} k_j} \]
Interpretation:
- Low Ncut signifies good communities as they are well-connected internally but sparsely connected externally.

Normalised Cut (Ncut): Continues

Quality Functions:
- Kernighan-Lin Objective:
  - Minimise the number of inter-cluster edges (or weights) of the same size.
  - \(KLObj(V_1, \cdots, V_k) = \sum_{i \neq j} A(V_i, V_j)\)
- Modularity:
  - Measures whether subgraphs are far from random subgraphs.
  - Independent of the number of clusters.
  - \(m\): Number of edges in the graph. \[ Q = \sum_{c=1}^k \left[\frac{A(V_c, V_j)}{m} - \left(\frac{k_{V_c}}{2m}\right)^2\right] \]

Normalised Cut (Ncut): Directed Network

Directed Networks:
- Normalised Cut: \[ Ncut_{dir}(S) = \frac{\sum_{i \in S, j \in \bar{S}} \pi(i)P(i, j)}{\sum_{i \in S} \pi(i)} + \frac{\sum_{i \in S, j \in \bar{S}} \pi(i)P(i, j)}{\sum_{j \in \bar{S}} \pi(j)} \]
  - \(P\): Transition matrix of a random walk.
  - \(\pi\): Stationary distribution vector (see PageRank).
  - Typically minimised using spectral clustering.
- Modularity: \[ Q = \frac{1}{m} \sum_{i, j \in E} \left[A_{ij} - \frac{k_i^{in} k_j^{out}}{m}\right] \delta_{c_i, c_j} \]
  - \(\delta_{c_i, c_j}\): 1 if community of \(i\) equals community of \(j\), 0 otherwise.
  - Measures community quality in directed networks.

Graph Partition Algorithm

Kernighan-Lin (KL) Algorithm (Partition a graph into two blocks):
- Classic graph partitioning.
- For each iteration, search for nodes that can be swapped to reduce the number of inter-cluster edges (edge cut).
- Gain: Reduction in edge cut when moving nodes.
- Nodes with the largest gain are repeatedly selected from larger partition.

Example KL Algorithm:
- Swapping A-D: 1 fewer cuts (4:3).
- Swapping B-D: 1 fewer cuts (4:3).
- Swapping C-D: 2 fewer cuts (4:2).

Agglomerative / Divisive Algorithms

Agglomerative: Start from single-node communities (bottom-up approach).
Divisive: Start from all-nodes community (top-down approach).

Girvan and Newman (Divisive) Algorithm

Edge Betweenness

The number of the shortest paths that go through a specific edge vs the total number of the shortest paths.

For every edge in a graph, calculate the edge betweenness centrality.
Remove the edge with the highest betweenness centrality.
Recalculate the betweenness centrality for every remaining edge.
Repeat steps 2-4 until there are no more edges left.

Girvan and Newman (Divisive) Algorithm

Spectral and Other Approaches

Spectral Algorithms:
- Use top \(k\) eigenvectors of the adjacency matrix to represent nodes in a \(k\)-dimensional space.
- Apply regular clustering techniques.
- Computational cost of Singular Vector Decomposition (SVD) is high.
Other Approaches:
- Local graph clustering.
- Multi-Level Regularised Markov Clustering.

Further Questions

Dynamic Networks:
- How to incorporate temporal information?
  - Use events to describe the change in communities:
    - Appear, disappear, join, when?
- Treating graphs as snapshots is not effective:
  - Use temporal slices of the network.
- Find meta-groups:
  - Sequence of groups which are similar.
- Using snapshot quality and history cost:
  - How well is the clustering, and how different are the clusters?
Challenges:
- Incorporating content:
  - Not only topological information to be analysed.
  - Perform semantically coherent and meaningful analysis of the network.
  - E.g., network + email content.

Case Study

Twitter (X)

Microblogging Platform [Source: Cybernews]:
- Character Limit: Standard users: 280 characters.
- Users: ~650 million monthly active users, ~240–300 million daily active users.
- No reciprocation necessary for following.
- Hashtags create a natural way of tracking topics.
- Retweet (RT) functionality enables content amplification.
Popularity in Research:
- API is more open and accessible compared to other platforms.
- Easy to find and follow conversations.
- Strong hashtag culture facilitates gathering, sorting, and expanding searches for data collection.
- Many researchers favour it due to positive personal experiences.
Significance:
- A global reach and real-time information-sharing capabilities make it a critical platform for communication and research.

Observations:
- Power law \(\alpha\) is roughly 5 (high).
- Bump at 20: Reflects initial suggestions.
- Bump at 2000: Maximum follower limit before 2009.

Twitter: Followers and Tweets

Many users never tweet (low median).
The average number of tweets per follower exceeds the median; users tweet far more than expected.
Observations include:
- Dips at 20/2000.
- Spam accounts commonly seen follower #250/500/2,000/5,000.

Twitter: Degrees of Separation

Degrees of Separation:
- Average path length is 4.12.
Ranking Users:
- PageRank: A measure of importance based on connections.
- Retweets: Reflects influence through content amplification.

Graph:

Top Reviewer Analysis

1,000 Top Reviewers:
Data collected on products reviewed from Amazon.com.

Summary

Take Home Messages

Social Media & Social Network
Analytics Tools on Graph:
- Centrality ***
- Clustering Coefficient ***
- Statistical Properties:
  - Link Property ***
  - Community Discovery *
Case Study:
- Twitter
- Top Reviewer Analysis

Web and Social Network Analytics

Table of Contents

Social Media

Introduction

Analytics Tools

Preliminaries:

Graph Basics

Preliminaries: Components

Characteristics

Examples 1

Examples 2

Vertex/Node-Based

Centrality - Degree

Centrality: Betweenness

Centrality: Example

Centrality: Exercise

Clustering Coefficient

Clustering Coefficient: Exercise

Clustering Coefficient

Statistical Properties

Distribution

Distribution of node connectivity

Power Law

Small Worlds

Experiment

Bacon number

Example of Small World

Smaller Number on Digital Plataform

Causes for Small World

Cause (Continuous)

Smaller Worlds on Digital Platforms

Link/Edge-Based

Link Properties

Friend of a Friend is a Friend

Triadic Closure: Clustering Coefficient

Triadic Closure: Causes

Bridge

Local Bridges

Strong Triadic Closure

Local Bridges and Weak Ties

Ties

Weak Ties: Example

Graph Structure

Community Discovery

Normalised Cut (Ncut)

Normalised Cut (Ncut): Continues

Normalised Cut (Ncut): Directed Network

Graph Partition Algorithm

Agglomerative / Divisive Algorithms

Girvan and Newman (Divisive) Algorithm

Girvan and Newman (Divisive) Algorithm

Girvan and Newman (Divisive) Algorithm

Spectral and Other Approaches

Further Questions

Case Study

Twitter (X)

Twitter: Observations from Social Network

Twitter: Followers and Tweets

Twitter: Degrees of Separation

Top Reviewer Analysis

Summary

Take Home Messages