Week 2: Search Engines & Web Crawler
Jan-2025
Search Engines: One of the most widely-used sources of information
Web Crawler
It aims to quickly and efficiently gather as many useful web pages as possible. Also called spiders, bots, or web robots.
robots.txt
files and meta tags.Aspect | Web Crawling | Web Scraping |
---|---|---|
Purpose | Indexing web content | Extracting specific data |
Process | Navigating and mapping | Parsing HTML for data |
Usage | Search engines | Data analysis, research |
Data | Metadata, links | Targeted data, text |
A Markov chain (or process) is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
Key Concepts
A random surfer’s behavior on the web graph can be modeled with an adjacency matrix: \[ A = (a_{ij}) = \text{Number of edges from vertex $i$ to $j$}. \]