Web Crawling
- The first step in accessing information is web crawling.
- Web crawlers (also known as spiders or bots) are automated programs that:
- Start from a set of seed URLs.
- Visit these pages.
- Follow the hyperlinks to discover new pages.
This process continues recursively, building a comprehensive map of the web.
Indexing
- As crawlers visit pages, they collect and store information about the content and structure of each page.
- This data is then indexed , creating a searchable database that allows search engines to quickly retrieve relevant results for user queries.
- When a crawler visits a webpage, it’s like a librarian receiving a new book.
- The librarian reads the book’s contents (title, author, keywords, chapters).
- This information is then entered into a catalogue.
- When a reader asks for “books on climate change,” the librarian doesn’t flip through every book on the shelve, they quickly check the catalogue (the index) and retrieve the relevant titles.
- In the same way, search engines don’t scan the entire web each time a query is made.
- Instead, they rely on the index, a structured database of terms and their locations, to find results in milliseconds.
PageRank Algorithm
- One of the most influential algorithms developed for search engines is PageRank, created by Google's founders, Larry Page and Sergey Brin.
- PageRank uses the web graph to evaluate the importance of web pages based on their link structure.
How PageRank Works
- Basic Idea:
- A page is considered important if it is linked to by other important pages.
- Link Analysis:
- Each link from one page to another is treated as a vote of confidence.
- However, not all votes are equal, votes from highly-ranked pages carry more weight.
- Iterative Calculation:
- PageRank is calculated iteratively.
- Starting with an initial rank for each page, the algorithm updates the ranks based on the incoming links and their respective ranks.
- Damping Factor:
- To simulate the behavior of a random web surfer , PageRank incorporates a damping factor (typically set to 0.85).
- This factor represents the probability that a user will continue following links , while the remaining probability accounts for the user randomly jumping to another page.
- PageRank revolutionized web search by prioritizing pages based on their relevance and authority, rather than just keyword matching.
- This approach ensured that users received higher-quality results, making search engines more effective and reliable.
Limitations and Evolution
- While PageRank was a groundbreaking innovation, it is not without limitations:
- Manipulation: Techniques like link farms were developed to artificially boost PageRank.
- Content Relevance: PageRank focuses on link structure and does not account for the actual content of the page.
- Scalability: As the web grows, calculating PageRank for billions of pages becomes computationally intensive.
- To address these challenges, modern search engines combine PageRank with other algorithms that analyze content relevance, user behavior, and machine learning models.