RevisionDojo

Web Crawling

The first step in accessing information is web crawling.
Web crawlers (also known as spiders or bots) are automated programs that:
1. Start from a set of seed URLs.
2. Visit these pages.
3. Follow the hyperlinks to discover new pages.

Hint

This process continues recursively, building a comprehensive map of the web.

Indexing

As crawlers visit pages, they collect and store information about the content and structure of each page.
This data is then indexed , creating a searchable database that allows search engines to quickly retrieve relevant results for user queries.

Analogy

When a crawler visits a webpage, it’s like a librarian receiving a new book.
- The librarian reads the book’s contents (title, author, keywords, chapters).
- This information is then entered into a catalogue.
- When a reader asks for “books on climate change,” the librarian doesn’t flip through every book on the shelve, they quickly check the catalogue (the index) and retrieve the relevant titles.
In the same way, search engines don’t scan the entire web each time a query is made.
Instead, they rely on the index, a structured database of terms and their locations, to find results in milliseconds.

PageRank Algorithm

One of the most influential algorithms developed for search engines is PageRank, created by Google's founders, Larry Page and Sergey Brin.
PageRank uses the web graph to evaluate the importance of web pages based on their link structure.

How PageRank Works

Basic Idea:
1. A page is considered important if it is linked to by other important pages.
Link Analysis:
1. Each link from one page to another is treated as a vote of confidence.
2. However, not all votes are equal, votes from highly-ranked pages carry more weight.
Iterative Calculation:
1. PageRank is calculated iteratively.
2. Starting with an initial rank for each page, the algorithm updates the ranks based on the incoming links and their respective ranks.
Damping Factor:
1. To simulate the behavior of a random web surfer , PageRank incorporates a damping factor (typically set to 0.85).
2. This factor represents the probability that a user will continue following links , while the remaining probability accounts for the user randomly jumping to another page.

Note

PageRank revolutionized web search by prioritizing pages based on their relevance and authority, rather than just keyword matching.
This approach ensured that users received higher-quality results, making search engines more effective and reliable.

Limitations and Evolution

While PageRank was a groundbreaking innovation, it is not without limitations:
1. Manipulation: Techniques like link farms were developed to artificially boost PageRank.
2. Content Relevance: PageRank focuses on link structure and does not account for the actual content of the page.
3. Scalability: As the web grows, calculating PageRank for billions of pages becomes computationally intensive.
To address these challenges, modern search engines combine PageRank with other algorithms that analyze content relevance, user behavior, and machine learning models.

Unlock the rest of this chapter with a Free account

Nice try, unfortunately this paywall isn't as easy to bypass as you think. Want to help devleop the site? Join the team at https://revisiondojo.com/join-us. exercitation voluptate cillum ullamco excepteur sint officia do tempor Lorem irure minim Lorem elit id voluptate reprehenderit voluptate laboris in nostrud qui non Lorem nostrud laborum culpa sit occaecat reprehenderit

Definition

Paywall

(on a website) an arrangement whereby access is restricted to users who have paid to subscribe to the site.

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Note

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam quis nostrud exercitation.

Excepteur sint occaecat cupidatat non proident

Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.

Hint

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum.

Introduction to Web Crawling

Web crawling is the process of systematically browsing the internet to collect information from web pages.
Web crawlers (also known as spiders or bots) are automated programs that:
- Start from a set of seed URLs.
- Visit these pages.
- Follow the hyperlinks to discover new pages.

DefinitionWeb CrawlersAutomated programs that systematically browse the internet to collect information from web pages.

AnalogyThink of a web crawler like a curious spider exploring a web. It starts at one point and follows each thread (link) it finds, discovering new areas as it goes.

HintThis process continues recursively, building a comprehensive map of the web.

1. System fundamentals2 subtopics

2. Computer organization1 subtopic

3. Networks1 subtopic

4. Computational thinking, problem-solving and programming3 subtopics

5. Abstract data structures (HL)1 subtopic

6. Resource management (HL)1 subtopic

7. Control (HL)1 subtopic

A. Databases4 subtopics

B. Modelling and simulation4 subtopics

C. Web science6 subtopics

D. Object-oriented programming (OOP)4 subtopics

C.5.5 Page Rank Algorithm Notes

1. System fundamentals2 subtopics

2. Computer organization1 subtopic

3. Networks1 subtopic

4. Computational thinking, problem-solving and programming3 subtopics

5. Abstract data structures (HL)1 subtopic

6. Resource management (HL)1 subtopic

7. Control (HL)1 subtopic

A. Databases4 subtopics

B. Modelling and simulation4 subtopics

C. Web science6 subtopics

D. Object-oriented programming (OOP)4 subtopics

Web Crawling

Indexing

PageRank Algorithm

How PageRank Works

Limitations and Evolution

Unlock the rest of this chapter with a Free account

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Duis aute irure dolor in reprehenderit

Excepteur sint occaecat cupidatat non proident

Unlock the rest of this chapter with a Free account

anim nostrud sit dolore minim proident quis fugiat velit et eiusmod nulla quis nulla mollit dolor sunt culpa aliqua

Duis aute irure dolor in reprehenderit

Excepteur sint occaecat cupidatat non proident

Introduction to Web Crawling