Example of a crawler spider with scrapy

Python -- Posted on July 7, 2023

This is a Scrapy spider script that crawls a website and extracts information from the web pages it visits. Here's an overview of what the script does:

1. It imports the necessary modules and classes from Scrapy and other Python libraries.

2. It defines a custom `ExternalLinkExtractor` class that inherits from `LinkExtractor`. However, the custom class is currently commented out and not used in the spider.

3. The `SearchSpider` class is defined, which inherits from `CrawlSpider`. This spider will crawl the website and follow links based on the specified rules.

4. The `name` attribute is set to "search", which will be used to identify the spider when running it.

5. The `allowed_domains` attribute restricts the spider to crawl only within the "example.com" domain.

6. The `start_urls` attribute lists the URLs where the spider will begin crawling. In this case, it starts with "https://example.com/index.html".

7. The `rules` attribute defines a tuple with one rule. The rule uses a `LinkExtractor` with an empty `allow` parameter, which means the spider will follow all links. When following a link, it will call the `parse_item` method to extract information from the page.

8. The `parse_item` method is defined, which handles the extraction of information from each page the spider visits.

Here's what the `parse_item` method does:

- It retrieves the content type of the response using the "Content-Type" header.

- It gets the body of the response, removes whitespace, and decodes it to a UTF-8 string.

- It obtains the URL of the page being parsed.

- It retrieves the referer (the page that led to the current page) from the request headers.

- It extracts all URLs from the page using an XPath expression and joins them with the base URL to get complete URLs.

- It extracts information about each HTML element on the page, including its name, value, and attributes.

- Finally, it yields a dictionary containing the extracted information, which will be saved as output data.

Currently, the spider is set to crawl the "example.com" domain, starting from the "https://example.com/index.html" URL and following all links. The information extracted from each page includes the URL, domain, response body, URLs on the page, content type, and information about HTML elements.

To run this spider, you need to have Scrapy installed, and you can start the crawl using the command `scrapy crawl search`. You can customize the spider further by modifying the `allowed_domains`, `start_urls`, and `rules` attributes to target different websites or adjust the crawling behavior.

 

              
                import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from urllib.parse import urljoin, urlparse


# class ExternalLinkExtractor(LinkExtractor):
#     def extract_links(self, response):
#         links = super().extract_links(response)
#         allowed_domains = self.allow_domains
#         for link in links:
#             parsed_url = urlparse(link.url)
#             if parsed_url.netloc and parsed_url.netloc not in allowed_domains:
#                 yield link



class SearchSpider(CrawlSpider):
    name = "search"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/index.html"]

    rules = (
        Rule(
            LinkExtractor(allow=()),
            callback="parse_item",
            follow=True
        ),
    )


    def parse_item(self, response):
        # get the content type of the response
        content_type = response.headers.get('Content-Type').decode('utf-8')
        # get the body of the response and remove whitespace
        body = re.sub(r'\s+', ' ', response.body.decode('utf-8')).strip()
        
        # the url of the page
        page_url = response.url
        
        # the referer of the page 
        referer = response.request.headers.get('Referer')
        # the domain of the response url
        domain = response.url.split('/')[2]
        
        # get all urls of the page 
        urls = response.xpath('//a/@href').extract()
        cleaned_urls =[]
        elements = []
        for url in urls:
            cleaned_url = urljoin(response.url, url)
            cleaned_urls.append(cleaned_url)
        
        # get all elements of the page 
        for element in response.xpath('//*'):
            element_attributes = []

            # get all attributes of an element
            for attr in element.attrib:
                attribute = {}
                attr_value = element.attrib[attr]
                if attr_value.strip() != '':
                    attribute[attr] = attr_value
                    element_attributes.append(attribute)
            element_text = element.get()
            element_name = element.xpath('name()').get()
            elements.append({
                'name': element_name,
                'value': element_text,
                'attributes': element_attributes
            })
        yield {
            'url':page_url,
            'domain':domain,
            'response':body,
            'urls':cleaned_urls,
            'content_type':content_type,
            'elements':elements,
            'referer': referer,

        }
                  
   
            

Related Posts