How to follow specific links and scrape content using scrapy?

Say I have one main page, index.html and four child pages, . All pages are linked on the main page, in the same way.

How can I follow these specific links with Python's scrapy and scrape content following a repetitive pattern.

Here is the setup:


<div class="one"><p>Text</p><a href="">Link 1</a></div>
<div class="one"><p>Text</p><a href="">Link 4</a></div>

<div class="one"><p>Text to be scraped</p></div>

How would I set up the spider in scrapy to just follow the links extracted from index.html?

I feel like the example from the tutorial does not help me much here:

from scrapy.spider import Spider

class IndexSpider(Spider):
    name = "index"
    allowed_domains = ["???"]
    start_urls = [

Note: This is a simplified example. In the original example, all URLs are from the web and index.html contains a lot more links than just 1….

The question is how to follow the extact links, which can be provided as a list, but will eventually stem from a xpath selector – select last column from table, but just every other row.


Use CrawlSpider and specify the rule for the SmglLinkExtractor:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class MySpider(CrawlSpider):
    name = "mydomain"
    allowed_domains = ["www.mydomain"]
    start_urls = ["http://www.mydomain/index.html",]

    rules = (Rule(SgmlLinkExtractor(allow=('\d+.html$', ),), callback="parse_items", follow=True), )

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        # get the data


 ? How to follow specific links and scrape content using scrapy?
 ? Following hyperlink and "Filtered offsite request"
 ? Scrapy, Parse items data from page then follow link to get additional items data
 ? Taking data from multiple links while storing in one item in Scrapy
 ? Scrapy CrawlSpider for AJAX content
 ? Scrapy CrawlSpider for AJAX content
 ? Scrapy CrawlSpider for AJAX content
 ? Scrapy: Access CrawlSpider url list
 ? scrapy does not return text
 ? scrapy does not return text