Say I have one main page,
index.html and four child pages,
… . All pages are linked on the main page, in the same way.
How can I follow these specific links with Python's
scrapy and scrape content following a repetitive pattern.
Here is the setup:
<body> <div class="one"><p>Text</p><a href="">Link 1</a></div> … <div class="one"><p>Text</p><a href="">Link 4</a></div> </body>
<body> <div class="one"><p>Text to be scraped</p></div> </body>
How would I set up the
spider in scrapy to just follow the links extracted from
I feel like the example from the tutorial does not help me much here:
from scrapy.spider import Spider
class IndexSpider(Spider): name = "index" allowed_domains = ["???"] start_urls = [ "index.html" ]
Note: This is a simplified example. In the original example, all URLs are from the web and
index.html contains a lot more links than just
The question is how to follow the extact links, which can be provided as a list, but will eventually stem from a xpath selector – select last column from table, but just every other row.