2

(How) can I archieve that scrapy only downloads the header data of a website (for check purposes etc.)

I've tried to disable some download-middlewares but it doesn't seem to work.

1
  • 3
    Are you asking about making "HEAD" requests?
    – alecxe
    Mar 11, 2015 at 14:38

1 Answer 1

8

Like @alexce said, you can issue HEAD Requests instead of the default GET:

Request(url, method="HEAD")

UPDATE: If you want to use HEAD requests for your start_urls you will need to override the make_requests_from_url method:

def make_requests_from_url(self, url):
    return Request(url, method='HEAD', dont_filter=True)

UPDATE: make_requests_from_url was removed in Scrapy 2.6.

3
  • Thank you! But one further question: How do I set this setting in the initial class of the Spider? (Eg. class TestSpider(scrapy.Spider): start_urls = ... ) just writing method = "HEAD" here doesn't work Mar 16, 2015 at 8:49
  • If you want to use HEAD requests for your start_urls you will need to override the make_requests_from_url method (answer updated). Mar 18, 2015 at 16:48
  • 1
    The link to make_requests.. seems to be down :(
    – Alex
    Aug 17, 2022 at 14:02

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.