Scrapy¶
If you have an existing Scrapy project, you don't need to rewrite it to enjoy Scrapling's parsing API. The Scrapy integration converts Scrapy responses to Scrapling Response objects right inside your spider callbacks, so Scrapy keeps handling the crawling while Scrapling handles the parsing.
Installation
This integration works with the default Scrapling installation (pip install scrapling), no extras needed. It only requires Scrapy to be installed, which you already have in a Scrapy project.
Usage¶
Put the scrapling_response decorator on any spider callback, and the response argument it receives becomes a Scrapling Response:
import scrapy
from scrapling.integrations.scrapy import scrapling_response
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
@scrapling_response
def parse(self, response): # `response` is now a Scrapling Response
first_quote = response.find_by_text("The world as we have created it", partial=True)
for quote in [first_quote, *first_quote.find_similar()]:
card = quote.parent
yield {
"text": quote.get_all_text(strip=True),
"author": card.find("small", class_="author").text,
"tags": [tag.text for tag in card.find_all("a", class_="tag")],
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
The decorator works on all the callback kinds Scrapy supports: regular functions, generators, coroutines, and async generators. The wrapper keeps the callback's kind, name, and docstring, so Scrapy's callback introspection and contracts keep working.
You can also pass Selector configuration to the decorator, and it will be forwarded to the generated Response:
If you have a Scrapy response at hand outside a callback (middlewares, pipelines, and so on), use the converter directly:
from scrapling.integrations.scrapy import convert_response
scrapling_response = convert_response(scrapy_response, keep_comments=False, keep_cdata=False)
Notes¶
- Yield
scrapy.Request(response.urljoin(href))for the next pages as in the example above. Scrapling'sResponse.follow()method builds requests for Scrapling's spider system, which Scrapy doesn't understand. - The response's
metadictionary is shallow-copied, so objects stored by other middlewares stay reachable. For example, withscrapy-playwright, the page is still atresponse.meta["playwright_page"]. - Cookies are parsed from the raw
Set-Cookieheaders into the response'scookiesdictionary.