Spider Classes¶
Here's the reference information for the spider framework classes' parameters, attributes, and methods.
You can import them directly like below:
scrapling.spiders.Spider
¶
Bases: ABC
flowchart TD
scrapling.spiders.Spider[Spider]
click scrapling.spiders.Spider href "" "scrapling.spiders.Spider"
An abstract base class for creating web spiders.
Check the documentation website for more information.
Initialize the spider.
| PARAMETER | DESCRIPTION |
|---|---|
crawldir
|
Directory for checkpoint files. If provided, enables pause/resume. |
interval
|
Seconds between periodic checkpoint saves (default 5 minutes).
TYPE:
|
Source code in scrapling/spiders/spider.py
concurrent_requests_per_domain
class-attribute
instance-attribute
¶
logging_format
class-attribute
instance-attribute
¶
start_requests
async
¶
Generate initial requests to start the crawl.
By default, this generates Request objects for each URL in start_urls
using the session manager's default session and parse() as callback.
Override this method for more control over initial requests (e.g., to add custom headers, use different callbacks, etc.)
Source code in scrapling/spiders/spider.py
parse
abstractmethod
async
¶
Default callback for processing responses
Source code in scrapling/spiders/spider.py
on_start
async
¶
Called before crawling starts. Override for setup logic.
| PARAMETER | DESCRIPTION |
|---|---|
resuming
|
It's enabled if the spider is resuming from a checkpoint, left for the user to use.
TYPE:
|
Source code in scrapling/spiders/spider.py
on_close
async
¶
on_error
async
¶
Handle request errors for all spider requests.
Override for custom error handling.
on_scraped_item
async
¶
A hook to be overridden by users to do some processing on scraped items, return None to drop the item silently.
is_blocked
async
¶
Check if the response is blocked. Users should override this for custom detection logic.
retry_blocked_request
async
¶
Users should override this to prepare the blocked request before retrying, if needed.
__repr__
¶
configure_sessions
¶
Configure sessions for this spider.
Override this method to add custom sessions. The default implementation creates a FetcherSession session.
The first session added becomes the default for start_requests() unless specified otherwise.
| PARAMETER | DESCRIPTION |
|---|---|
manager
|
SessionManager to configure
TYPE:
|
Source code in scrapling/spiders/spider.py
pause
¶
start
¶
Run the spider and return results.
This is the main entry point for running a spider. Handles async execution internally via anyio.
Pressing Ctrl+C will initiate graceful shutdown (waits for active tasks to complete). Pressing Ctrl+C a second time will force immediate stop.
If crawldir is set, a checkpoint will also be saved on graceful shutdown, allowing you to resume the crawl later by running the spider again.
| PARAMETER | DESCRIPTION |
|---|---|
use_uvloop
|
Whether to use the faster uvloop/winloop event loop implementation, if available.
TYPE:
|
backend_options
|
Asyncio backend options to be used with
TYPE:
|
Source code in scrapling/spiders/spider.py
stream
async
¶
Stream items as they're scraped. Ideal for long-running spiders or building applications on top of the spiders.
Must be called from an async context. Yields items one by one as they are scraped.
Access spider.stats during iteration for real-time statistics.
Note: SIGINT handling for pause/resume is not available in stream mode.
Source code in scrapling/spiders/spider.py
scrapling.spiders.Request
¶
Request(
url,
sid="",
callback=None,
priority=0,
dont_filter=False,
meta=None,
_retry_count=0,
**kwargs
)
Source code in scrapling/spiders/request.py
copy
¶
Create a copy of this request.
Source code in scrapling/spiders/request.py
update_fingerprint
¶
Generate a unique fingerprint for deduplication.
Caches the result in self._fp after first computation.
Source code in scrapling/spiders/request.py
__repr__
¶
__str__
¶
__lt__
¶
__gt__
¶
__eq__
¶
Requests are equal if they have the same fingerprint.
Source code in scrapling/spiders/request.py
__getstate__
¶
Prepare state for pickling - store callback as name string for pickle compatibility.
Source code in scrapling/spiders/request.py
__setstate__
¶
Restore state from pickle - callback restored later via _restore_callback().
Result Classes¶
scrapling.spiders.result.CrawlStats
dataclass
¶
CrawlStats(
requests_count=0,
concurrent_requests=0,
concurrent_requests_per_domain=0,
failed_requests_count=0,
offsite_requests_count=0,
robots_disallowed_count=0,
cache_hits=0,
cache_misses=0,
response_bytes=0,
items_scraped=0,
items_dropped=0,
start_time=0.0,
end_time=0.0,
download_delay=0.0,
blocked_requests_count=0,
custom_stats=dict(),
response_status_count=dict(),
domains_response_bytes=dict(),
sessions_requests_count=dict(),
proxies=list(),
log_levels_counter=dict(),
)
Statistics for a crawl run.
concurrent_requests_per_domain
class-attribute
instance-attribute
¶
response_status_count
class-attribute
instance-attribute
¶
domains_response_bytes
class-attribute
instance-attribute
¶
sessions_requests_count
class-attribute
instance-attribute
¶
log_levels_counter
class-attribute
instance-attribute
¶
increment_status
¶
increment_response_bytes
¶
increment_requests_count
¶
to_dict
¶
Source code in scrapling/spiders/result.py
scrapling.spiders.result.ItemList
¶
Bases: list
flowchart TD
scrapling.spiders.result.ItemList[ItemList]
click scrapling.spiders.result.ItemList href "" "scrapling.spiders.result.ItemList"
A list of scraped items with export capabilities.
to_json
¶
Export items to a JSON file.
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the output file |
indent
|
Pretty-print with 2-space indentation (slightly slower)
TYPE:
|
Source code in scrapling/spiders/result.py
to_jsonl
¶
Export items as JSON Lines (one JSON object per line).
| PARAMETER | DESCRIPTION |
|---|---|
path
|
Path to the output file |
Source code in scrapling/spiders/result.py
Session Management¶
scrapling.spiders.session.SessionManager
¶
Manages pre-configured session instances.
Source code in scrapling/spiders/session.py
add
¶
Register a session instance.
| PARAMETER | DESCRIPTION |
|---|---|
session_id
|
Name to reference this session in requests
TYPE:
|
session
|
Your pre-configured session instance
TYPE:
|
default
|
If True, this becomes the default session
TYPE:
|
lazy
|
If True, the session will be started only when a request uses its ID.
TYPE:
|
Source code in scrapling/spiders/session.py
remove
¶
pop
¶
Remove and returns a session.
| PARAMETER | DESCRIPTION |
|---|---|
session_id
|
ID of session to remove
TYPE:
|
Source code in scrapling/spiders/session.py
get
¶
start
async
¶
Start all sessions that aren't already alive.
Source code in scrapling/spiders/session.py
close
async
¶
Close all registered sessions.