Response Class¶
The Response class wraps HTTP responses returned by all fetchers, providing access to status, headers, body, cookies, and a Selector for parsing.
You can import the Response class like below:
scrapling.engines.toolbelt.custom.Response
¶
Response(
url,
content,
status,
reason,
cookies,
headers,
request_headers,
encoding="utf-8",
method="GET",
history=None,
meta=None,
**selector_config
)
Bases: Selector
flowchart TD
scrapling.engines.toolbelt.custom.Response[Response]
scrapling.parser.Selector[Selector]
scrapling.core.mixins.SelectorsGeneration[SelectorsGeneration]
scrapling.parser.Selector --> scrapling.engines.toolbelt.custom.Response
scrapling.core.mixins.SelectorsGeneration --> scrapling.parser.Selector
click scrapling.engines.toolbelt.custom.Response href "" "scrapling.engines.toolbelt.custom.Response"
click scrapling.parser.Selector href "" "scrapling.parser.Selector"
click scrapling.core.mixins.SelectorsGeneration href "" "scrapling.core.mixins.SelectorsGeneration"
This class is returned by all engines as a way to unify the response type between different libraries.
| PARAMETER | DESCRIPTION |
|---|---|
status
|
HTTP status code.
TYPE:
|
reason
|
HTTP status message.
TYPE:
|
cookies
|
Response cookies. |
headers
|
Response headers.
TYPE:
|
request_headers
|
Request headers sent with the request.
TYPE:
|
history
|
List of redirect responses, if any.
TYPE:
|
meta
|
Metadata dictionary (e.g., proxy used).
TYPE:
|
request
|
Associated spider Request object (set by crawler, in the spiders framework).
|
captured_xhr
|
List of captured XHR/fetch
|
The main class that works as a wrapper for the HTML input data. Using this class, you can search for elements with expressions in CSS, XPath, or with simply text. Check the docs for more info.
Here we try to extend module lxml.html.HtmlElement while maintaining a simpler interface, We are not
inheriting from the lxml.html.HtmlElement because it's not pickleable, which makes a lot of reference jobs
not possible. You can test it here and see code explodes with AssertionError: invalid Element proxy at....
It's an old issue with lxml, see this entry <https://bugs.launchpad.net/lxml/+bug/736708>
| PARAMETER | DESCRIPTION |
|---|---|
content
|
HTML content as either string or bytes. |
url
|
It allows storing a URL with the HTML data for retrieving later.
TYPE:
|
encoding
|
The encoding type that will be used in HTML parsing, default is
TYPE:
|
huge_tree
|
Enabled by default, should always be enabled when parsing large HTML documents. This controls the libxml2 feature that forbids parsing certain large documents to protect from possible memory exhaustion.
|
root
|
Used internally to pass etree objects instead of text/body arguments, it takes the highest priority. Don't use it unless you know what you are doing!
|
keep_comments
|
While parsing the HTML body, drop comments or not. Disabled by default for obvious reasons
|
keep_cdata
|
While parsing the HTML body, drop cdata or not. Disabled by default for cleaner HTML.
|
adaptive
|
Globally turn off the adaptive feature in all functions, this argument takes higher priority over all adaptive related arguments/functions in the class.
|
storage
|
The storage class to be passed for adaptive functionalities, see
|
storage_args
|
A dictionary of
|
Source code in scrapling/engines/toolbelt/custom.py
body
property
¶
Return the raw body of the current Selector without any processing. Useful for binary and non-HTML requests.
Return the raw body of the response as bytes.
generate_css_selector
property
¶
Generate a CSS selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_full_css_selector
property
¶
Generate a complete CSS selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_xpath_selector
property
¶
Generate an XPath selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_full_xpath_selector
property
¶
Generate a complete XPath selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
__slots__
class-attribute
instance-attribute
¶
__slots__ = (
"url",
"encoding",
"__adaptive_enabled",
"_root",
"_storage",
"__keep_comments",
"__huge_tree_enabled",
"__attributes",
"__text",
"__tag",
"__keep_cdata",
"_raw_body",
)
below_elements
property
¶
Return all elements under the current element in the DOM tree
children
property
¶
Return the children elements of the current element or empty list otherwise
siblings
property
¶
Return other children of the current element's parent or empty list otherwise
path
property
¶
Returns a list of type Selectors that contains the path leading to the current element from the root.
next
property
¶
Returns the next element of the current element in the children of the parent or None otherwise.
previous
property
¶
Returns the previous element of the current element in the children of the parent or None otherwise.
follow
¶
follow(
url,
sid="",
callback=None,
priority=None,
dont_filter=False,
meta=None,
referer_flow=True,
**kwargs
)
Create a Request to follow a URL.
This is a helper method for spiders to easily follow links found in pages.
IMPORTANT: The below arguments if left empty, the corresponding value from the previous request will be used. The only exception is dont_filter.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to follow (can be relative, will be joined with current URL)
TYPE:
|
sid
|
The session id to use
TYPE:
|
callback
|
Spider callback method to use
TYPE:
|
priority
|
The priority number to use, the higher the number, the higher priority to be processed first.
TYPE:
|
dont_filter
|
If this request has been done before, disable the filter to allow it again.
TYPE:
|
meta
|
Additional meta data to included in the request |
referer_flow
|
Enabled by default, set the current response url as referer for the new request url.
TYPE:
|
kwargs
|
Additional Request arguments
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Any
|
Request object ready to be yielded |
Source code in scrapling/engines/toolbelt/custom.py
__str__
¶
__getitem__
¶
__contains__
¶
__getstate__
¶
get_all_text
¶
Get all child strings of this element, concatenated using the given separator.
| PARAMETER | DESCRIPTION |
|---|---|
separator
|
Strings will be concatenated using this separator.
TYPE:
|
strip
|
If True, strings will be stripped before being concatenated.
TYPE:
|
ignore_tags
|
A tuple of all tag names you want to ignore
TYPE:
|
valid_values
|
If enabled, elements with text-content that is empty or only whitespaces will be ignored
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextHandler
|
A TextHandler |
Source code in scrapling/parser.py
urljoin
¶
prettify
¶
Return a prettified version of the element's inner html-code
Source code in scrapling/parser.py
has_class
¶
iterancestors
¶
Return a generator that loops over all ancestors of the element, starting with the element's parent.
Source code in scrapling/parser.py
find_ancestor
¶
Loop over all ancestors of the element till one match the passed function
| PARAMETER | DESCRIPTION |
|---|---|
func
|
A function that takes each ancestor as an argument and returns True/False |
| RETURNS | DESCRIPTION |
|---|---|
Optional[Selector]
|
The first ancestor that match the function or |
Source code in scrapling/parser.py
get
¶
Serialize this element to a string. For text nodes, returns the text value. For HTML elements, returns the outer HTML.
Source code in scrapling/parser.py
getall
¶
__repr__
¶
Source code in scrapling/parser.py
relocate
¶
This function will search again for the element in the page tree, used automatically on page structure change
| PARAMETER | DESCRIPTION |
|---|---|
element
|
The element we want to relocate in the tree
TYPE:
|
percentage
|
The minimum percentage to accept and not going lower than that. Be aware that the percentage calculation depends solely on the page structure, so don't play with this number unless you must know what you are doing!
TYPE:
|
selector_type
|
If True, the return result will be converted to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[List[HtmlElement], Selectors]
|
List of pure HTML elements that got the highest matching score or 'Selectors' object |
Source code in scrapling/parser.py
css
¶
Search the current tree with CSS3 selectors
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The CSS3 selector to be used.
TYPE:
|
adaptive
|
Enabled will make the function try to relocate the element if it was 'saved' before
TYPE:
|
identifier
|
A string that will be used to save/retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
xpath
¶
Search the current tree with XPath selectors
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
Note: Additional keyword arguments will be passed as XPath variables in the XPath expression!
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The XPath selector to be used.
TYPE:
|
adaptive
|
Enabled will make the function try to relocate the element if it was 'saved' before
TYPE:
|
identifier
|
A string that will be used to save/retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
find_all
¶
Find elements by filters of your creations for ease.
| PARAMETER | DESCRIPTION |
|---|---|
args
|
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements' attributes. Leave empty for selecting all.
TYPE:
|
kwargs
|
The attributes you want to filter elements based on it.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
The |
Source code in scrapling/parser.py
698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 | |
find
¶
Find elements by filters of your creations for ease, then return the first result. Otherwise return None.
| PARAMETER | DESCRIPTION |
|---|---|
args
|
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements' attributes. Leave empty for selecting all.
TYPE:
|
kwargs
|
The attributes you want to filter elements based on it.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Selector]
|
The |
Source code in scrapling/parser.py
save
¶
Saves the element's unique properties to the storage for retrieval and relocation later
| PARAMETER | DESCRIPTION |
|---|---|
element
|
The element itself that we want to save to storage, it can be a
TYPE:
|
identifier
|
This is the identifier that will be used to retrieve the element later from the storage. See the docs for more info.
TYPE:
|
Source code in scrapling/parser.py
retrieve
¶
Using the identifier, we search the storage and return the unique properties of the element
| PARAMETER | DESCRIPTION |
|---|---|
identifier
|
This is the identifier that will be used to retrieve the element from the storage. See the docs for more info.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Dict[str, Any]]
|
A dictionary of the unique properties |
Source code in scrapling/parser.py
json
¶
Return JSON response if the response is jsonable otherwise throws error
Source code in scrapling/parser.py
re
¶
Apply the given regex to the current text and return a list of strings with the matches.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string. |
replace_entities
|
If enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, the function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
re_first
¶
Apply the given regex to text and return the first match if found, otherwise return the default value.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string. |
default
|
The default value to be returned if there is no match
DEFAULT:
|
replace_entities
|
if enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, the function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
find_similar
¶
Find elements that are in the same tree depth in the page with the same tag name and same parent tag etc... then return the ones that match the current element attributes with a percentage higher than the input threshold.
This function is inspired by AutoScraper and made for cases where you, for example, found a product div inside a products-list container and want to find other products using that element as a starting point EXCEPT this function works in any case without depending on the element type.
| PARAMETER | DESCRIPTION |
|---|---|
similarity_threshold
|
The percentage to use while comparing element attributes. Note: Elements found before attributes matching/comparison will be sharing the same depth, same tag name, same parent tag name, and same grand parent tag name. So they are 99% likely to be correct unless you are extremely unlucky, then attributes matching comes into play, so don't play with this number unless you are getting the results you don't want. Also, if the current element doesn't have attributes and the similar element as well, then it's a 100% match.
TYPE:
|
ignore_attributes
|
Attribute names passed will be ignored while matching the attributes in the last step.
The default value is to ignore
TYPE:
|
match_text
|
If True, element text content will be taken into calculation while matching. Not recommended to use in normal cases, but it depends.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
A |
Source code in scrapling/parser.py
find_by_text
¶
Find elements that its text content fully/partially matches input.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text query to match
TYPE:
|
first_match
|
Returns the first element that matches conditions, enabled by default
TYPE:
|
partial
|
If enabled, the function returns elements that contain the input text
TYPE:
|
case_sensitive
|
if enabled, the letters case will be taken into consideration
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
Source code in scrapling/parser.py
find_by_regex
¶
Find elements that its text content matches the input regex pattern.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Regex query/pattern to match |
first_match
|
Return the first element that matches conditions; enabled by default.
TYPE:
|
case_sensitive
|
If enabled, the letters case will be taken into consideration in the regex.
TYPE:
|
clean_match
|
If enabled, this will ignore all whitespaces and consecutive spaces while matching.
TYPE:
|