Selector Class¶
The Selector class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities.
Here's the reference information for the Selector class, with all its parameters, attributes, and methods.
You can import the Selector class directly from scrapling:
scrapling.parser.Selector
¶
Selector(
content=None,
url="",
encoding="utf-8",
huge_tree=True,
root=None,
keep_comments=False,
keep_cdata=False,
adaptive=False,
_storage=None,
storage=SQLiteStorageSystem,
storage_args=None,
**_
)
Bases: SelectorsGeneration
flowchart TD
scrapling.parser.Selector[Selector]
scrapling.core.mixins.SelectorsGeneration[SelectorsGeneration]
scrapling.core.mixins.SelectorsGeneration --> scrapling.parser.Selector
click scrapling.parser.Selector href "" "scrapling.parser.Selector"
click scrapling.core.mixins.SelectorsGeneration href "" "scrapling.core.mixins.SelectorsGeneration"
The main class that works as a wrapper for the HTML input data. Using this class, you can search for elements with expressions in CSS, XPath, or with simply text. Check the docs for more info.
Here we try to extend module lxml.html.HtmlElement while maintaining a simpler interface, We are not
inheriting from the lxml.html.HtmlElement because it's not pickleable, which makes a lot of reference jobs
not possible. You can test it here and see code explodes with AssertionError: invalid Element proxy at....
It's an old issue with lxml, see this entry <https://bugs.launchpad.net/lxml/+bug/736708>
| PARAMETER | DESCRIPTION |
|---|---|
content
|
HTML content as either string or bytes. |
url
|
It allows storing a URL with the HTML data for retrieving later.
TYPE:
|
encoding
|
The encoding type that will be used in HTML parsing, default is
TYPE:
|
huge_tree
|
Enabled by default, should always be enabled when parsing large HTML documents. This controls the libxml2 feature that forbids parsing certain large documents to protect from possible memory exhaustion.
TYPE:
|
root
|
Used internally to pass etree objects instead of text/body arguments, it takes the highest priority. Don't use it unless you know what you are doing!
TYPE:
|
keep_comments
|
While parsing the HTML body, drop comments or not. Disabled by default for obvious reasons
TYPE:
|
keep_cdata
|
While parsing the HTML body, drop cdata or not. Disabled by default for cleaner HTML.
TYPE:
|
adaptive
|
Globally turn off the adaptive feature in all functions, this argument takes higher priority over all adaptive related arguments/functions in the class.
TYPE:
|
storage
|
The storage class to be passed for adaptive functionalities, see
TYPE:
|
storage_args
|
A dictionary of
TYPE:
|
Source code in scrapling/parser.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
__slots__
class-attribute
instance-attribute
¶
__slots__ = (
"url",
"encoding",
"__adaptive_enabled",
"_root",
"_storage",
"__keep_comments",
"__huge_tree_enabled",
"__attributes",
"__text",
"__tag",
"__keep_cdata",
"_raw_body",
)
body
property
¶
Return the raw body of the current Selector without any processing. Useful for binary and non-HTML requests.
below_elements
property
¶
Return all elements under the current element in the DOM tree
children
property
¶
Return the children elements of the current element or empty list otherwise
siblings
property
¶
Return other children of the current element's parent or empty list otherwise
path
property
¶
Returns a list of type Selectors that contains the path leading to the current element from the root.
next
property
¶
Returns the next element of the current element in the children of the parent or None otherwise.
previous
property
¶
Returns the previous element of the current element in the children of the parent or None otherwise.
generate_css_selector
property
¶
Generate a CSS selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_full_css_selector
property
¶
Generate a complete CSS selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_xpath_selector
property
¶
Generate an XPath selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
generate_full_xpath_selector
property
¶
Generate a complete XPath selector for the current element
| RETURNS | DESCRIPTION |
|---|---|
str
|
A string of the generated selector. |
__getitem__
¶
__contains__
¶
__getstate__
¶
get_all_text
¶
Get all child strings of this element, concatenated using the given separator.
| PARAMETER | DESCRIPTION |
|---|---|
separator
|
Strings will be concatenated using this separator.
TYPE:
|
strip
|
If True, strings will be stripped before being concatenated.
TYPE:
|
ignore_tags
|
A tuple of all tag names you want to ignore
TYPE:
|
valid_values
|
If enabled, elements with text-content that is empty or only whitespaces will be ignored
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
TextHandler
|
A TextHandler |
Source code in scrapling/parser.py
urljoin
¶
prettify
¶
Return a prettified version of the element's inner html-code
Source code in scrapling/parser.py
has_class
¶
iterancestors
¶
Return a generator that loops over all ancestors of the element, starting with the element's parent.
Source code in scrapling/parser.py
find_ancestor
¶
Loop over all ancestors of the element till one match the passed function
| PARAMETER | DESCRIPTION |
|---|---|
func
|
A function that takes each ancestor as an argument and returns True/False |
| RETURNS | DESCRIPTION |
|---|---|
Optional[Selector]
|
The first ancestor that match the function or |
Source code in scrapling/parser.py
get
¶
Serialize this element to a string. For text nodes, returns the text value. For HTML elements, returns the outer HTML.
Source code in scrapling/parser.py
getall
¶
__str__
¶
__repr__
¶
Source code in scrapling/parser.py
relocate
¶
This function will search again for the element in the page tree, used automatically on page structure change
| PARAMETER | DESCRIPTION |
|---|---|
element
|
The element we want to relocate in the tree
TYPE:
|
percentage
|
The minimum percentage to accept and not going lower than that. Be aware that the percentage calculation depends solely on the page structure, so don't play with this number unless you must know what you are doing!
TYPE:
|
selector_type
|
If True, the return result will be converted to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[List[HtmlElement], Selectors]
|
List of pure HTML elements that got the highest matching score or 'Selectors' object |
Source code in scrapling/parser.py
css
¶
Search the current tree with CSS3 selectors
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The CSS3 selector to be used.
TYPE:
|
adaptive
|
Enabled will make the function try to relocate the element if it was 'saved' before
TYPE:
|
identifier
|
A string that will be used to save/retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
xpath
¶
Search the current tree with XPath selectors
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
Note: Additional keyword arguments will be passed as XPath variables in the XPath expression!
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The XPath selector to be used.
TYPE:
|
adaptive
|
Enabled will make the function try to relocate the element if it was 'saved' before
TYPE:
|
identifier
|
A string that will be used to save/retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
find_all
¶
Find elements by filters of your creations for ease.
| PARAMETER | DESCRIPTION |
|---|---|
args
|
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements' attributes. Leave empty for selecting all.
TYPE:
|
kwargs
|
The attributes you want to filter elements based on it.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
The |
Source code in scrapling/parser.py
698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 | |
find
¶
Find elements by filters of your creations for ease, then return the first result. Otherwise return None.
| PARAMETER | DESCRIPTION |
|---|---|
args
|
Tag name(s), iterable of tag names, regex patterns, function, or a dictionary of elements' attributes. Leave empty for selecting all.
TYPE:
|
kwargs
|
The attributes you want to filter elements based on it.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Selector]
|
The |
Source code in scrapling/parser.py
save
¶
Saves the element's unique properties to the storage for retrieval and relocation later
| PARAMETER | DESCRIPTION |
|---|---|
element
|
The element itself that we want to save to storage, it can be a
TYPE:
|
identifier
|
This is the identifier that will be used to retrieve the element later from the storage. See the docs for more info.
TYPE:
|
Source code in scrapling/parser.py
retrieve
¶
Using the identifier, we search the storage and return the unique properties of the element
| PARAMETER | DESCRIPTION |
|---|---|
identifier
|
This is the identifier that will be used to retrieve the element from the storage. See the docs for more info.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Optional[Dict[str, Any]]
|
A dictionary of the unique properties |
Source code in scrapling/parser.py
json
¶
Return JSON response if the response is jsonable otherwise throws error
Source code in scrapling/parser.py
re
¶
Apply the given regex to the current text and return a list of strings with the matches.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string. |
replace_entities
|
If enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, the function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
re_first
¶
Apply the given regex to text and return the first match if found, otherwise return the default value.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string. |
default
|
The default value to be returned if there is no match
DEFAULT:
|
replace_entities
|
if enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, the function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
find_similar
¶
Find elements that are in the same tree depth in the page with the same tag name and same parent tag etc... then return the ones that match the current element attributes with a percentage higher than the input threshold.
This function is inspired by AutoScraper and made for cases where you, for example, found a product div inside a products-list container and want to find other products using that element as a starting point EXCEPT this function works in any case without depending on the element type.
| PARAMETER | DESCRIPTION |
|---|---|
similarity_threshold
|
The percentage to use while comparing element attributes. Note: Elements found before attributes matching/comparison will be sharing the same depth, same tag name, same parent tag name, and same grand parent tag name. So they are 99% likely to be correct unless you are extremely unlucky, then attributes matching comes into play, so don't play with this number unless you are getting the results you don't want. Also, if the current element doesn't have attributes and the similar element as well, then it's a 100% match.
TYPE:
|
ignore_attributes
|
Attribute names passed will be ignored while matching the attributes in the last step.
The default value is to ignore
TYPE:
|
match_text
|
If True, element text content will be taken into calculation while matching. Not recommended to use in normal cases, but it depends.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
A |
Source code in scrapling/parser.py
find_by_text
¶
Find elements that its text content fully/partially matches input.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
Text query to match
TYPE:
|
first_match
|
Returns the first element that matches conditions, enabled by default
TYPE:
|
partial
|
If enabled, the function returns elements that contain the input text
TYPE:
|
case_sensitive
|
if enabled, the letters case will be taken into consideration
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
Source code in scrapling/parser.py
find_by_regex
¶
Find elements that its text content matches the input regex pattern.
| PARAMETER | DESCRIPTION |
|---|---|
query
|
Regex query/pattern to match |
first_match
|
Return the first element that matches conditions; enabled by default.
TYPE:
|
case_sensitive
|
If enabled, the letters case will be taken into consideration in the regex.
TYPE:
|
clean_match
|
If enabled, this will ignore all whitespaces and consecutive spaces while matching.
TYPE:
|
Source code in scrapling/parser.py
scrapling.parser.Selectors
¶
Bases: List[Selector]
flowchart TD
scrapling.parser.Selectors[Selectors]
click scrapling.parser.Selectors href "" "scrapling.parser.Selectors"
The Selectors class is a subclass of the builtin List class, which provides a few additional methods.
first
property
¶
Returns the first Selector item of the current list or None if the list is empty
last
property
¶
Returns the last Selector item of the current list or None if the list is empty
__getitem__
¶
xpath
¶
Call the .xpath() method for each element in this list and return
their results as another Selectors class.
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
Note: Additional keyword arguments will be passed as XPath variables in the XPath expression!
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The XPath selector to be used.
TYPE:
|
identifier
|
A string that will be used to retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
css
¶
Call the .css() method for each element in this list and return
their results flattened as another Selectors class.
Important: It's recommended to use the identifier argument if you plan to use a different selector later and want to relocate the same element(s)
| PARAMETER | DESCRIPTION |
|---|---|
selector
|
The CSS3 selector to be used.
TYPE:
|
identifier
|
A string that will be used to retrieve element's data in adaptive, otherwise the selector will be used.
TYPE:
|
auto_save
|
Automatically save new elements for
TYPE:
|
percentage
|
The minimum percentage to accept while
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
|
Source code in scrapling/parser.py
re
¶
Call the .re() method for each element in this list and return
their results flattened as List of TextHandler.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string.
TYPE:
|
replace_entities
|
If enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, the function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
re_first
¶
Call the .re_first() method for each element in this list and return
the first result or the default value otherwise.
| PARAMETER | DESCRIPTION |
|---|---|
regex
|
Can be either a compiled regular expression or a string.
TYPE:
|
default
|
The default value to be returned if there is no match
TYPE:
|
replace_entities
|
if enabled character entity references are replaced by their corresponding character
TYPE:
|
clean_match
|
if enabled, this will ignore all whitespaces and consecutive spaces while matching
TYPE:
|
case_sensitive
|
if disabled, function will set the regex to ignore the letters case while compiling it
TYPE:
|
Source code in scrapling/parser.py
search
¶
Loop over all current elements and return the first element that matches the passed function
| PARAMETER | DESCRIPTION |
|---|---|
func
|
A function that takes each element as an argument and returns True/False |
| RETURNS | DESCRIPTION |
|---|---|
Optional[Selector]
|
The first element that match the function or |
Source code in scrapling/parser.py
filter
¶
Filter current elements based on the passed function
| PARAMETER | DESCRIPTION |
|---|---|
func
|
A function that takes each element as an argument and returns True/False |
| RETURNS | DESCRIPTION |
|---|---|
Selectors
|
The new |
Source code in scrapling/parser.py
get
¶
Returns the serialized string of the first element, or default if empty.
| PARAMETER | DESCRIPTION |
|---|---|
default
|
the default value to return if the current list is empty
DEFAULT:
|