Skip to content

MCP Server API Reference

The Scrapling MCP Server provides nine powerful tools for web scraping through the Model Context Protocol (MCP). This server integrates Scrapling's capabilities directly into AI chatbots and agents, allowing conversational web scraping with advanced anti-bot bypass features.

You can start the MCP server by running:

scrapling mcp

Or import the server class directly:

from scrapling.core.ai import ScraplingMCPServer

server = ScraplingMCPServer()
server.serve(http=False, host="0.0.0.0", port=8000)

Response Model

The standardized response structure that's returned by all MCP server tools:

scrapling.core.ai.ResponseModel

Bases: BaseModel


              flowchart TD
              scrapling.core.ai.ResponseModel[ResponseModel]

              

              click scrapling.core.ai.ResponseModel href "" "scrapling.core.ai.ResponseModel"
            

Request's response information structure.

status class-attribute instance-attribute

status = Field(
    description="The status code returned by the website."
)

content class-attribute instance-attribute

content = Field(
    description="The content as Markdown/HTML or the text content of the page."
)

url class-attribute instance-attribute

url = Field(
    description="The URL given by the user that resulted in this response."
)

Session Models

Model classes for session management:

scrapling.core.ai.SessionInfo

Bases: BaseModel


              flowchart TD
              scrapling.core.ai.SessionInfo[SessionInfo]

              

              click scrapling.core.ai.SessionInfo href "" "scrapling.core.ai.SessionInfo"
            

Information about an open browser session.

session_id class-attribute instance-attribute

session_id = Field(
    description="The unique identifier of the session."
)

session_type class-attribute instance-attribute

session_type = Field(
    description="The type of the session: 'dynamic' or 'stealthy'."
)

created_at class-attribute instance-attribute

created_at = Field(
    description="ISO timestamp of when the session was created."
)

is_alive class-attribute instance-attribute

is_alive = Field(
    description="Whether the session is still alive and usable."
)

scrapling.core.ai.SessionCreatedModel

Bases: SessionInfo


              flowchart TD
              scrapling.core.ai.SessionCreatedModel[SessionCreatedModel]
              scrapling.core.ai.SessionInfo[SessionInfo]

                              scrapling.core.ai.SessionInfo --> scrapling.core.ai.SessionCreatedModel
                


              click scrapling.core.ai.SessionCreatedModel href "" "scrapling.core.ai.SessionCreatedModel"
              click scrapling.core.ai.SessionInfo href "" "scrapling.core.ai.SessionInfo"
            

Response returned when a new session is created.

message class-attribute instance-attribute

message = Field(description='A confirmation message.')

session_id class-attribute instance-attribute

session_id = Field(
    description="The unique identifier of the session."
)

session_type class-attribute instance-attribute

session_type = Field(
    description="The type of the session: 'dynamic' or 'stealthy'."
)

created_at class-attribute instance-attribute

created_at = Field(
    description="ISO timestamp of when the session was created."
)

is_alive class-attribute instance-attribute

is_alive = Field(
    description="Whether the session is still alive and usable."
)

scrapling.core.ai.SessionClosedModel

Bases: BaseModel


              flowchart TD
              scrapling.core.ai.SessionClosedModel[SessionClosedModel]

              

              click scrapling.core.ai.SessionClosedModel href "" "scrapling.core.ai.SessionClosedModel"
            

Response returned when a session is closed.

session_id class-attribute instance-attribute

session_id = Field(
    description="The unique identifier of the closed session."
)

message class-attribute instance-attribute

message = Field(description='A confirmation message.')

MCP Server Class

The main MCP server class that provides all web scraping tools:

scrapling.core.ai.ScraplingMCPServer

ScraplingMCPServer()
Source code in scrapling/core/ai.py
def __init__(self):
    self._sessions: Dict[str, _SessionEntry] = {}

open_session async

open_session(
    session_type,
    session_id=None,
    headless=True,
    google_search=True,
    real_chrome=False,
    wait=0,
    proxy=None,
    timezone_id=None,
    locale=None,
    extra_headers=None,
    useragent=None,
    cdp_url=None,
    timeout=30000,
    disable_resources=False,
    wait_selector=None,
    cookies=None,
    network_idle=False,
    wait_selector_state="attached",
    max_pages=5,
    hide_canvas=False,
    block_webrtc=False,
    allow_webgl=True,
    solve_cloudflare=False,
    additional_args=None,
)

Open a persistent browser session that can be reused across multiple fetch calls. This avoids the overhead of launching a new browser for each request. Use close_session to close the session when done, and list_sessions to see all active sessions.

PARAMETER DESCRIPTION
session_type

The type of session to open. Use "dynamic" for standard Playwright browser, or "stealthy" for anti-bot bypass with fingerprint spoofing.

TYPE: SessionType

session_id

Optional custom session ID. If not provided, a random 12-character hex ID will be generated. Useful for naming sessions for easier management.

TYPE: Optional[str] DEFAULT: None

headless

Run the browser in headless/hidden (default), or headful/visible mode.

TYPE: bool DEFAULT: True

google_search

Enabled by default, Scrapling will set a Google referer header.

TYPE: bool DEFAULT: True

real_chrome

If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.

TYPE: bool DEFAULT: False

wait

The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.

TYPE: int | float DEFAULT: 0

proxy

The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.

TYPE: Optional[str | Dict[str, str]] DEFAULT: None

timezone_id

Changes the timezone of the browser. Defaults to the system timezone.

TYPE: str | None DEFAULT: None

locale

Specify user locale, for example, en-GB, de-DE, etc.

TYPE: str | None DEFAULT: None

extra_headers

A dictionary of extra headers to add to the request.

TYPE: Optional[Dict[str, str]] DEFAULT: None

useragent

Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.

TYPE: Optional[str] DEFAULT: None

cdp_url

Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.

TYPE: Optional[str] DEFAULT: None

timeout

The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000.

TYPE: int | float DEFAULT: 30000

disable_resources

Drop requests for unnecessary resources for a speed boost.

TYPE: bool DEFAULT: False

wait_selector

Wait for a specific CSS selector to be in a specific state.

TYPE: Optional[str] DEFAULT: None

cookies

Set cookies for the session. It should be in a dictionary format that Playwright accepts.

TYPE: Sequence[SetCookieParam] | None DEFAULT: None

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

wait_selector_state

The state to wait for the selector given with wait_selector. The default state is attached.

TYPE: SelectorWaitStates DEFAULT: 'attached'

max_pages

Maximum number of concurrent pages/tabs in the browser. Defaults to 5. Higher values allow more parallel fetches.

TYPE: int DEFAULT: 5

hide_canvas

(Stealthy only) Add random noise to canvas operations to prevent fingerprinting.

TYPE: bool DEFAULT: False

block_webrtc

(Stealthy only) Forces WebRTC to respect proxy settings to prevent local IP address leak.

TYPE: bool DEFAULT: False

allow_webgl

(Stealthy only) Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.

TYPE: bool DEFAULT: True

solve_cloudflare

(Stealthy only) Solves all types of the Cloudflare's Turnstile/Interstitial challenges.

TYPE: bool DEFAULT: False

additional_args

(Stealthy only) Additional arguments to be passed to Playwright's context as additional settings.

TYPE: Optional[Dict] DEFAULT: None

Source code in scrapling/core/ai.py
async def open_session(
    self,
    session_type: SessionType,
    session_id: Optional[str] = None,
    headless: bool = True,
    google_search: bool = True,
    real_chrome: bool = False,
    wait: int | float = 0,
    proxy: Optional[str | Dict[str, str]] = None,
    timezone_id: str | None = None,
    locale: str | None = None,
    extra_headers: Optional[Dict[str, str]] = None,
    useragent: Optional[str] = None,
    cdp_url: Optional[str] = None,
    timeout: int | float = 30000,
    disable_resources: bool = False,
    wait_selector: Optional[str] = None,
    cookies: Sequence[SetCookieParam] | None = None,
    network_idle: bool = False,
    wait_selector_state: SelectorWaitStates = "attached",
    max_pages: int = 5,
    # Stealthy-only params (ignored for dynamic sessions)
    hide_canvas: bool = False,
    block_webrtc: bool = False,
    allow_webgl: bool = True,
    solve_cloudflare: bool = False,
    additional_args: Optional[Dict] = None,
) -> SessionCreatedModel:
    """Open a persistent browser session that can be reused across multiple fetch calls.
    This avoids the overhead of launching a new browser for each request.
    Use close_session to close the session when done, and list_sessions to see all active sessions.

    :param session_type: The type of session to open. Use "dynamic" for standard Playwright browser, or "stealthy" for anti-bot bypass with fingerprint spoofing.
    :param session_id: Optional custom session ID. If not provided, a random 12-character hex ID will be generated. Useful for naming sessions for easier management.
    :param headless: Run the browser in headless/hidden (default), or headful/visible mode.
    :param google_search: Enabled by default, Scrapling will set a Google referer header.
    :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
    :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.
    :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
    :param timezone_id: Changes the timezone of the browser. Defaults to the system timezone.
    :param locale: Specify user locale, for example, `en-GB`, `de-DE`, etc.
    :param extra_headers: A dictionary of extra headers to add to the request.
    :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
    :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
    :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000.
    :param disable_resources: Drop requests for unnecessary resources for a speed boost.
    :param wait_selector: Wait for a specific CSS selector to be in a specific state.
    :param cookies: Set cookies for the session. It should be in a dictionary format that Playwright accepts.
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
    :param max_pages: Maximum number of concurrent pages/tabs in the browser. Defaults to 5. Higher values allow more parallel fetches.
    :param hide_canvas: (Stealthy only) Add random noise to canvas operations to prevent fingerprinting.
    :param block_webrtc: (Stealthy only) Forces WebRTC to respect proxy settings to prevent local IP address leak.
    :param allow_webgl: (Stealthy only) Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
    :param solve_cloudflare: (Stealthy only) Solves all types of the Cloudflare's Turnstile/Interstitial challenges.
    :param additional_args: (Stealthy only) Additional arguments to be passed to Playwright's context as additional settings.
    """
    session_id = session_id or uuid4().hex[:12]
    if session_id in self._sessions:
        raise ValueError(
            f"Session '{session_id}' already exists. Use a different ID or close the existing session first."
        )

    common_kwargs: Dict[str, Any] = dict(
        wait=wait,
        proxy=proxy,
        locale=locale,
        timeout=timeout,
        cookies=cookies,
        cdp_url=cdp_url,
        headless=headless,
        block_ads=True,
        max_pages=max_pages,
        useragent=useragent,
        timezone_id=timezone_id,
        real_chrome=real_chrome,
        network_idle=network_idle,
        wait_selector=wait_selector,
        google_search=google_search,
        extra_headers=extra_headers,
        disable_resources=disable_resources,
        wait_selector_state=wait_selector_state,
    )

    session: Union[AsyncDynamicSession, AsyncStealthySession]
    if session_type == "stealthy":
        session = AsyncStealthySession(
            **common_kwargs,
            hide_canvas=hide_canvas,
            block_webrtc=block_webrtc,
            allow_webgl=allow_webgl,
            solve_cloudflare=solve_cloudflare,
            additional_args=additional_args,
        )
    else:
        session = AsyncDynamicSession(**common_kwargs)

    await session.start()

    entry = _SessionEntry(session=session, session_type=session_type)
    self._sessions[session_id] = entry

    return SessionCreatedModel(
        session_id=session_id,
        session_type=session_type,
        created_at=entry.created_at,
        is_alive=True,
        message=f"Session '{session_id}' ({session_type}) created successfully.",
    )

close_session async

close_session(session_id)

Close a persistent browser session and free its resources.

PARAMETER DESCRIPTION
session_id

The unique identifier of the session to close. Use list_sessions to see active sessions.

TYPE: str

Source code in scrapling/core/ai.py
async def close_session(
    self,
    session_id: str,
) -> SessionClosedModel:
    """Close a persistent browser session and free its resources.

    :param session_id: The unique identifier of the session to close. Use list_sessions to see active sessions.
    """
    entry = self._sessions.pop(session_id, None)
    if entry is None:
        raise ValueError(f"Session '{session_id}' not found. Use list_sessions to see active sessions.")

    await entry.session.close()
    return SessionClosedModel(
        session_id=session_id,
        message=f"Session '{session_id}' closed successfully.",
    )

list_sessions async

list_sessions()

List all active browser sessions with their details.

Source code in scrapling/core/ai.py
async def list_sessions(self) -> List[SessionInfo]:
    """List all active browser sessions with their details."""
    return [
        SessionInfo(
            session_id=sid,
            session_type=entry.session_type,
            created_at=entry.created_at,
            is_alive=entry.session._is_alive,
        )
        for sid, entry in self._sessions.items()
    ]

screenshot async

screenshot(
    url,
    session_id,
    image_type="png",
    full_page=False,
    quality=None,
    wait=0,
    wait_selector=None,
    wait_selector_state="attached",
    network_idle=False,
    timeout=30000,
)

Capture a screenshot of a web page using an existing browser session and return it as an image. A browser session must be opened first with open_session (either dynamic or stealthy); the session ID is then passed here.

PARAMETER DESCRIPTION
url

The URL to navigate to and capture.

TYPE: str

session_id

ID of an open browser session created with open_session.

TYPE: str

image_type

Image format. Defaults to "png". Use "jpeg" for smaller file sizes.

TYPE: ScreenshotType DEFAULT: 'png'

full_page

When True, captures the full scrollable page instead of just the viewport. Defaults to False.

TYPE: bool DEFAULT: False

quality

Image quality (0-100) for JPEG only. Raises if passed with image_type="png".

TYPE: Optional[int] DEFAULT: None

wait

Time in milliseconds to wait after page load before capturing. Defaults to 0.

TYPE: int | float DEFAULT: 0

wait_selector

Optional CSS selector to wait for before capturing.

TYPE: Optional[str] DEFAULT: None

wait_selector_state

State to wait for the selector. Defaults to "attached".

TYPE: SelectorWaitStates DEFAULT: 'attached'

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

timeout

Timeout in milliseconds for page operations. Defaults to 30,000.

TYPE: int | float DEFAULT: 30000

Source code in scrapling/core/ai.py
async def screenshot(
    self,
    url: str,
    session_id: str,
    image_type: ScreenshotType = "png",
    full_page: bool = False,
    quality: Optional[int] = None,
    wait: int | float = 0,
    wait_selector: Optional[str] = None,
    wait_selector_state: SelectorWaitStates = "attached",
    network_idle: bool = False,
    timeout: int | float = 30000,
) -> List[ImageContent | TextContent]:
    """Capture a screenshot of a web page using an existing browser session and return it as an image.
    A browser session must be opened first with `open_session` (either `dynamic` or `stealthy`); the session ID is then passed here.

    :param url: The URL to navigate to and capture.
    :param session_id: ID of an open browser session created with `open_session`.
    :param image_type: Image format. Defaults to "png". Use "jpeg" for smaller file sizes.
    :param full_page: When True, captures the full scrollable page instead of just the viewport. Defaults to False.
    :param quality: Image quality (0-100) for JPEG only. Raises if passed with `image_type="png"`.
    :param wait: Time in milliseconds to wait after page load before capturing. Defaults to 0.
    :param wait_selector: Optional CSS selector to wait for before capturing.
    :param wait_selector_state: State to wait for the selector. Defaults to "attached".
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param timeout: Timeout in milliseconds for page operations. Defaults to 30,000.
    """
    if quality is not None and image_type != "jpeg":
        raise ValueError("'quality' is only valid when 'image_type' is 'jpeg'.")

    entry = self._get_session(session_id, expected_type=None)

    screenshot_kwargs: Dict[str, Any] = {"type": image_type, "full_page": full_page}
    if quality is not None:
        screenshot_kwargs["quality"] = quality

    captured: Dict[str, Any] = {}

    async def _capture(page: Any) -> None:
        try:
            captured["bytes"] = await page.screenshot(**screenshot_kwargs)
            captured["url"] = page.url
        except Exception as exc:
            captured["error"] = exc

    await entry.session.fetch(
        url,
        wait=wait,
        timeout=timeout,
        network_idle=network_idle,
        wait_selector=wait_selector,
        wait_selector_state=wait_selector_state,
        page_action=_capture,
    )

    if "error" in captured:
        raise captured["error"]
    if "bytes" not in captured:
        raise RuntimeError(f"Failed to capture screenshot for {url}")

    image = Image(data=captured["bytes"], format=image_type).to_image_content()
    return [image, TextContent(type="text", text=captured["url"])]

get async staticmethod

get(
    url,
    impersonate="chrome",
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    params=None,
    headers=None,
    cookies=None,
    timeout=30,
    follow_redirects="safe",
    max_redirects=30,
    retries=3,
    retry_delay=1,
    proxy=None,
    proxy_auth=None,
    auth=None,
    verify=True,
    http3=False,
    stealthy_headers=True,
)

Make GET HTTP request to a URL and return a structured output of the result. Note: This is only suitable for low-mid protection levels. For high-protection levels or websites that require JS loading, use the other tools directly. Note: If the css_selector resolves to more than one element, all the elements will be returned.

PARAMETER DESCRIPTION
url

The URL to request.

TYPE: str

impersonate

Browser version to impersonate its fingerprint. It's using the latest chrome version by default.

TYPE: ImpersonateType DEFAULT: 'chrome'

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

params

Query string parameters for the request.

TYPE: Optional[Dict] DEFAULT: None

headers

Headers to include in the request.

TYPE: Optional[Mapping[str, Optional[str]]] DEFAULT: None

cookies

Cookies to use in the request.

TYPE: Optional[Dict[str, str]] DEFAULT: None

timeout

Number of seconds to wait before timing out.

TYPE: Optional[int | float] DEFAULT: 30

follow_redirects

Whether to follow redirects. Defaults to "safe", which follows redirects but rejects those targeting internal/private IPs (SSRF protection). Pass True to follow all redirects without restriction.

TYPE: FollowRedirects DEFAULT: 'safe'

max_redirects

Maximum number of redirects. Default 30, use -1 for unlimited.

TYPE: int DEFAULT: 30

retries

Number of retry attempts. Defaults to 3.

TYPE: Optional[int] DEFAULT: 3

retry_delay

Number of seconds to wait between retry attempts. Defaults to 1 second.

TYPE: Optional[int] DEFAULT: 1

proxy

Proxy URL to use. Format: "http://username:password@localhost:8030". Cannot be used together with the proxies parameter.

TYPE: Optional[str] DEFAULT: None

proxy_auth

HTTP basic auth for proxy in dictionary format with username and password keys.

TYPE: Optional[Dict[str, str]] DEFAULT: None

auth

HTTP basic auth in dictionary format with username and password keys.

TYPE: Optional[Dict[str, str]] DEFAULT: None

verify

Whether to verify HTTPS certificates.

TYPE: Optional[bool] DEFAULT: True

http3

Whether to use HTTP3. Defaults to False. It might be problematic if used it with impersonate.

TYPE: Optional[bool] DEFAULT: False

stealthy_headers

If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.

TYPE: Optional[bool] DEFAULT: True

Source code in scrapling/core/ai.py
@staticmethod
async def get(
    url: str,
    impersonate: ImpersonateType = "chrome",
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    params: Optional[Dict] = None,
    headers: Optional[Mapping[str, Optional[str]]] = None,
    cookies: Optional[Dict[str, str]] = None,
    timeout: Optional[int | float] = 30,
    follow_redirects: FollowRedirects = "safe",
    max_redirects: int = 30,
    retries: Optional[int] = 3,
    retry_delay: Optional[int] = 1,
    proxy: Optional[str] = None,
    proxy_auth: Optional[Dict[str, str]] = None,
    auth: Optional[Dict[str, str]] = None,
    verify: Optional[bool] = True,
    http3: Optional[bool] = False,
    stealthy_headers: Optional[bool] = True,
) -> ResponseModel:
    """Make GET HTTP request to a URL and return a structured output of the result.
    Note: This is only suitable for low-mid protection levels. For high-protection levels or websites that require JS loading, use the other tools directly.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.

    :param url: The URL to request.
    :param impersonate: Browser version to impersonate its fingerprint. It's using the latest chrome version by default.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param params: Query string parameters for the request.
    :param headers: Headers to include in the request.
    :param cookies: Cookies to use in the request.
    :param timeout: Number of seconds to wait before timing out.
    :param follow_redirects: Whether to follow redirects. Defaults to "safe", which follows redirects but rejects those targeting internal/private IPs (SSRF protection).
        Pass True to follow all redirects without restriction.
    :param max_redirects: Maximum number of redirects. Default 30, use -1 for unlimited.
    :param retries: Number of retry attempts. Defaults to 3.
    :param retry_delay: Number of seconds to wait between retry attempts. Defaults to 1 second.
    :param proxy: Proxy URL to use. Format: "http://username:password@localhost:8030".
                 Cannot be used together with the `proxies` parameter.
    :param proxy_auth: HTTP basic auth for proxy in dictionary format with `username` and `password` keys.
    :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
    :param verify: Whether to verify HTTPS certificates.
    :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
    :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
    """
    results = await ScraplingMCPServer.bulk_get(
        urls=[url],
        impersonate=impersonate,
        extraction_type=extraction_type,
        css_selector=css_selector,
        main_content_only=main_content_only,
        params=params,
        headers=headers,
        cookies=cookies,
        timeout=timeout,
        follow_redirects=follow_redirects,
        max_redirects=max_redirects,
        retries=retries,
        retry_delay=retry_delay,
        proxy=proxy,
        proxy_auth=proxy_auth,
        auth=auth,
        verify=verify,
        http3=http3,
        stealthy_headers=stealthy_headers,
    )
    return results[0]

bulk_get async staticmethod

bulk_get(
    urls,
    impersonate="chrome",
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    params=None,
    headers=None,
    cookies=None,
    timeout=30,
    follow_redirects="safe",
    max_redirects=30,
    retries=3,
    retry_delay=1,
    proxy=None,
    proxy_auth=None,
    auth=None,
    verify=True,
    http3=False,
    stealthy_headers=True,
)

Make GET HTTP request to a group of URLs and for each URL, return a structured output of the result. Note: This is only suitable for low-mid protection levels. For high-protection levels or websites that require JS loading, use the other tools directly. Note: If the css_selector resolves to more than one element, all the elements will be returned.

PARAMETER DESCRIPTION
urls

A list of the URLs to request.

TYPE: List[str]

impersonate

Browser version to impersonate its fingerprint. It's using the latest chrome version by default.

TYPE: ImpersonateType DEFAULT: 'chrome'

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

params

Query string parameters for the request.

TYPE: Optional[Dict] DEFAULT: None

headers

Headers to include in the request.

TYPE: Optional[Mapping[str, Optional[str]]] DEFAULT: None

cookies

Cookies to use in the request.

TYPE: Optional[Dict[str, str]] DEFAULT: None

timeout

Number of seconds to wait before timing out.

TYPE: Optional[int | float] DEFAULT: 30

follow_redirects

Whether to follow redirects. Defaults to "safe", which follows redirects but rejects those targeting internal/private IPs (SSRF protection). Pass True to follow all redirects without restriction.

TYPE: FollowRedirects DEFAULT: 'safe'

max_redirects

Maximum number of redirects. Default 30, use -1 for unlimited.

TYPE: int DEFAULT: 30

retries

Number of retry attempts. Defaults to 3.

TYPE: Optional[int] DEFAULT: 3

retry_delay

Number of seconds to wait between retry attempts. Defaults to 1 second.

TYPE: Optional[int] DEFAULT: 1

proxy

Proxy URL to use. Format: "http://username:password@localhost:8030". Cannot be used together with the proxies parameter.

TYPE: Optional[str] DEFAULT: None

proxy_auth

HTTP basic auth for proxy in dictionary format with username and password keys.

TYPE: Optional[Dict[str, str]] DEFAULT: None

auth

HTTP basic auth in dictionary format with username and password keys.

TYPE: Optional[Dict[str, str]] DEFAULT: None

verify

Whether to verify HTTPS certificates.

TYPE: Optional[bool] DEFAULT: True

http3

Whether to use HTTP3. Defaults to False. It might be problematic if used it with impersonate.

TYPE: Optional[bool] DEFAULT: False

stealthy_headers

If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.

TYPE: Optional[bool] DEFAULT: True

Source code in scrapling/core/ai.py
@staticmethod
async def bulk_get(
    urls: List[str],
    impersonate: ImpersonateType = "chrome",
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    params: Optional[Dict] = None,
    headers: Optional[Mapping[str, Optional[str]]] = None,
    cookies: Optional[Dict[str, str]] = None,
    timeout: Optional[int | float] = 30,
    follow_redirects: FollowRedirects = "safe",
    max_redirects: int = 30,
    retries: Optional[int] = 3,
    retry_delay: Optional[int] = 1,
    proxy: Optional[str] = None,
    proxy_auth: Optional[Dict[str, str]] = None,
    auth: Optional[Dict[str, str]] = None,
    verify: Optional[bool] = True,
    http3: Optional[bool] = False,
    stealthy_headers: Optional[bool] = True,
) -> List[ResponseModel]:
    """Make GET HTTP request to a group of URLs and for each URL, return a structured output of the result.
    Note: This is only suitable for low-mid protection levels. For high-protection levels or websites that require JS loading, use the other tools directly.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.

    :param urls: A list of the URLs to request.
    :param impersonate: Browser version to impersonate its fingerprint. It's using the latest chrome version by default.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param params: Query string parameters for the request.
    :param headers: Headers to include in the request.
    :param cookies: Cookies to use in the request.
    :param timeout: Number of seconds to wait before timing out.
    :param follow_redirects: Whether to follow redirects. Defaults to "safe", which follows redirects but rejects those targeting internal/private IPs (SSRF protection).
        Pass True to follow all redirects without restriction.
    :param max_redirects: Maximum number of redirects. Default 30, use -1 for unlimited.
    :param retries: Number of retry attempts. Defaults to 3.
    :param retry_delay: Number of seconds to wait between retry attempts. Defaults to 1 second.
    :param proxy: Proxy URL to use. Format: "http://username:password@localhost:8030".
                 Cannot be used together with the `proxies` parameter.
    :param proxy_auth: HTTP basic auth for proxy in dictionary format with `username` and `password` keys.
    :param auth: HTTP basic auth in dictionary format with `username` and `password` keys.
    :param verify: Whether to verify HTTPS certificates.
    :param http3: Whether to use HTTP3. Defaults to False. It might be problematic if used it with `impersonate`.
    :param stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
    """
    normalized_proxy_auth = _normalize_credentials(proxy_auth)
    normalized_auth = _normalize_credentials(auth)

    async with FetcherSession() as session:
        tasks: List[Any] = [
            session.get(
                url,
                auth=normalized_auth,
                proxy=proxy,
                http3=http3,
                verify=verify,
                params=params,
                headers=headers,
                cookies=cookies,
                timeout=timeout,
                retries=retries,
                proxy_auth=normalized_proxy_auth,
                retry_delay=retry_delay,
                impersonate=impersonate,
                max_redirects=max_redirects,
                follow_redirects=follow_redirects,
                stealthy_headers=stealthy_headers,
            )
            for url in urls
        ]
        responses = await gather(*tasks)
        return [_translate_response(page, extraction_type, css_selector, main_content_only) for page in responses]

fetch async

fetch(
    url,
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    headless=True,
    google_search=True,
    real_chrome=False,
    wait=0,
    proxy=None,
    timezone_id=None,
    locale=None,
    extra_headers=None,
    useragent=None,
    cdp_url=None,
    timeout=30000,
    disable_resources=False,
    wait_selector=None,
    cookies=None,
    network_idle=False,
    wait_selector_state="attached",
    session_id=None,
)

Use playwright to open a browser to fetch a URL and return a structured output of the result. Note: This is only suitable for low-mid protection levels. Note: If the css_selector resolves to more than one element, all the elements will be returned. Note: If a session_id is provided (from open_session), the browser session will be reused instead of creating a new one. When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

PARAMETER DESCRIPTION
url

The URL to request.

TYPE: str

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

headless

Run the browser in headless/hidden (default), or headful/visible mode.

TYPE: bool DEFAULT: True

disable_resources

Drop requests for unnecessary resources for a speed boost. Requests dropped are of type font, image, media, beacon, object, imageset, texttrack, websocket, csp_report, and stylesheet.

TYPE: bool DEFAULT: False

useragent

Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.

TYPE: Optional[str] DEFAULT: None

cookies

Set cookies for the next request. It should be in a dictionary format that Playwright accepts.

TYPE: Sequence[SetCookieParam] | None DEFAULT: None

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

timeout

The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000

TYPE: int | float DEFAULT: 30000

wait

The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.

TYPE: int | float DEFAULT: 0

wait_selector

Wait for a specific CSS selector to be in a specific state.

TYPE: Optional[str] DEFAULT: None

timezone_id

Changes the timezone of the browser. Defaults to the system timezone.

TYPE: str | None DEFAULT: None

locale

Specify user locale, for example, en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules. Defaults to the system default locale.

TYPE: str | None DEFAULT: None

wait_selector_state

The state to wait for the selector given with wait_selector. The default state is attached.

TYPE: SelectorWaitStates DEFAULT: 'attached'

real_chrome

If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.

TYPE: bool DEFAULT: False

cdp_url

Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.

TYPE: Optional[str] DEFAULT: None

google_search

Enabled by default, Scrapling will set a Google referer header.

TYPE: bool DEFAULT: True

extra_headers

A dictionary of extra headers to add to the request. The referer set by google_search takes priority over the referer set here if used together.

TYPE: Optional[Dict[str, str]] DEFAULT: None

proxy

The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.

TYPE: Optional[str | Dict[str, str]] DEFAULT: None

session_id

Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.

TYPE: Optional[str] DEFAULT: None

Source code in scrapling/core/ai.py
async def fetch(
    self,
    url: str,
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    headless: bool = True,  # noqa: F821
    google_search: bool = True,
    real_chrome: bool = False,
    wait: int | float = 0,
    proxy: Optional[str | Dict[str, str]] = None,
    timezone_id: str | None = None,
    locale: str | None = None,
    extra_headers: Optional[Dict[str, str]] = None,
    useragent: Optional[str] = None,
    cdp_url: Optional[str] = None,
    timeout: int | float = 30000,
    disable_resources: bool = False,
    wait_selector: Optional[str] = None,
    cookies: Sequence[SetCookieParam] | None = None,
    network_idle: bool = False,
    wait_selector_state: SelectorWaitStates = "attached",
    session_id: Optional[str] = None,
) -> ResponseModel:
    """Use playwright to open a browser to fetch a URL and return a structured output of the result.
    Note: This is only suitable for low-mid protection levels.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.
    Note: If a `session_id` is provided (from open_session), the browser session will be reused instead of creating a new one.
        When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

    :param url: The URL to request.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param headless: Run the browser in headless/hidden (default), or headful/visible mode.
    :param disable_resources: Drop requests for unnecessary resources for a speed boost.
        Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
    :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
    :param cookies: Set cookies for the next request. It should be in a dictionary format that Playwright accepts.
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
    :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
    :param wait_selector: Wait for a specific CSS selector to be in a specific state.
    :param timezone_id: Changes the timezone of the browser. Defaults to the system timezone.
    :param locale: Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting
        rules. Defaults to the system default locale.
    :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
    :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
    :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
    :param google_search: Enabled by default, Scrapling will set a Google referer header.
    :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
    :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
    :param session_id: Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.
    """
    results = await self.bulk_fetch(
        urls=[url],
        extraction_type=extraction_type,
        css_selector=css_selector,
        main_content_only=main_content_only,
        headless=headless,
        google_search=google_search,
        real_chrome=real_chrome,
        wait=wait,
        proxy=proxy,
        timezone_id=timezone_id,
        locale=locale,
        extra_headers=extra_headers,
        useragent=useragent,
        cdp_url=cdp_url,
        timeout=timeout,
        disable_resources=disable_resources,
        wait_selector=wait_selector,
        cookies=cookies,
        network_idle=network_idle,
        wait_selector_state=wait_selector_state,
        session_id=session_id,
    )
    return results[0]

bulk_fetch async

bulk_fetch(
    urls,
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    headless=True,
    google_search=True,
    real_chrome=False,
    wait=0,
    proxy=None,
    timezone_id=None,
    locale=None,
    extra_headers=None,
    useragent=None,
    cdp_url=None,
    timeout=30000,
    disable_resources=False,
    wait_selector=None,
    cookies=None,
    network_idle=False,
    wait_selector_state="attached",
    session_id=None,
)

Use playwright to open a browser, then fetch a group of URLs at the same time, and for each page return a structured output of the result. Note: This is only suitable for low-mid protection levels. Note: If the css_selector resolves to more than one element, all the elements will be returned. Note: If a session_id is provided (from open_session), the browser session will be reused instead of creating a new one. When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

PARAMETER DESCRIPTION
urls

A list of the URLs to request.

TYPE: List[str]

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

headless

Run the browser in headless/hidden (default), or headful/visible mode.

TYPE: bool DEFAULT: True

disable_resources

Drop requests for unnecessary resources for a speed boost. Requests dropped are of type font, image, media, beacon, object, imageset, texttrack, websocket, csp_report, and stylesheet.

TYPE: bool DEFAULT: False

useragent

Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.

TYPE: Optional[str] DEFAULT: None

cookies

Set cookies for the next request. It should be in a dictionary format that Playwright accepts.

TYPE: Sequence[SetCookieParam] | None DEFAULT: None

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

timeout

The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000

TYPE: int | float DEFAULT: 30000

wait

The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.

TYPE: int | float DEFAULT: 0

wait_selector

Wait for a specific CSS selector to be in a specific state.

TYPE: Optional[str] DEFAULT: None

timezone_id

Changes the timezone of the browser. Defaults to the system timezone.

TYPE: str | None DEFAULT: None

locale

Specify user locale, for example, en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules. Defaults to the system default locale.

TYPE: str | None DEFAULT: None

wait_selector_state

The state to wait for the selector given with wait_selector. The default state is attached.

TYPE: SelectorWaitStates DEFAULT: 'attached'

real_chrome

If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.

TYPE: bool DEFAULT: False

cdp_url

Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.

TYPE: Optional[str] DEFAULT: None

google_search

Enabled by default, Scrapling will set a Google referer header.

TYPE: bool DEFAULT: True

extra_headers

A dictionary of extra headers to add to the request. The referer set by google_search takes priority over the referer set here if used together.

TYPE: Optional[Dict[str, str]] DEFAULT: None

proxy

The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.

TYPE: Optional[str | Dict[str, str]] DEFAULT: None

session_id

Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.

TYPE: Optional[str] DEFAULT: None

Source code in scrapling/core/ai.py
async def bulk_fetch(
    self,
    urls: List[str],
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    headless: bool = True,  # noqa: F821
    google_search: bool = True,
    real_chrome: bool = False,
    wait: int | float = 0,
    proxy: Optional[str | Dict[str, str]] = None,
    timezone_id: str | None = None,
    locale: str | None = None,
    extra_headers: Optional[Dict[str, str]] = None,
    useragent: Optional[str] = None,
    cdp_url: Optional[str] = None,
    timeout: int | float = 30000,
    disable_resources: bool = False,
    wait_selector: Optional[str] = None,
    cookies: Sequence[SetCookieParam] | None = None,
    network_idle: bool = False,
    wait_selector_state: SelectorWaitStates = "attached",
    session_id: Optional[str] = None,
) -> List[ResponseModel]:
    """Use playwright to open a browser, then fetch a group of URLs at the same time, and for each page return a structured output of the result.
    Note: This is only suitable for low-mid protection levels.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.
    Note: If a `session_id` is provided (from open_session), the browser session will be reused instead of creating a new one.
        When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

    :param urls: A list of the URLs to request.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param headless: Run the browser in headless/hidden (default), or headful/visible mode.
    :param disable_resources: Drop requests for unnecessary resources for a speed boost.
        Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
    :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
    :param cookies: Set cookies for the next request. It should be in a dictionary format that Playwright accepts.
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
    :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
    :param wait_selector: Wait for a specific CSS selector to be in a specific state.
    :param timezone_id: Changes the timezone of the browser. Defaults to the system timezone.
    :param locale: Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting
        rules. Defaults to the system default locale.
    :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
    :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
    :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
    :param google_search: Enabled by default, Scrapling will set a Google referer header.
    :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
    :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
    :param session_id: Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.
    """
    if session_id:
        entry = self._get_session(session_id, "dynamic")
        tasks = [
            entry.session.fetch(
                url,
                wait=wait,
                timeout=timeout,
                google_search=google_search,
                extra_headers=extra_headers,
                disable_resources=disable_resources,
                wait_selector=wait_selector,
                wait_selector_state=wait_selector_state,
                network_idle=network_idle,
                proxy=proxy,
            )
            for url in urls
        ]
        responses = await gather(*tasks)
    else:
        async with AsyncDynamicSession(
            wait=wait,
            proxy=proxy,
            locale=locale,
            timeout=timeout,
            cookies=cookies,
            cdp_url=cdp_url,
            headless=headless,
            block_ads=True,
            max_pages=len(urls),
            useragent=useragent,
            timezone_id=timezone_id,
            real_chrome=real_chrome,
            network_idle=network_idle,
            wait_selector=wait_selector,
            google_search=google_search,
            extra_headers=extra_headers,
            disable_resources=disable_resources,
            wait_selector_state=wait_selector_state,
        ) as session:
            tasks = [session.fetch(url) for url in urls]
            responses = await gather(*tasks)

    return [_translate_response(page, extraction_type, css_selector, main_content_only) for page in responses]

stealthy_fetch async

stealthy_fetch(
    url,
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    headless=True,
    google_search=True,
    real_chrome=False,
    wait=0,
    proxy=None,
    timezone_id=None,
    locale=None,
    extra_headers=None,
    useragent=None,
    hide_canvas=False,
    cdp_url=None,
    timeout=30000,
    disable_resources=False,
    wait_selector=None,
    cookies=None,
    network_idle=False,
    wait_selector_state="attached",
    block_webrtc=False,
    allow_webgl=True,
    solve_cloudflare=False,
    additional_args=None,
    session_id=None,
)

Use the stealthy fetcher to fetch a URL and return a structured output of the result. Note: This is the only suitable fetcher for high protection levels. Note: If the css_selector resolves to more than one element, all the elements will be returned. Note: If a session_id is provided (from open_session), the browser session will be reused instead of creating a new one. When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

PARAMETER DESCRIPTION
url

The URL to request.

TYPE: str

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

headless

Run the browser in headless/hidden (default), or headful/visible mode.

TYPE: bool DEFAULT: True

disable_resources

Drop requests for unnecessary resources for a speed boost. Requests dropped are of type font, image, media, beacon, object, imageset, texttrack, websocket, csp_report, and stylesheet.

TYPE: bool DEFAULT: False

useragent

Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.

TYPE: Optional[str] DEFAULT: None

cookies

Set cookies for the next request.

TYPE: Sequence[SetCookieParam] | None DEFAULT: None

solve_cloudflare

Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.

TYPE: bool DEFAULT: False

allow_webgl

Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.

TYPE: bool DEFAULT: True

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

wait

The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.

TYPE: int | float DEFAULT: 0

timeout

The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000

TYPE: int | float DEFAULT: 30000

wait_selector

Wait for a specific CSS selector to be in a specific state.

TYPE: Optional[str] DEFAULT: None

timezone_id

Changes the timezone of the browser. Defaults to the system timezone.

TYPE: str | None DEFAULT: None

locale

Specify user locale, for example, en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules. Defaults to the system default locale.

TYPE: str | None DEFAULT: None

wait_selector_state

The state to wait for the selector given with wait_selector. The default state is attached.

TYPE: SelectorWaitStates DEFAULT: 'attached'

real_chrome

If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.

TYPE: bool DEFAULT: False

hide_canvas

Add random noise to canvas operations to prevent fingerprinting.

TYPE: bool DEFAULT: False

block_webrtc

Forces WebRTC to respect proxy settings to prevent local IP address leak.

TYPE: bool DEFAULT: False

cdp_url

Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.

TYPE: Optional[str] DEFAULT: None

google_search

Enabled by default, Scrapling will set a Google referer header.

TYPE: bool DEFAULT: True

extra_headers

A dictionary of extra headers to add to the request. The referer set by google_search takes priority over the referer set here if used together.

TYPE: Optional[Dict[str, str]] DEFAULT: None

proxy

The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.

TYPE: Optional[str | Dict[str, str]] DEFAULT: None

additional_args

Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.

TYPE: Optional[Dict] DEFAULT: None

session_id

Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.

TYPE: Optional[str] DEFAULT: None

Source code in scrapling/core/ai.py
async def stealthy_fetch(
    self,
    url: str,
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    headless: bool = True,  # noqa: F821
    google_search: bool = True,
    real_chrome: bool = False,
    wait: int | float = 0,
    proxy: Optional[str | Dict[str, str]] = None,
    timezone_id: str | None = None,
    locale: str | None = None,
    extra_headers: Optional[Dict[str, str]] = None,
    useragent: Optional[str] = None,
    hide_canvas: bool = False,
    cdp_url: Optional[str] = None,
    timeout: int | float = 30000,
    disable_resources: bool = False,
    wait_selector: Optional[str] = None,
    cookies: Sequence[SetCookieParam] | None = None,
    network_idle: bool = False,
    wait_selector_state: SelectorWaitStates = "attached",
    block_webrtc: bool = False,
    allow_webgl: bool = True,
    solve_cloudflare: bool = False,
    additional_args: Optional[Dict] = None,
    session_id: Optional[str] = None,
) -> ResponseModel:
    """Use the stealthy fetcher to fetch a URL and return a structured output of the result.
    Note: This is the only suitable fetcher for high protection levels.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.
    Note: If a `session_id` is provided (from open_session), the browser session will be reused instead of creating a new one.
        When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

    :param url: The URL to request.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param headless: Run the browser in headless/hidden (default), or headful/visible mode.
    :param disable_resources: Drop requests for unnecessary resources for a speed boost.
        Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
    :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
    :param cookies: Set cookies for the next request.
    :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
    :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
    :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
    :param wait_selector: Wait for a specific CSS selector to be in a specific state.
    :param timezone_id: Changes the timezone of the browser. Defaults to the system timezone.
    :param locale: Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting
        rules. Defaults to the system default locale.
    :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
    :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
    :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
    :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
    :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
    :param google_search: Enabled by default, Scrapling will set a Google referer header.
    :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
    :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
    :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
    :param session_id: Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.
    """
    results = await self.bulk_stealthy_fetch(
        urls=[url],
        extraction_type=extraction_type,
        css_selector=css_selector,
        main_content_only=main_content_only,
        headless=headless,
        google_search=google_search,
        real_chrome=real_chrome,
        wait=wait,
        proxy=proxy,
        timezone_id=timezone_id,
        locale=locale,
        extra_headers=extra_headers,
        useragent=useragent,
        hide_canvas=hide_canvas,
        cdp_url=cdp_url,
        timeout=timeout,
        disable_resources=disable_resources,
        wait_selector=wait_selector,
        cookies=cookies,
        network_idle=network_idle,
        wait_selector_state=wait_selector_state,
        block_webrtc=block_webrtc,
        allow_webgl=allow_webgl,
        solve_cloudflare=solve_cloudflare,
        additional_args=additional_args,
        session_id=session_id,
    )
    return results[0]

bulk_stealthy_fetch async

bulk_stealthy_fetch(
    urls,
    extraction_type="markdown",
    css_selector=None,
    main_content_only=True,
    headless=True,
    google_search=True,
    real_chrome=False,
    wait=0,
    proxy=None,
    timezone_id=None,
    locale=None,
    extra_headers=None,
    useragent=None,
    hide_canvas=False,
    cdp_url=None,
    timeout=30000,
    disable_resources=False,
    wait_selector=None,
    cookies=None,
    network_idle=False,
    wait_selector_state="attached",
    block_webrtc=False,
    allow_webgl=True,
    solve_cloudflare=False,
    additional_args=None,
    session_id=None,
)

Use the stealthy fetcher to fetch a group of URLs at the same time, and for each page return a structured output of the result. Note: This is the only suitable fetcher for high protection levels. Note: If the css_selector resolves to more than one element, all the elements will be returned. Note: If a session_id is provided (from open_session), the browser session will be reused instead of creating a new one. When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

PARAMETER DESCRIPTION
urls

A list of the URLs to request.

TYPE: List[str]

extraction_type

The type of content to extract from the page. Defaults to "markdown". Options are: - Markdown will convert the page content to Markdown format. - HTML will return the raw HTML content of the page. - Text will return the text content of the page.

TYPE: extraction_types DEFAULT: 'markdown'

css_selector

CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.

TYPE: Optional[str] DEFAULT: None

main_content_only

Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the <body> tag.

TYPE: bool DEFAULT: True

headless

Run the browser in headless/hidden (default), or headful/visible mode.

TYPE: bool DEFAULT: True

disable_resources

Drop requests for unnecessary resources for a speed boost. Requests dropped are of type font, image, media, beacon, object, imageset, texttrack, websocket, csp_report, and stylesheet.

TYPE: bool DEFAULT: False

useragent

Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.

TYPE: Optional[str] DEFAULT: None

cookies

Set cookies for the next request.

TYPE: Sequence[SetCookieParam] | None DEFAULT: None

solve_cloudflare

Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.

TYPE: bool DEFAULT: False

allow_webgl

Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.

TYPE: bool DEFAULT: True

network_idle

Wait for the page until there are no network connections for at least 500 ms.

TYPE: bool DEFAULT: False

wait

The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the Response object.

TYPE: int | float DEFAULT: 0

timeout

The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000

TYPE: int | float DEFAULT: 30000

wait_selector

Wait for a specific CSS selector to be in a specific state.

TYPE: Optional[str] DEFAULT: None

timezone_id

Changes the timezone of the browser. Defaults to the system timezone.

TYPE: str | None DEFAULT: None

locale

Specify user locale, for example, en-GB, de-DE, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting rules. Defaults to the system default locale.

TYPE: str | None DEFAULT: None

wait_selector_state

The state to wait for the selector given with wait_selector. The default state is attached.

TYPE: SelectorWaitStates DEFAULT: 'attached'

real_chrome

If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.

TYPE: bool DEFAULT: False

hide_canvas

Add random noise to canvas operations to prevent fingerprinting.

TYPE: bool DEFAULT: False

block_webrtc

Forces WebRTC to respect proxy settings to prevent local IP address leak.

TYPE: bool DEFAULT: False

cdp_url

Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.

TYPE: Optional[str] DEFAULT: None

google_search

Enabled by default, Scrapling will set a Google referer header.

TYPE: bool DEFAULT: True

extra_headers

A dictionary of extra headers to add to the request. The referer set by google_search takes priority over the referer set here if used together.

TYPE: Optional[Dict[str, str]] DEFAULT: None

proxy

The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.

TYPE: Optional[str | Dict[str, str]] DEFAULT: None

additional_args

Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.

TYPE: Optional[Dict] DEFAULT: None

session_id

Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.

TYPE: Optional[str] DEFAULT: None

Source code in scrapling/core/ai.py
async def bulk_stealthy_fetch(
    self,
    urls: List[str],
    extraction_type: extraction_types = "markdown",
    css_selector: Optional[str] = None,
    main_content_only: bool = True,
    headless: bool = True,  # noqa: F821
    google_search: bool = True,
    real_chrome: bool = False,
    wait: int | float = 0,
    proxy: Optional[str | Dict[str, str]] = None,
    timezone_id: str | None = None,
    locale: str | None = None,
    extra_headers: Optional[Dict[str, str]] = None,
    useragent: Optional[str] = None,
    hide_canvas: bool = False,
    cdp_url: Optional[str] = None,
    timeout: int | float = 30000,
    disable_resources: bool = False,
    wait_selector: Optional[str] = None,
    cookies: Sequence[SetCookieParam] | None = None,
    network_idle: bool = False,
    wait_selector_state: SelectorWaitStates = "attached",
    block_webrtc: bool = False,
    allow_webgl: bool = True,
    solve_cloudflare: bool = False,
    additional_args: Optional[Dict] = None,
    session_id: Optional[str] = None,
) -> List[ResponseModel]:
    """Use the stealthy fetcher to fetch a group of URLs at the same time, and for each page return a structured output of the result.
    Note: This is the only suitable fetcher for high protection levels.
    Note: If the `css_selector` resolves to more than one element, all the elements will be returned.
    Note: If a `session_id` is provided (from open_session), the browser session will be reused instead of creating a new one.
        When using a session, browser-level params (headless, proxy, locale, etc.) are ignored since they were set at session creation time.

    :param urls: A list of the URLs to request.
    :param extraction_type: The type of content to extract from the page. Defaults to "markdown". Options are:
        - Markdown will convert the page content to Markdown format.
        - HTML will return the raw HTML content of the page.
        - Text will return the text content of the page.
    :param css_selector: CSS selector to extract the content from the page. If main_content_only is True, then it will be executed on the main content of the page. Defaults to None.
    :param main_content_only: Whether to extract only the main content of the page. Defaults to True. The main content here is the data inside the `<body>` tag.
    :param headless: Run the browser in headless/hidden (default), or headful/visible mode.
    :param disable_resources: Drop requests for unnecessary resources for a speed boost.
        Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
    :param useragent: Pass a useragent string to be used. Otherwise the fetcher will generate a real Useragent of the same browser and use it.
    :param cookies: Set cookies for the next request.
    :param solve_cloudflare: Solves all types of the Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
    :param allow_webgl: Enabled by default. Disabling WebGL is not recommended as many WAFs now check if WebGL is enabled.
    :param network_idle: Wait for the page until there are no network connections for at least 500 ms.
    :param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
    :param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
    :param wait_selector: Wait for a specific CSS selector to be in a specific state.
    :param timezone_id: Changes the timezone of the browser. Defaults to the system timezone.
    :param locale: Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect navigator.language value, Accept-Language request header value as well as number and date formatting
        rules. Defaults to the system default locale.
    :param wait_selector_state: The state to wait for the selector given with `wait_selector`. The default state is `attached`.
    :param real_chrome: If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it.
    :param hide_canvas: Add random noise to canvas operations to prevent fingerprinting.
    :param block_webrtc: Forces WebRTC to respect proxy settings to prevent local IP address leak.
    :param cdp_url: Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.
    :param google_search: Enabled by default, Scrapling will set a Google referer header.
    :param extra_headers: A dictionary of extra headers to add to the request. _The referer set by `google_search` takes priority over the referer set here if used together._
    :param proxy: The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only.
    :param additional_args: Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings.
    :param session_id: Optional session ID from open_session. If provided, reuses the existing browser session instead of creating a new one.
    """
    if session_id:
        entry = self._get_session(session_id, "stealthy")
        tasks = [
            entry.session.fetch(
                url,
                wait=wait,
                timeout=timeout,
                google_search=google_search,
                extra_headers=extra_headers,
                disable_resources=disable_resources,
                wait_selector=wait_selector,
                wait_selector_state=wait_selector_state,
                network_idle=network_idle,
                proxy=proxy,
                solve_cloudflare=solve_cloudflare,
            )
            for url in urls
        ]
        responses = await gather(*tasks)
    else:
        async with AsyncStealthySession(
            wait=wait,
            proxy=proxy,
            locale=locale,
            cdp_url=cdp_url,
            timeout=timeout,
            cookies=cookies,
            headless=headless,
            block_ads=True,
            useragent=useragent,
            timezone_id=timezone_id,
            real_chrome=real_chrome,
            hide_canvas=hide_canvas,
            allow_webgl=allow_webgl,
            network_idle=network_idle,
            block_webrtc=block_webrtc,
            wait_selector=wait_selector,
            google_search=google_search,
            extra_headers=extra_headers,
            additional_args=additional_args,
            solve_cloudflare=solve_cloudflare,
            disable_resources=disable_resources,
            wait_selector_state=wait_selector_state,
        ) as session:
            tasks = [session.fetch(url) for url in urls]
            responses = await gather(*tasks)

    return [_translate_response(page, extraction_type, css_selector, main_content_only) for page in responses]

serve

serve(http, host, port)

Serve the MCP server.

Source code in scrapling/core/ai.py
def serve(self, http: bool, host: str, port: int):
    """Serve the MCP server."""
    server = FastMCP(name="Scrapling", host=host, port=port)
    # Session management tools
    server.add_tool(self.open_session, title="open_session", structured_output=True)
    server.add_tool(self.close_session, title="close_session", structured_output=True)
    server.add_tool(self.list_sessions, title="list_sessions", structured_output=True)
    # HTTP tools
    server.add_tool(self.get, title="get", description=self.get.__doc__, structured_output=True)
    server.add_tool(self.bulk_get, title="bulk_get", description=self.bulk_get.__doc__, structured_output=True)
    # Dynamic browser tools
    server.add_tool(self.fetch, title="fetch", description=self.fetch.__doc__, structured_output=True)
    server.add_tool(
        self.bulk_fetch, title="bulk_fetch", description=self.bulk_fetch.__doc__, structured_output=True
    )
    # Stealthy browser tools
    server.add_tool(
        self.stealthy_fetch,
        title="stealthy_fetch",
        description=self.stealthy_fetch.__doc__,
        structured_output=True,
    )
    server.add_tool(
        self.bulk_stealthy_fetch,
        title="bulk_stealthy_fetch",
        description=self.bulk_stealthy_fetch.__doc__,
        structured_output=True,
    )
    # Screenshot tool (returns image + url content blocks, not structured JSON)
    server.add_tool(self.screenshot, title="screenshot", description=self.screenshot.__doc__)
    server.run(transport="stdio" if not http else "streamable-http")