Autonomous Web Scraping

What This Guide Covers

Some websites your business relies on don’t offer APIs — government portals, legacy business directories, industry registries, or internal tools with web-only interfaces. Your team might spend hours manually searching these sites, filling in forms, and copying results. This guide introduces a technique your AI worker can use to access these websites programmatically — without opening a browser. Instead of automating clicks in a visual browser, the worker talks directly to the website’s server using HTTP requests.

This is an advanced, experimental technique. It may or may not work depending on the target website. Websites change their protections and structure without notice, and there’s no guarantee of ongoing compatibility. This guide is a suggestion for making certain websites more accessible to your AI workers — not a supported, guaranteed integration.

For You: Understanding the Value

What problem does this solve?

When a website has no API, the traditional options are:

Manual work — someone on your team does the lookups by hand
Browser automation — your AI workers tries to use the browser

Browser automation is heavy, slow, resource-intensive, and fragile. It breaks when websites update their layout, and it struggles with CAPTCHAs and anti-bot protections. The approach in this guide is fundamentally different: your worker reverse-engineers how the website works under the hood and communicates directly with the server. Think of it as learning to speak the website’s language rather than pretending to be a human clicking buttons.

When should you consider this?

Repetitive lookups or searches on a specific website (e.g., checking a business registry daily)
Data extraction from form-based portals (e.g., government databases, legal publications)
Monitoring pages for changes or new entries
Any task where your team currently fills in web forms manually and copies results

What does your worker need?

A CAPTCHA solving service API key — services like 2Captcha solve CAPTCHAs programmatically for approximately $0.003 per solve. Store this key in your Worker Vault.
Your guidance on the target — tell your worker which website to target and what data you need extracted.
Time for reconnaissance — your worker will need to inspect the website first to understand its structure before automating it.

What to expect

First-time setup takes effort — your worker needs to reverse-engineer each specific website (but they’ll do the heavy lifting)
Once working, the process is fast and reliable (until the website changes)
Save it as a skill — once your worker cracks a specific site, ask them to save the workflow as a reusable skill

Ethical guidelines

Only scrape publicly accessible data that is meant to be viewed by anyone
Respect the website’s terms of service and robots.txt
Implement rate limiting — don’t overwhelm target servers with rapid-fire requests
Use this for legitimate business purposes only

Using third-party CAPTCHA solving services operates in a legal gray area in some jurisdictions. Ensure your use case is legitimate and permissible in your region before proceeding.

For Your AI Worker: Technical Methodology

AI Worker Reference — This section is a technical guide designed for AI workers to learn the autonomous web scraping methodology. It covers the full pattern from reconnaissance to result parsing.

The “Package 2” Pattern

This methodology uses direct HTTP requests (via libraries like requests or httpx) combined with a third-party CAPTCHA solving service. There is no browser involved — no Selenium, no Playwright, no headless Chrome. You communicate directly with the web server. Advantages over browser automation:

Drastically faster execution
Minimal memory/resource usage
No browser driver version mismatches
Scales easily for high-concurrency workloads
No UI rendering context to manage

Phase 1: Reconnaissance — Understanding the Target

Before writing any code, inspect the website’s architecture and understand its request flow. Step 1: Observe the Request Flow

Open the browser’s Developer Tools (F12), navigate to the Network tab
Ensure “Preserve log” is checked
Submit the form manually and observe the initial GET request and subsequent POST request
Note the request URL, headers, and payload structure

Step 2: Identify the Tech Stack

Look at URLs and page source for clues:
- .aspx extensions and WebResource.axd paths → ASP.NET WebForms
- .php extensions → PHP
- JSON API calls in the background → JavaScript SPA with API backend
ASP.NET WebForms is particularly common in government/enterprise portals and maintains state via hidden fields: __VIEWSTATE, __EVENTVALIDATION, __VIEWSTATEGENERATOR

Step 3: Identify Anti-Bot Protections

reCAPTCHA: Look for iframes loading from google.com/recaptcha or grecaptcha elements
JavaScript Challenges: Look for inline scripts evaluating math expressions or string manipulations (e.g., NoBot controls that embed expressions like eval('43+40'))
Rate Limits: Note if there are strict rate limits or IP blocking behaviors

Step 4: Map the Form Fields

Use the Elements tab to inspect the <form>
Note the name attributes of all <input> elements
For ASP.NET WebForms, inputs inside server controls often use $ separators (e.g., ctl00$ContentPlaceHolder$txtSearchField)

Phase 2: Replaying the Request Flow

Your script must replicate exactly what the browser does, step by step. Step 1: Establish Session and Extract State

import requests
from bs4 import BeautifulSoup

# Use Session to automatically persist cookies between GET and POST
session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

# GET the page to establish session and extract hidden state
resp = session.get(TARGET_URL)
soup = BeautifulSoup(resp.text, "html.parser")

# Extract ASP.NET hidden state fields
viewstate = soup.find("input", {"name": "__VIEWSTATE"})["value"]
event_validation = soup.find("input", {"name": "__EVENTVALIDATION"})["value"]
viewstate_gen = soup.find("input", {"name": "__VIEWSTATEGENERATOR"})["value"]

Step 2: Handle Server-Side JavaScript Challenges Many sites use JavaScript challenges to block simple bots. You must solve these server-side.

import re

# Decode Unicode escapes FIRST (e.g., \u0027 -> ')
page_decoded = resp.text.encode().decode('unicode_escape')

# Extract math expression from patterns like eval('43+40')
match = re.search(r"eval\('(\d+[+\-*/]\d+)'\)", page_decoded)
if match:
    expression = match.group(1)
    
    # Strip leading zeros to avoid Python syntax errors
    # (Python 3 treats leading zeros as invalid octal literals)
    clean_expression = re.sub(r'\b0+(\d)', r'\1', expression)
    challenge_answer = str(eval(clean_expression))

Critical pitfalls:

Unicode escape sequences: HTML source may contain \u0027 instead of '. Always decode Unicode escapes before parsing with regex.
Leading zeros: Expressions like 0275+85 will fail in Python’s eval(). Strip leading zeros first using re.sub(r'\b0+(\d)', r'\1', expression).

Phase 3: Solving CAPTCHAs Programmatically

For sites protected by reCAPTCHA v2, use a CAPTCHA solving service (e.g., 2Captcha). The concept: You don’t solve the CAPTCHA visually. Instead, you extract the site key, send it to a solving API, and receive a bypass token. Step-by-step API flow:

import time

def solve_recaptcha(api_key, site_key, page_url):
    # 1. Submit the task to 2Captcha
    resp = requests.post("https://2captcha.com/in.php", data={
        "key": api_key,
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
    })
    
    if "|" not in resp.text:
        raise Exception(f"Failed to submit captcha: {resp.text}")
    
    task_id = resp.text.split("|")[1]
    
    # 2. Poll for completion every 5 seconds
    while True:
        time.sleep(5)
        resp = requests.get(
            f"https://2captcha.com/res.php?key={api_key}&action=get&id={task_id}"
        )
        if resp.text != "CAPCHA_NOT_READY":
            if "OK|" in resp.text:
                return resp.text.split("|")[1]  # The token
            else:
                raise Exception(f"Captcha solve failed: {resp.text}")

Finding the site key:

Look for data-sitekey attribute in the HTML
Or find it inside a grecaptcha.render() function call

Important notes:

The CAPTCHA token must be submitted within the same session (matching cookies) that loaded the page
Typical solve time: 15-30 seconds
Cost: ~$0.003 per solve
Alternative services (Anti-Captcha, CapSolver) follow the same architectural pattern

Phase 4: Assembling & Submitting the Request

Combine the state, solved challenge, CAPTCHA token, and search parameters into a single POST payload:

payload = {
    # ASP.NET hidden state
    "__VIEWSTATE": viewstate,
    "__EVENTVALIDATION": event_validation,
    "__VIEWSTATEGENERATOR": viewstate_gen,
    # Search parameters (field names from reconnaissance)
    "ctl00$ContentPlaceHolder$txtSearchField": search_term,
    # Submit button name/value pair
    "ctl00$ContentPlaceHolder$btnSearch": "Search",
    # Solved JavaScript challenge
    "NoBotControl$NoBotExtender_ClientState": challenge_answer,
    # CAPTCHA token
    "g-recaptcha-response": captcha_token,
}

# POST to the same URL (ASP.NET WebForms posts back to itself)
result = session.post(TARGET_URL, data=payload)

Key details:

Set a legitimate User-Agent header and correct Content-Type
Include the submit button’s name-value pair (often overlooked)
ASP.NET WebForms always posts back to the same URL

Phase 5: Parsing Results

Extract structured data from the response HTML:

soup = BeautifulSoup(result.text, "html.parser")

# Locate the results table (ASP.NET GridView components have specific IDs)
table = soup.find("table", {"id": "ctl00_ContentPlaceHolder_gvResults"})

if table:
    rows = table.find_all("tr")[1:]  # Skip header row
    results = []
    for row in rows:
        cells = row.find_all("td")
        results.append({
            "column_1": cells[0].text.strip(),
            "column_2": cells[1].text.strip(),
            # ... map to meaningful field names
        })

Pagination: For ASP.NET, pagination uses __EVENTTARGET and __EVENTARGUMENT hidden fields. To navigate to page 2, populate these fields and make another POST request simulating the page link click.

Common Pitfalls & Debugging

Issue	Symptom	Fix
Unicode escapes in JS	Regex fails to match expressions	`.encode().decode('unicode_escape')` before parsing
Leading zeros in math	`eval('04+2')` throws SyntaxError	Strip leading zeros via `re.sub(r'\b0+(\d)', r'\1', expr)`
Session mismatch	CAPTCHA solved but form rejected	Use `requests.Session()` for all requests
ViewState expiry	Form rejected after long CAPTCHA solve (>5 min)	Retry with fresh GET if CAPTCHA takes too long
Missing submit button	ASP.NET Event Validation error	Include the submit button’s name-value pair in payload
Missing hidden field	Server returns validation error	Check all hidden inputs from the form, not just ViewState

Turning This Into a Skill

Once you’ve successfully automated a specific website:

Test it reliably — run the process multiple times to confirm stability
Save it as a skill — this ensures you can repeat the workflow without re-engineering the site each time
Add error handling — websites change; build in graceful failure and retry logic
Implement rate limiting — add time.sleep() between requests to avoid overwhelming the target server

Custom Integrations

Connect tools that have APIs but no native Spinnable integration

Worker Vault

Store API keys and credentials securely

Skills

Save repeatable workflows for reuse

Security Best Practices

Keep your worker integrations secure

Getting Started

Core Concepts

Common Tasks

Account & Billing

Troubleshooting

Support

What This Guide Covers

For You: Understanding the Value

What problem does this solve?

When should you consider this?

What does your worker need?

What to expect

Ethical guidelines

For Your AI Worker: Technical Methodology

The “Package 2” Pattern

Phase 1: Reconnaissance — Understanding the Target

Phase 2: Replaying the Request Flow

Phase 3: Solving CAPTCHAs Programmatically

Phase 4: Assembling & Submitting the Request

Phase 5: Parsing Results

Common Pitfalls & Debugging

Turning This Into a Skill

Custom Integrations

Worker Vault

Skills

Security Best Practices

Getting Started

Core Concepts

Common Tasks

Account & Billing

Troubleshooting

Support

Documentation Index

​What This Guide Covers

​For You: Understanding the Value

​What problem does this solve?

​When should you consider this?

​What does your worker need?

​What to expect

​Ethical guidelines

​For Your AI Worker: Technical Methodology

​The “Package 2” Pattern

​Phase 1: Reconnaissance — Understanding the Target

​Phase 2: Replaying the Request Flow

​Phase 3: Solving CAPTCHAs Programmatically

​Phase 4: Assembling & Submitting the Request

​Phase 5: Parsing Results

​Common Pitfalls & Debugging

​Turning This Into a Skill

​Related

Custom Integrations

Worker Vault

Skills

Security Best Practices

What This Guide Covers

For You: Understanding the Value

What problem does this solve?

When should you consider this?

What does your worker need?

What to expect

Ethical guidelines

For Your AI Worker: Technical Methodology

The “Package 2” Pattern

Phase 1: Reconnaissance — Understanding the Target

Phase 2: Replaying the Request Flow

Phase 3: Solving CAPTCHAs Programmatically

Phase 4: Assembling & Submitting the Request

Phase 5: Parsing Results

Common Pitfalls & Debugging

Turning This Into a Skill

Related