Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.spinnable.ai/llms.txt

Use this file to discover all available pages before exploring further.

What This Guide Covers

Some websites your business relies on don’t offer APIs — government portals, legacy business directories, industry registries, or internal tools with web-only interfaces. Your team might spend hours manually searching these sites, filling in forms, and copying results. This guide introduces a technique your AI worker can use to access these websites programmatically — without opening a browser. Instead of automating clicks in a visual browser, the worker talks directly to the website’s server using HTTP requests.
This is an advanced, experimental technique. It may or may not work depending on the target website. Websites change their protections and structure without notice, and there’s no guarantee of ongoing compatibility. This guide is a suggestion for making certain websites more accessible to your AI workers — not a supported, guaranteed integration.

For You: Understanding the Value

What problem does this solve?

When a website has no API, the traditional options are:
  • Manual work — someone on your team does the lookups by hand
  • Browser automation — your AI workers tries to use the browser
Browser automation is heavy, slow, resource-intensive, and fragile. It breaks when websites update their layout, and it struggles with CAPTCHAs and anti-bot protections. The approach in this guide is fundamentally different: your worker reverse-engineers how the website works under the hood and communicates directly with the server. Think of it as learning to speak the website’s language rather than pretending to be a human clicking buttons.

When should you consider this?

  • Repetitive lookups or searches on a specific website (e.g., checking a business registry daily)
  • Data extraction from form-based portals (e.g., government databases, legal publications)
  • Monitoring pages for changes or new entries
  • Any task where your team currently fills in web forms manually and copies results

What does your worker need?

  1. A CAPTCHA solving service API key — services like 2Captcha solve CAPTCHAs programmatically for approximately $0.003 per solve. Store this key in your Worker Vault.
  2. Your guidance on the target — tell your worker which website to target and what data you need extracted.
  3. Time for reconnaissance — your worker will need to inspect the website first to understand its structure before automating it.

What to expect

  • First-time setup takes effort — your worker needs to reverse-engineer each specific website (but they’ll do the heavy lifting)
  • Once working, the process is fast and reliable (until the website changes)
  • Save it as a skill — once your worker cracks a specific site, ask them to save the workflow as a reusable skill

Ethical guidelines

  • Only scrape publicly accessible data that is meant to be viewed by anyone
  • Respect the website’s terms of service and robots.txt
  • Implement rate limiting — don’t overwhelm target servers with rapid-fire requests
  • Use this for legitimate business purposes only
Using third-party CAPTCHA solving services operates in a legal gray area in some jurisdictions. Ensure your use case is legitimate and permissible in your region before proceeding.

For Your AI Worker: Technical Methodology

AI Worker Reference — This section is a technical guide designed for AI workers to learn the autonomous web scraping methodology. It covers the full pattern from reconnaissance to result parsing.

The “Package 2” Pattern

This methodology uses direct HTTP requests (via libraries like requests or httpx) combined with a third-party CAPTCHA solving service. There is no browser involved — no Selenium, no Playwright, no headless Chrome. You communicate directly with the web server. Advantages over browser automation:
  • Drastically faster execution
  • Minimal memory/resource usage
  • No browser driver version mismatches
  • Scales easily for high-concurrency workloads
  • No UI rendering context to manage

Phase 1: Reconnaissance — Understanding the Target

Before writing any code, inspect the website’s architecture and understand its request flow. Step 1: Observe the Request Flow
  • Open the browser’s Developer Tools (F12), navigate to the Network tab
  • Ensure “Preserve log” is checked
  • Submit the form manually and observe the initial GET request and subsequent POST request
  • Note the request URL, headers, and payload structure
Step 2: Identify the Tech Stack
  • Look at URLs and page source for clues:
    • .aspx extensions and WebResource.axd paths → ASP.NET WebForms
    • .php extensions → PHP
    • JSON API calls in the background → JavaScript SPA with API backend
  • ASP.NET WebForms is particularly common in government/enterprise portals and maintains state via hidden fields: __VIEWSTATE, __EVENTVALIDATION, __VIEWSTATEGENERATOR
Step 3: Identify Anti-Bot Protections
  • reCAPTCHA: Look for iframes loading from google.com/recaptcha or grecaptcha elements
  • JavaScript Challenges: Look for inline scripts evaluating math expressions or string manipulations (e.g., NoBot controls that embed expressions like eval('43+40'))
  • Rate Limits: Note if there are strict rate limits or IP blocking behaviors
Step 4: Map the Form Fields
  • Use the Elements tab to inspect the <form>
  • Note the name attributes of all <input> elements
  • For ASP.NET WebForms, inputs inside server controls often use $ separators (e.g., ctl00$ContentPlaceHolder$txtSearchField)

Phase 2: Replaying the Request Flow

Your script must replicate exactly what the browser does, step by step. Step 1: Establish Session and Extract State
import requests
from bs4 import BeautifulSoup

# Use Session to automatically persist cookies between GET and POST
session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
})

# GET the page to establish session and extract hidden state
resp = session.get(TARGET_URL)
soup = BeautifulSoup(resp.text, "html.parser")

# Extract ASP.NET hidden state fields
viewstate = soup.find("input", {"name": "__VIEWSTATE"})["value"]
event_validation = soup.find("input", {"name": "__EVENTVALIDATION"})["value"]
viewstate_gen = soup.find("input", {"name": "__VIEWSTATEGENERATOR"})["value"]
Step 2: Handle Server-Side JavaScript Challenges Many sites use JavaScript challenges to block simple bots. You must solve these server-side.
import re

# Decode Unicode escapes FIRST (e.g., \u0027 -> ')
page_decoded = resp.text.encode().decode('unicode_escape')

# Extract math expression from patterns like eval('43+40')
match = re.search(r"eval\('(\d+[+\-*/]\d+)'\)", page_decoded)
if match:
    expression = match.group(1)
    
    # Strip leading zeros to avoid Python syntax errors
    # (Python 3 treats leading zeros as invalid octal literals)
    clean_expression = re.sub(r'\b0+(\d)', r'\1', expression)
    challenge_answer = str(eval(clean_expression))
Critical pitfalls:
  • Unicode escape sequences: HTML source may contain \u0027 instead of '. Always decode Unicode escapes before parsing with regex.
  • Leading zeros: Expressions like 0275+85 will fail in Python’s eval(). Strip leading zeros first using re.sub(r'\b0+(\d)', r'\1', expression).

Phase 3: Solving CAPTCHAs Programmatically

For sites protected by reCAPTCHA v2, use a CAPTCHA solving service (e.g., 2Captcha). The concept: You don’t solve the CAPTCHA visually. Instead, you extract the site key, send it to a solving API, and receive a bypass token. Step-by-step API flow:
import time

def solve_recaptcha(api_key, site_key, page_url):
    # 1. Submit the task to 2Captcha
    resp = requests.post("https://2captcha.com/in.php", data={
        "key": api_key,
        "method": "userrecaptcha",
        "googlekey": site_key,
        "pageurl": page_url,
    })
    
    if "|" not in resp.text:
        raise Exception(f"Failed to submit captcha: {resp.text}")
    
    task_id = resp.text.split("|")[1]
    
    # 2. Poll for completion every 5 seconds
    while True:
        time.sleep(5)
        resp = requests.get(
            f"https://2captcha.com/res.php?key={api_key}&action=get&id={task_id}"
        )
        if resp.text != "CAPCHA_NOT_READY":
            if "OK|" in resp.text:
                return resp.text.split("|")[1]  # The token
            else:
                raise Exception(f"Captcha solve failed: {resp.text}")
Finding the site key:
  • Look for data-sitekey attribute in the HTML
  • Or find it inside a grecaptcha.render() function call
Important notes:
  • The CAPTCHA token must be submitted within the same session (matching cookies) that loaded the page
  • Typical solve time: 15-30 seconds
  • Cost: ~$0.003 per solve
  • Alternative services (Anti-Captcha, CapSolver) follow the same architectural pattern

Phase 4: Assembling & Submitting the Request

Combine the state, solved challenge, CAPTCHA token, and search parameters into a single POST payload:
payload = {
    # ASP.NET hidden state
    "__VIEWSTATE": viewstate,
    "__EVENTVALIDATION": event_validation,
    "__VIEWSTATEGENERATOR": viewstate_gen,
    # Search parameters (field names from reconnaissance)
    "ctl00$ContentPlaceHolder$txtSearchField": search_term,
    # Submit button name/value pair
    "ctl00$ContentPlaceHolder$btnSearch": "Search",
    # Solved JavaScript challenge
    "NoBotControl$NoBotExtender_ClientState": challenge_answer,
    # CAPTCHA token
    "g-recaptcha-response": captcha_token,
}

# POST to the same URL (ASP.NET WebForms posts back to itself)
result = session.post(TARGET_URL, data=payload)
Key details:
  • Set a legitimate User-Agent header and correct Content-Type
  • Include the submit button’s name-value pair (often overlooked)
  • ASP.NET WebForms always posts back to the same URL

Phase 5: Parsing Results

Extract structured data from the response HTML:
soup = BeautifulSoup(result.text, "html.parser")

# Locate the results table (ASP.NET GridView components have specific IDs)
table = soup.find("table", {"id": "ctl00_ContentPlaceHolder_gvResults"})

if table:
    rows = table.find_all("tr")[1:]  # Skip header row
    results = []
    for row in rows:
        cells = row.find_all("td")
        results.append({
            "column_1": cells[0].text.strip(),
            "column_2": cells[1].text.strip(),
            # ... map to meaningful field names
        })
Pagination: For ASP.NET, pagination uses __EVENTTARGET and __EVENTARGUMENT hidden fields. To navigate to page 2, populate these fields and make another POST request simulating the page link click.

Common Pitfalls & Debugging

IssueSymptomFix
Unicode escapes in JSRegex fails to match expressions.encode().decode('unicode_escape') before parsing
Leading zeros in matheval('04+2') throws SyntaxErrorStrip leading zeros via re.sub(r'\b0+(\d)', r'\1', expr)
Session mismatchCAPTCHA solved but form rejectedUse requests.Session() for all requests
ViewState expiryForm rejected after long CAPTCHA solve (>5 min)Retry with fresh GET if CAPTCHA takes too long
Missing submit buttonASP.NET Event Validation errorInclude the submit button’s name-value pair in payload
Missing hidden fieldServer returns validation errorCheck all hidden inputs from the form, not just ViewState

Turning This Into a Skill

Once you’ve successfully automated a specific website:
  1. Test it reliably — run the process multiple times to confirm stability
  2. Save it as a skill — this ensures you can repeat the workflow without re-engineering the site each time
  3. Add error handling — websites change; build in graceful failure and retry logic
  4. Implement rate limiting — add time.sleep() between requests to avoid overwhelming the target server

Custom Integrations

Connect tools that have APIs but no native Spinnable integration

Worker Vault

Store API keys and credentials securely

Skills

Save repeatable workflows for reuse

Security Best Practices

Keep your worker integrations secure