The AI-Crawler is an experimental data extraction app by Oxylabs AI Studio that uses advanced AI algorithms to crawl a given domain. It identifies relevant pages based on a natural language prompt and extracts structured JSON or Markdown output data.
This low-code tool is designed to simplify complex data acquisition tasks, allowing developers and data scientists to focus on analysis rather than building and maintaining custom web scrapers. The AI web crawler offers advanced filtering, schema-based parsing, and seamless integration with various automation pipelines.
- Start a crawl from any given URL: Begin your data extraction from any valid web address using the AI Crawler as a starting point.
- Natural language prompt: Define your data needs in plain English, and the crawl agent will interpret the prompt to find relevant content.
- AI-assisted URL selection: The AI web crawler intelligently explores the site, identifying and prioritizing pages most aligned with your prompt.
- Multiple output formats: Choose between structured JSON or Markdown output for seamless integration into automation or AI workflows.
- Schema-based parsing: For JSON output, you can define a parsing schema in natural language to ensure the extracted data is structured to fit your application.
To get started with the AI Crawler, follow this four-step process:
- Provide a starting URL of the website you want the web crawler to explore.
- Describe the content you want to retrieve using a natural language prompt for the crawl agent.
- Select the output format. Choose between structured JSON or Markdown.
- If using JSON output, provide a schema to guide the AI web crawler in parsing and structuring the extracted data.
To begin, be sure you have access to an API key (or get a free trial with 1,000 credits) and Python 3.10+
installed. You can install the oxylabs-ai-studio
package using pip:
pip install oxylabs-ai-studio
The following examples demonstrate how to use the AiCrawler
to perform common crawling tasks.
from oxylabs_ai_studio.apps.ai_crawler import AiCrawler
# Initialize the AI Crawler with your API key
crawler = AiCrawler(api_key="your_api_key")
# Generate a schema automatically from natural language
schema = crawler.generate_schema(prompt="want to parse name, platform, price")
print(f"Generated schema: {schema}")
# Crawl a website and extract structured data
url = "https://sandbox.oxylabs.io/products"
result = crawler.crawl(
url=url,
user_prompt="Find all Halo games for Xbox",
output_format="json",
schema=schema,
render_javascript=False,
return_sources_limit=3,
geo_location="US",
)
# Print the crawl output
print("Results:")
for item in result.data:
print(item, "\n")
Learn more about AI-Crawler and Oxylabs AI Studio Python SDK in our PyPI repository. You can also check out our AI Studio JavaScript SDK guide for JS users.
Parameter | Description | Default Value |
---|---|---|
url * |
Starting URL to crawl | – |
user_prompt * |
Natural language prompt to guide extraction | – |
output_format |
Output format (json , markdown ) |
markdown |
schema |
OpenAPI schema for structured extraction (mandatory for JSON) | – |
render_javascript |
Enable render JavaScript | False |
return_sources_limit |
Max number of sources to return | 25 |
geo_location |
Proxy location in ISO2 format | – |
*
– mandatory parameters
The AI-Crawler
can return parsed, ready-to-use output that is easy to integrate into your applications.
This is a structured JSON of the response output:
Results:
{"data": {"items": [{"name": "Halo: Reach", "platform": "Xbox platform", "price": 84.99}]}, "src": "https://sandbox.oxylabs.io/products/141"}
{"data": {"items": [{"name": "Halo 3", "platform": "Xbox", "price": 81.99}]}, "src": "https://sandbox.oxylabs.io/products/28"}
{"data": {"items": [{"name": "Halo: Combat Evolved", "platform": "Xbox platform", "price": 87.99}]}, "src": "https://sandbox.oxylabs.io/products/6"}
Alternatively, you can use output_format=”markdown”
to receive Markdown results instead of parsed JSON.
The AI-Crawler is a versatile tool for a wide range of applications, including:
- Finding terms of service pages: Quickly locate legal and policy pages across a domain.
- Gathering pricing pages: Collect pricing details for competitor analysis or market research.
- Retrieving all “About” pages: Automatically find and extract company information from a list of websites.
- Listing AI-related news articles: Scrape a news site to gather and archive articles on a specific topic.
Unlike traditional scrapers that rely on static selectors (CSS/XPath) and custom scripts, AI-Crawler uses natural language prompts and AI-assisted URL selection to dynamically identify and extract relevant content. It returns Markdown results and also supports schema-based parsing for structured JSON outputs, reducing the need for manual parsing logic.
You can crawl most publicly accessible websites. AI-Crawler is designed to handle both static and JavaScript-rendered pages, and it can be configured with geo-targeting. However, be sure your use case complies with the website’s terms of service and local laws.
Oxylabs AI Studio AI-Crawler is free to try by signing up for a free trial that includes 1,000 credits. After the trial, the monthly plans start at $12/month with 3,000 credits and 1 request/s, with higher plans offering more credits and higher request rates.
Yes, you can either provide your own schema in OpenAPI format or let AI-Crawler generate one automatically from a natural language prompt. This allows your extracted data to match the exact structure your application needs.
For a deeper dive into available parameters, advanced integrations, and additional examples, check out the AI Studio documentation.
If you have questions or need support, reach out to us at hello@oxylabs.io or through live chat.