Welcome to Parsera

Parsera is a lightweight Python library for scraping websites with LLMs.
You can clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.

If you want to use Parsera in your TypeScipt application - check our Parsera SDK

Installation

pip install parsera
playwright install

Basic usage

First, set up PARSERA_API_KEY env variable (If you want to run custom LLM see Custom Models). You can do this from python with:

import os

os.environ["PARSERA_API_KEY"] = "YOUR_PARSERA_API_KEY_HERE"

Next, you can run a basic version:

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": "News title",
    "Points": "Number of points",
    "Comments": "Number of comments",
}

scraper = Parsera()
result = scraper.run(url=url, elements=elements)

result variable will contain a json with a list of records:

[
   {
      "Title":"Hacking the largest airline and hotel rewards platform (2023)",
      "Points":"104",
      "Comments":"24"
   },
    ...
]

There is also arun async method available:

result = await scrapper.arun(url=url, elements=elements)

Specify output types

You can specify the output types using the following schema:

from parsera import Parsera

url = "https://news.ycombinator.com/"
elements = {
    "Title": {
        "description": "News title",
        "type": "string",
    },
    "Points": {
        "description": "Number of points",
        "type": "integer",
    }
    "Comments": {
        "description": "Number of comments",
        "type": "integer",
    }
}

scraper = Parsera(typed=True)
result = scraper.run(url=url, elements=elements)

List of the supported types:

Schema Type	Python Type
`string`	`string`
`integer`	`int`
`number`	`float`
`bool`	`bool`
`list`	`list`
`object`	`dict`
`any`	Model can return any type

When typed set to True, Parsera switches to Structured Extractor.

Running with CLI

Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file

Usage

You can configure elements to parse using JSON string or FILE. Optionally, you can provide FILE to write output.

python -m parsera.main URL {--scheme '{"title":"h1"}' | --file FILENAME} [--output FILENAME]

More features

Check out further documentation to explore more features: