Welcome to Parsera
Parsera is a lightweight Python library for scraping websites with LLMs.
You can clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.
If you want to use Parsera in your TypeScipt application - check our Parsera SDK
Installation
Basic usage
First, set up PARSERA_API_KEY env variable (If you want to run custom LLM see Custom Models).
You can do this from python with:
Next, you can run a basic version:
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scraper = Parsera()
result = scraper.run(url=url, elements=elements)
result variable will contain a json with a list of records:
[
{
"Title":"Hacking the largest airline and hotel rewards platform (2023)",
"Points":"104",
"Comments":"24"
},
...
]
There is also arun async method available:
Specify output types
You can specify the output types using the following schema:
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": {
"description": "News title",
"type": "string",
},
"Points": {
"description": "Number of points",
"type": "integer",
}
"Comments": {
"description": "Number of comments",
"type": "integer",
}
}
scraper = Parsera(typed=True)
result = scraper.run(url=url, elements=elements)
| Schema Type | Python Type |
|---|---|
string |
string |
integer |
int |
number |
float |
bool |
bool |
list |
list |
object |
dict |
any |
Model can return any type |
When typed set to True, Parsera switches to Structured Extractor.
Running with CLI
Before you run Parsera as command line tool don't forget to put your OPENAI_API_KEY to env variables or .env file
Usage
You can configure elements to parse using JSON string or FILE.
Optionally, you can provide FILE to write output.
More features
Check out further documentation to explore more features: