Welcome to Parsera
Parsera is a lightweight Python library for scraping websites with LLMs.
You can clone and run it locally or use an API, which provides more scalable way and some extra features like built-in proxy.
If you want to use Parsera in your TypeScipt application - check our Parsera SDK
Installation
Basic usage
First, set up PARSERA_API_KEY
env variable (If you want to run custom LLM see Custom Models).
You can do this from python with:
Next, you can run a basic version:
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": "News title",
"Points": "Number of points",
"Comments": "Number of comments",
}
scraper = Parsera()
result = scraper.run(url=url, elements=elements)
result
variable will contain a json with a list of records:
[
{
"Title":"Hacking the largest airline and hotel rewards platform (2023)",
"Points":"104",
"Comments":"24"
},
...
]
There is also arun
async method available:
Specify output types
You can specify the output types using the following schema:
from parsera import Parsera
url = "https://news.ycombinator.com/"
elements = {
"Title": {
"description": "News title",
"type": "string",
},
"Points": {
"description": "Number of points",
"type": "integer",
}
"Comments": {
"description": "Number of comments",
"type": "integer",
}
}
scraper = Parsera(typed=True)
result = scraper.run(url=url, elements=elements)
Schema Type | Python Type |
---|---|
string |
string |
integer |
int |
number |
float |
bool |
bool |
list |
list |
object |
dict |
any |
Model can return any type |
When typed
set to True
, Parsera
switches to Structured Extractor.
Running with CLI
Before you run Parsera
as command line tool don't forget to put your OPENAI_API_KEY
to env variables or .env
file
Usage
You can configure elements to parse using JSON string
or FILE
.
Optionally, you can provide FILE
to write output.
More features
Check out further documentation to explore more features: