Extractors
Different extractor types
There are different types of extractors, that provide output in different formats:
- For tables.
ChunksTabularExtractor
- for tables, capable of processing larger pages with chunkingTabularExtractor
- for tables, without chunking (fails when page doesn't fit into the model's context)
ListExtractor
for separate lists of values.ItemExtractor
for specific values.
By default a ChunksTabularExtractor
is used.
Tabular Extractor
from parsera import Parsera
from parsera.engine.simple_extractor import TabularExtractor
extractor = TabularExtractor()
scraper = Parsera(extractor=extractor)
[
{"name": "name1", "price": "100"},
{"name": "name2", "price": "150"},
{"name": "name3", "price": "300"},
]
Chunks Tabular Extractor
Provides the same output format as TabularExtractor
, but capable of processing larger pages due to page chunking.
For example, if your model has 16k context size, you can set chunks to be not larger than 12k (keeping 4k buffer for other parts of the prompt):
from parsera import Parsera
from parsera.engine.chunks_extractor import ChunksTabularExtractor
extractor = ChunksTabularExtractor(chunk_size=12000)
scraper = Parsera(extractor=extractor)
By default number of tokens is counted based on the OpenAI tokenizer for gpt-4o
model, but you can provide custom
function for counting tokens:
import tiktoken
def count_tokens(text):
# Initialize the tokenizer for GPT-4o-mini
encoding = tiktoken.get_encoding("cl100k_base")
# Count tokens
tokens = encoding.encode(text)
return len(tokens)
scraper = Parsera(extractor=ExtractorType.CHUNKS_TABULAR, chunk_size=12000, token_counter=count_tokens)
List Extractor
from parsera import Parsera
from parsera.engine.simple_extractor import ListExtractor
extractor = ListExtractor()
scraper = Parsera(extractor=extractor)