Module bastionlab.tokenizers
Sub-modules
Classes
BastionLabTokenizers()
-
Methods
from_hugging_face_pretrained(self, model_name: str, revision: Optional[str] = 'main', auth_token: Optional[str] = None) ‑> [bastionlab.tokenizers.remote_tokenizers](remote_tokenizers.md).RemoteTokenizer
-
Loads a Hugging Face tokenizer model with the checkpoint name.
Args: model_name: str Model name. revision: str A branch or commit id auth_token: str, optional, default=None An optional auth token used to access private repositories on the Hugging Face Hub
RemoteTokenizer()
-
Represents a tokenizer on the server.
Instance variables
decoder
:model
:normalizer
:padding
:post_processor
:pre_tokenizer
:truncation
:Methods
enable_padding(_self, *args, **kwargs)
:enable_truncation(_self, *args, **kwargs)
:encode(self, rdf: RemoteLazyFrame, add_special_tokens: bool = True) ‑> Tuple[bastionlab.polars.RemoteArray, bastionlab.polars.RemoteArray]
-
Encodes a RemoteLazyFrame as tokenized RemoteArray.
Args: rdf: RemoteLazyFrame The RemoteDataframe containing string sequences to be tokenized. add_special_tokens: bool Whether to add the special tokens
Returns: Tuple[RemoteArray, RemoteArray] Returns a tuple of the tokenized entries (first RemoteArray contains input_ids and the other, attention_mask)
get_vocab(_self, *args, **kwargs)
:get_vocab_size(_self, *args, **kwargs)
:no_padding(_self, *args, **kwargs)
: