Vocabulary object to be used with nvtext::wordpiece_tokenizer. More...
#include <wordpiece_tokenize.hpp>
Public Member Functions | |
wordpiece_vocabulary (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) | |
Vocabulary object constructor. More... | |
Vocabulary object to be used with nvtext::wordpiece_tokenizer.
Use nvtext::load_wordpiece_vocabulary to create this object.
Definition at line 36 of file wordpiece_tokenize.hpp.
nvtext::wordpiece_vocabulary::wordpiece_vocabulary | ( | cudf::strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Vocabulary object constructor.
Token ids are the row indices within the vocabulary column. Each vocabulary entry is expected to be unique otherwise the behavior is undefined.
std::invalid_argument | if vocabulary contains nulls or is empty |
input | Strings for the vocabulary |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |