8.7.2.3. sklearn.feature_extraction.text.CharNGramAnalyzer¶
- class sklearn.feature_extraction.text.CharNGramAnalyzer(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6, charset_error='strict')¶
Compute character n-grams features of a text document
This analyzer is interesting since it is language agnostic and will work well even for language where word segmentation is not as trivial as English such as Chinese and German for instance.
Because of this, it can be considered a basic morphological analyzer.
Parameters : charset: string :
If bytes are given to analyze, this charset is used to decode.
min_n: integer :
The lower boundary of the range of n-values for different n-grams to be extracted.
max_n: integer :
The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
preprocessor: callable :
A callable that preprocesses the text document before tokens are extracted.
charset_error: {‘strict’, ‘ignore’, ‘replace’} :
Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given charset. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.
Methods
analyze(text_document) From documents to token get_params([deep]) Get parameters for the estimator set_params(**params) Set the parameters of the estimator. - __init__(charset='utf-8', preprocessor=RomanPreprocessor(), min_n=3, max_n=6, charset_error='strict')¶
- analyze(text_document)¶
From documents to token
- get_params(deep=True)¶
Get parameters for the estimator
Parameters : deep: boolean, optional :
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- set_params(**params)¶
Set the parameters of the estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.
Returns : self :