8.7.2.2. sklearn.feature_extraction.text.WordNGramAnalyzer¶

class sklearn.feature_extraction.text.WordNGramAnalyzer(charset='utf-8', min_n=1, max_n=1, preprocessor=RomanPreprocessor(), stop_words='english', token_pattern=u'\b\w\w+\b', charset_error='strict')¶

Simple analyzer: transform text document into a sequence of word tokens

This simple implementation does:

lower case conversion
unicode accents removal
token extraction using unicode regexp word bounderies for token of minimum size of 2 symbols (by default)
output token n-grams (unigram only by default)

The stop words argument may be “english” for a built-in list of English stop words or a collection of strings. Note that stop word filtering is performed after preprocessing, which may include accent stripping.

Parameters :

charset: string :

If bytes are given to analyze, this charset is used to decode.

min_n: integer :

The lower boundary of the range of n-values for different n-grams to be extracted.

max_n: integer :

The upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

preprocessor: callable :

A callable that preprocesses the text document before tokens are extracted.

stop_words: string, list, or None :

If a string, it is passed to _check_stop_list and the appropriate stop list is returned. The default is “english” and is currently the only supported string value. If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. If None, no stop words will be used.

token_pattern: string :

Regular expression denoting what constitutes a “token”.

charset_error: {‘strict’, ‘ignore’, ‘replace’} :

Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given charset. By default, it is ‘strict’, meaning that a UnicodeDecodeError will be raised. Other values are ‘ignore’ and ‘replace’.

Methods

`analyze`(text_document)	From documents to token
`get_params`([deep])	Get parameters for the estimator
`set_params`(**params)	Set the parameters of the estimator.

__init__(charset='utf-8', min_n=1, max_n=1, preprocessor=RomanPreprocessor(), stop_words='english', token_pattern=u'\b\w\w+\b', charset_error='strict')¶

analyze(text_document)¶: From documents to token

get_params(deep=True)¶

Get parameters for the estimator

Parameters :

deep: boolean, optional :

If True, will return the parameters for this estimator and contained subobjects that are estimators.

set_params(**params)¶

Set the parameters of the estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns :	self :

Citing

This page

8.7.2.2. sklearn.feature_extraction.text.WordNGramAnalyzer¶