Erase punctuation from text and documents
erases punctuation and symbols from newDocuments
= erasePunctuation(documents
)documents
. If a word is
empty after removing punctuation and symbol characters, then the function removes
it. For tokenized document input, the function erases punctuation from tokens with
type 'punctuation'
and 'other'
. For example,
the function does not erase punctuation and symbol characters from URLs and email
addresses.
erases punctuation and symbols from only the specified token types.newDocuments
= erasePunctuation(documents
,'TokenTypes',types
)
For string input, erasePunctuation
removes punctuation
characters from URLs and HTML tags. This behavior can prevent the functions
eraseTags
, eraseURLs
, and decodeHTMLEntities
from working as expected. If you want to use
these functions to preprocess your text, then use these functions before using
erasePunctuation
.
[1] Unicode Character Categories. https://www.fileformat.info/info/unicode/category/index.htm
decodeHTMLEntities
| eraseTags
| eraseURLs
| lower
| tokenizedDocument
| upper