Tokenization is the process by which string is divided into smaller sub parts called tokens.
Tokenization is the first step toward solving the problems like Text classification, sentiment analysis, smart chatbot etc using Natural Language toolkit.
Natural Language toolkit has ‘Tokenizer Interface’, now this tokenize module is further divided into sub parts
- word tokenize
- sentence tokenize
NLTK Tokenizer Package
Tokenizer package is used to find the words, sentences and the punctuation in the given string.
To import this package in Python we use
import nltk.tokenize
1. Word Tokenize Using nltk.tokenize
In this step we split a sentence into words.
from nltk.tokenize import word_tokenize
input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,
typically used to store working data and machine code. It cost around $40.99 of 8gb.”’
#applying the word_tokenize method
output = word_tokenize(input_string)
print(output)
Result:
[‘Random-access’, ‘memory’, ‘(‘, ‘RAM’, ‘/ræm/’, ‘)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ‘40.99’, ‘of’, ‘8gb’, ‘.’]
Word + Punctuation tokenize Using nltk.tokenize
In this step we split input text into words and punctuation.
from nltk.tokenize import wordpunct_tokenize
input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,
typically used to store working data and machine code. It cost around $40.99 of 8gb.”’
output = wordpunct_tokenize(input_string)
print(output)
Result:
[‘Random’, ‘-‘, ‘access’, ‘memory’, ‘(‘, ‘RAM’, ‘/’, ‘ræm’, ‘/)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ’40’, ‘.’, ’99’, ‘of’, ‘8gb’, ‘.’]
2.Sentence Tokenize tokenize Using nltk.tokenize
In this step we split input text into sentences.
from nltk.tokenize import sent_tokenize
input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,
typically used to store working data and machine code. It cost around $40.99 of 8gb.”’
output = sent_tokenize(input_string)
print(output)
Result:
[‘Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order, \ntypically used to store working data and machine code.’, ‘It cost around $40.99 of 8gb.’]
http://mycloudplace.com/natural-language-processing-an-introduction/
Pingback: An Introduction To Machine Learning - Mycloudplace
Pingback: Natural Language Processing An Introduction - Mycloudplace
Pingback: Python Programming Introduction - Mycloudplace