Tokenization of Words and Sentences using NLTK

Tokenization is the process by which string is divided into smaller sub parts called tokens.

Tokenization is the first step toward solving the problems like Text classification, sentiment analysis, smart chatbot etc using Natural Language toolkit.

Natural Language toolkit has ‘Tokenizer Interface’, now this tokenize module is further divided into sub parts

word tokenize
sentence tokenize

NLTK Tokenizer Package

Tokenizer package is used to find the words, sentences and the punctuation in the given string.

To import this package in Python we use
import nltk.tokenize

1. Word Tokenize Using nltk.tokenize
In this step we split a sentence into words.

from nltk.tokenize import word_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

#applying the word_tokenize method

output = word_tokenize(input_string)

print(output)

Result:

[‘Random-access’, ‘memory’, ‘(‘, ‘RAM’, ‘/ræm/’, ‘)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ‘40.99’, ‘of’, ‘8gb’, ‘.’]

Word + Punctuation tokenize Using nltk.tokenize

In this step we split input text into words and punctuation.

from nltk.tokenize import wordpunct_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

output = wordpunct_tokenize(input_string)

print(output)

Result:

[‘Random’, ‘-‘, ‘access’, ‘memory’, ‘(‘, ‘RAM’, ‘/’, ‘ræm’, ‘/)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ’40’, ‘.’, ’99’, ‘of’, ‘8gb’, ‘.’]

2.Sentence Tokenize tokenize Using nltk.tokenize
In this step we split input text into sentences.

from nltk.tokenize import sent_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

output = sent_tokenize(input_string)

print(output)

Result:

[‘Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order, \ntypically used to store working data and machine code.’, ‘It cost around $40.99 of 8gb.’]

http://mycloudplace.com/natural-language-processing-an-introduction/

Natural Language Processing An Introduction

Natural_Language_Toolkit

Tokenization of Words and Sentences using NLTK

Related

3 thoughts on “Tokenization of Words and Sentences using NLTK”

Leave a Comment Cancel Reply

About Us

Quick Links

Contact Us