DATA PRE-PROCESSING -2 (TEXT VECTORIZATION-1)

In this story, we are going to learn about another problem and another solving technique for that problem.

Oct 15, 2024

In this story, we are going to learn about another problem and another solving technique for that problem.

WE ALL KNOW THAT THE CALCULATIONS CAN BE DONE ONLY ON NUMBERS NOT ON SENTENCES OR WORDS.

What happens if we have “words” or “sentences” as a value in our dataset. What is the main thing is “we want those (features)values to give as an input to make our model”.

DON’T WORRY. We got a way to convert “sentences” and “word” into numerical data. THE way is called “TEXT VECTORIZATION”.

WHAT YOU WILL LEARN:

WHAT IS TEXT VECTORIZATION.
STEPS TO DO TEXT VECTORIZATION.

SO, LET’S START...

DATASET HAVING WORDS AND SENTENCES AS VALUES — DATASET OF COURSERA

In here we can see that the data(feature) present in here are “words” and “sentences”. If we want to build the course recommendation system. We going to need the columns: - (course_title,course_organization,course_rating,course_difficult).

TEXT VECTORIZER:

It is a “tool” used to convert the text into numerical data. It is used in NATURAL LANGUAGE PROCESSING (NLP) to convert text into numerical representations to make our machine learn from it so that we can do computation.

SO, THESE ARE THE STEPS INVOLVED IN TEXT VECTORIZATION

TOKENIZATION: — It means breaking down the group of words in a document into individual words are tokens. This is done so that the word can be “indexed” and useful for the preceding processes.

EX: -In the above image you can see the sentence is converted into individual tokens and indexed.

NORMALIZATION: - It is used to make the words in the dataset are in same scale without any distinction so that the machine can perform task perfectly.

EX: -You can see that the above picture shows the conversion of text into normalized text. If you watch closely, you can see the abbreviation letter ML is also converted into lowercase. This can be avoided by using the certain libraries in (NLTK-NATURAL LAUNGUAGE TOOLKIT) and creating our own rule.

STOP WORD REMOVAL: -This involves removing the unnecessary or not useful words like (the,that,this,what,why)ETC. Removing these values will make us make the dataset containing features(values) simple and minimum so that the computation can be done effectively.

In the above image you can see that the words like (A,for,can,be,i,so) is removed because it is not needed for the machine. This prevents unnecessary data being given as an input to the machine.

STEMMING: - In here we remove the “prefix” and “suffix” of a word and convert it into its root form. This is necessary because when we token the words in tokenization. the same word should not be “indexed” twice or the word with same meaning should not be tokenized twice.

In the above example you can see that the word “dogs” is converted into “dog”. This is done because when the machine can think that the words “dog” and “dogs” are have different meaning, and it will give separate index to the features (dogs and dog). This causes the dataset complex and increase the error in the machine’s work.

GUYS, PLEASE HOLD ON, I WILL FINISH IN A MINUTE.

VECTORIZATION: - This is the awaited part in which we convert the “words” into “numerical representation". There are various methods to do this but what we are going to discuss is the BAG OF WORD(BAG) approach.

EX: In the above example, we can see that the count of each word is observed. This is the output of the BAG OF WORD (BOW) approach.

THIS IS THE LAST STEP.I WILL GIVE MORE INFO ABOUT THIS IN MY NEXT STORY.

WHOA!! THAT’S A LOT OF INFO, BUT I AM SURE YOU LEARNED SOMETHING VALUBLE.

WHAT YOU LEARNED:

WHAT IS TEXT VECTORIZATION
WHAT ARE THE STEPS IN THE TEXT VECTORIZATION

IN THE NEXT STORY I WILL SHARE THE CODE AND TEACH ABOUT THE “BAG OF WORD(BAG)” APPROACH.LINK TO THE NEXT STORY: -TEXT VECTORIZATION-2

THANK YOU FOR SPENDING YOUR TIME WITH ME….

VIJAYARAGUL.S’s Blog

Discussion about this post

Ready for more?