1/29/2022

Text Prediction

In this project I have analyzed a three sources of text data and build a text prediction model using Natural Language Processing algorithms. Basically, trained algorithm predicts five most probable next word based on the input text (multiple words).

This type of algorithms are commonly used in mobile phones, tablets and other places where digital keyboards are used.

The goal of text prediction algorithms is to increase your typing speed while keeping high accuracy. Higher accuracy requires larger data to train the model, but this can lead to longer time to run the model.

Data Summary

This project utilized three sources of text data with various sizes.

Sources Number.of.lines Max.Line.size Size
twitter 2360148 140 318.99 MB
blogs 899288 40833 255.35 MB
news 1010242 11384 257.34 MB


N-grams Frequency Table of 1% random sample

##   one_gram two_grams    three_grams
## 1      the    of the     one of the
## 2       to    in the       a lot of
## 3      and    to the thanks for the
## 4        a   for the    going to be
## 5       of    on the     out of the

Prediction Algorithm

I trained my prediction model on a training sample, which is 70% percent of 10% sample data, and test accuracy of the model on test sample.

I used SBO package in R to train the prediction model. In sbo library provides utilities for building and evaluating text predictors based on Stupid Back-off N-gram models. (Technical note)

Basically, model takes last 3 words of the user’s input and check 3-gram frequency of that word-triplets and give a frequency score. Then check N-1 grams until 0-grams. Each matching N-1 grams gets a penalty and gets a score. Finally, model displays the five words with highest score.

Application

A simple demo of how prediction model works can be seen at Shiny Web-App.

User types a text or a few words or a half sentence and click on the submit button. Then prediction algorithm displays 5 one-word suggestions to complete the sentence.

Codes are available at Github Repo