How to Make a Text Summarizer – Intro to Deep Learning #10

How to Make a Text Summarizer – Intro to Deep Learning #10

November 24, 2019 100 By Stanley Isaacs


(siraj) Hello, world!
It’s Siraj, and
we’re going to make
an app that reads
an article of text
and creates a one
sentence summary out
of it using the power of
natural language processing.
Language is in many ways
the seat of intelligence.
It’s the original
communication protocol
that we invented to
describe all the incredibly
complex processes
happening in our neocortex.
Do you ever feel
like you’re getting
flooded with an increasing
amount of articles and links
and videos to choose from?
As this data grows, the
importance of semantic density
does as well.
How can you say the
most important things
in the shortest amount of time?
Having a generated
summary lets you
decide whether you want to
deep dive further or not.
And the better it
gets, the more we’ll
be able to apply it to
more complex language,
like that in a scientific
paper or even an entire book.
The future of NLP is
a very bright one.
Interestingly enough, one of the
earliest use cases for machine
summarization was by
the Canadian government
in the early 90s
for a weather system
they invented called FoG.
Instead of sifting through
all the meteorological data
they had access
to manually, they
let FoG read it and generate
a weather forecast from it
on a recurring basis.
It had a set textual
template and it
would fill in the values
for the current weather
given the data,
something like this.
It was just an
experiment, but they
found that sometimes
people actually
prefer the computer generated
forecasts to the human ones,
partly because the
generated ones use
more consistent terminology.
A similar approach has
been applied in fields
with lots of data that
needs human readable
summaries, like finance.
And in medicine, summarizing
a patient’s medical data
has proven to be a
great decision support
tool for doctors.
Most summarization tools in
the past were extractive,
they selected an existing
subset of words or numbers
from some data to
create a summary.
But you and I do something a
little more complex than that.
When we summarize,
our brain builds
an internal semantic
representation
of what we’ve just
read and from that, we
can generate a summary.
This is instead an
abstractive method
and we can do this
with deep learning.
What can’t we do with it?
So let’s build a
tech summariser that
can generate a headline from
a short article using Keras.
We’re going to use this
collection of news articles
as our training data.
We’ll convert it
to pickle format,
which essentially
means converting it
into a raw bytestream.
Pickling is a way of
converting a Python
object into a character stream.
So we can easily reconstruct
that object in another Python
script.
Modularity for the win.
We’re saving the data as a tuple
with the heading, description,
and keywords.
The heading and description
are the list of headings
and their respective
articles in order.
And the keywords
are akin to tags,
but we won’t be using
those in this example.
We’re going to first tokenize,
or split up the text,
into individual
words because that’s
the level we’re going to
deal with this data in.
Our headline will be
generated one word at a time.
We want some way of representing
these words numerically.
Bengio coined the term
for this called word
embeddings back in 2003,
but they were first
made popular by a team
of researchers at Google
when they released word2vec,
inspired by Boyz II Men.
Just kidding.
Word2vec is a two layer neural
net trained on a big label text
corpus.
It’s a pre-trained
model you can download.
It takes a word as its
input and produces a vector
as its output, one
vector per word.
Creating word vectors lets us
analyze words mathematically.
So these high
dimensional vectors
represent words
and each dimension
encodes a different property,
like gender or title.
The magnitude along each
axis represents the relevance
of that property to a word.
So we could say king plus
man minus woman equals queen.
We can also find the
similarity between words,
which equates to distance.
Word2vec offers a
predictive approach
to creating word vectors,
but another approach
is count based.
And a popular algorithm
for that is GloVe,
short for global vectors.
It first constructs a large
co-occurence matrix of words
by context.
For each word, i.e.
row, it will count
how frequently it sees
it in some context, which
is the column.
Since the number of
context can be large,
it factorizes the matrix to
get a lower dimensional matrix,
which represents
words by features.
So each row has a feature
representation for each word.
And they also trained it
on a large text corpus.
Both perform similarly well,
but GloVe trains a little faster
so we’ll go with that.
We’ll download the
pre-trained GloVe word vectors
from this link and
save them to disk.
Then we’ll use them to
initialize an embedding matrix
with our tokenized vocabulary
from our training data.
We’ll initialize it
with random numbers then
copy all the GloVe weights
of words that show up
in our training vocabulary.
And for every word outside
this embedding matrix,
we’ll find the closest
word inside the matrix
by measuring the cosine
distance of GloVe vectors.
Now we’ve got this
matrix of word embeddings
that we could do so
many things with.
So how are we going to use
these word embeddings to create
a summary headline for a
novel article we feed it?
Let’s back up for a second.
[INAUDIBLE] first introduced
a neural architecture called
sequence to sequence in 2014.
That later inspired
the Google Brain team
to use it for text
summarization successfully.
It’s called sequence to sequence
because we are taking an input
sequence and outputting
not a single value,
but a sequence as well.
[SINGING] We gonna
encode, then we decode.
We gonna encode, then we decode.
When I feed it a book,
it gets vectorized,
and when I decode
that, I’m mesmerized.
So we use two
recurrent networks,
one for each sequence.
The first is the
encoder network.
It takes an input
sequence and creates
an encoded representation of it.
The second is the
decoder network.
We feed it as its input that
same encoded representation
and it will generate an output
sequence by decoding it.
There are different ways we
can approach this architecture.
One approach would be to let
our encoder network learn
these embeddings from scratch
by feeding it our training data.
But we’re taking a less
computationally expensive
approach, because we already
have learned embeddings
from GloVe.
When we build our
encoder LSTM network,
we’ll set those
pre-trained embeddings
as our first layer’s weights.
The embedding layer is
meant to turn input integers
into fixed size vectors anyway.
We’ve just given it a huge
head start by doing this.
And when we train this
model, it will just
fine tune or improve the
accuracy of our embeddings
as a supervised classification
problem where the input data is
our set of vocab words
and the labels are
their associated headline words.
We’ll minimize the cross-entropy
loss using rmsprop.
Now, for our decoder.
Our decoder will
generate headlines.
It will have the same LSTM
architecture as our encoder
and we’ll initialize
its weights using
our same pre-trained
GloVe embeddings.
It will take as input
the vector representation
generated after feeding in the
last word of the input text.
So it will first generate
its own representation
using its embedding layer.
And the next step is to
convert this representation
into a word, but there is
actually one more step.
We need a way to decide
what part of the input
we need to remember,
like names and numbers.
We talked about the
importance of memory.
That’s why we use LSTM cells.
But another important aspect of
learning theory is attention.
Basically, what is the most
relevant data to memorize?
Our decoder will generate
a word as its output
and that same word
will be fed in
as input when generating
the next word until we
have a headline.
We use an attention mechanism
when outputting each word
in the decoder.
For each output word,
it computes a weight
over each of the
input words that
determines how much
attention should
be paid to that input word.
All the weights
sum up to 1 and are
used to compute a
weighted average
of the last hidden layers
generated after processing
each of the inputted words.
We’ll take that weighted average
and input it into the softmax
layer along with the last hidden
layer from the current step
of the decoder.
So let’s see what our model
generates for this article
after training.
All right, we’ve got this
headline generated beautifully.
And let’s do it once more
for a different article.
Couldn’t have said
it better myself.
So, to break it down, we can
use [? retrained ?] word vectors
using a model like GloVe easily
to avoid having to create them
ourselves.
To generate an output sequence
of words given an input
sequence of words, we use
a neural encoder decoder
architecture.
And by adding an attention
mechanism to our decoder,
it can help it decide what
is the most relevant token
to focus on when
generating new text.
The winner of the coding
challenge from the last video
is Jie Xun See.
He wrote an AI composer
in 100 lines of code.
Last week’s challenge
was non-trivial
and he managed to get
a working demo up.
So definitely
check out his repo.
Wizard of the week.
The coding challenge
for this video
is to use a sequence
to sequence model
with Keras to summarize
a piece of text.
Post your GitHub
link in the comments
and I’ll announce the
winner next video.
Please subscribe for more
programming videos and for now,
I’ve got to remember
to pay attention.
So thanks for watching.