A Beginner's Guide to Understanding the Development of Conversational AI

By Marie Haynes
7 min read

Table of Contents

How wild is it that we have learned to communicate with machines in natural language? I recently watched this talk by Jeff Dean and learned so much about the progression of AI in this area and was inspired to write this article.

Throughout, I'll share some important papers to study if you want to learn more about how AI has developed over the years.

💡
Conversational AI is artificial intelligence that allows computers to understand and generate human-like text or speech.

Let's walk through some of the research that led to where we are today in conversational AI.

Word2Vec: Representing Words in Vector Space

Before computers could truly understand human language, they first needed a way to represent words in a way they could process.

Word2Vec, introduced by Mikolov et al. in 2013, provided a breakthrough by representing words as dense vectors in a continuous vector space.

📖
Recommended Reading: The Word2Vec paper.

Word embedding is a way of turning words into numbers so that computers can understand them. It's like giving each word a secret code. This code isn't random; it's designed to capture how words relate to each other. The system figures this out by looking at how often words appear together in large amounts of text. This is what is meant by "co-occurrence patterns."

Think of it this way: imagine each word as a point in a vast map. Words that are used in similar contexts, like "king" and "queen," would be located near each other, while words with different meanings, such as "king" and "table", would be farther apart. This is because the algorithm learns to associate words based on the company they keep.

I think it is wild that you can do math with these vectors. If you take the vector for "king" and subtract the vector for "male", you end up really close to the vector for "queen."

Word2Vec uses a shallow neural network architecture with two main learning models:

  • Continuous Bag-of-Words (CBOW): This model predicts a target word based on the surrounding context words. For example, if the context is "the royal family included a ____", the model might predict "king" or "queen" as the target word.
  • Skip-gram: This model does the opposite, predicting the surrounding context words given a target word. If the target word is "king", it might predict "royal", "family", "crown", and other related words in the context.

By training these models on massive amounts of text, Word2Vec learns vector representations that capture semantic similarities and analogies between words. This means that the vectors themselves start to encode the meaning of the word.

Sequence-to-Sequence Learning: Modeling Sequential Data

While Word2Vec was a major step forward in representing words, it had limitations, notably its inability to understand the sequential nature of language. To address this, sequence-to-sequence (seq2seq) learning was developed, which explicitly models the sequential nature of language. This approach, popularized by Sutskever et al. in 2014, uses recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, to process input sequences and generate output sequences of varying lengths.

A typical seq2seq model consists of two main components:

  • Encoder: The encoder takes an input, like a sentence, and turns it into a simplified summary that captures the main meaning. It's like squeezing all the important information into a single, compact package.
  • Decoder: The decoder takes that summary from the encoder and uses it to create an output, like a translated sentence. If you think of the encoder as "understanding" the input, the decoder is "speaking" the output, using that understanding as its guide.

Essentially, the encoder reads and understands the input, while the decoder speaks or writes the output based on that understanding. This process allows seq2seq models to handle complex tasks such as translation and text summarization.

Seq2seq models have achieved remarkable success in machine translation, where they learn to map sentences from one language to another. For example, a seq2seq model can take an English sentence as input, encode it into a vector, and then decode that vector into its French equivalent. This was a significant leap from previous models that could only understand individual words at a time.

In essence, seq2seq learning moved the field from word-level representations to sentence-level representations, which made it possible to perform more complex language tasks, like translation. However, seq2seq models have limitations, such as struggling with long-range dependencies in language and the bottleneck that can occur when processing very long input sequences.

Neural conversational models - Closer to human-like dialogue

Neural conversational models build upon the principles of sequence-to-sequence (seq2seq) learning to generate more natural and engaging dialogues. These models are trained on large conversational datasets, such as movie subtitles or chat logs, to learn patterns and structures in human conversations. In contrast to seq2seq models, which focus on tasks like translation, neural conversational models are explicitly designed to generate conversational responses.

📖
Recommended Reading: A Neural Conversational Model

One of the early influential neural conversational models was proposed by Vinyals, Sutskever and Le in 2015. This model uses a seq2seq architecture to predict the next sentence in a conversation, given the previous sentences. This approach demonstrated the potential of neural networks for generating human-like conversations. The model could find a solution to a technical problem via conversations using a domain-specific IT helpdesk dataset. It could also perform simple forms of common sense reasoning on a noisy open-domain movie transcript dataset.

These models use a recurrent model to process the input sequence. Recurrent models process input step-by-step, using the output of each step as input for the next, which gives them a kind of memory, but this step-by-step processing limits the ability to do parallel processing, or in other words, perform multiple calculations at the same time.

Enter the next exciting development...

The Transformer Revolution: Attention is all you need

Google's transformer paper published in 2017 represents a significant shift in the field of AI. It replaced recurrent neural networks with attention mechanisms.

📖
Recommended Reading: Google's Attention is All You Need Paper.

Recurrent models process input sequentially, updating their state with each new token, with each step depending on the previous one. In order to process the third word in a sentence, you'd need to process the second word, which required processing of the first word.

Transformer models, on the other hand, process all input data in parallel. Instead of a single state, they keep representation of all words and use attention mechanisms to focus on the relevant parts of the input when generating the output.

Transformer models use a self-attention mechanism to understand the relationship between words in a sentence. Self-attention allows each word in a sentence to pay attention to every other word in the sentence. This helps the model to understand the context of a word in relation to the others around it.

This model lead to dramatic improvement in the performance of conversational AI.

The T in ChatGPT stands for Transformer. In fact, most of the conversational AI we use today is built on this model.

Towards a Human-like Open-Domain

This model was developed in 2020. It marked a significant step in the evolution of conversational AI by using a transformer model instead of a recurrent model.

📖

The key feature of this model was that it was trained on conversational data, which resulted in responses that were both more sensible and specific. Unlike previous models that often generated generic or inconsistent responses, this model was able to produce more relevant and engaging replies. This was an important step toward creating open-domain chatbots capable of more human-like interactions.

Gemini

Google announced Gemini in 2023, calling it a significant milestone in the development of AI.

Gemini has multimodal capabilities. Unlike previous models that focused mainly on text, Gemini turns text, image, audio and video into a sequence of tokens that they trained a transformer based model on.

Gemini's ability to integrate and understand multiple types of data marks a significant step towards more human-like interaction with AI, enabling machines to understand and respond to a wider range of inputs.

Gemini sets us up for all sorts of new ways of communicating with machines. I cannot wait to try Project Astra - AI in glasses that understands the world around us and allows us to learn and communicate in new ways.

2025 is expected to be the year of AI Agents - a whole new level in not only communicating with machines, but collaborating with them on complex tasks and goals. I believe we are heading into a time where the world changes in ways we cannot currently comprehend.

We've come a long way!

It's truly incredible to reflect on the fact that we've reached the point where we can communicate with machines using our own natural language. From the early days of representing words with numbers to building models that can carry on full conversations, the progress has been staggering.

The fact that we've taught computers to perform mathematical operations on vectors and that we can translate an idea into a new language with a neural network is still wild to me. The ability to communicate with machines in natural language was a science fiction dream for many years, and the journey we've been on as a species to achieve this is truly remarkable.

What does it mean for the future? The possibilities are endless and as a species we are on the cusp of many more amazing breakthroughs.

🤖
How was AI used in writing this article? I used AI extensively to help me write this article. I took my own notes on Jeff Dean's talk and then used those in Gemini Advanced Deep Research which analyzed many websites to add more to my notes. I then took all of that information plus Jeff's video and put it into NotebookLM. NotebookLM and I conversed back and forth to produce most of the text you have just read. Then, I used Gemini 2.0 in AI Studio to improve upon my article and help me determine how to finish it. I used Imagen in Gemini Advanced to create the featured image with the prompt, "An image of a King and Queen in a way that represents vector math." This article truly was a collaboration between me and machines.

Tagged in:

Learn, Gemini, Research, Google

Last Update: December 17, 2024

About the Author

Marie Haynes

I love learning and sharing about AI. Formerly a veterinarian, in 2008, understanding Google search algorithms captivated me. In 2022 my focus shifted to understanding AI. AI is the future!

View All Posts