“Enterprise to Computer (a Star Trek Chatbot)”: Grishma Jena with IBM (Video + Transcript)

Grishma Jena / Cognitive Software Engineer / IBM
Sukrutha Bhadouria / CTO & Co-Founder / Girl Geek X


Sukrutha Bhadouria: Hi everyone, I hope you’ve been having a great day so far. Hi, Grishma. Hi, so yes, we are ready for our next talk. I’m Sukrutha and Grishma is here to give the next talk. Just before we get started, the same set of housekeeping rules. First is, we’re recording. We’re gonna share in a week. Please post your questions, not in chat, but in the Q and A. So you see the Q and A button at the bottom? Click on that and post there. If for some reason we run out of time, and we can’t get to your questions, we’ll have a record of it and it’s easy for us to find later and get you your answers later.

Sukrutha Bhadouria: So please share on social media #GGXelevate and look for job postings on our website at girlgeek.io/opportunities. We’ve also been having, throughout the day, viewing parties at various companies. So shout-out to Zendesk, Strava, Guidewire, Climate, Grand Rounds, Netflix, Change.org, Blue Shield, Grio, and Salesforce Portland office.

Sukrutha Bhadouria: So now, on to Grishma. Grishma is a cognitive software engineer at IBM. She works on the data science for marketing team at IBM Watson. So today her talk is about Enterprise to Computer: a Star Trek chatbot. I’m sure there’s a lot of Star Trek fans out there because I know I am one, and I can’t wait to hear about your talk, Grishma.

Grishma Jena: Thank you, Sukrutha.

Sukrutha Bhadouria: Go ahead and get started. You can share your slides.

Grishma Jena: Okay, I’m gonna minimize this. Alright, can you see my slides? Okay. Hi, everyone, I’m Grishma. As Sukrutha mentioned I work as a cognitive software engineer with IBM in San Francisco. So, a lot of my job duties involve dealing with a lot of data, trying to come up with proprietary data science or AI solutions for our Enterprise customers. My background is in machine learning and natural language processing which is why I’m talking on a chatbot today.

Grishma Jena: I’ve also recently joined this non-profit called For Her, where we’re trying deal with creating a chatbot that could act as a health center, as a resource center for people who are going through things like domestic abuse or sexual violence so I’m very interested to see you know, a totally different social application of chatbot. But for today we’ll focus on something fun. And before I begin, a very happy Women’s Day to all of you out there. So, yeah.

Grishma Jena: When was the last time you interacted with a chatbot? It could have been a few minutes before, when, you know, Akilah was talking and your Alexa probably got activated by mistake and you had to be like, “Alexa, stop.” It could be with Siri. We interact with Siri every day. It could be on a customer service chat or it could be on a customer service call.

Grishma Jena: Basically, there are so many different avenues and applications of chatbots today that sometimes it’s even hard to distinguish if are we talking to a human. Is it a chatbot in disguise of a human? And it’s quite interesting to see where chatbots have come in the past few years.

Grishma Jena: So, this was a grad school project that we did. Our idea was, okay, chatbots are amazing. We really like that they help take some of the workload off humans, but how can we make them seem a little more human, a little less mechanical? Could we give them some sort of a fun personality?

Grishma Jena: And we brainstormed for a bit and we finally came up with the idea, hey, why don’t we, I mean … Well, to be honest we weren’t that big fans of Star Trek, but we did become one during the course of this project and we were like, “Okay, let’s think of Star Trek”. It has a wide fan base and let’s try to not pick one single character from Star Trek but let’s take all of the characters and make this huge mix of references and trademark dialogues and see what kind of personality the chatbot would have.

Grishma Jena: So, like I mentioned, the motivation was to make a chatbot a little more human-like. And we wanted to have a more engaging user experience. So the application of this could be, it doesn’t have to be something related to, you know, like an entertainment industry. It could be also something like a sports lover bot so that would be very chatty and extroverted and it would support your favorite sports team. Or it could be something a little more sober like a counselor bot who is very understanding and supportive and listens to you venting out or asks you about how your day was. So yeah, we chose Star Trek infused personality.

Grishma Jena: So our objective with Star Trek was wanted it to incorporate references from the show. [inaudible 00:05:17] wanted to [inaudible 00:05:20] Spock and live long and prosper. We wanted it to be data driven model, we did not want to feed in dialogues we wanted it to just feed in a corpus and have it generate dialogues on its own. We obviously wanted it to give interesting responses and to have the user engaged because that is one of the things that a chatbot should do, right? So in really simple words, just think of a friend of yours or it could be yourself who is this, you know, absolutely big fan of Star Trek and just transfer that personality to a chatbot.

Grishma Jena: So this is what the schema of our bot look like. We had the user utterance which is basically anything that you say or that you provide as input to the chatbot. And then we had a binary classifier. I’ll delve deeper into why exactly we wanted it, but the main point is that we wanted it to be able to distinguish whether what you’re saying to the chatbot is it something related to Star Trek or is it something a little more general conversation like, “How are you feeling today?” Or “What is the weather like?” And depending on that we had on that we had two different routes which the bot would take to generate a response.

Grishma Jena: So before we begin, we obviously need some sort of data and we decided that we would take all of the data that was available for the different Star Trek movies and the TV series. You’d be surprised at how little data is available, actually. We initially thought of just doing a Spock bot, but Spock himself has very limited dialogues so we just expanded our search to the entire Star Trek universe. And that’s why we took dialogues from movies, TV series. We didn’t want to have any sort of limitations as far as the data was concerned. We ended up with about a little over 100,000 pairs of dialogues.

Grishma Jena: Then we also went and got this database, which is known as the Cornell Movie Database. This database was created by Cornell University, which has a collection of raw movie scripts. It’s just a really good data set to train your bot on, the way how humans interact and what kind of topics they talk about, what are the responses like.

Grishma Jena: And finally, we also had a Twitter data set because we wanted some topics that were related to the ongoing affairs in the world, the current news topics. Because we envisioned that if you had a chatbot then people do like to talk to the chatbot or ask for the chatbot’s opinion on something that’s happening in real time.

Grishma Jena: So the very first component of a chatbot was having a binary classifier. Like I mentioned, we had two different routes for our chatbot. One would be the Star Trek route and the other would be a general conversation route. So we had the binary classifier that would help us distinguish whether whatever the user is uttering or whatever the user is giving as an input is it related to Star Trek or is it general conversation which was getting handled by the Cornell Movie Database. So we used an 80:20, that is the training data set and the testing data set split. And the features that we used were we took the top 10,000 TF-IDF unigrams and bigrams.

Grishma Jena: TF-IDF stands for tone frequency and inwards document frequency. Tone frequency is nothing but how many times a given word occurs in your corpus and inverse document frequency,, it’s kind of a weight that is attached to a word. So think of a textbook or think of a document that you have. Words like prepositions, like the, of, and would occur multiple times. But really words that would be important that would have some sort of conceptual representation, perhaps like the topic of it. Compared to it would be a little rare in occurrence, compared to prepositions, compared to commonly used words, and that’s why they should be given more weightage. So that’s the whole idea behind TF-IDF.

Grishma Jena: Unigrams and bigrams are nothing but you divide the entire document that you have into words. An unigram would be one [bit kilo word inaudible 00:09:17] bigram would be a set of two consecutive words that occur in the document. There’s an example later on in the slide to explain it better. Stop words, when consider stop words are just filler words like I mentioned similar to the prepositions. And we were very happy with the performance of the binary classifier. We were able to get a 95% accuracy on the test set, and we decided that is good enough, let’s move on to the next one.

Grishma Jena: And finally, this is the main core of it, where deep learning comes into play. So with deep learning, we used a model called a Seq2seq which is a particular type of recurrent neural network. So if you can see the image on the right, it is a simplified version of a neural network where you give it an input and it gets an output and that output is also the input for the next cycle, so it’s kind of like a feedback looping mechanism.

Grishma Jena: First, the specific type of neural network that we use, Seq2seq. It was just two recurrent neural networks so just think of a really big component that has two smaller components, which is an encoder and a decoder.

Grishma Jena: So the encoder actually takes in the input from the user and tries to provide some sort of context. What do the words mean? What exactly is the semantics behind the sentence that the user has given? And the decoder generates the output based on the context that it has understood and also based on the previous inputs that were given to it, which is where the feedback mechanism comes into play.

Grishma Jena: So just to go a little deeper into it. This is a representation of what a Seq2seq with encoder and decoder would look like. So the input over here would be, “Are you free tomorrow?” and the encoder takes in that input and tries to understand what exactly is the context or the meaning of this sentence. And finally the decoders understands, okay, this is something someone is asking about either they want to take an appointment or someone’s availability or someone’s schedule. And that’s where the reply is something like, “Yes, I am. What’s up?”

Grishma Jena: So these are some statistics about how exactly we went on training this on AWS. We used a p2.xlarge instance with one Nvidia Accelerator GPU and then we had the Star Trek Seq2seq. So we had one Seq2seq for just Star Trek dialogues and we had another one, the Cornell Seq2seq which is on Cornell data, which is more for just a general conversation purpose.

Grishma Jena: So we went ahead, we generated some sentences, but then we realized that the ones for Star Trek were really good because you’re giving it Star Trek as input so obviously the output is also going to be Star-trekky. But for the general conversation ones, for things like, “What is the weather like?”, “How are you doing today?”, “What is the time?” it was a little difficult for us because obviously the input is not Star Trek related, right? So the output also wouldn’t be Star Trek related, but we wanted this to be a Star Trek chatbot.

Grishma Jena: So we brainstormed a bit and we thought, “Hey, why don’t we try something called a style shifting?” Which is basically like you take a normal sentence, a sentence from the general conversation, and you try to shift it into the Star Trek domain.

Grishma Jena: And the way we did this was, we went through the entire corpus, the data set for Star Trek, and we created a word graph out of it. A word graph would be, just think of it as you pass different sentences in the data set and each of the words would form a node and the edges between them would tell how they occurred in relation to one another. So if they occurred right next to each other or within the same sentence.

Grishma Jena: And along with the words in the node we also had a part of speech tag. So we indicated whether it was an adjective, or a noun, or a pronoun or a conjunction. So let’s say for example our sentence was, “Live long and prosper.” You break it down into four words which are the four different nodes and then we label them with a different part of speech tag and we connected them because they come one after the other in the sentence.

Grishma Jena: So what we did, was after we built out this really huge word graph, we looked it up to insert what could be appropriate words between two given words in the input. So once we had the sentence we would check for every two words in the sentence and see what are the words that we could insert in between to give it more of a Star Trek feel to it to just, you know, shift the domain into Star Trek.

Grishma Jena: We went ahead and we did that and these were the kind of results that we got. “I am sorry” was the input and then the word graph went ahead and inputted “Miranda” at the end. “I will go” and then it inputted “back” at the end of the sentence because “go” and “back” kind of occur very commonly with each other. And similarly for the start of the sentences, it tried to input names like “Uhura” or “Captain”. So one thing we noticed was it really good at inputting names at the start and the end of the sentence and using the character names from the show did end up giving it a slightly more Star Trek feel than before.

Grishma Jena: So we went ahead and we just randomly tried to insert words that occurred more frequently between two words but then we realized that some of the sentences were ungrammatical. So what do we do? We came up with this idea of let us use the word graph as it is and then let’s take some sort of a filter to our responses. So, like I said, we realized that the word graph was giving a few incoherent and incorrect responses. What we did was we went ahead and constructed an n-gram model.

Grishma Jena: So n over here would be unigram, bigram, trigram. You can see the example over here if n is equal to one, which is an unigram, you break down the sentence into just different words so “this” would be one unigram “is” would be another unigram. If n is two, which a bigram, you would take two words that co-occur together. So in this case the first bigram would be “This is,” second one would be “is a” and then similar for trigram it would be “This is a” and then “is a sentence”.

Grishma Jena: So we created an n-gram model which was just to understand what exactly is the kind of dataset that Star Trek has. And then finally we wanted to get a probability distribution over the sequence of words that we have had.

Grishma Jena: So once we get this, we start to filter the responses and we ran the sentences using the bigram models that we trained on the Star Trek data set. Because of this we kind of got a reference type for seeing that what structures are grammatically correct. We went ahead and we get them and the ones that were a little odd sounding or that didn’t really occur anywhere in the data set we went ahead and removed them.

Grishma Jena: Another metric that we used for this was perplexity. So just think of perplexity as some sort of an explainability metric. We went ahead and used that which would help us tell how well a probability distribution was able to predict it.

Grishma Jena: Finally, we have all of the things in place and we have to evaluate the performance of the chatbot. So we came up with two categories of evaluation metrics. The first one was quantitative metrics where we used perplexity, which was mentioned on the first slide. And the second one was we wanted to see often was it using words that were very particular to Star Trek that you don’t really use in normal day life, you know, like maybe spaceship or engage.

Grishma Jena: And the second category was human evaluations where we got a bunch of, user group and we asked them to just read the input and the output and see how good it was in terms of grammar. If the response actually made sense, if it was appropriate. And finally, on the Star Trek style. Just how Star-trekky did it sound?

Grishma Jena: And, we also came across another bot online which is called as a Fake Spock Pandora Bot which was contrary to the way we had. Our bot was data driven this was rule based so it was actually given an input of human generated responses.

Grishma Jena: We wanted to see how good would a data driven model perform as compared to a human generated one. So this is just what the Fake Spock Pandora Bot looked like. And these were the kind of responses that the Pandora Bot gave. If you said, “I’m hungry, Captain” it said, “What will you be eating?” So it’s giving really good appropriate responses because humans were the back end for this.

Grishma Jena: And then, what we did was we went ahead and evaluated the results. And we saw that our bot was performing better for Star Trek style and it also was a little more coherent. For grammar, Pandora Bot was much better and that’s not surprising because humans were the ones who actually wrote it out. For perplexity, the Star Trek perplexity started dialogues were 65, so that was our baseline number and we figured out that the kind of responses our bot was generating that are 60, 60.9 was a little closer compared to Pandora was like, way far off at 45.

Grishma Jena: So we were pretty happy with our performance. I’m just gonna give you a few examples of what the different bots generated. So the yellow ones are the Pandora Bot and the blue ones are the E2Cbot. So let’s see, if the user says, “Beam me up, Scotty” the yellow one, that is the Fake Pandora Bot, gives, “I don’t have a teleportation device” which is a good answer. And the blue one is, “Aye, Sir” which is also a good answer. A little curt, but nothing wrong with it.

Grishma Jena: In the second example if you see our bot answered, “Bones, I like you.” So the “Bones” part is actually come from the word graph which gives it a little more of a Star Trek feel. And the last one over here is the Fake Bot, the human generated one, just says, “I am just an AI chatting on the internet” which is kind of not the response that you are looking for.

Grishma Jena: A few more examples over here. The user says, “My name is Alex” and then the Fake Spock Bot says, “Yes, I know Christine.” I just told you my name was Alex, why would you call me Christine? But our bot says, “What do you want me to do, Doctor?”, which is a better response. And, yeah, these are the kind of responses.

Grishma Jena: I think some of our human focus group people said that they might be correct, appropriate responses, but they might not be factually correct, which was a challenge for us, as well as for the Fake Spock Bot. We didn’t really delve deeper into it because that would kind of dive more into having a question answering system and trying to check if it’s factually correct or not but we tried to make our focus group users understand that it’s just a bot at the end of the day.

Grishma Jena: So finally, we were able to generate Star Trek style text. We were very happy with that, we were able to use the data driven approach which meant we could automate it. And we did figure that it performed better than the human generated responses that Pandora Bot would give, at least on style and at least on the appropriateness. It still needs a little bit of improvement in grammar but we were pretty happy with it.

Grishma Jena: So that’s me. Live long and prosper. And feel free to reach out to me on Linkedin or on Twitter if you have any questions about this. Thank you.

Sukrutha Bhadouria: Thank you, Grishma. This was great. So just to close I just wanted to mention to everybody that you actually sent your speaker submission to us and that’s how we got connected. So thank you for doing that. We got a lot of comments from people who are Star Trek fans, but yeah, what inspired you to build this project?

Grishma Jena: Yes, so this was actually a grad school project. We were taking a deep learning course so all of us had to build a chatbot as an Alexa skill. We brainstormed a lot, and we actually thought that Spock because Star Trek has a really huge fan base so Spock would be a good idea to do. But Spock had very little dialogue in all of the movies and the television series and then we were like, “You know what, let’s not stick to just one character, let’s have the entire Star Trek universe.” And, the bonus was that during my semester, I could continuously binge watch Star Trek and say that, “Yeah, I’m doing research because I want to see how well my chatbot works,” but I was just binge watching to be honest.

Sukrutha Bhadouria: Nice. That’s awesome. Well, thank you so much, Grishma, for your time. We really appreciate it and for your enthusiasm in signing up through our speaker submissions.

Grishma Jena: Thank you so much, Sukrutha.