0:02 Hi Mara.
0:03 Data science number seven.
0:04 Oh hey guys.
0:05 Hey Graham.
0:07 Hey uh Ingo Ralph. How are you today?
0:09 Ingo it's time for
0:10 five minutes with
0:11 five minutes with Ingo.
0:12 And today also with Ralph co-founder of
0:14 Rapid Miner and a true expert on text analytics.
0:15 analytics.
0:16 Just talking about sentiment analysis
0:18 for data scientist number seven. Ralph,
0:19 how is that working?
0:22 Well, we've seen text classification in
0:23 the past. So now we want to apply that
0:26 to text. Question is how do we do that?
0:28 Um well let's look at some statements.
0:30 Okay. What do people say about data
0:30 scientist number seven?
0:32 I just happen to have a like like a
0:33 statement here.
0:34 How convenient.
0:36 Yeah, it is. So, we have for example
0:38 positive statement. Unicorns are amazing.
0:41 amazing.
0:42 Yeah. Yes, they are.
0:43 Yeah, of course they are.
0:44 Other people have some trouble. So,
0:47 let's check out what they're saying. Finding
0:49 Finding unicorns
0:52 unicorns
0:55 is difficult.
1:00 Okay. two text sentences to well yeah
1:02 unstructured data that it really is how
1:04 can we bring this into structured format
1:06 so Ralph what what would the first step
1:07 look like
1:10 well grammar is always so difficult and
1:11 I think the most important part is in
1:13 the words so let's skip grammar
1:15 that's basically pretty much exactly
1:16 what I'm doing in English every day true
1:16 true
1:19 yeah I know anyway
1:20 anyway
1:23 okay so let's tear the text apart into components
1:24 components
1:26 so we are like like just breaking this
1:27 down. So we have still those two
1:30 sentences but every sentence is broken
1:32 down into those what we call tokens. So
1:33 every word is becoming a token. Okay.
1:34 Okay.
1:35 So what's the next step?
1:36 Well still a lot of words and some of
1:38 them don't carry as much information
1:40 like is and are are not so central.
1:42 Let's just get them out.
1:43 Okay. So we throw them away. That's what
1:46 we call stop words. We can remove them.
1:48 Well now we still have the two
1:50 sentences. Not really as structured as I
1:52 would like it. We're heading towards the
1:54 table. So let's put in more structure.
1:56 Since we put away the grammar, we ignore
1:59 the word order and basically only look
2:00 at the words.
2:02 Okay, so that is interesting. That
2:04 almost looks like like a good structure
2:06 already. I think we need a pen. Um,
2:07 whiteboard number one, can you give us a pen?
2:08 pen?
2:10 All right, perfect.
2:10 Thank you.
2:12 What a good white board.
2:14 Um, so we could actually say like those
2:16 two words occurring in both both
2:18 sentences. So we can actually make a cut here,
2:19 here,
2:21 another one here,
2:24 and another one here. So it's almost
2:26 like columns in a table. You have almost
2:28 a structure now already. Okay. So then
2:29 what can we do now?
2:31 Well, you want to simplify things. So we
2:33 just count is a word occurring or not.
2:36 So here there's
2:37 no word here.
2:40 True. But we have one here for finding.
2:47 Actually we have two ones.
2:50 And for difficult we have
2:51 a simple one.
2:53 So look at that. Now we have actually a
2:56 table. Every token is becoming a column.
2:58 The values here is just the count of
3:00 words for for every every word in each
3:02 sentence. Now we can add another column
3:04 which is basically the positive
3:06 sentiment or the negative sentiment
3:08 here. And now we have this label we
3:09 usually want to predict with machine
3:11 learning. So we can now use any machine
3:13 learning method. SVMs by the way are
3:16 great for that by just taking this table
3:18 here, train the model on this data in
3:20 order to predict if it's a positive or a
3:21 negative sentiment. And that's really
3:23 about it for for for for the general
3:26 idea about text text transformation.
3:28 Only problem is what are you doing if
3:29 the text are becoming any longer?
3:31 Well, the longer the text, the higher
3:32 the values will be. So it's kind of
3:35 unfair. The longer text are stronger. So
3:36 we want to divide that by the length of
3:39 the text. This has two words. So I
3:41 divide those by two. The other one has
3:42 three words. So length is three. I
3:45 divide it by three. So afterwards it
3:46 does not depend on the length anymore.
3:48 So that means really unicorn for example
3:50 is more typical for the first text. Yeah.
3:51 Yeah.
3:52 Because it's occurring in 50% of all the
3:55 verse basically and here it's only in 30
3:57 33% of all the words. So that makes it a
3:59 more typical verse.
4:00 Unfortunately it occurs in both text. So
4:02 it doesn't make a typical verse for
4:04 positive or negative. But for example
4:06 amazing and diff difficult. You can see
4:07 there's some diff difference.
4:08 I can.
4:09 Okay. That is excellent. Uh that's
4:11 really amazing. Now of course there's
4:13 another problem in my opinion because
4:14 what are you doing with tech words which
4:17 are just very frequent overall like we
4:18 have been throwing away those stop words
4:21 like is or are. So those are words which
4:23 are very frequent in all text documents.
4:24 What can you do about them?
4:26 Yeah. So this was is called text
4:27 frequency. It's about how often is the
4:29 word in the text. Oh, I think we needed
4:30 a second whiteboard. Where's whiteboard
4:34 number two? Here it is. There we go. So
4:36 this is text frequency. It's the count
4:38 of the word divided by the length of the
4:40 text. And then the other term is taking
4:42 care of words that are too frequent in
4:43 too many documents. So we count in how
4:45 many documents they are and take the
4:47 inverse of that.
4:48 So that means we are normalizing the
4:50 term frequency. So the term frequency is
4:53 what we see here. And by normalizing
4:56 this we give terms which are well very
4:58 frequent in all documents a smaller
4:59 weight and that's a perfect
5:02 representation for text documents in
5:04 general and that's really a great way to
5:06 transform unstructured information into
5:08 structured information. And that's how
5:08 for today.
5:10 Interesting. Thanks, Ingo. Thanks, Ralph.
5:12 Ralph.
5:14 And this has been your five minutes with