+ 18

Working with natural languages

I would like to analyze large amounts of text for the contained vocabulary. Therefore, I'd like a tool that recognizes all sorts of shapes of words and connects them back to the basic word, so that they are only counted once. For example the words "counting", "count", "counted", "counts" would all be recognized as "count" and... counted only once. Is there some framework with the appropriate databases that can do that sort of thing, preferably an easy-to-use one?

25th Dec 2021, 7:31 PM
HonFu
HonFu - avatar
9 Answers
26th Dec 2021, 2:49 AM
Vitaly Sokol
Vitaly Sokol - avatar
+ 9
So you have a text and want to extract word stems out of it (its sentences)? Did you try nltk (Python )? It should enable you to do something like that for English at least...
25th Dec 2021, 8:22 PM
Lisa
Lisa - avatar
+ 7
Simon Sauter the ability to use Snowball for different languages
26th Dec 2021, 3:18 AM
Vitaly Sokol
Vitaly Sokol - avatar
+ 5
I've never used it myself, but this looks like it might do what you're looking for: https://machinelearningknowledge.ai/learn-lemmatization-in-ntlk-with-examples/
25th Dec 2021, 8:38 PM
Simon Sauter
Simon Sauter - avatar
+ 4
Vitaly Sokol, wow, thank you, the example shows it clearly!
26th Dec 2021, 8:56 AM
HonFu
HonFu - avatar
+ 4
Arif Dastager That's exactly the code posted above, isn't it?
27th Dec 2021, 12:00 PM
Lisa
Lisa - avatar
+ 2
Hm, cool, that does look like the general thing I need... Would be great if it worked for other languages, foremost German and Japanese. Thanks, I'll check that out!
25th Dec 2021, 8:43 PM
HonFu
HonFu - avatar
+ 1
Vitaly Sokol is there a reason why you used stemming instead of lemmatization?
26th Dec 2021, 3:15 AM
Simon Sauter
Simon Sauter - avatar
+ 1
That's real nice 👍
27th Dec 2021, 2:49 PM
₿ig Ray
₿ig Ray - avatar