+ 1
how to Process the text in the document
tokenization of text document I know the file with the file, but I do not answer it in pycharm
3 odpowiedzi
+ 5
You want to simply break the text into words, or need a more complex analysis?
In the first case you can use the split() function to break a string into a list of strings. You can also get rid of punctuation with replace().
If you need some better data science tools, Python probably has them in some module. Maybe this helps:
http://www.nltk.org
+ 5
Exactly what Pedro said. Nltk's tokenizer is really good and the lib itself can get you going through the whole process -- plus if you want to do a semantic analysis, you can employ word2vec, which goes smoothly with nltk corpus.
+ 2
Please explain, your question is a bit vague, else I hope this helps.
with open('text_file.txt') as f:
file_contents = f.readlines()
# This should print out the contents of the file named 'text_file.txt'
print (file_contents)