+ 1

how to Process the text in the document

tokenization of text document I know the file with the file, but I do not answer it in pycharm

1st Jun 2018, 7:40 AM
reza
reza - avatar
3 odpowiedzi
+ 5
You want to simply break the text into words, or need a more complex analysis? In the first case you can use the split() function to break a string into a list of strings. You can also get rid of punctuation with replace(). If you need some better data science tools, Python probably has them in some module. Maybe this helps: http://www.nltk.org
1st Jun 2018, 7:52 AM
Pedro Demingos
Pedro Demingos - avatar
+ 5
Exactly what Pedro said. Nltk's tokenizer is really good and the lib itself can get you going through the whole process -- plus if you want to do a semantic analysis, you can employ word2vec, which goes smoothly with nltk corpus.
1st Jun 2018, 8:06 AM
Kuba Siekierzyński
Kuba Siekierzyński - avatar
+ 2
Please explain, your question is a bit vague, else I hope this helps. with open('text_file.txt') as f: file_contents = f.readlines() # This should print out the contents of the file named 'text_file.txt' print (file_contents)
1st Jun 2018, 7:49 AM
Mpho Mphego
Mpho Mphego - avatar