+ 2

Words frequency

Please check my script and tell me, how would you simplify or improve it? This script calculates 10 (or as many as you want) most often words found in the format file `.txt.` To start the script, type in the console: python lang_frequency.py --name script_name.txt The result of the code execution will be similar: [('и', 1728), ('в', 1576), ('не', 1360), ('он', 1190), ('что', 1100), ('я', 1066), ('на', 1000), ('его', 690), ('это', 688), ('с', 663)] To use your `.txt` format file, you need to put it in the folder with the script from collections import Counter import re import sys import argparse def create_parser (): parser = argparse.ArgumentParser() parser.add_argument ('-n', '--name', type=argparse.FileType()) return parser parser = create_parser() namespace = parser.parse_args(sys.argv[1:]) text = namespace.name.read() number = int(input("Type a number of most often words you want to know ")) words = re.findall(r'\w+', text.lower()) ten_most_frequent_words = Counter(words).most_common(number) print(ten_most_frequent_words)

1st Oct 2017, 3:49 PM
Dmitriy Yurkin
Dmitriy Yurkin - avatar
2 Réponses
+ 7
A great start for a text analyzer. Now think of four variables: 1. word_count - how many times a particular word is present in a document 2. total_word_count - how many words are there in a document 3. total_document_count - how many documents you analyze 4. word_document_count - how many documents contain the particular word There is a so-called TF-IDF method of text analysis, which - briefly saying - checks word importance ina document and a corpus of analyzed documents. It identifies the most important words for each document, which allows you to categorize those documents into thematic groups, without having to read them :) And you just wrote 25% of the needed processing code for this ;)
1st Oct 2017, 5:39 PM
Kuba Siekierzyński
Kuba Siekierzyński - avatar
+ 2
i think, it is very good! Today performance and memory space are not very important. So you can focus on readable and reusable . one thing: you could improve the naming of variables. another thing: its a clear case for using a dictionary. Some details are a question of style not a question of improvement.
1st Oct 2017, 4:36 PM
Oma Falk
Oma Falk - avatar