0

Text Search

How to achieve a real time text search from multiple number of PDF Documents ? The preferred language is Python. There are many big size pdf documents and doing a normal search may degrade the real time performance. I know that indexing these docs would help, but I am not sure about the packages that can be used here.

16th Jan 2019, 12:26 PM
Bhushan
Bhushan - avatar
3 Réponses
+ 3
I'll use Python to demonstrate this since you didn't specify a language. Python has lots of great file handling functions and of the basic ones is it can open and read files. So in your example, all you'll have to do is open multiple files and read them. Then compare this read text to what you're searching for. Here's a short and simple example: file_list = ["my_pdf.pdf" , "some_file.pdf"] search_txt = "hi" for name in file_list: if search_txt in open(name, "r"): print("Text found!")
16th Jan 2019, 1:40 PM
Tim Thuma
Tim Thuma - avatar
+ 2
So since I personaly don't have any experience with this, I've just done a google search. Link: https://www.google.com/search?ie=UTF-8&q=JUMP_LINK__&&__python__&&__JUMP_LINK+fast+large+file+reading And the most useful answer was this stackoverfolw post https://stackoverflow.com/questions/30294146/python-fastest-way-to-process-large-file And many other search hits mention the 'multiprocessing' library so probably digging more into it could also be useful as well.
17th Jan 2019, 3:05 PM
Tim Thuma
Tim Thuma - avatar
+ 1
Tim Thuma Yes, I forgot to mention that the preferred language is Python. Thanks for the answer, but I have many big size pdf documents and doing a normal search may degrade the real time performance. I know that indexing these docs would help, but I am not sure about the pacakages that can be used here.
17th Jan 2019, 12:02 PM
Bhushan
Bhushan - avatar