0
Text Search
How to achieve a real time text search from multiple number of PDF Documents ? The preferred language is Python. There are many big size pdf documents and doing a normal search may degrade the real time performance. I know that indexing these docs would help, but I am not sure about the packages that can be used here.
3 Answers
+ 3
I'll use Python to demonstrate this since you didn't specify a language.
Python has lots of great file handling functions and of the basic ones is it can open and read files. So in your example, all you'll have to do is open multiple files and read them. Then compare this read text to what you're searching for.
Here's a short and simple example:
file_list = ["my_pdf.pdf" , "some_file.pdf"]
search_txt = "hi"
for name in file_list:
if search_txt in open(name, "r"):
print("Text found!")
+ 2
So since I personaly don't have any experience with this, I've just done a google search. Link: https://www.google.com/search?ie=UTF-8&q=JUMP_LINK__&&__python__&&__JUMP_LINK+fast+large+file+reading
And the most useful answer was this stackoverfolw post https://stackoverflow.com/questions/30294146/python-fastest-way-to-process-large-file
And many other search hits mention the 'multiprocessing' library so probably digging more into it could also be useful as well.
+ 1
Tim Thuma Yes, I forgot to mention that the preferred language is Python. Thanks for the answer, but I have many big size pdf documents and doing a normal search may degrade the real time performance. I know that indexing these docs would help, but I am not sure about the pacakages that can be used here.