[ale] Mining PDF's

Jeff Hubbs hbbs at attbi.com
Thu Dec 5 14:37:09 EST 2002


I was just poking through some PDFs I have here and it seems that if the
objective is to mine arbitrary PDFs, you might have a problem on your
hands.  I might be wrong, but it looks as though sometimes PDFs of text
are made from images of the text, not the text themselves, suggesting
that you'd have to OCR the images.  However, Google appears to do this
somehow, suggesting that you might be able to engineer something similar
or perhaps even make your PDF repository available to Google and use it
to search...

- Jeff

On Thu, 2002-12-05 at 13:41, Kevin O'Neill Stoll wrote:
> Hey all,
> 
> I need to implement a search functionality that is able to
> mine a url directory structure which contains pdf's. I was
> hoping that someone knew of an opensource project that
> already has done some of the grunt work otherwise, I'm open
> to ideas as to how to accomplish this task.
> 
> In mining the pdfs, the search functionality needs to grab
> a title, file size, a summary and relevance based on a text
> search. (i.e. - if I search for 'dog', all pdfs with the
> phrase 'dog' in it would be returned. )  I'm just not sure
> how to get the text out of a pdf.
> 
> Anywho, thanks in advance.
> 
> 
> 
> =====
> Kevin Stoll
> http://kevinstoll.org
> 
> OpenSource Software...FREE!
> Angering Bill Gates...priceless.
> ============================================================
> 
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://www.ale.org/mailman/listinfo/ale


_______________________________________________
Ale mailing list
Ale at ale.org
http://www.ale.org/mailman/listinfo/ale






More information about the Ale mailing list