Does anyone know what the current status is on whether the search engine spidering includes the text in pdf files?
Thanks!
JPnyc
09-01-2006, 12:42 PM
Not so far as I'm aware, no.
The Old Sarge
09-01-2006, 05:42 PM
Anything in a PDF can be a good place to provide links, but I think JP is correct. The SE don't crawl them.
Thanks for the replies! I know that there is software available for an organization to search pdf documents on a network. I wonder how long it will take this technology to get to search engines?
My boss asked an interesting question: If the search engines can't search pdf's, why do they list them and how do they get the description for the listing?
The Old Sarge
09-08-2006, 04:33 PM
Nora,
After you posed the original question, I got to thinking the same thing ...
Here's something I found:
PDF and Web Site Searching
As mentioned above, PDF files are hard on search engines, and HTML pages are much easier for them to deal with. However, if you must have PDF, please follow these procedures.
Preparing PDF Files for Searching
Make sure each PDF file has correct document properties, especially the title. An incorrect title makes it difficult for a person viewing search results to tell if this file is useful to them.
Check the PDF file format version number and make sure your search engine can read that version. Acrobat 5 uses the PDF 1.4 format.
If possible, break long PDF files into smaller single-subject files, such as book sections, chapters or even chapter sections. That way, no one will accidentally download a very long document just because a word has been matched.
PDF and Metadata
Metadata is defined as "information about information". For simple search engines, that generally constitutes the document title, description, keywords, file size and modification date, but it can be much richer than that, providing many more ways to describe an object, and to search for that object. For more information, see the SearchTools Report on Metadata
When search tools index PDF files, they can get the text from the PDF information fields, such as a document title and additional keywords. If the document creator didn't enter that information, the indexer may attempt to generate a title, or may just use the file name of the document.
Adobe XMP
With Acrobat 5.0 and new releases of other products, Adobe is supporting a new eXtensible Metadata Platform (XMP, previously called XAP). This allows the files to contain substantially more information about themselves, including Dublin Core data such as author, description, actual modification date and so on. This has not been widely used and we know of no search engines that take advantage of this metadata.
You can read the entire piece at http://www.searchtools.com/info/pdf.html
Glad you brought it up again. Very intersting reading. :)
autoecart
09-13-2006, 02:56 PM
If you want to make a PDF searchable follow the above guidelines and break each section down to a file size no larger than 50k - 85k as that is the largest file size a search engine spider will index in one single visit. Any larger than that and the spider will make two or more trips to the file for indexing which will mean a delay.
toniaxp
09-26-2006, 11:57 PM
PDF files are search able by SE and I would agree with auto that you should break them down for faster indexing.