Acrobat PDF url Extractor

%! % ACROBAT URL EXTRACTOR & LINK TESTER % =================================== % Copyright c 1999 by Don Lancaster and Syenrgetics, Box 809, Thatcher, AZ, 85552 % (520) 428-4073 don@tinaja.com http://www.tinaja.com % Consulting services available per http://www.tinaja.com/info01.html % All commercial rights and all electronic media rights fully reserved. % Personal use permitted provided header and entire file remains intact. % Linking welcome. Reposting expressly forbidden. % version 2.1 % This routine reads a specified Acrobat file, and attempts to extract all % PDFMark linked url's from that file. These url's can then be manually % checked, used by a supervisory language, or converted to HTML where routines % such as DOCTOR HTML can verify the links. % An optional HTML document generator is included in this version. % To use this program, enter the full path sourcefile and optional html % target file names below and resave. Then distill the file. % The url's are returned to the Distiller log file. Note that a NO FILE PRODUCED % message is normal and expected, since you are after only the log file, not pdf. % IMPORTANT: Be sure to use "\\" when you mean "\" in any PostScript string! % /sourcepdffilename (C:\\medocs\\Muse\\muse136\\muse136a.pdf) def % /targethtmlfilename (C:\\medocs\\Muse\\muse136\\graburls.html) def /sourcepdffilename (F:\\newblatmuse\\funstuff\\funstuff.pdf) def /targethtmlfilename (F:\\newblatmuse\\funstuff\\graburls.html) def /wanthtmloutput true def % set flag here if html output wanted /workstring 1000 string def % temporary workstring /auxworkstring 1000 string def % rarely needed but must be unique % checkline tests to see if a /URI followed by a space and string is present... /checkline { (/URI \() search % look for magic header {pop pop fixlongurl % repair long url truncateextra % if found, clean up line addurltohtmlfile % optionally add to html doc print (\n) print flush} % and pretty print {pop} ifelse % otherwise do nothing } def % mergestr merges the two top stack strings into one top stack string /mergestr {2 copy length exch length add string dup dup 4 3 roll 4 index length exch putinterval 3 1 roll exch 0 exch putinterval} def % fixlongurl grabs another line if the url line ends in "\" % current limit is around 210 chars in url. Note need for second workstring. /fixlongurl {dup dup length 1 sub get 92 eq % if line ends in "\" {dup length 1 sub 0 exch getinterval % remove "\" workfile auxworkstring readline % get next string not {stringprocessingerror} if % trap unlikely error mergestr % and combine } if % only when needed } def % truncateextra crops everything beyond the closing url parenthesis... /truncateextra { (\)) search % look for closing parenthesis {exch pop exch pop } % get rid of it if found { (There was a sting closing error.\n) % this should never happen print flush stringclosingerror } ifelse % generate intentional error } def % /starthtmlfile conditionally creates a html file object and writes a stock % html header to it... /ws {writefile exch writestring} def % file writing utility /starthtmlfile { wanthtmloutput { targethtmlfilename (w+) file /writefile exch def % make a file to write (\n) ws % write HTML header (\n) ws ( Acrobat PDF url Extractor\n) ws (\n\n) ws ( ) ws (

\n) ws (\n\nLinked URL list extracted from ) ws sourcepdffilename ws (\n\n

\n
\n
\n) ws % indent list (\n) ws } if % only when wanted } def % /addtohtmlfile conditionally adds the current url line to the htmlfile... /addurltohtmlfile { wanthtmloutput { dup dup % two string copies () ws % end the url ws % text name (
\n) ws % end the line } if % only when wanted } def % /endhtmlfile conditionally finishes up and closes the htmlfile... /endhtmlfile { wanthtmloutput { (
\n

\n\n) ws % outdent (\n\n\n) ws % and close writefile closefile } if % only when wanted } def % this is the main loop. It reads one line of the source pdf file at a time % for processing... /graburls { sourcepdffilename (r) file /workfile exch def % make a file to read starthtmlfile % start html file if wanted {mark workfile workstring readline % read one line at a time {checkline}{exit} ifelse % test lines till done cleartomark } loop (\n\nDone extracting url's from ) print % optional trailer sourcepdffilename print (\n) print (\n) print flush endhtmlfile % complete html file if wanted pop } def % This actually does it... graburls %% EOF