%! % POSTSCRIPT ACROBAT CATALOG INTERNAL DATA FORMATS % ================================================ % by Don Lancaster v1.2 April 13, 1997 % Copyright c. 1997 by Don Lancaster and Synergetics, Box 809, % Thatcher AZ, 85552 (520) 428-4073. synergetics@tinaja.com % All commercial rights and all electronic media rights *fully* % reserved. Linking welcome. Reposting is expressly forbidden. % Further support on http://www.tinaja.com % Consulting services available via don@tinaja.com % ==================================== % WARNING: Preliminary and partial code. Use at your own risk. % Report all problems to don@tinaja.com % *NOT* an official Verity or Adobe document. % Only warranty is "approximate quantity one" % ==================================== % This PostScript file examines internal Acrobat data structures. % It also includes a low level routine that expands data structure % files for detailed study and further annotation. % More specific practical uses are found in CATWORDS.PS that extracts % a list of indexed words; and CATFREQS that extracts a list of the % indexed words and their usage frequencies, and CATINDEX that creates % a hard copy index of which words are on which page. % For the fundamentals of using Acrobat Distiller as a general purpose % PostScript computer, see DISTLANG.PS % TUTORIAL INTRODUCTION % ====================== % The PostScript language can be used to access or manipulate internal % Acrobat Catalog files. This allows the custom creation of new features % such as... % - extracting a list of all active keywords % - monitoring a "word frequency list" as a writing aide % - highlighting oddball or strange word usage % - providing additional spell checking and usage flagging % - creating a hard copy index of words and their pages % - finding sequential word groups for hard copy indexes % - removing or inserting keywords % - reducing file sizes through additional excluded words % The usual starting points on Acrobat Catalog file formats are the % info are Adobe Fax number 331308 on "Files Associated with an % Acrobat Catalog Index General Information". Plus the contents of % /index/Style/ for a given Acrobat document set. % The analysis that follows is based on a "clean room" independent % study of generated Acrobat Catalog files. It relies heavily on my % "tearing method" found in my ENHANCING YOUR APPLE II, Volume I. % The analysis specifically applies to version vdk103.dll and Style % 3.0 ../Style/style.did Acrobat Catalog. Circa spring of 1997. % In general, Acrobat catalog is a simple word counter. Every % group of ASCII characters ending with a space or followed by the % end of a PS string is considered a word. These words are simply % numbered in sequential OCCURANCE ORDER from the start of a document. % Acrobat Catalog does *NOT* pay direct attention to pages, page numbers, % or .PDF internal objects. Nor to the exact sequence in which text and % figures get put down on a page. As each word is imaged, it gets % added to the word list. In order. No matter where it came from. % However; track is kept of which part of the word list occurs on which % page. Indirect methods can thus be used to extract word page numbers. % A distinction is kept as to whether a word is "near" or "far". Words % that occur within 127 words are "near" each other. Words spaced further % than 128 words are considered "far". % Positioning information is RELATIVE. A document's position is % how far forward it is from a previous document in the master list. A % word's position is how far forward it is from the previously mapped % occurrance of the same word in the same document. % Acrobat catalog maintains three primary files that involve word % position and frequency. The first (or "diw") file is a list of indexed % words in alphabetical order. The second (or "div") directory is an % ocurrence mapping list, showing where each indexed word appears how % often in the master document word lists. Only non-excluded words appear % on this list. The third (or "dif") file links the position of a word on % the word list to the position of the mapping info on the mapping list. % The word list, the word map, and the link list are easily extracted % using PostScript and can be used to generate new custom features. % INTERNAL FILE STRUCTURES % ======================== % The two most useful files to access are in the /index/parts/ folder. % These are the "DDD" or Document Dataset Descriptor file and the "DID" % or "Document Information Descriptor" files. The HIGHEST NUMBER % file is the most current. As in 00000005.DDD or 00000005.DID. % Both of these files consist of "n" pages of 1024 eight bit bytes each. % A directory is always on page #0. "Active" pages are numbered starting % with page one. % A typical 44-byte master directory entry consists of.... % - subfile name as three ASCII characters and a null % - 4-byte length of active portion of file, lsb first % - 4 mystery bytes, typically zeros. % - sequence of up to FIFTEEN 2-byte page numbers, LSB first. % The LSB start of the first page number is THIRTEEN bytes % (or value "12" starting with zero) into the directory. % each page is 1024 bytes long. % - when a SIXTEENTH to TWENTY-THIRD non-zero 2-byte page % number exists, it points to a DIRECTORY CONTINUANCE PAGE. % - The format of a DIRECTORY CONTINUANCE PAGE is up to % 512 additional non-zero 2-byte data page numbers. % - padding nulls to the start of the next directory listing % or the end of the page. % The spacing and length of the directory entries may vary, so they % are best found by searching on their three-byte-plus-null descriptor. % In general, data pages are contiguous with occasional large jumps. % These jumps apparently take place on later catalog revisions. % DID SUBFILES % ============ % $$$ directory - Reads up to 256 characters of "style" information % Thus telling how the rest of the database is formatted. % Useful info ends on first null. % $$f directory - Unknown mystery function. Probably has to do with % word stemming or sound index. Sometimes points to % page of all nulls. % $$x directory - Unknown mystery function. Points to a page of mostly % run together word fragments. Probably has something % to do with word stemming. % dif directory - Links the position of a word in the diw file to the % position of the occurrance info in the div file. One % record of a fixed 17 bytes for each word. Format... % bytes 0-2 - word position in diw file, LSB first % byte 3 - offset to next word position in diw % file. Also ONE MORE THAN length of word. % bytes 4-7 - map position in div file, LSB first % byte 8-10 - offset to next word position in div file, % LSB first % bytes 11-13 - stem index, LSB first % bytes 14-16 - sound index, LSB first % dis directory - A list of the total words in each referenced file. % including keywords. In the same order as the % document list. LSB first A then B is 256*B + A % Apparently changes somehow if more than 65535 words % are in a given document. % $$v directory - Apparently empty and unused. Possibly involved with % stemming. % diw directory - Mixed case listed words in ASCII alphabetical order, % separated by single nulls. % div directory - The word occurrance file. This one needs the following % detailed study... % Word Occurrance File Details % ============================ % The word occurrance file uses an oddball yet elegantly cute base 128 % number system. Its advantages are compact storage and easy marking of % words "near" each other. It is important to fully understand this % base 128 number system before word incidence files can be explored % or understood. % The system consists of seven numeric bits and one flag per byte. % If the MSB is cleared, the numeric bits are used as is, ending the % value being conveyed. If the MSB flag is set, at least one additional % byte is required to complete the relative offset value. % Specifically, offset values may be NEAR, FAR, or WAYOUT... % - A NEAR offset uses ONE byte and covers the range of 1 to 127. % Its MSB flag is cleared. The offset value is as read. % % - A FAR offset uses TWO bytes A and then B. It covers the range % of 128 to 16383. The offset value is B/128 plus A - 128. The % first MSB flag is set; the second is cleared. % % - A WAYOUT offset uses THREE bytes A, B, and then C. It covers % the range of 16,384 to 2,097,151. The offset value is C/16384 % plus (B - 128)/128 plus A - 128. The first and second MSB flags % are set; the third and last is cleared. % % - Extremely unlikely higher values can add a fourth byte. % Each word incidence record consists of a group of document incidence % sub records. One sub record is needed for each document on which the % word occurs at least once. Each sub-record ends with a null. The % length of each sub-record is variable, depending on the number of % words in the current document and the number of bytes per offset. % The first value in the sub-record is the ABSOLUTE position of the % current document in the document list. % Once again, each value may be a NEAR single byte, a FAR two-byte % pair, or a WAYOUT three-byte triad. Using flagged mod 128 arithmetic. % The second offset in the sub record is the ABSOLUTE position of the % first word in the DOCUMENT's word count list. You can also think of % this as the RELATIVE position from the start of the word count list. % If the word was only used once in the current document, a null follows, % ending the sub record. If the word is used a second time, a RELATIVE % offset position is read, again using one, two, or three bytes as needed. % The process continues for as many words as are in the present document, % ending with a null. Each null moves you to the next document in the % document list that uses the current word. % Note that there is no explicit word usage count per se. Instead, you % have to add up each word incidence for each document. % DDD SUBFILES % ============ % $$$ directory - Reads up to 256 characters of "style" information % Thus telling how the rest of the database is formatted. % Useful info ends on first null. Apparently the same % as the $$$directory in DID % $$f directory - mystery file of a few splattered integers over 455 % bytes. The "32" might be the offset into the word % and occurrance files. Those "4" values may be lengths % of words. Call it a "parameters and conventions" stash. % Not understood. % _df directory - the version number and vdk103.dll header info. 112 % bytes on page two. Called "header info for partition % management". % dkf directory - converts document number to pathname. Fixed records % of nine bytes each... % bytes 0-3 - document number % bytes 3-6 - start position on full filename stash % bytes 7-8 - length of full filename, lsb first % ddf directory - document data format directory holds key information % about each indexed file as sequential records of % 125 bytes each. Partially explained in style.ddd % byte 0 - exists flag (1 = exists) % byte 1 - chunk flag (0 = not a chunk) % byte 2 - largedoc data (187 value; not understood) % bytes 3-6 - start page % bytes 7-10 - end page % bytes 11-14 - start page from % bytes 15-18 - end page at % bytes 19-22 - number of pages in doc, lsb first % bytes 23-55 - Permanant ID as 32 ASCII characters % byte 56 - wxe version (2 - not understood) % bytes 57-60 - creation date % bytes 61-64 - modification date % bytes 65-68 - pointer to document filename path % bytes 69-70 - offset to document creator name % bytes 71-74 - pointer to start of individual page % word counts % bytes 75-76 - total number of page word count bytes % (equals number of pages * 2) % bytes 77-80 - pointer to short xyb filename % bytes 81-82 - length of xyb filename % bytes 83-107 - nulls. Probably additional pointers % and lengths. % bytes 108-111 - unknown mystery pointer % bytes 112-113 - unknown mystery length % byte 114 - unknown cleared flag? value = 0 % byte 115 - unknown cleared flag? % byte 116 - unknown set flag? value = 255 % byte 117 - unknown set flag? % byte 118 - unknown set flag? % bytes 119-122 - pointer to full dkv filename % bytes 123-124 - length of full dkv filename % (( Above bytes repeat for each document indexed )) % uid directory - The user id directory. 33 byte ASCII key followed % by a double null for each indexed document. % drd directory - A group of pointers to read the relative ()(.) % file headers in the ddc directory. Apparently % remaps docs between index space and listing space. % A six doc index was in a 1-3-5-4-2-0 sequence. % Each document has a nine byte record... % % bytes 0-2 - document number as used by word list % bytes 3-6 - start of file header in ddc % bytes 7-8 - length of file header in ddc % $$v directory - Apparently consists of the full first filename and % the full last filename in the document collection. % ASCII filename characters ending in null. First name % starts after 32 nulls. Use not understood. % dkv directory - A listing of the full filename path for each indexed % doc. Starts after 32 nulls. Each filename is followed % by a null. % xya directory - Contains info on each indexed doc. Especially the % relationship between page number and word count. % First string ending in null is relative filename % such as ()(.)(MUSE100D.PDF) % Second string ending in null is doc program source % such as Acrobat Distiller for Windows. % Fixed data file of two bytes per page gives the % number of total words on that page, lsb first in % page order. % To find the range of words on any page, sum the % totals of all previous pages to get a start word % number. Add the words on the current page to get % an end word number. % This allows relating positions in the word list to % locations on actual document pages. % xyb directory - Short document filenames in the order used by the % div mapping. 32 null offset. ASCII strings ending % in null. Such as MUSE103D.PDF % This file relates document number to document name. % xyc directory - Apparently unused. Might be involved with word % stemming or sounds like. Unmapped. Not understood. % xyd directory - Apparently unused. Might be involved with word % stemming or sounds like. Unmapped. Not understood. % xye directory - Apparently unused. Might be involved with word % stemming or sounds like. Unmapped. Not understood. % ddv directory - List of the full filename of each document, starting % with C: Each ASCII string ends with a null. % ddc directory - List of relative prefixes for each short document % filename. As in ()(.) Each ASCII string ends with % a null. Use not fully understood. % SELECTING FILES FOR STUDY % =========================== % Of all these subfiles, the most interesting ones appear to be... % - the xyb subfile in DDD of document names % - the diw subfile in DID of indexed words % - the div subfile in DID of indexed word occurance % - the ddf subfile in DDD of document page word counts % - the dif subfile in DID linking the diw and div subfiles % CATALOG FILE LENGTHS % ==================== % The number of stored bytes needed per indexed word depends very % much upon the documents being indexed. On the document side, there % is the ratio of graphics to text and the style of compression used. % Almost always, though, the catalog file sizes can be dramatically % reduced by excluding more high frequency words of minor interest. % There is a minimum of 25K or so of overhead. Mostly as two 8K blocks % in the DDD and DID files. Each word needs its length plus one byte % for the word list. Say nine bytes. Each word link on the link list % needs an additional seventeen bytes. The position mapping depends on % whether a word is "near" or "far", but averages something between % two or three bytes per occurrance. % Thus, each eight character word used 25 times in a document requires % something like 100 bytes of storage. Every ten high freq useless words % you can eliminate from your index knocks off 1K or so off the total. %%%%%%%%%%%%%% LOW LEVEL DATA FILE EXAMINING UTILITY %%%%%%%%%%%%%% % This PostScript-as-language routine examines data files, breaking % them up into individual 1024 byte pages. Each byte is shown % numerically in decimal, arranged 32 values per line. % To activate, use an editor to change the source and target filenames % below. Then send the file to Acrobat Distiller of Ghostscript. % More details in DISTLANG.PS on www.tinaja.com/pslib01.html % ALWAYS USE "\\" WHEN YOU MEAN "\" IN THE FILENAME STRINGS!!!!! % The final results appear in your named destination file. They may % be viewed and further annotated using an editor or word processor. % Place the exact full filename you wish to expand in the /grabfilename % string. In the case of Acrobat Catalog, you will usually be interested % in a .DDD and a .DID file. Usually found in the INDEX\PARTS folder of % the cataloged documents. The highest number files are the latest. /grabfilename (c:\\Windows\\Desktop\\PDX.TRY1\\CAT6\\CAT6\\PARTS\\00000000.DDD) def % data file source % Place the exact full filename of where you wish to put the data % file examination results in the /dumpfilename string. /dumpfilename (c:\\Windows\\Desktop\\PDX.TRY1\\CAT6\\CAT6\\PARTS\\formsnop) def % source grabfilename (r) file /source exch store % create source file object dumpfilename (w) file /sink exch store % create sink file object % /proc32 grabs 32 data values and numerically formats them on one line... /proc32 {source 32 string readstring {{3 string cvs sink exch writestring ( ) sink exch writestring}forall (\n) sink exch writestring true} {false} ifelse /oktocont exch store} def % this is the main loop that examines one 1024 byte page at a time... 0 1 4000 {/page exch store % grab page number (\npage number ) sink exch writestring % show page number page 4 string cvs sink exch writestring (\n\n) sink exch writestring 32 {proc32} repeat % report 1024 values oktocont not {exit} if % exit at end of file } for source closefile % clean up sink closefile %%%%%%%%%%%%%%%%% end utility %%%%%%%%%%%%%% % ==================================== % Copyright c. 1997 by Don Lancaster and Synergetics, Box 809, % Thatcher AZ, 85552 (520) 428-4073. synergetics@tinaja.com % All commercial rights and all electronic media rights *fully* % reserved. Linking welcome. Reposting is expressly forbidden. % Further support on http://www.tinaja.com % Consulting services available via don@tinaja.com % ====================================