%!

%   POSTSCRIPT ACROBAT CATALOG INTERNAL DATA FORMATS
%   ================================================
%   by  Don Lancaster            v1.2  April 13, 1997

%   Copyright c. 1997 by Don Lancaster and Synergetics, Box 809,
%   Thatcher AZ, 85552 (520) 428-4073. synergetics@tinaja.com
%   All commercial rights and all electronic media rights *fully*
%   reserved. Linking welcome. Reposting is expressly forbidden.

%   Further support on http://www.tinaja.com
%   Consulting services available via don@tinaja.com

%   ====================================

%   WARNING: Preliminary and partial code. Use at your own risk.
%            Report all problems to don@tinaja.com
%            *NOT* an official Verity or Adobe document.
%            Only warranty is "approximate quantity one"

%   ====================================

%  This PostScript file examines internal Acrobat data structures.
%  It also includes a low level routine that expands data structure
%  files for detailed study and further annotation. 

%  More specific practical uses are found in CATWORDS.PS that extracts
%  a list of indexed words; and CATFREQS that extracts a list of the
%  indexed words and their usage frequencies, and CATINDEX that creates
%  a hard copy index of which words are on which page.

%  For the fundamentals of using Acrobat Distiller as a general purpose
%  PostScript computer, see DISTLANG.PS


%  TUTORIAL INTRODUCTION
%  ======================

% The PostScript language can be used to access or manipulate internal
% Acrobat Catalog files. This allows the custom creation of new features
% such as...

%       - extracting a list of all active keywords
%       - monitoring a "word frequency list" as a writing aide
%       - highlighting oddball or strange word usage
%       - providing additional spell checking and usage flagging
%       - creating a hard copy index of words and their pages
%       - finding sequential word groups for hard copy indexes
%       - removing or inserting keywords 
%       - reducing file sizes through additional excluded words

% The usual starting points on Acrobat Catalog file formats are the
% info are Adobe Fax number 331308 on "Files Associated with an 
% Acrobat Catalog Index General Information". Plus the contents of 
% /index/Style/ for a given Acrobat document set.

% The analysis that follows is based on a "clean room" independent
% study of generated Acrobat Catalog files. It relies heavily on my
% "tearing method" found in my ENHANCING YOUR APPLE II, Volume I.

% The analysis specifically applies to version vdk103.dll and Style 
% 3.0  ../Style/style.did  Acrobat Catalog. Circa spring of 1997.

% In general, Acrobat catalog is a simple word counter. Every 
% group of ASCII characters ending with a space or followed by the
% end of a PS string is considered a word. These words are simply
% numbered in sequential OCCURANCE ORDER from the start of a document.

% Acrobat Catalog does *NOT* pay direct attention to pages, page numbers,
% or .PDF internal objects. Nor to the exact sequence in which text and 
% figures get put down on a page. As each word is imaged, it gets 
% added to the word list. In order. No matter where it came from.

% However; track is kept of which part of the word list occurs on which
% page. Indirect methods can thus be used to extract word page numbers.

% A distinction is kept as to whether a word is "near" or "far". Words
% that occur within 127 words are "near" each other. Words spaced further
% than 128 words are considered "far". 

% Positioning information is RELATIVE. A document's position is
% how far forward it is from a previous document in the master list. A
% word's position is how far forward it is from the previously mapped
% occurrance of the same word in the same document.

% Acrobat catalog maintains three primary files that involve word
% position and frequency. The first (or "diw") file is a list of indexed 
% words in alphabetical order. The second (or "div") directory is an
% ocurrence mapping list, showing where each indexed word appears how 
% often in the master document word lists. Only non-excluded words appear
% on this list. The third (or "dif") file links the position of a word on 
% the word list to the position of the mapping info on the mapping list.

% The word list, the word map, and the link list are easily extracted
% using PostScript and can be used to generate new custom features.


%  INTERNAL FILE STRUCTURES
%  ========================

% The two most useful files to access are in the /index/parts/ folder.
% These are the "DDD" or Document Dataset Descriptor file and the "DID"
% or "Document Information Descriptor" files. The HIGHEST NUMBER
% file is the most current. As in 00000005.DDD or 00000005.DID.

% Both of these files consist of "n" pages of 1024 eight bit bytes each.
% A directory is always on page #0. "Active" pages are numbered starting
% with page one.

% A typical 44-byte master directory entry consists of....

%            - subfile name as three ASCII characters and a null

%            - 4-byte length of active portion of file, lsb first

%            - 4 mystery bytes, typically zeros. 

%            - sequence of up to FIFTEEN 2-byte page numbers, LSB first.
%              The LSB start of the first page number is THIRTEEN bytes 
%              (or value "12" starting with zero) into the directory.
%              each page is 1024 bytes long.

%            - when a SIXTEENTH to TWENTY-THIRD non-zero 2-byte page 
%              number exists, it points to a DIRECTORY CONTINUANCE PAGE.

%                 - The format of a DIRECTORY CONTINUANCE PAGE is up to
%                   512 additional non-zero 2-byte data page numbers. 

%            - padding nulls to the start of the next directory listing
%              or the end of the page.

% The spacing and length of the directory entries may vary, so they
% are best found by searching on their three-byte-plus-null descriptor.

% In general, data pages are contiguous with occasional large jumps. 
% These jumps apparently take place on later catalog revisions.


%  DID SUBFILES
%  ============

%  $$$ directory - Reads up to 256 characters of "style" information
%                  Thus telling how the rest of the database is formatted.
%                  Useful info ends on first null.

%  $$f directory - Unknown mystery function. Probably has to do with
%                  word stemming or sound index. Sometimes points to
%                  page of all nulls.

%  $$x directory - Unknown mystery function. Points to a page of mostly
%                  run together word fragments. Probably has something
%                  to do with word stemming.

%  dif directory - Links the position of a word in the diw file to the
%                  position of the occurrance info in the div file. One
%                  record of a fixed 17 bytes for each word. Format...

%                  bytes 0-2   - word position in diw file, LSB first

%                  byte 3      - offset to next word position in diw
%                                file. Also ONE MORE THAN length of word.

%                  bytes 4-7   - map position in div file, LSB first

%                  byte 8-10   - offset to next word position in div file,
%                                LSB first

%                  bytes 11-13 - stem index, LSB first

%                  bytes 14-16 - sound index, LSB first


%  dis directory - A list of the total words in each referenced file.
%                  including keywords. In the same order as the
%                  document list. LSB first  A then B is 256*B + A
%                  Apparently changes somehow if more than 65535 words
%                  are in a given document.

%  $$v directory - Apparently empty and unused. Possibly involved with
%                  stemming.

%  diw directory - Mixed case listed words in ASCII alphabetical order,
%                  separated by single nulls.

%  div directory - The word occurrance file. This one needs the following
%                  detailed study...        


% Word Occurrance File Details
% ============================

% The word occurrance file uses an oddball yet elegantly cute base 128
% number system. Its advantages are compact storage and easy marking of
% words "near" each other. It is important to fully understand this
% base 128 number system before word incidence files can be explored
% or understood.

% The system consists of seven numeric bits and one flag per byte.
% If the MSB is cleared, the numeric bits are used as is, ending the
% value being conveyed. If the MSB flag is set, at least one additional
% byte is required to complete the relative offset value.

% Specifically, offset values may be NEAR, FAR, or WAYOUT...

%     - A NEAR offset uses ONE byte and covers the range of 1 to 127.
%       Its MSB flag is cleared. The offset value is as read.
%
%     - A FAR offset uses TWO bytes A and then B. It covers the range
%       of 128 to 16383. The offset value is B/128 plus A - 128. The
%       first MSB flag is set; the second is cleared.
%
%     - A WAYOUT offset uses THREE bytes A, B, and then C. It covers 
%       the range of 16,384 to 2,097,151. The offset value is C/16384
%       plus (B - 128)/128 plus A - 128. The first and second MSB flags
%       are set; the third and last is cleared.
%
%     - Extremely unlikely higher values can add a fourth byte.

% Each word incidence record consists of a group of document incidence
% sub records. One sub record is needed for each document on which the
% word occurs at least once. Each sub-record ends with a null. The 
% length of each sub-record is variable, depending on the number of 
% words in the current document and the number of bytes per offset. 

% The first value in the sub-record is the ABSOLUTE position of the 
% current document in the document list.

% Once again, each value may be a NEAR single byte, a FAR two-byte
% pair, or a WAYOUT three-byte triad. Using flagged mod 128 arithmetic.

% The second offset in the sub record is the ABSOLUTE position of the
% first word in the DOCUMENT's word count list. You can also think of
% this as the RELATIVE position from the start of the word count list.

% If the word was only used once in the current document, a null follows,
% ending the sub record. If the word is used a second time, a RELATIVE
% offset position is read, again using one, two, or three bytes as needed.

% The process continues for as many words as are in the present document,
% ending with a null. Each null moves you to the next document in the
% document list that uses the current word.

% Note that there is no explicit word usage count per se. Instead, you
% have to add up each word incidence for each document.


%  DDD SUBFILES
%  ============

%  $$$ directory - Reads up to 256 characters of "style" information
%                  Thus telling how the rest of the database is formatted.
%                  Useful info ends on first null. Apparently the same
%                  as the $$$directory in DID

%  $$f directory - mystery file of a few splattered integers over 455 
%                  bytes. The "32" might be the offset into the word
%                  and occurrance files. Those "4" values may be lengths
%                  of words. Call it a "parameters and conventions" stash.
%                  Not understood.

%  _df directory - the version number and vdk103.dll header info. 112
%                  bytes on page two. Called "header info for partition
%                  management".

%  dkf directory - converts document number to pathname. Fixed records
%                  of nine bytes each...

%                  bytes 0-3 - document number
%                  bytes 3-6 - start position on full filename stash
%                  bytes 7-8 - length of full filename, lsb first

%  ddf directory - document data format directory holds key information
%                  about each indexed file as sequential records of
%                  125 bytes each. Partially explained in style.ddd

%                  byte 0      - exists flag (1 = exists)
%                  byte 1      - chunk flag (0 = not a chunk)
%                  byte 2      - largedoc data (187 value; not understood)

%                  bytes 3-6   - start page
%                  bytes 7-10  - end page
%                  bytes 11-14 - start page from
%                  bytes 15-18 - end page at
%                  bytes 19-22 - number of pages in doc, lsb first

%                  bytes 23-55 - Permanant ID as 32 ASCII characters
%                  byte 56     - wxe version (2 - not understood)  
%                  bytes 57-60 - creation date
%                  bytes 61-64 - modification date

%                  bytes 65-68 - pointer to document filename path
%                  bytes 69-70 - offset to document creator name
%                  bytes 71-74 - pointer to start of individual page
%                                word counts
%                  bytes 75-76 - total number of page word count bytes
%                                (equals number of pages * 2)
%                  bytes 77-80 - pointer to short xyb filename
%                  bytes 81-82 - length of xyb filename 

%                 bytes 83-107 - nulls. Probably additional pointers 
%                                and lengths.
%                bytes 108-111 - unknown mystery pointer
%                bytes 112-113 - unknown mystery length

%                byte 114      - unknown cleared flag? value = 0
%                byte 115      - unknown cleared flag? 
%                byte 116      - unknown set flag? value = 255 
%                byte 117      - unknown set flag?
%                byte 118      - unknown set flag?

%                bytes 119-122 - pointer to full dkv filename
%                bytes 123-124 - length of full dkv filename 

%            (( Above bytes repeat for each document indexed ))


%  uid directory - The user id directory. 33 byte ASCII key followed
%                  by a double null for each indexed document.

%  drd directory - A group of pointers to read the relative ()(.)
%                  file headers in the ddc directory. Apparently
%                  remaps docs between index space and listing space.
%                  A six doc index was in a 1-3-5-4-2-0 sequence.

%                  Each document has a nine byte record...
%
%                    bytes 0-2 - document number as used by word list
%                    bytes 3-6 - start of file header in ddc
%                    bytes 7-8 - length of file header in ddc

%  $$v directory - Apparently consists of the full first filename and
%                  the full last filename in the document collection.
%                  ASCII filename characters ending in null. First name
%                  starts after 32 nulls. Use not understood.

%  dkv directory - A listing of the full filename path for each indexed
%                  doc. Starts after 32 nulls. Each filename is followed
%                  by a null.

%  xya directory - Contains info on each indexed doc. Especially the
%                  relationship between page number and word count.

%                  First string ending in null is relative filename
%                  such as ()(.)(MUSE100D.PDF)

%                  Second string ending in null is doc program source
%                  such as Acrobat Distiller for Windows.

%                  Fixed data file of two bytes per page gives the
%                  number of total words on that page, lsb first in
%                  page order.

%                  To find the range of words on any page, sum the
%                  totals of all previous pages to get a start word
%                  number. Add the words on the current page to get
%                  an end word number.

%                  This allows relating positions in the word list to
%                  locations on actual document pages.

%  xyb directory - Short document filenames in the order used by the
%                  div mapping. 32 null offset. ASCII strings ending
%                  in null. Such as MUSE103D.PDF

%                  This file relates document number to document name.

%  xyc directory - Apparently unused. Might be involved with word
%                  stemming or sounds like. Unmapped. Not understood.

%  xyd directory - Apparently unused. Might be involved with word
%                  stemming or sounds like. Unmapped. Not understood.

%  xye directory - Apparently unused. Might be involved with word
%                  stemming or sounds like. Unmapped. Not understood.

%  ddv directory - List of the full filename of each document, starting
%                  with C: Each ASCII string ends with a null.

%  ddc directory - List of relative prefixes for each short document
%                  filename. As in ()(.) Each ASCII string ends with
%                  a null. Use not fully understood.


% SELECTING FILES FOR STUDY
% ===========================

%  Of all these subfiles, the most interesting ones appear to be...

%          - the xyb subfile in DDD of document names
%          - the diw subfile in DID of indexed words
%          - the div subfile in DID of indexed word occurance
%          - the ddf subfile in DDD of document page word counts
%          - the dif subfile in DID linking the diw and div subfiles


%  CATALOG FILE LENGTHS
%  ====================

% The number of stored bytes needed per indexed word depends very
% much upon the documents being indexed. On the document side, there
% is the ratio of graphics to text and the style of compression used.

% Almost always, though, the catalog file sizes can be dramatically 
% reduced by excluding more high frequency words of minor interest.

% There is a minimum of 25K or so of overhead. Mostly as two 8K blocks
% in the DDD and DID files. Each word needs its length plus one byte
% for the word list. Say nine bytes. Each word link on the link list
% needs an additional seventeen bytes. The position mapping depends on
% whether a word is "near" or "far", but averages something between
% two or three bytes per occurrance. 

% Thus, each eight character word used 25 times in a document requires
% something like 100 bytes of storage. Every ten high freq useless words
% you can eliminate from your index knocks off 1K or so off the total.



%%%%%%%%%%%%%%  LOW LEVEL DATA FILE EXAMINING UTILITY %%%%%%%%%%%%%% 

% This PostScript-as-language routine examines data files, breaking
% them up into individual 1024 byte pages. Each byte is shown 
% numerically in decimal, arranged 32 values per line.

% To activate, use an editor to change the source and target filenames 
% below. Then send the file to Acrobat Distiller of Ghostscript.

% More details in DISTLANG.PS on www.tinaja.com/pslib01.html

% ALWAYS USE "\\" WHEN YOU MEAN "\" IN THE FILENAME STRINGS!!!!!

% The final results appear in your named destination file. They may
% be viewed and further annotated using an editor or word processor.

% Place the exact full filename you wish to expand in the /grabfilename
% string. In the case of Acrobat Catalog, you will usually be interested
% in a .DDD and a .DID file. Usually found in the INDEX\PARTS folder of 
% the cataloged documents. The highest number files are the latest.

/grabfilename 
(c:\\Windows\\Desktop\\PDX.TRY1\\CAT6\\CAT6\\PARTS\\00000000.DDD) 
def % data file source

% Place the exact full filename of where you wish to put the data
% file examination results in the /dumpfilename string.

/dumpfilename 
(c:\\Windows\\Desktop\\PDX.TRY1\\CAT6\\CAT6\\PARTS\\formsnop)
 def % source


grabfilename (r) file /source exch store  % create source file object
dumpfilename (w) file /sink exch store    % create sink file object

% /proc32 grabs 32 data values and numerically formats them on one line...

/proc32 {source 32 string readstring {{3 string cvs 
         sink exch writestring ( ) sink exch writestring}forall
         (\n) sink exch writestring true}
        {false} ifelse 
         /oktocont exch store} def

% this is the main loop that examines one 1024 byte page at a time...

0 1 4000 {/page exch store                        % grab page number

         (\npage number ) sink exch writestring   % show page number
         page 4 string cvs sink exch writestring 
         (\n\n) sink exch writestring

          32 {proc32} repeat                      % report 1024 values 

         oktocont not {exit} if                   % exit at end of file
         } for

source closefile                                  % clean up
sink closefile

%%%%%%%%%%%%%%%%% end utility %%%%%%%%%%%%%%


%   ====================================
%   Copyright c. 1997 by Don Lancaster and Synergetics, Box 809,
%   Thatcher AZ, 85552 (520) 428-4073. synergetics@tinaja.com
%   All commercial rights and all electronic media rights *fully*
%   reserved. Linking welcome. Reposting is expressly forbidden.

%   Further support on http://www.tinaja.com
%   Consulting services available via don@tinaja.com

%   ====================================