Hawaii Island Kauai Maui
Home About Us
Join Us Contact Us LWV-US
newsletters position papers legislature reports testimony links

The archiving process

How the League newsletters were converted into electronic form for the website and archives
  1. Scanning. The first step is the scanning of the issue. (In cases of missing issues recovered from the bound issues collection at the UH Hamilton Library Hawaiian Collection, the issues were photocopied at the library, and scans were made from the photocopies). In general, the archive scans were made of a two-page spread, using a large-format (12"×17") scanner [various, currently Epson GT-15000]. When the masthead or pages were in color, at least the first page was generally scanned in color. The scans made as Adobe Acrobat pdf files - [Create PDF from Scanner] one for each issue. The file name is the code for the newsletter plus the date, e.g. av6401.pdf is the January 1964 issue of Aloha Voter (For consistency, "av" is used even for earlier, different names of Aloha Voter). We now have a photographic pdf copy of each issue.
  2. Masthead. A full-size image of the masthead is made by cropping the image of the first page [using HyperSnap software]. This image was then "cleaned up" where necessary - mailing labels, spotting and other defects removed [using PhotoShop)], and the image saved to provide the masthead image shown at the top of each article. (The pdf file retains the original image of the masthead.) The masthead shown with the articles is generally, especially for early issues, shown at half-size. The underlying image is full size. Right-clicking a masthead and selecting "display image" will show the full-size image. (A thumbnail image of the masthead is also created at this time.)
  3. Database. The issue article information is input into the database. [In this case, a MS Access database.] The input screen looks similar to this (click to enlarge):

    This is a custom-designed input program for the database, written for the Hawaii League's newsletters, with each article's data input individually. A full copy of each (html) formatted article is stored in the database as well as information about the article and issue.

    Each article's information is input into a screen like this. The title is sometimes changed slightly from the original: A preceding "A" "An" or "The" is removed to facilitate alphabetization in the indexes. Sometimes the title is shortened, or, in cases where the original article had no title, one is added. On the actual webpage display, the original title is shown (not the shortened one), and in the case where a title was supplied, it is shown in [Square Brackets].

    Multiple authors are listed separately, so they can appear separately in the indexes. Author names are "regularized" to facilitate indexing. The actual signature of an article is generally shown on the article page. (For example, the signature on the president's message may be just a first name or nickname, but it is indexed as a full name. Typos in names are corrected when recognized.

    Information from the masthead is stored as issue information, used for statistics.

  4. Article information input. The input screen includes buttons to control various functions. Once all the titles, authors, etc., for an issue have been added and saved, the indexes (title, author, subject), newsletter index page, and issue pages are generated by the custom software. These programs create an html "frame" page for each article… the masthead/header, top index, title, signature and footer, each in a file with a coded name -- the issue name plus part of the article title (as shown in the filename textbox in the screen shot above).
  5. OCR (Optical Character Recognition). Once the database entries have been made and the page frames generated, the pdf file of the issue is analyzed by the OCR software. (In this case, OmniPage Pro.) This is the process whereby the photographic image of the pdf file is interpreted into characters which can be edited, output as a text file. Except in the cases where the separate parts of each page are clearly defined, this usually requires marking each section of a page with a "frame" to allow the software to isolate the separate articles. While this process is usually very accurate for originals which were printed or photocopied (from type or electronically), for older issues, which were mimeographed or used similar copying processes, or when the paper was colored, there are generally many errors. The OCR software includes proofreading functions.
  6. Proofreading. Once the OCR prosess is complete, the pages are copied to Microsoft Word, where Word's proofreading and editing functions can be used, especially the strong spell-checker function. Typos are corrected, both those originally in the article or resulting from the archiving process. However, particularly on earlier issues where the quality was relatively poor for machine readability, typos still get through into the final pages. (In some cases image quality is so poor that retyping the article is necessary. Please help by notifying us of typos you spot, including minor ones. Just mention the article and typo... I'll find it!)
  7. HTML formatting. Once the Word file is "clean", the separate articles are copied into the corresponding article "frame" pages in a text processing programmer's editor (similar to NotePad but more sophisticated… in this case UltraEdit-32). The articles are then formatted using HTML tags for display on the webpage. For the most part I try to match the original formatting. Until word-processing, the only character formatting available for most typewriters was all-caps and underline. Often I have replaced all caps with bold, as all-caps are usually very dense on webpages. I've generally retained underlining. In issues containing italicization and boldface, I use a (custom) macro to replace the bold, italic, and underlined words with their equivalent html tags. Budget pages are generally shown as images, rather than html. Many, probably most, images are retained with the articles, and photographs are also collected into a photo database and displayed together in the "League Photo Album".
  8. Browser proofing. The semi-final step is checking the page as displayed in the browser (Internet Explorer, Mozilla Firefox, etc.) Finally formatting and proofreading changes are made in the html file based on this check, and the article pages are complete.
  9. Sequential paging. Once the pages are finished, the input screen is used once more… to read the formatted pages into the database, and to regenerate the pages for the issue articles before and after the new issue, so that the "next" and "previous" arrows on each screen will link to the proper issues.
  10. Uploading. The final article pages and indexes are then uploaded using ftp (file transfer protocol) software (in this case WS-FTP Pro) to our web provider, Lunar Pages.

And that's about all there is to it!


Home | About Us | Join Us | Contact Us | LWV-US
newsletters | position papers | legislature | reports | testimony | links