Imagine facing this dilemma: digitizing a huge volume of microfilm images, numbering in the millions, or even hundreds of millions. How have other companies overcome this kind of daunting task?

One firm, Science Applications International Corporation (SAIC), has the answer. SAIC is utilizing SunRise scanners and InputAccel to complete a project that involves the digitization of more than 100 million government employee records on microfiche.

"The nature of the program is to take personnel files having to do with government investigations for security clearances and put them online in an internal network," explains Steve McCaughey of SAIC. "They have (the records of) about three million people on microfiche, with anywhere from one to five fiche per person, and, typically 60 images per fiche."

With a job of this magnitude, SAIC had to find both a heavy-duty scanner that could handle the sheer number of documents in the microfiche library as well as a document capture program powerful enough to keep all of the images they were planning to have available online organized and easily accessible.

The Digitizing Solution

SAIC's choice for the scanning portion of this massive job was SunRise's digital scanner, the only scanner capable of performing the job in the time frame available. Due to the large number of documents in the library, SAIC is using five scanners connected to a local area network (LAN), all running simultaneously, eight hours a day, six days a week, to convert the immense document backlog.

In order to ensure accuracy in indexing as well as in document identification, SAIC gave each microfiche a unique bar code number, usually tied to the Social Security number of the individual's file. The bar code is then printed on a pick ticket separate from the microfiche and then moved to a file for scanning.

An operator scans the bar code from the pick ticket and then loads the microfiche into the scanner. Using RowScan, a SunRise software component, the scanner scans each row of the film, saving the row as one large grayscale or bitonal image on the network. The operator then unloads the film and reloads another, starting the scanning process over again.

After the rows are scanned, they are sent on to the LAN where the RowScan Segmenter workstations reside. These workstations automatically segment each image out of the rows, framing them with a red box which the Extractor workstation uses to extract the images. If an error is made during the segmentation phase, the operator can manually adjust the parameter to correct the problem.

The Extractor is a single workstation running the extraction part of RowScan. The basic function of the Extractor is to cut out all the segmented images and save them as digitized files to a user-selected directory on hard disk, compact disc or other storage device. In SAIC's case, all images are written to another network where they enter the InputAccel environment.

The Indexing Solution

InputAccel is an extremely flexible document capture application that is composed of two parts: 1) an open integration server that manages and controls the document capture process, and 2) a customizable set of plug-in software modules that perform specific document capture tasks, including OCR editing, rescanning and exporting to various imaging, workflow and/or full-text retrieval applications. SAIC uses approximately 20 workstations to perform the InputAccel end of the document capture process.

After the images have been extracted, they are saved to a file folder within InputAccel. Every few minutes, the InputAccel server searches this file to see if it contains any new documents. The server automatically imports the new image along with its associated naming file into the InputAccel system.

"After the importer creates the appropriate InputAccel batch label, the images go through the normal InputAccel process which includes image enhancement and QA," McCaughey says. "We also add any images that have been converted via paper scanners if that's the type of thing we're working on." SAIC has created a custom image enhancement module within InputAccel to perform deskewing, hole removal and noise processing on the incoming images.

When the images are ready for viewing, they are indexed. Explains McCaughey: "The indexing/QA module we're using allows us to view many pages at once and then zoom in on them, so that our QA process goes very quickly." At the indexing stage, the operator confirms that the bar code number scanned matches the one on the document.

From there, the operator performs the first level of Quality Assurance (QA). The operator double-clicks on an image, which brings up a magnified version of it, then checks the image's text for legibility and overall image quality. If the image passes, it is marked as "good" and continues through the InputAccel process. If it fails, the image is tagged as "bad" and sent to a rescanning station. The physical images, meanwhile, are sent over to a high-speed, automatic feed Kodak 5500 paper scanner (connected to a workstation running an InputAccel scan module). If there are paper documents that need to be added into the document batch, they are scanned directly into it.

Flexible Indexing Structure

According to McCaughey, "InputAccel can maintain a tree structure of the document (batch), so that you can index by row or section. We find it very useful to keep track of the rows on the microfiche, so that if we have to rescan an image at a later time, we can find the appropriate image on the tree diagram where the images are indexed by fiche number, row and even column."

If an image is marked as "bad" during the QA process, rather than continuing along the InputAccel workflow, it is rerouted to a rescan station which consists of a workstation connected to both a paper scanner and a Canon microfiche scanner. The proper microfiche or paper documents are pulled and rescanned to achieve the best possible image quality and then sent back to indexing/QA to continue through the process again.

SAIC also added an extra QA step near the end of the InputAccel process. "Because of the way our storage system works, and the customer's desire to check everything before (the data) is committed to the storage towers, we created a Document QA step," explains McCaughey. During Document QA, the digitized images are compared to the actual documents they were scanned from to make sure that everything is organized correctly online, have the appropriate content and are in the right location within the batch structure.

Exporter Processes TIFF Images

After the document batches have been accepted online, they are processed by InputAccel's Image Export module. "(Image Export) can group the documents within a batch into sections and export them into the storage system as multi-page TIFF images. In our case, within each document batch, section 1 is made up of the images that were taken from the microfiche, section 2, images scanned from paper, and section 3, administrative releases that have been signed by the individual being investigated. Section 3 images are routed to a separate storage tower." The final step is the Index Export module, which allows the users to input the type of information included in the index file, such as section number or the full name of the file.

"Eventually, we export the finished documents out of InputAccel and save them online in a 4 terabyte hard disk storage system where they can be accessed by their unique case file number or by the social security number of the person being investigated."

Asked who makes inquiries in this growing database, McCaughey replies, "Typically, it's Defense Investigative Service (DIS) people, usually after someone applies for clearance, for instance, or if a report comes in from the field, and an evaluation, whether good, bad or indifferent, has to be made concerning a person's background…they have to have quick access to that person's file." When the new system is fully in place, the retrieval process will take only a matter of minutes.

