Russian Digital Libraries Journal - 2000 - Vol 3 - Issue 1
|
User and Large-Scale Electronic Document Files Interaction in the PCBIRS System
Vitaliy Yu. Bugaev Scientific Research Institute of Physics
Introduction
The twentieth Century has solved the communication problem of people at the global level, but has aggravated the problem of information chaos. At present, the world is literally about to choke us with electronic information. Suffice it to mention Internet and a great number of CD-ROM publications. The OCR developers claim publicly to convert all hard copy documents into electronic equivalents. And if one takes into account that the personal computer is capable of storing gigabytes of information then it is not a problem to accumulate major information content at home. The only question is what to do next?
The computer is probably the most suitable prosthesis which was invented by mankind to solve these problems. But considering the basic directions of the development of computer technologies at present, the impression is that the ears are the most relevant part of the head.
Really, the lion's share of projects is related to communication development and access to huge information resources which are growing with overwhelming speed. At the same time, the culture of information consumption has yet to undergo substantial modification and is little different from folder-based allocation of files. If the user does not have a tool for information processing, he will most likely look at the possibility of obtaining additional information from external sources (a potential but probably useless option). Therefore, the issue of creating appropriate tools is becoming more important - tools will not only extend the access to different sources, but will also allow the processing of huge amounts of information.
How then to accumulate information so that it does not turn into garbage, but can be used in a beneficial way? How to build up an information environment interacting with other major information sources?
This is not only the problem of users, but also of practically all agencies which own large information resources and provide access to a broad audience. It may be that they should stay clear of these issues (having enough of their own problems), but one has to realize, that, as a result, the number of consumers which were initially pleased with the access to new sources has decreased sharply.
It is obvious, that development of access should be done in parallel with the development of methods that allow the end user to assimilate large amounts of information content. Such a fact, in my opinion, should be taken into account when developing such projects as electronic libraries.
One of the most efficient methods of information consumption from external sources is that the end user builds up his (or her) own databases. It should be kept in mind that the user often searches in external information sources not because he (or she) has nothing of value. One of the tasks of accumulating information is its processing, and, as a result, building up new information files for posterior publication.
Database Management System (DBMS) and Data Retrieval System (DRS)
If the problem of information accumulation and storage is solved by means of hardware and OS, then the problem of searching and handling of information and its inclusion into decision-making behaviour is dependant on relevant software.
The problem is substantially simplified when information is accepted and then stored as structured databases. In this case most processing operations can be formalised when the handling of data is replaced by the handling of names.
The ability to formalise the procedures of structured information processing will in many ways define the further development of database management systems (DBMS). Multiple DBMS in the market actually differ from each other by storage and access methods, and end user service, while united by the common idea of a structure. The best known model is a relational one where information can be submitted as an associated set of tables (hierarchical and network architectures are likely to be seen as variants of a relational model).
But a very important question remains unanswered: how to transform free text messages into the database format?
Unfortunately there is no common formal procedure to solve this problem and the database developers have to rely on their intuition and experience.
It is not by accident that there has been discussion recently of relational model limitations with regard to database information storage and access. For the main part it happens because the meaning of the word “data” undergoes substantial changes. On the one hand, some complex objects like texts, tables, images, control functions etc. are called data. On the other hand, the structured bases paradigm has limitations since it assumes semantics should be described at the development stage and could be stored separately from the data. On top of this it is still uncertain whether any information can be presented as a finite set of interrelated simple data tables.
That is why some attempts to step outside the limits of the relational approach are being made (post-relational, object-relational, object-directed and other databases). Note that the use of complex objects as data does not automatically mean quitting the relational model. The support of functions of object restoration for browsing (for example, multimedia file) is a very important DBMS feature. But this all does not contradict the relational model and does not extend it until a considerable set of such objects could be described with use of their parameters and presented as a finite number of linked tables. We should rather speak of implementing the relational model rather than abandoning it.
The relational approach limitations become clearer when the data do not have clearly expressed and fixed semantics which may depend on information interpretation by a concrete individual. This is a very common situation in text processing. Each sentence of a coherent text can generate a number data and their semantics may be inseparable from the general context. The meaning of data may depend on a subjective approach defined by the tasks to be solved (different people read texts in different ways and find different parts of it interesting). Although any text theoretically may be divided into a lot of datum – value type binary relations, in practice relational database developers out of all the possible relations choose only the ones needed to solve a limited number of tasks. The scheme of such database operation is given below:

Fig. 1.
As a rule, database creation is an expert task. The expert develops a data model and processes original messages according to this model. The messages are interpreted and converted into structures for further storage. Designing a database, the expert must have assurance that as new information is added only the data values will change and the structure will remain the same.
The user who needs information may work only within the model defined by the expert, and he doesn't have a clue to how the data were added to the information storage, so his/her own search results interpretation is in many ways depend on the expert's opinion.
Despite the evident contradiction, such approach, in many cases, is fairly justified for a number of technologies with well established and clear information processing patterns (accounting, bank operations, passenger transportation etc.). But we should remember that such databases are secondary products of the preliminary processing of primary information where formalisation and generalisation procedures may lead not only to some loss but also to distortion of data semantics taken out of the original message context.
Another approach to database formation is to store all the incoming messages in the format in which they are originally received.

Fig. 2.
This chart looks simpler but this simplicity is spurious. If in the former case in the course of search the user retrieves a lot of relevant data, in the latter case the user retrieves a lot of original messages containing these data. If then there is a need to include the data in algorithms of subsequent processing the interpretation gets considerably more complicated for it requires some additional non-formalised mechanisms of data extraction.
On the other hand, this scheme eliminates the contradictions described above. The methods of information storage and search do not depend on the data model which may vary depending on the user objectives. This allows a lot of flexibility, in particular, working with large information batches and eliminates the need for preliminary structuring and formalisation; which are the most expensive stages of database creation. It should also be noted that the majority of information (letters, protocols, normative acts, decrees, laws etc.) can not be structured in any sensible way and can only be presented as free text.
The DBMS for structured data processing are usually classified as factographic systems. Software products implementing the second scheme are documentary data retrieval systems (DRS). The two types differ not only technically but also in user interaction results.
The DRS developers are trying to convince us that the mere possibility to find several thousand documents amongst millions is incredible luck. The DRS developers say: give us a structure, fill it in with consistent data, and you will see how SQL requests can work wonders.
Indeed, in factographic systems the users receive a set of so called “corteges” with relevant data and can have their questions answered.
In documentary DRS's the user is given a list of relevant documents as a result of context search and only one question is answered: which documents contain the relevant data. The users have to extract and link the relevant data and only then will they have answers to their questions. When there are too many documents found the problem remains unsolved. The only solution is automated semantic analysis of the text. But we should admit that by now no acceptable results have been achieved in this area. In the best case the DRSs offer only statistical methods of documents ranging by relevance. It does not make much sense to assess the importance of words only by the frequency of their occurrence without any regard to their value for the person submitting the request. However, sometimes these methods produce acceptable results.
It should be noted that in practice the user has to manage both the structured and full text information. Different software and different processing standards cause a great deal of inconvenience. So there is a need to have a software product which would incorporate the features of both structured DBMS and DRS.
PCBIRS=DBMS + DRS
Let's consider how the problem of the end user information is solved in the PCBIRS system.
The analytical retrieval system PCBIRS (see «PC World» 12/97, p. 54, “PC World” 8/99, p.76, http://www.chat.ru/~birs) is designed for PC processing of large amounts of information, which may consist of a random set of text documents or structured databases. For both types the PCBIRS provides fast contextual information search and uniform standard of analysis and subsequent processing based on the technology of automatic lexical indexing of texts and data structures.
In technical terms the PCBIRS is a management system for documentary and factographic databases. Unlike conventional (relational) DBMS's, the PCBIRS is designed for the processing of information that could be considered as raw material or primary source for further processing.
Information from external sources may be stored in user databases. Such databases can contain either documents or only references to them.
The PCBIRS implements the processing standards related to the search and analysis of information represented as free texts. The main idea consists of providing a facility to store and manipulate information in the format it is received and to derive the required data structures in a dynamic way.
The main task which PCBIRS helped to achieve was transparent processing of large amounts of full-text or structured information since any relatively large information batch is like a "black box" for the user. There may be no answer to the user request. Does it mean that the requested information is not available or the request was formulated incorrectly? Is it possible to get an idea about the contents of the whole bulk of information without reading all of it? It is possible. For this purpose the PCBIRS has a number of tools which allow to extract and analyse the data contained in the retrieved documents.
One of such tools is the support of dynamic conceptual classifiers.
The user is given an opportunity to formulate his/her area of interest as lists of concepts which he/she may accumulate and apply to different information batches to obtain an overview of their content.
Since the dynamic classification usually takes few seconds even in case of large amounts of information the user may freely change his/her point of view, fix various sets of documents and apply to them other lists of concepts; which eventually constitutes a powerful search navigation tool. On the one hand, before the search requests are formulated the user can see the contents of the information resource, on the other hand, having made a search request the user can understand what the retrieved set contains.
Apart from the usual highlighting of keywords the PCBIRS allows the variation of the document browsing modes in a flexible way, building different lists for information representation, which eventually saves the time needed to decide on the relevance of the retrieved documents.
Thus, navigation through big information resources from a PC is no longer an insoluble problem. The main thing is to design methods which will enable the user to describe his/her area of interest with the use of natural language.
PCBIRS allows to build secondary databases from original sources and to prepare on this basis different applications which can be published on CD-ROM or in the Internet.
Some features of PCBIRS 3.2
The basic unit of information storage and search is a document. Each document is indexed by the contents: a dictionary of keywords is created automatically to ensure a high speed of searching regardless of the database size. The average speed of indexing in batch mode on Pentium 133 MHz computers is 10 Ìb/min, the speed of searching for specific terms in hundreds of megabytes of information is 0.01-0.1s (depending on the frequency of the term occurrence). The size of dictionary is 1-3 % for databases with over 100 megabytes of Russian text. Each database may store up to 500000 documents, the database size is up to 4 GB if texts are stored directly in the database, or up to 16 GB if only references to original sources are stored. Several databases may be integrated on a logical level as sub-bases and appear to the user as a single base. The number of sub-bases per base is not limited.
The PCBIRS operates on IBM-compatible computers with processors higher than 486 under WINDOWS operating system (WINDOWS ' 95,' 98, NT). At least 32 MB of memory are needed for continuous indexing of information batches up to 2 GB.
Documents
May have or may not have an internal structure of information storage; they include:
- Random texts;
- Specific data;
- Graphic images;
- Tables;
- Push buttons;
- External functions arguments.
Information Sources
- Documents on-line entry the from the keyboard or from the external applications (files annotation technology);
- Interactive text input from the keyboard or from external applications (file annotation technology);
- Batch import from files:
- marked lines format (PCBIRS input format);
- PCBIRS databases;
- DBASE IV-format databases;
- File-indices of text fragments (PMT files);
- Linearly tagged texts in ASCII or ANSI codes;
- Random text files in ASCII or ANSI codes;
- HTML files;
- Connection of functions for source reading;
The batch indexing of information files means the storage of either texts or references to information sources. When structured documents are loaded there is a flexible way to apply the source structure to the database structure.
Data Base Operations
- Database creation and reorganisation in interactive mode;
- Author's rights protection (copyright, permitted operations with information);
- Access level assignment within LAN;
- Connection/disconnection of databases as sub-bases, creation of themes for joint work with a set of databases in a multi-window mode;
- Interactive and batch entry, correction and deletion of documents in the database with immediate or delayed text indexing and control of information entry;
- Export of documents from the database, database merger.
Query Language
- Includes search parameters (words, numbers, dates, special terms) and conditions of their search (AND, ÎR, NO, XOR, NOT logical operators, NEAR, CTX, SEGM context operators, search area restriction operators). Term masking on the left, on the right, inside, recording clear and indistinct versions of words are allowed;
- Search by numbers and numeric intervals includes automatic translation of measurement units (meters, centimeters etc.).
Query Processing
- Query formation and execution in interactive mode;
- Direct query term selection in the browsed documents, from the database dictionary and frequency dictionary;
- Selection and execution of queries stored in the database;
- Creation and execution of the constant queries as concepts and hierarchic classes in the catalogues. Dynamic connection of query catalogues to the database and batch processing of queries;
- Query storage and clarification, automatic repetition of queries in the background documents import mode;
- Simultaneous search in several databases distributed in the LAN.
Analysis of Search Results
- Presentation of documents in different forms, moving, switching of fragments, report tables, compact text, monitoring of actual phrases from the retrieved document list, special forms, letterheads;
- Highlighting of actual query terms for any form of document representation;
- Documents frequency dictionary browsing with filtering by subject and fast scrolling for relevant terms;
- Fast scrolling for document fragments;
- Document browsing in sorted order with selection of sorting criteria;
- Supplementary lists for quick browsing;
- Virtual lists of data from document texts, their representation as diagrams, charts, tables with possible calculation and additional selection of documents depending on the data contained in them;
- Storing document sets for subsequent queries;
- Inversion of document lists to browse the documents irrelevant to the query;
- Removal of classes and separate irrelevant documents from the list of search results;
- Output of documents and virtual lists to printer, files, export to other WINDOWS applications;
- Input restriction by copyright.
Applications Development
- Joining databases into a theme for co-processing in a multi-window mode;
- Content linking of documents from one or several databases (dynamic hypertext);
- Transfer of parameters from document texts and start-up of external DOS or WINDOWS applications;
- A tool for programming and interactive debugging (BML built-in macro language) of developed applications;
- Additional commands connection to PCBIRS interface push buttons, creation of navigators for information search and processing, preparation of electronic publications on CD-ROM or in the Internet.
About Author
Vitaly Yu. Bugayev – Doctor of Physical and Mathematical
Science, Head of the Laboratory under the Science Research Physical and
Technical and Radio Technical Metrology Institute, author of the PCBIRS
information and search analytical system (see “A PC World” 12/1997, “A
PC World” 8/1999), http://www.chat.ru/~birs,
tel.: +7 (095) 535-08-52, e-mail: bgv@ftri.extech.msk.su
|