Welcome to Community Server Sign in | Join

Automatic Classification of Documents

Grid Computing Applications

<October 2003>
SuMoTuWeThFrSa
2829301234
567891011
12131415161718
19202122232425
2627282930311
2345678

Navigation

Subscriptions

Where does automatic classification come from?

I am extracting this text from some of the google result, it focus on why we may need to classify documents, how it is being done currently. It seems to me that the kind of poptential this idea has, can be explored in individual thesis as well. Anyhow, find some time to have a look at it:

Long before the Web hit the world, problems of finding the right data, hidden knowledge or undiscovered correlations in a big data collection had been addressed. For different types of data, different approaches where used: Knowledge Discovery in Databases (KDD) techniques were used if the data were stored in a database as a complete set of entries, Information Retrieval (IR) techniques were used for heterogeneous unordered set of data, like the WWW. In contrast to a deterministic approach of KDD (exact match), IR rather states probabilities. KDD uses a monothetic classification, IR rather a polythetic, which means attributes of an object can make the object a member of more than one class. Thus you need more than one fitting attribute to decide to which class the object should be put. In KDD the retrieval language is artificial, i.e. a combination of Boolean operators whereas the retrieval language of IR is naturally and more vague, which makes it more fault tolerant. The answer will always be a probability and not an exact match.

Data mining is the process of finding patterns in the data and is relevant to both mentioned techniques. There are different functions of data mining:

  • sequence analysis for time dependent data,
  • link analysis which tries to determine relations between the data,
  • summarization which describes subsets of the datasets by computing the median and standard deviation,
  • classification which map datasets to one or more predefined classes and
  • cluster analysis which, similar to classification, groups datasets into clusters, by means of similarity metrics

Artificial neural networks are a means of representing findings of data mining. For Internet data Information Retrieval techniques fit well, especially if the data mining consists of probabilistic search methods, classification and clustering.

Before data mining techniques can be applied the documents have to be pre-processed to create a document index with frequency and weightings of the location of the document terms (title, abstract, keywords or body of the text). Different methods of term indexing exist:

  • Signature Files containing hashes of all words of the document normally combined with a list of stop words ("a", "the", etc.) and a lemmatisation of the words. This is done by reducing all different word forms either to the stem of a word or to their basic form (nominative singular and infinitive).Signature files can be stored separately from the documents and can be searched very fast.
  • Inversion, a representation of the document as index of the words and pointers to the document where it can be found. Almost all commercial search engines use inversion together with Boolean retrieval. This technique is fast as well but needs considerably more storage memory than signature files.

Instead of using stop word lists, the frequency of a word in a document (term frequency) can be used for selecting the words. It had been shown that a medium frequency points to the highest significance (high frequency are stop words and low frequency means low significance).

Another method is to use thesauri for structuring words according to their meaning. A thesaurus is a collection of relevant terms ordered in a hierarchy of superordinate and subordinate concepts and homonyms. Thesauri are very sumptuary to maintain and need special knowledge. There are also different methods for the actual clustering after the documents had been pre-processed, i.e. the different statistical methods to determine similarity of documents. It has shown that specific techniques are appropriate for specific subjects. Artificial neural networks are a non linear extension of such classical statistical methods and are especially useful for classification and clustering. They can be self organizing and self expandable via different techniques, of which to only name two here: Self Organizing Map (SOM) for visualizing relations in a multidimensional space and Competitive Learning, which is used to minimize errors and maximize entropy to enhance the amount of possible matchings.

~Vishal

posted on Sunday, October 26, 2003 8:49 PM by sapna