technique for classification of document
Look at this concept of classification, I guess this gives us a clue about what we need to do in some direction-->
The classification can be carried out with respect to the content of the documents to be classified, and is done in a two-steps process:
- retrieval of keywords in the documents;
- classification of documents using a hierarchy of concepts.
The keyword retrieval in a document may be obtained by counting absolute and relative frequencies of a series of large number of character bigrams, to extract the ones that offer the best characterization for the document considered. That step can thus be considered as typical of vector space representations.
The second step however uses a semantic hierarchy on keywords in order to obtain a hierarchical classification of the set of documents itself. This step is therefore typical of structured concept representation.
The combination of these two approaches in order to classify a textual database with respect to the semantical content of the documents has the advantage of making use of computationally efficient tools through the vector representation, and integrating much semantic information with the pre-existing hierarchy of keywords.
Regards
~Vishal