Need a customer (1 or 2 people). Sit down with them and have them ask us questions. Will help us shape our requirements and make our product customer focused.
First priority is a meeting with Prof. Williams
Need to enforce our own process.
Tighten requirements, estimate lines of code, timeline...
eprojects.com? Web based project management. Need to find one like it.
We need a requirements document - scenarios, web shots, storyboards, etc.
Ladder diagrams and such
Waterfall
optimal to get something simple running by mid-January, forcing two cycles.
Excel works for schedule updating
first time user - installation of web services
1. Web services at local machine
-some classification technique
- updates index everyday in local directory
-send information to server
-frequency of update has to be decided
2. Searching through the grid
-user can connect to any machine in the grid
-machine should have web services running on it
-user select some classification techniquew already implemented and offered through web services
-user specify some document name or search parameters
-user search it throughout the grid and get all the search results
Architecture in notes...
What are the key components of grid computing?
http://www-106.ibm.com/developerworks/grid/library/gr-overview/
-sapna
This is what myself and Mohan discussed:
We build a web service that is capable of classifying all the text based documents, given the directory containing documents to be classified. A remote machine name, documents directory and permisison to acess it will be critical input to web service. As a service, we will go through all the documents, classify them, assign some sort of index, will restructure directory based on the classification and store the indexes on server side. More importantly, we will put that machine in our grid and install a web service on itself so that any changes made later in the documents can be taken care and updated indexes can b enotified to server. In a way, this will be an application of grid computing and web services. Classification techniques and algorithm needs to be studied and explored in detail. Whenever, some of the machine in the grid goes out of the network, thoses indexes will not be available to the user using web services at server or at any of the machine connected. A user can also use these web services in oreder to search and locate some document based on some keyword or classification tecchnique implemented in service. Classification technique will be implemented as plug-ins so that new ones acn be added or current ones can be modified later and made more effecient and effective, if needed.
We are planning to meet tomorrow at 10am in M.Engg. lab to exlplore further by initiating a group discussion among ourselves. So in case if we are able to make it, let us meet there ot otherwise let us allocate some other time and date. It will not be feasible for all of us to meet all the time, and that should be fine.
~Vishal
Using grid computing to speed up searching
Issue of security and privacy? Do we want to lock these files from other people or just have it all open?
Go to website, submit a folder of documents (you want this documents to be archived). Press the “archive” button. Install some piece of code which is a web service.
When you go to search on the webpage, the program uses the grid of computers to search different parts of the index. Possibly using WSDL (publishing) and SOAP (messaging).
Another issue to deal with: Where are the documents stored? Somewhat like P2P.
I am extracting this text from some of the google result, it focus on why we may need to classify documents, how it is being done currently. It seems to me that the kind of poptential this idea has, can be explored in individual thesis as well. Anyhow, find some time to have a look at it:
Long before the Web hit the world, problems of finding the right data, hidden knowledge or undiscovered correlations in a big data collection had been addressed. For different types of data, different approaches where used: Knowledge Discovery in Databases (KDD) techniques were used if the data were stored in a database as a complete set of entries, Information Retrieval (IR) techniques were used for heterogeneous unordered set of data, like the WWW. In contrast to a deterministic approach of KDD (exact match), IR rather states probabilities. KDD uses a monothetic classification, IR rather a polythetic, which means attributes of an object can make the object a member of more than one class. Thus you need more than one fitting attribute to decide to which class the object should be put. In KDD the retrieval language is artificial, i.e. a combination of Boolean operators whereas the retrieval language of IR is naturally and more vague, which makes it more fault tolerant. The answer will always be a probability and not an exact match.
Data mining is the process of finding patterns in the data and is relevant to both mentioned techniques. There are different functions of data mining:
- sequence analysis for time dependent data,
- link analysis which tries to determine relations between the data,
- summarization which describes subsets of the datasets by computing the median and standard deviation,
- classification which map datasets to one or more predefined classes and
- cluster analysis which, similar to classification, groups datasets into clusters, by means of similarity metrics
Artificial neural networks are a means of representing findings of data mining. For Internet data Information Retrieval techniques fit well, especially if the data mining consists of probabilistic search methods, classification and clustering.
Before data mining techniques can be applied the documents have to be pre-processed to create a document index with frequency and weightings of the location of the document terms (title, abstract, keywords or body of the text). Different methods of term indexing exist:
- Signature Files containing hashes of all words of the document normally combined with a list of stop words ("a", "the", etc.) and a lemmatisation of the words. This is done by reducing all different word forms either to the stem of a word or to their basic form (nominative singular and infinitive).Signature files can be stored separately from the documents and can be searched very fast.
- Inversion, a representation of the document as index of the words and pointers to the document where it can be found. Almost all commercial search engines use inversion together with Boolean retrieval. This technique is fast as well but needs considerably more storage memory than signature files.
Instead of using stop word lists, the frequency of a word in a document (term frequency) can be used for selecting the words. It had been shown that a medium frequency points to the highest significance (high frequency are stop words and low frequency means low significance).
Another method is to use thesauri for structuring words according to their meaning. A thesaurus is a collection of relevant terms ordered in a hierarchy of superordinate and subordinate concepts and homonyms. Thesauri are very sumptuary to maintain and need special knowledge. There are also different methods for the actual clustering after the documents had been pre-processed, i.e. the different statistical methods to determine similarity of documents. It has shown that specific techniques are appropriate for specific subjects. Artificial neural networks are a non linear extension of such classical statistical methods and are especially useful for classification and clustering. They can be self organizing and self expandable via different techniques, of which to only name two here: Self Organizing Map (SOM) for visualizing relations in a multidimensional space and Competitive Learning, which is used to minimize errors and maximize entropy to enhance the amount of possible matchings.
~Vishal
Using measures of association between keywords based on their frequency of co-occurrence (that is, the frequency with which any two keywords occur together in the same document), documents can be effectively classified. It has been shown that such related words can be used effectively to improve recall, that is, to increase the proportion of the relevant documents which are retrieved. Interestingly, the early ideas are still being developed and many automatic methods of characterisation are based on this kind of research work.
The term information structure (for want of better words) covers specifically a logical organisation of information, such as document representatives, for the purpose of information retrieval. The development in information structures has been fairly recent. The main reason for the slowness of development in this area of information retrieval is that for a long time no one realised that computers would not give an acceptable retrieval time with a large document set unless some logical structure was imposed on it. In fact, owners of large data-bases are still loath to try out new organisation techniques promising faster and better retrieval. The slowness to recognise and adopt new techniques is mainly due to the scantiness of the experimental evidence backing them. The earlier experiments with document retrieval systems usually adopted a serial file organisation which, although it was efficient when a sufficiently large number of queries was processed simultaneously in a batch mode, proved inadequate if each query required a short real time response. The popular organisation to be adopted instead was the inverted file. By some this has been found to be restrictive . More recently experiments have attempted to demonstrate the superiority of clustered files for on-line retrieval.
The organisation of these files is produced by an automatic classification method. Good and Fairthorne were among the first to suggest that automatic classification might prove useful in document retrieval. Not until several years later were serious experiments carried out in document clustering (Doyle; Rocchio). All experiments so far have been on a small scale. Since clustering only comes into its own when the scale is increased, it is hoped that this book may encourage some large scale experiments by bringing together many of the necessary tools.
Evaluation of retrieval systems has proved extremely difficult. Senko in an excellent survey paper states: 'Without a doubt system evaluation is the most troublesome area in ISR ...', and I am inclined to agree. Despite excellent pioneering work done by Cleverdon et al. in this area, and despite numerous measures of effectiveness that have been proposed (see Robertson for a substantial list), a general theory of evaluation had not emerged.
Enjoy
~Vishal
Look at this concept of classification, I guess this gives us a clue about what we need to do in some direction-->
The classification can be carried out with respect to the content of the documents to be classified, and is done in a two-steps process:
- retrieval of keywords in the documents;
- classification of documents using a hierarchy of concepts.
The keyword retrieval in a document may be obtained by counting absolute and relative frequencies of a series of large number of character bigrams, to extract the ones that offer the best characterization for the document considered. That step can thus be considered as typical of vector space representations.
The second step however uses a semantic hierarchy on keywords in order to obtain a hierarchical classification of the set of documents itself. This step is therefore typical of structured concept representation.
The combination of these two approaches in order to classify a textual database with respect to the semantical content of the documents has the advantage of making use of computationally efficient tools through the vector representation, and integrating much semantic information with the pre-existing hierarchy of keywords.
Regards
~Vishal
http://www.gridcomputingplanet.com/features/article.php/11170_2234691_1
Describes the potential of grid computing.
How grid computing enables companies to share computing resources on demand.
http://www.gridcomputingplanet.com/features/article.php/3291_1140791
Describes .Net being fused with grid computing. This was the first generic system.
Good quote:
"Hook enough computers together and what do you get? A new kind of utility that offers supercomputer processing on tap."
Open Group Standard Architecture, large-scale simulation, processes hold in a sandbox and need to be run on to machine garden, utilizing some of the available CPU, got this environment setup.
Application of “gridGarden.NET” architecture
Monte Carlo Simulation
Password cracking
How do we arrange molecules life trapped into different enclosing boxes?
Peer to peer computing like Nepster
Here is another idea, department has its own admission process and want to make it as online process.
Processes having no interaction among them
Processes having a simple communication among them
There is a complex dependency among different processes.
One machine can be configured to have different domains and in a way a number of processors can be launched on same machine.
A database search technique through different machines
Automatic classification of documents
That's what utility computing will be like — much more like a grid of interlinked resources than a single outsourced data center. Though in terms of complexity, it will be more akin to the telecoms network, which carries many different forms of traffic and services, than the homogenized electricity grid. That's appropriate, of course, because it already runs on top of the telecoms network — the difference being that the utility computing grid will run at the level of applications infrastructure instead of down on the wire.
Adopting utility computing isn't going to be an either-or proposition. Most companies will start with selected services that meet particular needs. In some cases, just as householders in California today can generate their own solar power and sell it back to the grid, some customers will double-up as providers, offering resources where they have special skills or excess capacity. Naturally, all these arrangements will depend on a robust set of standards that guarantee interoperability, which is why it is so important to reach industry-wide agreement on the Web services stack
With the emergence of this universal computing grid, there will be a need for a new breed of utility provider. Although companies and individuals will retain some local computing capabilities, there will be a growing opportunity for trusted neutral parties to operate and manage shared resources within the grid. This is a role that telecoms and hosting providers are well-suited to fulfill, but few have gone beyond the simplistic vision of outsourcing existing computing assets to a shared location.
Last week, U.K.-based telecoms carrier BT became one of the first to show it has grasped what the role of a utility provider needs to be in the Web services era. It announced a portfolio of hosted Web services offerings that fill the layers in between Internet infrastructure and business functionality, These new layers of application infrastructure will form the core of the utility computing grid.
http://www.aspnews.com/analysis/analyst_cols/article/0,2350,9921_1473711,00.html
how do we offer functionailty “classification of documents” using web services?
Web Services, Linux and grid computing are among the technologies researchers are using to develop a system of predicting and improving warning times for weather emergencies such as tornadoes and flash floods.
The Engineering Research Center for Collaborative Adaptive Sensing of the Atmosphere (CASA) says it hopes to overcome a shortcoming of existing weather forecasting and warning systems, which have difficulty monitoring conditions close to the ground because of the curvature of the Earth.
CASA plans to get around the curvature issue and obstructions such as mountains by setting up dense networks of short-range radars that are physically smaller than most existing meteorological radars, says UMass, which is a leader in the CASA project. The radars can be mounted on top of buildings or cell phone towers and supported by PC-sized computers - as opposed to today's high-power radars that often have 30-foot antennas and supercomputer accompaniments.
http://www.nwfusion.com/news/2003/1006radar.html
~Vishal
Sapna
Brandon
Vishal
Mohan
next meeting : Monday 10-11 am
see something similiar at http://www.dspace.org/
Dspace is a product jointly developed by MIT Library and HP. It is a J2EE product that is capable of creating a digitally durable repositor and classification of documents can be done using it.
But it is not a web service, in terms of the technologies it uses J2EE and third parties like postGRESql, tomcat and ant.
We want to provide a web service that is capable of classifying documents . So users submit all sort of text documents, we go through all the new submitted documents and try to classify what does that document stands for? It may be classifies as resume, research paper with specific research details, class snotes or whatever. At the same time user will be able to search through documents by classified name or group.
we need to exlpore it further and add on more details ....