Vacation: Week 3

Downloaded and tried out the GATE system, as well as OpenNLP libraries. GATE can save its output files as XML files, or preserve the source format (i.e., can preserve text files). I have actually found examples online of how to use OpenNLP libraries through CLI (Command Line Interface). This seems to be useful, as I will probably be able to use CreateProcess() in VC++ to start a console application and use OpenNLP library functions, such as their PosTagger, Tokenizer, etc. OpenNLP, like GATE, is also considered to be state-of-the-art in NLP.

Am also looking into the possibility of compiliing python modules (from Confucius project) into exe files which may then be invoked through VC++. Some useful modules include the stemmer, word_similarity_measurement, subj_obj_parse_tree and verb_usage_forms.

Have come up with a flowchart of how the API project should work. (Note to self: One key concept is that instead of picking 3 different poetry lines from the database based on the 3 most important words in the input, only pick one poetry line. Then, run the chosen line through the algorithm again to pick the next line, and so on. This will help ensure coherence between continuous lines in the selected poetry.) During the meeting with Dr. Newton, Hoomun and Ken, the following pointers were obtained:
1) Verify whether the verb_usage_forms module has been tested extensively and also whether it follows established methodologies/algorithms
2) Before the output step (the selection of poetry lines), insert one more module, which should be a learning system. In summary, the input should go through NLP modules, then a learning system, so that more dynamic and meaningful output can be generated.

Discussed with Helga also and obtained the contact of Dr. Ng from SOC, an expert in NLP. Helga also suggested the use of SVM in Weka for the learning system portion of the project. Will download Weka and explore its functionalities, then figure out how to incorporate it into the project.