Started integration with main system.. Made topic modeling compatible for the envisioned Recombination algorithms In this layer we primarily process the output that we get from the Semantic Relatedness Measure and eventually the Word Sense Disambiguation Layer to summarize the topic for each poem line and input message. We use these topics to select the best possible lines faster. Before we go further let me define the basic categories into which individual words will be summarized as. 5.1 Possible Word Topics (Super senses)We use the 26 noun lexicographer noun (26) classes or super
senses to classify nouns into. They are shown as below Table 4: Noun Super Senses
In the same vein there are 15 verb hierarchies (26), as noted below Table 5: Verb Super Senses
For those verbs which are based on some nouns like to love (based on love) we use the noun super senses to derive their topic when it doesn’t belong to any of the above. Auxiliary verbs are not classified as above. The above is not a random set of categories. They are in fact one of the topmost levels in the WordNet “is a” hierarchy. Pronouns are classified by the gender it refers to, the count it refers to (singular/plural), the case used (nominative like ‘I’ / possessive like ‘my’ / objective ‘mine’ / reflexive like ‘myself’ / demonstrative like ‘that’) and finally whether it refers to a location / person / any other object. Prepositions are not exactly classified as it doesn’t play a big role in topic summarization. Connector words are classified as coordinating connectors (and, or), negation connectors (not), opposition connectors (whereas), cause and effect connectors (because), time connectors (when, after), condition connectors (if), thought connectors (Um) and exclamation connectors (oh). 5.2 Process of classifying words into their super senseFor this part we just utilize the “is a” hierarchy of WordNet for nouns and verbs (hypernyms) to find if a particular word belongs to a particular super sense. For nouns this is straightforward as we just have to traverse up through the hypernyms, whereas for verbs, adjectives and adverbs it is not so, as we have to determine the noun form if there is no direct hypernym. 5.2.1 Process for verb / adjective / adverb with no direct hypernymFor this part we utilize the other relations in WordNet like “also” (related to), “deri” (is a derived form of), “pert” (pertains to) to figure out the noun form and then we do a simple recursive hypernym check to see if it belongs to a particular super sense or not. 5.2.2 Process for pronouns (Resolution) For this part we utilize a simpler modified Hobbs approach (27) to resolve the pronouns into the proper noun that they refer to and hypernyms and instances of this word will effectively add on to the noun’s details. We chose a simpler approach mainly because as Peter Bosch notes in “Some Good Reasons for Shallow Pronoun Processing” (28) that success rates of algorithms for pronoun resolution seem to demonstrate the overall futility of close linguistic analysis for efficient machine processing of natural language. Very simple algorithms that take as their input no more than a parse tree easily yield a success rate of well over 80%.
Our algorithm is simply a probabilistic model of matching the gender and count of a pronoun with a similar noun. We look through the subject of that particular sentence and then the objects for such nouns. Also if a similar pronoun occurs earlier we try to match that with this pronoun and so on. For these we use the “antonym” (ants) relation in WordNet to derive the proper hypernyms. 5.2.4 Process for other word types For other words we pretty much use the dictionary definition directly to classify them. 5.2.5 Words for which no hypernym were found We assign such word to the default “other” but we don’t include the super sense in the calculations below but rather use the word straight away. 5.3Topic Summarizing ProcessThe process of summarizing begins with understanding the senses of each word followed by classification of each word according to its super sense (or when it doesn’t exist, a specific topic). Then we rank this classification of super sense/ specific topics as per its salience measure in the content. Before we go on, let us understand some terms used below. The term frequency is a measure of how often a particular term occurs in a poem lines. For instance, if the word ‘beautiful’ appeared twice in a poem line consisting of 10 words, the term frequency of the word ‘beautiful’ would be 0.2. Mathematically, this can be described as follows:
Equation 9: Term Frequency where is the number of occurrences of the term being considered in document , and the denominator is the number of occurrences of all terms in document . The salience measure is a combination of the following factors (with weight associated shown in brackets which is derived from experiments to produce the best result) 1.
As Subject Term Frequency (15/140) 2.
Co Occurence frequency (20/140) 3.
Term Frequency (30 /140) 4.
Gloss Overlap factor (20/140)
Equation 10: Gloss Overlap Factor 5.
Hypernym Relatedness factor (35/140)
Equation 11: Hypernym Relatedness Factor So in the above example this works out to (1-(2/3)) * 2^2 = 4/3
6.
POS Frequency factor (5/140) 7. POS Salience (15/140): The decreasing order of importance of word types is as follows, 1. Nouns (includes pronouns & prepositions) (Salience =1) 2. Verbs (.85), Adverbs (.7) and Adjectives (.7) 3. Auxiliary Verbs(.4), Connectors(.3) (excluding just coordinating & super sense less ones for which it is 0.1) and Prepositions(.2) We sort the content in the order of the above salience measures combine the above to summarize the content into the 5 best super senses each associated with a salience measure. As an example, for the phrase “If you have to be a man, do this” the first super sense is person. Please note the terms topic and super sense are used interchangeably here. For each of the top 5 super senses we also find a holonym (found by traversing down the “is a” hierarchy) that has the highest frequency and is not a pronoun, is assigned. This will be called the specific topic. This sort of accomplishes a basic topic modeling structure. In fact it is quite interesting to note that this approach to topic modeling is the exact reverse of what several people have done (9), that is they apply other topic modeling measures to do Word Sense Disambiguation while we do the reverse. The following is a brief overview of the process we just studied.
Figure 6: Topic Summarization Process We did some preliminary testing on 20 set of text files each with at least a few hundred words. And here are the results Table 6: Accuracy of Topic Summarizing
The above results are expected; as the salience measure goes down then the accuracy of that being a topic is reduced. This accuracy in topic summarizing is good enough for us to apply recombination algorithms based on it. We shall discuss that in detail in the following chapter. |
HOME: FYP STUDENTS 2008 > Vidyarth E.S >