Week 11 Sem2

Started integration with main system..

Made topic modeling compatible for the envisioned Recombination algorithms

In this layer we primarily process the output that we get from the Semantic Relatedness Measure and eventually the Word Sense Disambiguation Layer to summarize the topic for each poem line and input message. We use these topics to select the best possible lines faster. Before we go further let me define the basic categories into which individual words will be summarized as.

5.1      Possible Word Topics (Super senses)

We use the 26 noun lexicographer noun (26) classes or super senses to classify nouns into. They are shown as below

Table 4: Noun Super Senses

Act

Food

possession

Animal

Group

process

Artifact

Location

quantity

Attribute

Motive

relation

Body

Object

shape

Cognition

Other

state

communication

Person

substance

Event

phenomenon

time

Feeling

Plant

 

In the same vein there are 15 verb hierarchies (26), as noted below

Table 5: Verb Super Senses

Body

consumption

perception

Change

Contact

Possession

cognition

Creation

Social

communication

Emotion

Stative

competition

Motion

Weather

 

For those verbs which are based on some nouns like to love (based on love) we use the noun super senses to derive their topic when it doesn’t belong to any of the above. Auxiliary verbs are not classified as above.

 

The above is not a random set of categories. They are in fact one of the topmost levels in the WordNet “is a” hierarchy.

 

Pronouns are classified by the gender it refers to, the count it refers to (singular/plural), the case used (nominative like ‘I’ / possessive like ‘my’ / objective ‘mine’ / reflexive like ‘myself’ / demonstrative like ‘that’) and finally whether it refers to a location / person / any other object.

 

Prepositions are not exactly classified as it doesn’t play a big role in topic summarization. Connector words are classified as coordinating connectors (and, or), negation connectors (not), opposition connectors (whereas), cause and effect connectors (because), time connectors (when, after), condition connectors (if), thought connectors (Um) and exclamation connectors (oh).

5.2      Process of classifying words into their super sense

For this part we just utilize the “is a” hierarchy of WordNet for nouns and verbs (hypernyms) to find if a particular word belongs to a particular super sense. For nouns this is straightforward as we just have to traverse up through the hypernyms, whereas for verbs, adjectives and adverbs it is not so, as we have to determine the noun form if there is no direct hypernym.

5.2.1    Process for verb / adjective / adverb with no direct hypernym

For this part we utilize the other relations in WordNet like “also” (related to), “deri” (is a derived form of), “pert” (pertains to) to figure out the noun form and then we do a simple recursive hypernym check to see if it belongs to a particular super sense or not.

5.2.2    Process for pronouns (Resolution)

For this part we utilize a simpler modified Hobbs approach (27) to resolve the pronouns into the proper noun that they refer to and hypernyms and instances of this word will effectively add on to the noun’s details.  We chose a simpler approach mainly because as Peter Bosch notes in “Some Good Reasons for Shallow Pronoun Processing” (28) that success rates of algorithms for pronoun resolution seem to demonstrate the overall futility of close linguistic analysis for efficient machine processing of natural language. Very simple algorithms that take as their input no more than a parse tree easily yield a success rate of well over 80%.

 

Our algorithm is simply a probabilistic model of matching the gender and count of a pronoun with a similar noun. We look through the subject of that particular sentence and then the objects for such nouns. Also if a similar pronoun occurs earlier we try to match that with this pronoun and so on.

5.2.2    Process for negation

For these we use the “antonym” (ants) relation in WordNet to derive the proper hypernyms.

5.2.4    Process for other word types

For other words we pretty much use the dictionary definition directly to classify them.

5.2.5    Words for which no hypernym were found

We assign such word to the default “other” but we don’t include the super sense in the calculations below but rather use the word straight away.

5.3Topic Summarizing Process

The process of summarizing begins with understanding the senses of each word followed by classification of each word according to its super sense (or when it doesn’t exist, a specific topic). Then we rank this classification of super sense/ specific topics as per its salience measure in the content.

5.3.1    Salience Measure of a particular super sense (or super sense less specific topic) in a document

Before we go on, let us understand some terms used below. The term frequency is a measure of how often a particular term occurs in a poem lines. For instance, if the word ‘beautiful’ appeared twice in a poem line consisting of 10 words, the term frequency of the word ‘beautiful’ would be 0.2. Mathematically, this can be described as follows:

 

Equation 9: Term Frequency

 

where  is the number of occurrences of the term being considered in document , and the denominator is the number of occurrences of all terms in document .

The salience measure is a combination of the following factors (with weight associated shown in brackets which is derived from experiments to produce the best result)

1.       As Subject Term Frequency (15/140)
Term frequency of the index as a subject. (Subjects are found by a very simple parse tree algorithm. The logic being it is usually the noun starting point after a sentence barring any connector words or adjectives or prepositions. A statement lacks a subject if it starts with a verb. But even for that we add a “you” in front of the imperative sentence to make it the subject.)

2.       Co Occurence frequency (20/140)
Summation of the term frequencies of the words to which the index is syntactically related to over the total number of words. (Syntactic relation is measured by whether it occurs in conjunction with the following relationships – adjective to noun, adverb to verb, verb to noun [subjects and objects], connectors & prepositions to nouns etc)

3.       Term Frequency (30 /140)
Simple term frequency of the index

4.       Gloss Overlap factor (20/140)
How closely related are the meanings of the index word as against the words in the content? This needs to increase exponentially with each word match; hence we choose to square the number of overlaps. Also we need to remove a, an, the and other such common words from the calculations. Hence this is measured by the following formula


Equation 10: Gloss Overlap Factor

5.       Hypernym Relatedness factor (35/140)
How closely are the hypernym chain related. Take for example’s sake 2 words man and woman. This should be a combined factor of the number of levels that match (and this factor should contribute exponentially) and also the depth of the lowest hypernym match as this indicates further closeness.

Their hypernym chains shall be assumed as follows..
a) man -> male person->person…
b) woman->female person ->person..
If the hypernym chains exist like in the above case and the belonging category i.e. supersense of the 2 words are the same (person), the formula is as follows

Equation 11: Hypernym Relatedness Factor

So in the above example this works out to (1-(2/3)) * 2^2 = 4/3


In cases when the hypernym chain doesn’t exist like a pronoun etc but the belonging category does then if the match occurs the relatedness is set to a default 0.3 or 0.2 depending on the non index’ availability of hypernym chain (available implies 0.3). If the index word has the hypernym chain no matter what the above formula is used.

6.       POS Frequency factor (5/140)
Within a particular POS how frequent is this word? This is measured by this factor.

7.       POS Salience (15/140): The decreasing order of importance of word types is as follows,

1.       Nouns (includes pronouns & prepositions) (Salience =1)

2.       Verbs (.85), Adverbs (.7) and Adjectives (.7)

3.       Auxiliary Verbs(.4), Connectors(.3) (excluding just coordinating & super sense less ones for which it is 0.1) and Prepositions(.2)

 

We sort the content in the order of the above salience measures combine the above to summarize the content into the 5 best super senses each associated with a salience measure. As an example, for the phrase “If you have to be a man, do this” the first super sense is person. Please note the terms topic and super sense are used interchangeably here.

 

For each of the top 5 super senses we also find a holonym (found by traversing down the “is a” hierarchy) that has the highest frequency and is not a pronoun, is assigned. This will be called the specific topic. This sort of accomplishes a basic topic modeling structure. In fact it is quite interesting to note that this approach to topic modeling is the exact reverse of what several people have done (9), that is they apply other topic modeling measures to do Word Sense Disambiguation while we do the reverse.

The following is a brief overview of the process we just studied.

Figure 6: Topic Summarization Process

5.3.2    Accuracy

We did some preliminary testing on 20 set of text files each with at least a few hundred words. And here are the results

Table 6: Accuracy of Topic Summarizing

nth Topic

Accuracy %

1

90.00%

2

80.00%

3

65.00%

4

50.00%

5

35.00%

 

The above results are expected; as the salience measure goes down then the accuracy of that being a topic is reduced.

 

This accuracy in topic summarizing is good enough for us to apply recombination algorithms based on it. We shall discuss that in detail in the following chapter.
Comments