Week 7 Sem 2

Did some research on how to use WSD and analysed efficiency..

As mentioned earlier Word Sense Disambiguation is the process of assigning a meaning to a word based on the context in which it occurs (21).

4.1      Preliminary knowledge required for Word Sense Disambiguation

Before proceeding, we must clarify the relationship between words, concepts and word senses. Concepts are real world objects, notions or ideas that are represented in text or speech by words. For example, the concept of a stone would be represented by the word stone. In addition, it may well be represented by the word rock or pebble. Hence, the same concept may be represented by different words. Also, the concept need not be a solid object. It could be an abstract thing, like art, or an action, like walking. Each such concept has a number of words that represent it. Not only that, but a single word may represent a number of concepts. For example, the word bank could mean the financial institution concept or the river bank concept. The different meanings of a word are known as word senses. A word could, therefore, correspond to a number of concepts, while a word sense corresponds only to a single concept. Due to this equivalence of word senses and concepts, in this thesis we use the terms concepts and word sense interchangeably.

 

The reader may have noted that in our previous examples, dealing with semantic relatedness and human perception of relatedness, we have been showing words in the text but then referring to the relatedness of concepts or of word senses. This is simply a convenience. In general, measures of relatedness focus on underlying concepts or word senses, and when a pair of words are presented in work such as this, it is  presumed that the obvious senses are intended (unless otherwise indicated). For example, if we present the example pair bank and money, we would presume that we are referring to the most obvious senses, which are likely the financial ones.

 

Concepts may be related by virtue of being similar in appearance or function (e.g., chair and stool), being opposites (e.g., rich and poor), or having some other defined relationship between them, such as one concept being a type of or a part of the other. For example, a penguin is a type of bird, and a beak is a part of a penguin. However, two concepts can be related without there being a defined relationship between the words. They may be considered highly related if they just happen to occur together very often in everyday use. For example bargain and book may be considered related, simply because the collocation bargain book is relatively common.

 

In this thesis we will review various measures that are based on the lexical database WordNet (18). Some measures extend WordNet’s content with statistical information derived from large corpora (research done by others), while others employ different aspects of WordNet’s structure and content. Before we delve into the intricacies of measuring semantic relatedness, a quick introduction to WordNet is in order, since all the measures described here are based on WordNet.

4.2      WordNet

The creators of WordNet refer to it as an electronic lexical database (18). This is a convenient but over-simplified description of a very complex resource. WordNet can be visualized as a large graph or semantic network, where each node of the network represents a real world concept. For example, the concept could be an object like a house, or an entity like a teacher, or an abstract concept like art, and so on.

 

Every node consists of a set of words, each representing the real world concept associated with that node. Thus, each node is essentially a set of synonyms that represent the same concept. For example, the concept of a car may be represented by the set of words _car, auto, automobile, motorcar_. Such a set, in WordNet terminology, is known as a synset. A synset also has associated with it a short definition or description of the real world concept known as a gloss. The synsets and the glosses in WordNet are comparable to the content of an ordinary dictionary.

 

What sets WordNet apart as against dictionaries like freedictionary.com which was used in the previous system is the presence of links between the synsets – the edges of the graph mentioned above. Each link or edge describes a relationship between the real world concepts represented by the synsets that are linked. For example, relationships of the form “a vehicle is a kind of conveyance” or “a spoke is a part of a wheel” are defined. Other relationships include is opposite of, is a member of, causes, pertains to, etc. Table 1 shows the list of relations defined in WordNet.

Table 1: Relations between synsets defined in WordNet

Relation

Description

Example

Hyponymy

is a generalization of

furniture is a hypernym of chair

Hyponym

is a kind of

chair is a hyponym of furniture

Troponym

is a way to

amble is a troponym of walk

Meronym

is part/substance/member of

wheel is a (part) meronym of a bicycle

Holonym

contains part

bicycle is a holonym of a wheel

Antonym

opposite of

ascend is an antonym of descend

Attribute

attribute of

heavy is an attribute of weight

Entailment

entails

ploughing entails digging

Cause

cause to

to offend causes to resent

Also see

related verb

to lodge is related to reside

Similar to

similar to

dead is similar to assassinated

Participle of

is participle of

stored (adj) is the participle of "to store"

Pertainym of

pertains to

radial pertains to radius

 

The network of relations between word senses present in WordNet encodes a vast amount of human knowledge. This gives rise to a great number of possibilities in the way it could be used for various Natural Language Processing (and other) tasks. Figure 4 focuses on a small portion of the structure of WordNet and illustrates the nodes and edges of the semantic network just described.

Figure 4:  An illustration of synsets and relations in WordNet, taken from Ted Pedersen’s paper (21)

 

The synsets in WordNet are divided into four distinct categories, each corresponding to four of the parts of speech – nouns, verbs, adjectives and adverbs. Most of the relationships defined between the synsets are restricted to a particular part of speech and do not cross part of speech boundaries. Exceptions are the ‘pertains to’ and ‘attribute’ relationships that exist between adjectives and nouns. Thus, the set of relations defined on the synsets in WordNet, divide them into four almost disjoint regions.

 

One of the relations in WordNet of interest to us, mainly because of its structure and utility in measuring semantic relatedness, is the “is a kind of” relationship or simply “is a”. This relationship between synsets is restricted to nouns and to verbs. This relation organizes the noun and verb synsets into large hierarchies or trees. Each tree has a single root node. The more general concept nodes are ancestors of more specific concept nodes. We say that the more general concepts subsume the more specific concepts. For instance, entity is the most general concept in one of the noun hierarchies and is the root node of the tree. It subsumes other more specific concepts such as furniture, bicycle, etc, which are lower down in the tree. Similarly, furniture may subsume other concepts such as those of chair or table. There exist 9 such hierarchies in the WordNet nouns, while there are about 628 hierarchies for verbs. The large number of hierarchies in verbs is due to the fact that the verb hierarchies are, on average, much shorter and broader than the noun hierarchies. Each of the verb hierarchies, therefore, covers a much smaller portion of the synsets, as compared to the noun hierarchies. This makes the verb hierarchies a lot less effective in the relatedness measures that we describe later in the section. Figure 5 shows an example of the “is-a” hierarchy in WordNet.

Figure 5: A schematic of the “is-a” hierarchy in WordNet, taken from Ted Pedersen’s paper (21)

For the purpose of our experiments we use WordNet 2.1. We also use the CPAN module WordNet::SenseRelate project (22) provided by Ted Pederson, Siddharth Patwardhan & Satanjeev Banerjee which enables to try out the various measures and compare our results. Next we shall see what are the general measures used to measure semantic relatedness and eventually disambiguate words.

4.3      Measuring Semantic relatedness between words in a context to disambiguate words

Measuring the semantic relatedness of concepts is an intriguing problem in Natural Language Processing. Various approaches that attempt to approximate human judgment of relatedness have been tried by researchers. In this section we look at a few WordNet–based measures of semantic relatedness that we propose to compare. Approaches to measuring semantic relatedness that we have not experimented with in this thesis are discussed in the section on related work (chapter 8).

 

It is important to note that, although semantic similarity and semantic relatedness are sometimes used interchangeably, the term relatedness is more general than similarity. Budanitsky and Hirst (23) discuss this point and they say that similarity usually refers to concepts that are related because they look alike. For example, table is similar to desk. On the other hand, dissimilar concepts like wheel and spoke may be semantically related. In this thesis, we deal with measures of semantic relatedness.

 

Given the vast store of human knowledge encoded in WordNet, it has been used by many researchers in developing measures of semantic relatedness. Some use only the structure and content of WordNet to measure semantic relatedness. Other approaches combine statistical data from large corpora with the structure of WordNet to give us a score of semantic relatedness. Details of these measures are quoted from “Using Measures of Semantic Relatedness for Word Sense Disambiguation” (21) and related papers and notes (24).

4.3.1    Path length

This measure uses a simple node-counting scheme (path). The relatedness score is inversely proportional to the number of nodes along the shortest path between the synsets. The shortest possible path occurs when the two synsets are the same, in which case the length is 1. Thus, the maximum relatedness value is 1. Assuming c1,c2 to be the 2 concepts for which semantic relation is sought.

Equation 1: Path Length Relatedness measure

4.3.2    Leacock & Chodorow

An intuitive method to measure the semantic relatedness of word senses using WordNet, given its tree-like structure, would be to count up the number of links between the two synsets. The shorter the length of the path between them, the more related they are considered. The Leacock & Chodorow measure does this in WordNet. But it considers only the “is a” hierarchies of nouns in WordNet. Since only noun hierarchies are considered, this measure is restricted to finding relatedness between noun concepts. The noun hierarchies are all combined into a single hierarchy by imagining a single root node that subsumes all the noun hierarchies. This ensures that there exists a path between every pair of noun synsets in this single tree. To determine the semantic relatedness of two synsets, the shortest path between the two in the taxonomy is determined and is scaled by the depth of the taxonomy. The following formula is used to compute semantic relatedness:

Equation 2: Leacock & Chodorow Relatedness measure

Where c1 and c2 represent the two concepts, shortest path (c1, c2) specifies the length of the shortest path between the two synsets c1 and c2 and D is the maximum depth of the taxonomy.

 

This method assumes the size or weight of every link in the taxonomy to be equal. This is a false assumption. It is observed that lower down in the hierarchy, concepts that are a single link away are more related than such pairs higher up in the hierarchy. This simple approach, however, does relatively well, despite its lack of complexity.

 

Some highly related approaches attempt to overcome this disadvantage of simple edge counting by augmenting the information present in WordNet with statistical information from large corpora.

4.3.3    Resnik Measure

Statistical information from large corpora is used to estimate the information content of concepts. The idea of information content was introduced by Resnik, in his paper that describes a novel method to compute semantic relatedness.

 

In brief, information content of a concept measures the specificity or the generality of that concept, i.e. how specific to a topic the concept is. For example, a concept like sprinkler is a highly topical concept and would have high information content. On the other hand, a more general concept such as artifact would have much lower information content.

 

Computation of information content is based on the subsuming property of the “is a” hierarchy. Suppose we come across concept c1 in a discourse, which is subsumed by concept c2 in the “is a” hierarchy of WordNet. Then the occurrence of concept c1 in the text implies the occurrence of c2 in the text, which is explained by the fact that c1 is a kind of c2. For example, suppose we come across the concept chair in a given text. Because chair is a kind of furniture, we can always say that text contains the concept of furniture. And further, without being wrong, we can say that it speaks of an object. Thus chair, furniture and object all represent the same concept in the text at varying degrees of specificity.

 

To find information content we first compute the frequency of occurrence of every concept in a large corpus of text. Every occurrence of a concept in the corpus adds to the frequency of the concept and to the frequency of every concept subsuming the concept encountered. We note that by this process the root node includes the frequency count of every concept in the taxonomy. It is incremented for every occurrence of every concept in the taxonomy.

 

Information content is defined as the negative logarithm of the probability of occurrence of the concept in a large corpus. Thus, with these frequency counts computed from a large corpus, we arrive at the following formula for information content:

Equation 3: Information Content Definition

Where IC(c) is the information content of concept c, root is the root node of the “is a” hierarchy and freq (c) and freq (root) are the frequency counts of these concepts.

 

Resnik defines the semantic relatedness of two concepts as the amount of information they share in common. He goes on to elaborate that the quantity of information common to two concepts is equal to the information content of their lowest common subsumer – the lowest node in the hierarchy that subsumes both concepts. For example, in figure 5 the lowest common subsumer of carrot and radish is plant root, while that of carrot and birdcage is object.

Equation 4: Resnik Relatedness measure

Where IC determines the information content of a concept and lcs (c1, c2) finds the lowest common subsumer of concepts c1 and c2.

The Resnik measure depends completely upon the information content of the lowest common subsumer of the two concepts whose relatedness we wish to measure. It takes no account of the concepts themselves. This leads to somewhat “coarser” relatedness values. For example, the concept pair car and bicycle will have the same measure of semantic relatedness as the pair car and all terrain bicycle because both pairs of concepts have the same lowest common subsumer.

 

This measure has a lower bound of 0 and no upper bound.

4.3.4    Jiang and Conrath Measure

A measure introduced by Jiang and Conrath addresses the limitations of the Resnik measure. It incorporates the information content of the two concepts, along with that of their lowest common subsumer. The measure is a distance measure that specifies the extent of unrelatedness of two concepts. It combines features of simple edge counting with those of information content introduced in the Resnik measure. The Jiang–Conrath measure is given by the formula:

Equation 5: Jiang & Conrath Distance measure

 

For the purpose of our experiments, to maintain a scale where a value of 0 indicates unrelated concepts, we invert the value to make it a measure of semantic relatedness.

Equation 6: Jiang & Conrath relatedness measure

This measure works only with WordNet nouns, has a lower bound of 0 and no upper bound.

4.3.5    Lin Measure

Another measure, based on information content of concepts, is described by Lin. The measure is given by:

Equation 7: Lin relatedness measure

This measure has a lower bound of 0 and an upper bound of 1.

4.3.6    Wu & Palmer Measure

The Wu & Palmer measure (wup) calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS. The formula is

Equation 8: Wu & Palmer relatedness measure

This has a lower bound of 0 and upper bound of 1. The score is one if the two input synsets are the same.

4.3.7    Hirst & St-Onge Measure

This measure (hso) works by finding lexical chains linking the two word senses. There are three classes of relations that are considered: extra-strong, strong, and medium-strong. The maximum relatedness score is 16.

4.3.8    Adapted Lesk Measure

The Extended Gloss Overlaps measure (lesk) works by finding overlaps in the glosses of the two synsets. The relatedness score is the sum of the squares of the overlap lengths. For example, a single word overlap results in a score of 1. Two single word overlaps results in a score of 2. A two word overlap (i.e., two consecutive words) results in a score of 4. A three word overlap results in a score of 9.

4.3.9    Gloss Vector Measure

The Gloss Vector measure (vector) works by forming second-order co-occurrence vectors from the glosses or WordNet definitions of concepts. The relatedness of two concepts is determined as the cosine of the angle between their gloss vectors. In order to get around the data sparsity issues presented by extremely short glosses, this measure augments the glosses of concepts with glosses of adjacent concepts as defined by WordNet relations.

4.4      Analysis of the various measures & their weaknesses

As Ted Pedersen notes in his paper “Using Measures of Sematic Relatedness For Word Sense Disambiguation (21), the measures Path Length and Wup have inherent deficiencies and the other measures are inherently superior to them. In the same paper the other measures barring vectors were compared. 29 words were taken as the key comparators and here are the results,

Table 2: Semantic Relatedness Measure Analysis

Word

WN

Cand

Res

Jcn

Lin

Lch

Hso

Lesk

art

4

14

0.41

0.54

0.42

0.44

0.4

0.61

authority

7

9

0.14

0.16

0.17

0.19

0.2

0.27

bar

13

21

0.21

0.23

0.25

0.18

0.25

0.21

bum

4

4

0.2

0.73

0.41

0.59

0.31

0.13

chair

4

7

0.37

0.33

0.46

0.21

0.44

0.84

channel

7

13

0.15

0.15

0.16

0.23

0.2

0.1

child

4

5

0.27

0.43

0.38

0.02

0.16

0.62

church

3

9

0.37

0.41

0.37

0.41

0.48

0.38

circuit

6

15

0.43

0.51

0.48

0.34

0.41

0.53

day

10

18

0.12

0.43

0.32

0.28

0.19

0.15

detention

2

5

0.61

0.81

0.61

0.52

0.63

0.88

dyke

2

2

0.73

0.86

0.77

0.46

0.61

0.89

facility

5

7

0.24

0.34

0.29

0.21

0.23

0.29

fatigue

4

6

0.16

0.42

0.22

0.77

0.44

0.77

feeling

6

7

0.22

0.55

0.27

0.53

0.26

0.49

grip

7

8

0.22

0.19

0.19

0.18

0.18

0.12

hearth

3

3

0.43

0.72

0.59

0.38

0.42

0.75

holiday

2

3

0.55

0.16

0.32

0.55

0.55

0.16

lady

3

8

0.36

0.17

0.19

0.42

0.36

0.17

material

5

10

0.44

0.55

0.44

0.4

0.38

0.29

mouth

8

8

0.12

0.12

0.11

0.055

0.2

0.46

nation

4

6

0.18

0.35

0.26

0.22

0.26

0.59

nature

5

6

0.1

0.11

0.05

0.11

0.18

0.16

post

8

12

0.16

0.35

0.19

0.09

0.15

0.31

restraint

6

7

0.31

0.4

0.33

0.36

0.3

0.16

sense

5

11

0.49

0.4

0.51

0.4

0.43

0.5

spade

3

4

0.7

0.15

0.56

0.21

0.4

0.59

stress

5

5

0.32

0.38

0.33

0.44

0.39

0.31

yew

2

3

0.66

0.79

0.73

0.57

0.7

0.86


Clearly from the above the Adapted Lesk algorithm and the JCN algorithms are the most consistently accurate. An interesting point to note here is that JCN and Lesk adopt 2 different approaches to measure semantic relatedness. While Lesk uses glosses JCN uses information content. In a later study by Siddharth Patwardhan (25) they came up with a new measure gloss vectors which is described in detail in an earlier section. This combines information content and glosses into a vector and uses it find out the semantic relations. So theoretically this measure might be more accurate. We shall determine this for a fact later in the thesis.

 

A simple analysis of the above measures over a bunch (over 500 words) of poem lines and the existing database of input messages that BlogWall had, we observed that over 50% of the words were unrecognized. This was found to be the fault of not the measures but rather of WordNet, which didn’t have the common words like “the, a, an” etc stored. This was a huge problem because the word sense disambiguation layer forms the basis for the topic summarizing layer. With such a low accuracy our whole algorithm will suffer. Hence the need to modify these algorithms arose.

 

On deeper analysis, the flaws in the system that needed to be fixed were

1)      Missing pronoun handling

2)      Missing conjunction handling

3)      Missing preposition handling

4)      Missing auxiliary verb handling

5)      Missing punctuation especially quotes handling

4.5      Our Modified Semantic Relatedness Measure & its performance

Those missing layers mentioned above are added as dictionary based layers. As in we include the above in the system by augmenting WordNet with a dictionary like database. We study the performance of the various methods for our sample data and find that the vector measure is actually the best performer. So we build our solution on top of it. Thus instead of the traditional WordNet tagging which produces noun, verb, adjective and adverb tagging alone we have pronoun, preposition, connector words (conjunctions etc) and auxiliary verbs (see Appendix A).

 

When all these missing layers were added, the number of words that went unidentified reduced to almost 5% from 50%, a tenfold reduction. The following table illustrates the differences between the new modified algorithm and the other industry standard algorithms. Refer to appendix B for a sample of the tagging of a simple sentence.

Table 3: Performance of the best relatedness measures and our modified vector measure for our sample set

Method

% unidentified

% identified Correctly (out of those identified)

Vector

50.00%

70.00%

Lesk

50.00%

55.00%

Jcn

50.00%

45.00%

Modified Approach

5.00%

85.00%

Another point to note is that even the accuracy of the algorithm for the identified words has increased by 15%. This is essentially due to the pronoun handling and the punctuation handling which cuts down on mistakes and enhances accuracy.


An accuracy of 85% to understand words & 95% to identify words keeps us in good stead for the other upcoming components – topic summarizing & recombination coherence generator.

 

Comments