HyperDic: mapping

English > WordNet Mapping

At this moment, it is still not possible to map WordNets without applying uncertain heuristics. Because of this, some fallacies are found among the top-ranking results of all known WordNet mappings.

Since HyperDic version 3.0, we try to address this problem by moving towards a more theoretic approach to the question of WordNet mappings. While we cannot yet totally eliminate the need for heuristics and uncertainties, we can at least considerably narrow their scope.

The mappings used in HyperDic are available as zip files:

With version 3.0, HyperDic started to integrate WordNets from other languages than English, specifically GWA Grid versions of the Spanish and the Catalan WordNets, which refer to WordNet 1.6 synset identifiers from 1998.

Synset identifiers are byte offsets in the original WordNet databases, i.e. an accidental effect of the compilation process, so they are never compatible across different versions of WordNet.

Foreign language WordNets have different development paces, so they refer to different WordNet versions. Thus, the need arises for some mapping method, in order to link together the contents of these different WordNets.

With version 2.0 in 2004, HyperDic abandoned every reference to the arbitrary synset offsets. Since then, instead, HyperDic has used the permanent WordNet sense keys, of which approx. only 1% tend to change between each WordNet version.

Between version 1.6 and version 3.0, the number of changed sense keys accumulate to 5%, but we are still in a happy situation, where 95% of the WordNet 1.6 data is already mapped to WordNet 3.0 on the sense keys alone.

In principle, a sense key always represents the same word sense, so word senses that have the same sense key across different WordNet versions should be identical. So, while having 95% of our data already mapped is not totally satisfying, at least those 95% are mostly certain.

Still, we have to apply heuristics, if we want to cover the remaining 5% of the data, and this remains a very muddy issue. But, at least, the scope of the uncertainty is reduced to only 5%, which is a considerable improvement compared to methods where all numbers are uncertain.

However, there are also some known exceptions, where a given sense key can be shown to represent diverging word senses across WordNet versions. For example, in version 1.6 of WordNet's noun.communication file, the sense nr. "0" of the word "C" is the name of a programming language, while in WordNet version 3.0, it is the third letter of the roman alphabet, and the sense number of the programming language has changed to "1". At this moment, we do not know how to discover these cases, except by focusing on the differences between the results of different mapping methods. But, since sense keys are meant to be stable, there should only be few occurrences of such exceptions.

Another problem is that mapping synonym sets is much more complex than mapping word senses, because synsets are groups of word senses. While individual word senses can only be present or absent in a given WordNet version, in addition to that, sense groupings can also split or merge.

A reason for merging synsets between WordNet versions can be that there was substantial word overlap between the original synsets. As a result, the different senses of these words are unified, which raises the question of choosing which one of the source sense numbers to keep. If one of the source word senses was broader than the others, it becomes the target word sense, since it covers all the previous differences, and the sense numbers originating from the previously narrower synsets disappear. Otherwise, a reasonable heuristic may be to discard the higher sense numbers.

Splitting synomym sets is a bigger problem than merging them, because it raises the question of how to distribute the source senses across target groups. Some solve this problem by distributing all source senses to all target sets. For HyperDic 3.0, we chose to map them to only one, supposedly most representative set, while our 3.1 mapping uses a mixed approach. Both strategies are insufficient: ideally, these cases should be reviewed, and retagged by human lexicographers.

So, at this moment, it is not possible to entirely believe in a fully automatic WordNet mapping. However, we compared our previous mapping of the Spanish dataset, with the result of a similar mapping produced by the FreeLing project, included in their FreeLing 2.0 release. Although the mapping methods were different, both mappings agree on approx. 95% of the WordNet 3.0 targets. Thus, they tend to validate each other.

A finer analysis of the cases where both methods yield different targets indicates, as expected, that the FreeLing mapping performs better with unstable sense keys, while the HyperDic mapping performs better on stable sense keys. The latter case is meant to be the rule, while the former case should be exceptional, so in theory, we expect the HyperDic targets to be more reliable overall. Hower, we would need to further review all the dubious cases before drawing a conclusion.

Advertize here MegaDoc Tag Dictionary