简介概要

WordNet-based lexical semantic classification for text corpus analysis

来源期刊：中南大学学报(英文版)2015年第5期

论文作者：LONG Jun WANG Lu-da LI Zu-de ZHANG Zu-ping YANG Liu

文章页码：1833 - 1840

Key words：document representation; lexical semantic content; classification; eigenvector

Abstract: Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors. This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet, this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.

详情信息展示

WordNet-based lexical semantic classification for text corpus analysis

LONG Jun(龙军)¹, WANG Lu-da(王鲁达)¹, LI Zu-de(李祖德)¹, ZHANG Zu-ping(张祖平)¹, YANG Liu(杨柳)²

(1. School of Information Science and Engineering, Central South University, Changsha 410075, China;
2. School of Software, Central South University, Changsha 410075, China)

Abstract:Many text classifications depend on statistical term measures to implement document representation. Such document representations ignore the lexical semantic contents of terms and the distilled mutual information, leading to text classification errors. This work proposed a document representation method, WordNet-based lexical semantic VSM, to solve the problem. Using WordNet, this method constructed a data structure of semantic-element information to characterize lexical semantic contents, and adjusted EM modeling to disambiguate word stems. Then, in the lexical-semantic space of corpus, lexical-semantic eigenvector of document representation was built by calculating the weight of each synset, and applied to a widely-recognized algorithm NWKNN. On text corpus Reuter-21578 and its adjusted version of lexical replacement, the experimental results show that the lexical-semantic eigenvector performs F1 measure and scales of dimension better than term-statistic eigenvector based on TF-IDF. Formation of document representation eigenvectors ensures the method a wide prospect of classification applications in text corpus analysis.

Key words:document representation; lexical semantic content; classification; eigenvector

简介概要

详情信息展示

WordNet-based lexical semantic classification for text corpus analysis

相关论文

相关知识点