Parallel naive Bayes algorithm for large-scaleChinese text classification based on spark

来源期刊:中南大学学报(英文版)2019年第1期

论文作者:朱宗卫 刘鹏 赵慧含 滕家雨 仰彦妍 刘亚峰

文章页码:1 - 12

Key words:Chinese text classification; naive Bayes; spark; hadoop; resilient distributed dataset; parallelization

Abstract: The sharp increase of the amount of Internet Chinese text data has significantly prolonged the processing time of classification on these data. In order to solve this problem, this paper proposes and implements a parallel naive Bayes algorithm (PNBA) for Chinese text classification based on Spark, a parallel memory computing platform for big data. This algorithm has implemented parallel operation throughout the entire training and prediction process of naive Bayes classifier mainly by adopting the programming model of resilient distributed datasets (RDD). For comparison, a PNBA based on Hadoop is also implemented. The test results show that in the same computing environment and for the same text sets, the Spark PNBA is obviously superior to the Hadoop PNBA in terms of key indicators such as speedup ratio and scalability. Therefore, Spark-based parallel algorithms can better meet the requirement of large-scale Chinese text data mining.

Cite this article as: LIU Peng, ZHAO Hui-han, TENG Jia-yu, YANG Yan-yan, LIU Ya-feng, ZHU Zong-wei. Parallel naive Bayes algorithm for large-scale Chinese text classification based on spark [J]. Journal of Central South University, 2019, 26(1): 1–12. DOI: https://doi.org/10.1007/s11771-019-3978-x.

相关论文

  • 暂无!

相关知识点

  • 暂无!

有色金属在线官网  |   会议  |   在线投稿  |   购买纸书  |   科技图书馆

中南大学出版社 技术支持 版权声明   电话:0731-88830515 88830516   传真:0731-88710482   Email:administrator@cnnmol.com

互联网出版许可证:(署)网出证(京)字第342号   京ICP备17050991号-6      京公网安备11010802042557号