基于TF-IDF-MP算法的新闻关键词提取研究
作者:
作者单位:

作者简介:

曹义亲(1964—),男,教授,研究方向为图像处理与模式识别。E-mail:yqcao@ecjtu.edu.cn。

通讯作者:

中图分类号:

TP391

基金项目:

国家自然科学基金项目(61967006)


Research on News Keyword Extraction Based on TF-IDF-MP Algorithm
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    TF-IDF 算法使用词频和逆文档频率来判断文章中词语的重要性,但类别区分效果不是很好。 为提高分类效果,提出 TFIDF-MP 算法。首先对语料库中的文档进行段落标注,利用 jieba 分词工具分词并标注词性,然后根据特征词在单个文档中出现的次数与该特征词在语料库所有文档中出现的平均次数进行比较,采用改进后的 Sigmoid 函数调整特征词权值,同时根据相关文档的段落位置重要程度赋予不同的位置权重,根据特征词权重大小排序后用朴素贝叶斯分类器对文档进行分类。 实验结果表明,TF-IDF-MP 算法应用到新闻分类中,精确率、召回率和 F1 值等评价指标较 TF-IDF 及相关改进算法都得到较好的提升。

    Abstract:

    The TF-IDF algorithm uses the word frequency and inverse document frequency to judge the importance of words, but the category discrimination effect is not very good. In order to improve the classification effect, a TF-IDF-MP algorithm is proposed. First, the documents in the corpus were marked with paragraphs. The word segmentation tool jieba was used to label and tag the parts of speech. Then, the number of times a feature word in a single document was compared with the average number of occurrences in the document, and the feature word weights were adjusted by the improved Sigmoid function. At the same time, different position weights were given according to the importance of the paragraph position of the relevant document. According to the weight of the feature words, Naive Bayes classifier was used to classify the documents. The experimental results show that the TF-IDF-MP algorithm is applied to the news classification, and the evaluation indicators such as accuracy, recall and F1 value are better than TF-IDF and related improved algorithms.

    参考文献
    相似文献
    引证文献
引用本文

曹义亲,盛武平,周会祥.基于TF-IDF-MP算法的新闻关键词提取研究[J].华东交通大学学报,2021,37(1):122-130.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2021-04-23
  • 出版日期: