NLTK基础教程学习笔记(十三)

简介:

在信息摘要应用中还包含着另一种理论逻辑:重要的句子中通常包含着重要的词汇,而跨语料库的差异词(discriminatory word)绝大多数数是重要词汇。因此,句子中包含具有差异很大的词汇,它就很重要。这样就得到一个非常简单的测量方法,就是计算每一个词各种的TF-IDF(term frequency-inverse document )分值,然后根据词汇的重要性找出一种标准化的凭据评分。这个评分就可以用来充当在信息摘要中选取句子的标准。
TF-IDF(term frequency–inverse document frequency)是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜寻引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜寻引擎还会使用基于连结分析的评级方法,以确定文件在搜寻结果中出现的顺序。
按照其不拿整段介绍来做,只拿前三句来实践,我拿了前一段:

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
f=open('news.txt')
news_content=f.read()
results=[]
sentences=nltk.sent_tokenize(news_content)
vectorizer=TfidfVectorizer(norm='l2',min_df=0,use_idf=True,smooth_idf=False,sublinear_tf=True)
sklearn_binary=vectorizer.fit_transform(sentences)
print(vectorizer.get_feature_names())
print(sklearn_binary.toarray())

结果:

['accept', 'accepting', 'altria', 'and', 'announce', 'approaches', 'arthur', 'as', 'at', 'be', 'birth', 'britain', 'british', 'by', 'caliburn', 'ceremonial', 'character', 'decides', 'despite', 'destined', 'dies', 'draws', 'ector', 'eligible', 'embedded', 'enters', 'entrusted', 'explaining', 'fearing', 'fifteen', 'following', 'for', 'full', 'gender', 'growing', 'hardships', 'heir', 'her', 'hesitation', 'his', 'however', 'if', 'in', 'inspired', 'invasion', 'is', 'king', 'knight', 'known', 'large', 'leadership', 'leaving', 'legends', 'legitimate', 'loyal', 'mantle', 'merlin', 'monarch', 'name', 'nativity', 'never', 'no', 'not', 'of', 'or', 'pendragon', 'people', 'period', 'preserving', 'publicly', 'pulling', 'raises', 'recognize', 'responsible', 'ruler', 'saber', 'saxons', 'she', 'shoulders', 'sir', 'slab', 'son', 'soon', 'stone', 'subjects', 'surrogate', 'sword', 'symbolic', 'that', 'the', 'this', 'threat', 'throne', 'to', 'turmoil', 'uther', 'welfare', 'when', 'who', 'will', 'withdraws', 'without', 'woman']
[[ 0.          0.          0.15095332  0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.          0.
   0.20340954  0.          0.          0.31622502  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.31622502
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.31622502  0.          0.17386773
   0.24504638  0.          0.          0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.
   0.31622502  0.          0.          0.          0.          0.15095332
   0.          0.31622502  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.31622502  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.15095332  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [ 0.23250474  0.          0.11098857  0.          0.23250474  0.          0.
   0.14955705  0.23250474  0.          0.23250474  0.          0.          0.
   0.          0.          0.          0.23250474  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.23250474  0.          0.          0.          0.          0.18017058
   0.          0.          0.          0.11098857  0.          0.23250474
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.23250474  0.          0.          0.          0.          0.
   0.23250474  0.23250474  0.          0.23250474  0.          0.23250474
   0.          0.          0.          0.          0.23250474  0.          0.
   0.          0.          0.18017058  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.23250474
   0.          0.          0.          0.          0.          0.          0.
   0.          0.14955705  0.          0.18017058  0.          0.          0.
   0.14955705  0.          0.          0.23250474]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.          0.
   0.          0.          0.29128766  0.          0.          0.
   0.29128766  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.13904921  0.          0.
   0.          0.          0.          0.          0.          0.1601566
   0.          0.29128766  0.          0.          0.          0.          0.
   0.          0.29128766  0.          0.22572213  0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.29128766  0.          0.
   0.          0.          0.          0.18736875  0.          0.29128766
   0.          0.29128766  0.          0.          0.          0.29128766
   0.          0.          0.          0.          0.          0.          0.
   0.18736875  0.          0.          0.          0.          0.29128766
   0.          0.          0.          0.        ]
 [ 0.          0.          0.14155101  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.29652856  0.          0.          0.29652856  0.          0.          0.
   0.          0.          0.29652856  0.          0.          0.          0.
   0.          0.          0.29652856  0.          0.          0.          0.
   0.          0.          0.          0.          0.16303816  0.22978336
   0.          0.29652856  0.          0.          0.29652856  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.29652856  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.14155101  0.          0.          0.29652856  0.19073992  0.
   0.22978336  0.          0.29652856  0.          0.          0.          0.
   0.        ]
 [ 0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.24121053  0.
   0.20022545  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.31127497
   0.          0.          0.          0.          0.31127497  0.          0.
   0.          0.31127497  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.31127497  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.25158536  0.          0.          0.
   0.31127497  0.          0.          0.          0.          0.          0.
   0.          0.          0.31127497  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.25158536  0.          0.31127497  0.          0.
   0.31127497  0.          0.          0.          0.          0.          0.
   0.          0.        ]
 [ 0.          0.          0.10632924  0.          0.          0.22274414
   0.          0.14327861  0.          0.          0.          0.
   0.22274414  0.          0.17260697  0.22274414  0.          0.          0.
   0.22274414  0.          0.          0.          0.          0.22274414
   0.          0.          0.22274414  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.10632924
   0.          0.          0.          0.22274414  0.22274414  0.          0.
   0.          0.          0.          0.          0.22274414  0.          0.
   0.          0.          0.          0.          0.17260697  0.          0.
   0.          0.          0.          0.          0.10632924  0.          0.
   0.17260697  0.          0.          0.          0.          0.
   0.22274414  0.          0.17260697  0.          0.          0.14327861
   0.          0.          0.22274414  0.          0.22274414  0.22274414
   0.          0.          0.17260697  0.          0.22274414  0.10632924
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.14327861  0.22274414  0.          0.        ]
 [ 0.          0.24521796  0.11705736  0.19002219  0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.24521796  0.          0.          0.
   0.24521796  0.          0.11705736  0.          0.          0.24521796
   0.          0.          0.          0.          0.13482643  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.          0.
   0.          0.24565801  0.          0.          0.19002219  0.
   0.24521796  0.          0.24521796  0.          0.          0.24521796
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.19002219
   0.24521796  0.          0.19819534  0.24521796  0.          0.          0.
   0.          0.          0.24521796  0.          0.          0.15773474
   0.          0.          0.        ]
 [ 0.          0.          0.          0.38872173  0.          0.          0.
   0.          0.          0.          0.          0.22958532  0.          0.
   0.22958532  0.          0.          0.          0.29627299  0.          0.
   0.29627299  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.22958532
   0.          0.          0.          0.14142901  0.29627299  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.29627299  0.          0.          0.          0.
   0.29627299  0.          0.          0.          0.          0.          0.
   0.          0.14142901  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.19057553  0.29627299  0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.29627299  0.        ]]

timg

目录
相关文章
|
8天前
|
存储 PyTorch 算法框架/工具
PyTorch 2.2 中文官方教程(十三)(2)
PyTorch 2.2 中文官方教程(十三)
70 1
|
1月前
|
存储 前端开发 C++
【Python 基础教程 09】全面掌握Python3列表:从入门到精通的综合教程与实战指南
【Python 基础教程 09】全面掌握Python3列表:从入门到精通的综合教程与实战指南
90 1
|
8月前
|
机器学习/深度学习 数据采集 数据挖掘
【Python入门系列】第九篇:Python数据分析和处理
Python数据分析和处理是当今数据科学领域中的重要技能之一。随着大数据时代的到来,越来越多的组织和企业需要从海量数据中提取有价值的信息。Python作为一种功能强大且易于上手的编程语言,提供了丰富的数据分析和处理工具和库,如pandas、numpy、matplotlib等。本文将介绍Python数据分析和处理的基础知识和常用操作。
220 1
|
API C语言 开发者
Python进阶系列(十三)
Python进阶系列(十三)
|
机器学习/深度学习 人工智能 运维
开讲啦:《Python3进阶实战》
想必你已经知道了Python这门编程语言的火爆程度,曾几何时,你可能会见到类似 Python将被纳入高考内容、计算机等级考试新增Python程序设计、人工智能时代,你还不会Python嘛? 等的新闻头条,毫无疑问,人工智能时代,Python将占据不可或缺的地位!
开讲啦:《Python3进阶实战》
|
Python
Python习题集(十三)
Python习题集(十三)
115 0
|
存储 索引 Python
初识字典 | Python从入门到精通:进阶篇之十三
本节我们将会对字典做一个简单的介绍以及如何去创建字典,并通过键获取值。
初识字典 | Python从入门到精通:进阶篇之十三
|
机器学习/深度学习

热门文章

最新文章

相关实验场景

更多