NLTK基础教程学习笔记(二)

  1. 云栖社区>
  2. 博客>
  3. 正文

NLTK基础教程学习笔记(二)

night李 2018-01-31 23:51:04 浏览25971
展开阅读全文

Python基础:
字典(dictionary)也是最常用到的一种数据结构。在其他语言中被称为关联数组/存储。字典是一种键值索引型的数据结构,其索引键可以是一种不可变的类型,例如字符串和数字常被用来充当索引键。
Python的字典结构是哈希表实现之一。哈希表是一种操作起来非常容易的字典结构,其优势在于通过简短的代码就能建立起非常复杂的数据结构。
例子用字典来获取文本中各单词出现的频率:

mystring="Monty Python! And the holy Grail !\n"
word_frep={}
for tok in mystring.split():
    if tok in word_frep:
        word_frep[tok]+=1
    else:
        word_frep[tok]=1
print(word_frep)

结果:

{'holy': 1, 'the': 1, 'Python!': 1, '!': 1, 'Grail': 1, 'And': 1, 'Monty': 1}

NLTK入门:
先介绍了一个简单的爬虫例子,爬取了Python官网主页上的文本信息:

import urllib.request
response=urllib.request.urlopen('http://python.org/')
html=response.read()
print(len(html))

这里和书上的不同对于我用的python3.5,urllib2包已经不能用了,用urllib.request代替。
结果;

48907

接下来做一次探索性数据分析(EDA),对于一段文本域而言,EDA可能包含多重含义,这里只会涉及一个简单的例子,即该文档的主体术语类型。文字的主体和出现的频率等。
对于之前从Python主页爬的文字域,我们先清除其中的html标签,做法是先用正则表达式选取其中的标记,包括数字和字符,转换为一个列表;
版本1:

import urllib.request
response=urllib.request.urlopen('http://python.org/')
html=response.read()
#print(len(html))
tokens=[tok for tok in html.split()]
print("Total no of tokens:" +str (len(tokens)))
print(tokens[0:100])

结果;

Total no of tokens:2932
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js', b'ie6', b'lt-ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js', b'ie7', b'lt-ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js', b'ie8', b'lt-ie9">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js"', b'lang="en"', b'dir="ltr">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8">', b'<meta', b'http-equiv="X-UA-Compatible"', b'content="IE=edge">', b'<link', b'rel="prefetch"', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">', b'<meta', b'name="application-name"', b'content="Python.org">', b'<meta', b'name="msapplication-tooltip"', b'content="The', b'official', b'home', b'of', b'the', b'Python', b'Programming', b'Language">', b'<meta', b'name="apple-mobile-web-app-title"', b'content="Python.org">', b'<meta', b'name="apple-mobile-web-app-capable"', b'content="yes">', b'<meta', b'name="apple-mobile-web-app-status-bar-style"', b'content="black">', b'<meta', b'name="viewport"', b'content="width=device-width,', b'initial-scale=1.0">', b'<meta', b'name="HandheldFriendly"', b'content="True">', b'<meta', b'name="format-detection"', b'content="telephone=no">', b'<meta', b'http-equiv="cleartype"', b'content="on">', b'<meta', b'http-equiv="imagetoolbar"', b'content="false">', b'<script', b'src="/static/js/libs/modernizr.js"></script>', b'<link', b'href="/static/stylesheets/style.css"', b'rel="stylesheet"', b'type="text/css"', b'title="default"', b'/>', b'<link', b'href="/static/stylesheets/mq.css"', b'rel="stylesheet"', b'type="text/css"', b'media="not', b'print,', b'braille,']

版本2:

import urllib.request
import re
response=urllib.request.urlopen('http://python.org/')
html=response.read()
html=html.decode('utf-8')
tokens=re.split('\W+',html)
print(len(tokens))
print(tokens[0:100])

结果:

6221
['', 'doctype', 'html', 'if', 'lt', 'IE', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'html', 'class', 'no', 'js', 'ie8', 'lt', 'ie9', 'endif', 'if', 'gt', 'IE', '8', 'html', 'class', 'no', 'js', 'lang', 'en', 'dir', 'ltr', 'endif', 'head', 'meta', 'charset', 'utf', '8', 'meta', 'http', 'equiv', 'X', 'UA', 'Compatible', 'content', 'IE', 'edge', 'link', 'rel', 'prefetch', 'href', 'ajax', 'googleapis', 'com', 'ajax', 'libs', 'jquery', '1', '8', '2', 'jquery', 'min', 'js', 'meta', 'name', 'application', 'name', 'content', 'Python', 'org', 'meta', 'name', 'msapplication', 'tooltip', 'content', 'The', 'official']

注python3要用上

html=html.decode('utf-8')

否则会报错:

cannot use a string pattern on a bytes-like object

接下来用NLTK的方式清理这些标签:

import nltk
import urllib
from bs4 import BeautifulSoup
response=urllib.request.urlopen('http://python.org/')
html=response.read()
html=html.decode('utf-8')
soup=BeautifulSoup(html,'lxml')
clean=soup.get_text()
tokens=[tok for tok in clean.split()]
print(tokens[:100])

结果:

['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network']

下面是用nltk进行词频的统计:

import nltk
import urllib
from bs4 import BeautifulSoup

response=urllib.request.urlopen('http://python.org/')
html=response.read()
html=html.decode('utf-8')
soup=BeautifulSoup(html,'lxml')
clean=soup.get_text()
tokens=[tok for tok in clean.split()]
#print(tokens[:100])
Freq_dist_nltk=nltk.FreqDist(tokens)
print(Freq_dist_nltk)
for k,v in Freq_dist_nltk.items():
    print(str(k)+':'+str(v))

结果:

<FreqDist with 614 samples and 1117 outcomes>
up:2
==:1
now:3
document.getElementsByTagName('script')[0];:1
Best:1
"url"::1
-:1
core:1
[];:1
Statements:1
ga.src:1
s.parentNode.insertBefore(ga,:1
tkInter,:1
international:1
Trac,:1
Legal:1
Beginner’s:1
While:1
'Apple'),:1
here.:2
FOSDEM:2
Brochure:2
2018-01-09:1
Windows:3
programmers:1
Stories:3
Essays:2
_gaq.push(['_setAccount',:1
Interpretation:1
Up:1
Chat:1
discussed:1
Logo:2
window.jQuery:1
Search:1
List:2
comprehensions:1
processing:1
?:1
b,:1
PyCon:2
'Banana'),:1
<:1
Easy:1
Diversity:3
1.:1
Contributing:1
Top:2
Learn:3
go.:1
User:4
even:1
Notice::1
for?:1
Arts:2
sure:1
programs:1
functions:2
it's:1
control:3
document.location.protocol:1
web2py:1
Government:2
Event:2
-,:1
tools:1
/:4
our:2
Industrial:1
A:2
Smaller:1
Scientific:3
compound:1
knows:1
About:2
Tru64,:1
rendering:1
Security:1
classic:1
Sign:3
have:1
Check:1
Solaris,:1
Copyright:1
quickly,:1
Javascript:2
=:14
Launch:1
Runs:1
987:1
Events:11
an:3
Whet:1
Getting:2
hire:1
Unicode):1
join:1
Mailing:2
Foundation:3
will:1
Roundup:1
Web:1
Hi,:1
learn.:1
Developer's:3
Submit:3
Django,:1
built-in:1
(PEPs)::1
GO:1
Types:1
I'm:2
lists.:1
Girls:1
alpha:1
used:2
Not:1
name):1
math:1
Started:3
Latest:1
Proposals:1
Python!"):1
21:1
to:17
55:1
expected;:1
syntax:2
Documentation:3
Latest::1
other:4
6,:1
}:2
programmers.:1
Engineering:2
limited.:1
Website:1
Talks:2
PSF:4
faster:1
can:3
lists:1
Numeric::1
3.6.4,:1
Implementations:2
377:1
Django:1
n::1
list:2
Data:1
pipeline.:1
—:2
Larger:1
relaunched:1
programming:4
PyGObject,:1
appetite:1
(1,:1
website,:1
Intuitive:1
structure:1
Python.:2
Linux,:1
©2001-2018.:1
argument:1
IPython:1
Forums:2
What:1
a,:2
Kivy,:1
SciPy,:1
Pyramid,:1
for,:1
output:1
For:1
available.:2
own:1
one:1
**:1
"WebSite",:1
//:1
Meetup:1
You’d:1
Skip:1
(with:1
144:1
arithmetic:1
running:1
Conduct:2
turn:1
motion:1
Archive:4
allows:1
['Banana',:1
is:',:1
Light:1
users:1
community-run:1
Contact:1
Compaq:1
speak:1
indentation:1
Issue:1
fruits:1
0:1
course.:1
operators:1
Upcoming:1
straightforward::1
re-code):1
Tracker:1
3.6.4:5
The:5
General:1
fruit:1
systems:1
Welcome:1
fib(n)::1
8:2
all:1
Become:1
versions!:1
release:1
job:1
you:1
'.google-analytics.com/ga.js';:1
daily.:1
languages:1
610:1
document.write('<script:1
manipulated:1
together:1
Quick:1
new:1
Powered:1
product):1
≡:1
print(a,:1
of:17
Calculations:1
Tim:1
language,:1
...:7
2018:2
Defined:1
input('What:1
essential:1
frames:1
2018-02-02:2
end=':1
Fibonacci:1
Flask,:1
use?:1
Lists:4
%s.':1
Please:1
functions.:2
'Lime']:1
development:1
Initiatives:1
standard:1
"potentialAction"::1
types:1
s:1
languages):1
[2,:1
enumerate:1
Donate:1
picture:1
Looking:1
"@type"::2
():1
fib(1000):1
protect,:1
Enhancement:1
in:8
more:2
Jobs:2
+:1
Register:1
Menu:1
Code:2
Hello,:1
thousands:1
experienced:1
document.createElement('script');:1
2018-01-23:1
Practices:1
(and:1
future:1
grouping.:1
by:3
[(0,:1
News:11
0,:1
['BANANA',:1
for…:1
true;:1
day.:1
library,:1
Flow:1
with:7
pick:1
number:2
Ansible,:1
4,:1
Software:6
(function():1
as:2
ga:1
testing.:1
OpenStack:1
easy:2
5.666666666666667:1
machines:1
last:1
find:1
Mac:2
environment:1
production:2
Source:2
print("Hello,:1
Special:2
3:8
'http://www'):1
'):1
fourth:2
keyword:1
docs.python.org:1
▲:3
'LIME']:1
ga.type:1
per:1
System:1
3.:1
Fortenberry:1
Bug:1
Success:3
Awards:2
Input,:1
Development::3
Pandas,:1
3.7.0a4:1
print('Hi,:1
statements:1
arrays:1
name?:1
the:19
"http://schema.org",:1
twists,:1
parentheses:1
Platforms:2
"@context"::1
community:1
This:1
usual:1
growth:1
lets:1
are:5
Books:2
Alternative:2
~800:1
Audio/Visual:2
effectively.:1
way:1
Back:2
89:1
Development:2
'Apple',:1
ILM:1
})();:1
arguments,:2
is::1
Our:1
some:1
Reset:1
"query-input"::1
*:2
Functions:1
four:1
"target"::1
Chelyabinsk:1
Rackspace:1
sliced:1
2:3
place:1
source:1
list(enumerate(fruits)):1
content:2
Privacy:1
def:1
License:2
key,:1
not:1
Simple:2
FAQ:2
very:1
PyPI:1
34:1
interaction:1
n:1
installers:1
Python:60
Conferences:2
float:1
numbers:1
Wiki:2
Guide:6
'Lime')]:1
Facebook:1
Buildbot,:1
||:2
extensible:1
PyQt,:1
::1
>>>:24
experience.:1
assignment:1
Community:7
Pythology:1
numbers::1
#:9
compositing:1
facilitate:1
quickly:1
series:1
IRC:3
advance:1
a+b:1
ga.async:1
arbitrary:1
Administration::1
var:3
2017-12-06:1
_gaq:2
for:11
Salt,:1
or:2
Core:1
PEP:2
Magic:1
Education:2
board:1
and:22
Legon:1
your:4
tens:1
language:2
mission:1
support:1
was:1
>_:1
Python,:1
while:2
Site:1
Applications:2
name?\n'):1
17:2
about:3
2017-12-19:2
that:5
version:1
Thousands:1
available:3
returns:1
jobs.python.org:1
position:1
'APPLE',:1
Interactive:1
_gaq.push(['_trackPageview']);:1
Groups:2
modeling,:1
promote,:1
X,:1
233:1
Merchandise:2
understands.:1
Non-English:2
Network:1
batch:1
3.5.5rc1:1
optional:1
Mentorship:1
product:5
floor:1
Expect:1
on:4
8]:1
Interest:2
%:1
which:1
'text/javascript';:1
print(loud_fruits):1
Socialize:1
0.5:1
Member:1
{:3
capable:1
One-Day:1
candidate:1
Whether:1
full:1
('https:':1
Group:4
Close:1
"https://www.python.org/search/?q={search_term_string}",:1
In:2
developer,:1
fruits]:1
function:1
data:1
loud_fruits:1
Experienced:1
Downloads:2
Status:1
you're:2
expression:1
flow:2
python-dev:1
online.:1
indexed,:1
3::4
Google+:1
(known:1
is:16
src="/static/js/libs/jquery-1.8.2.min.js"><\/script>'):1
Python's:1
Use:1
its:1
3.4.8rc1:1
trying:1
clean:1
use:1
13:1
2018-02-03:3
Policy:1
3.6.4rc1:1
download:1
"SearchAction",:1
Beginner's:2
(2,:1
Conference::1
work:3
+,:1
planned:1
code:4
s);:1
range:1
this:2
overview.:1
loop:1
RSS:1
be:3
All:3
Bottle,:1
Business:2
tutorials:1
Twitter:1
runs:1
beginners:1
defining:2
GUI:1
Speed:1
OS:3
provide:1
learn:1
division:2
IRIX,:1
'https://ssl':1
[fruit.upper():1
if,:1
▼:1
Start:1
diverse:1
Python.org:1
Shell:1
1:5
first:1
name:1
X:2
Download:1
releases:3
384:1
More:9
wxPython:1
Compound:1
Other:2
any:1
b:2
pipeline:1
along:1
related:1
integrate:1
PySide,:1
print('The:1
5:2
print():1
"required:1
Docs:6
&:3
Tornado,:1
Python!:1
name=search_term_string":1
"https://www.python.org/",:1
'UA-39055973-1']);:1
Get:1
simple:2
Help:3
guides,:1
Quotes:2
a:10
Index:2
mandatory:1
2.7.14:1

图表:
figure_1
timg

网友评论

登录后评论
0/500
评论
night李
+ 关注