【Python爬虫6】表单交互

  1. 云栖社区>
  2. 博客>
  3. 正文

【Python爬虫6】表单交互

wu_being 2017-02-17 13:13:36 浏览797
展开阅读全文

严格来说,本篇表单交互和下一篇验证码处理不算是网络爬虫,而是广义上的网络机器人。使用网络机器人可以减少提取数据时需要表单交互的一道门槛。

1.手工处理发送POST请求提交登录表单

我们先在示例网站手工注册一个账号,注册这个账号需要验证码,下一篇会介绍处理验证码问题。

1.1分析表单内容

我们在登录网址http://127.0.0.1:8000/places/default/user/login 获得如下表单。在下面登录表单中包括几个重要的组成部分:
- form标签的action属性:用于设置表单数据提交的地址,本例中为#,也就是和登录表单同一个URL;
- form标签的enctype属性:用于设置数据提交的编码,本例中为application/x-www-form-urlencoded,表示所有非字母数字的字符都需要转换为十六进制的ASCII值;上传二进程文件最好用multipart/form-data编码类型,这种编码不会对输入进行编码从而不会影响效率,而是使用MIME协议将其作为多个部分进行发送,和邮件的传输标准相同。文档:http://www.w3.org/TR/html5/forms.html#selecting-a-form-submission-encoding
- form标签的method属性:本例中post表示通过请求体向服务器提交表单数据;
- imput标签的name属性:用于设定提交到服务器端时某个域的名称。

<form action="#" enctype="application/x-www-form-urlencoded" method="post">
    <table>
        <tr id="auth_user_email__row">
            <td class="w2p_fl"><label class="" for="auth_user_email" id="auth_user_email__label">E-mail: </label></td>
            <td class="w2p_fw"><input class="string" id="auth_user_email" name="email" type="text" value="" /></td>
            <td class="w2p_fc"></td>
        </tr>
        <tr id="auth_user_password__row">
            <td class="w2p_fl"><label class="" for="auth_user_password" id="auth_user_password__label">Password: </label></td>
            <td class="w2p_fw"><input class="password" id="auth_user_password" name="password" type="password" value="" /></td>
            <td class="w2p_fc"></td>
        </tr>
        <tr id="auth_user_remember_me__row">
            <td class="w2p_fl"><label class="" for="auth_user_remember_me" id="auth_user_remember_me__label">Remember me (for 30 days): </label></td>
            <td class="w2p_fw"><input class="boolean" id="auth_user_remember_me" name="remember_me" type="checkbox" value="on" /></td>
            <td class="w2p_fc"></td>
        </tr>
        <tr id="submit_record__row">
            <td class="w2p_fl"></td><td class="w2p_fw">
                <input type="submit" value="Log In" />
                <button class="btn w2p-form-button" onclick="window.location=&#x27;/places/default/user/register&#x27;;return false">Register</button>
            </td>
            <td class="w2p_fc"></td>
        </tr>
    </table>
    <div style="display:none;">
        <input name="_next" type="hidden" value="/places/default/index" />
        <input name="_formkey" type="hidden" value="7b1add4b-fa91-4301-975e-b6fbf7def3ac" />
        <input name="_formname" type="hidden" value="login" />
    </div>
</form>

1.2手工测试post请求提交表单

如果登录成功则跳到主页,否则回到登录页。下面是尝试自动登录的初始版本代码。显然登录失败!

>>> import urllib,urllib2
>>> LOGIN_URL='http://127.0.0.1:8000/places/default/user/login'
>>> LOGIN_EMAIL='1040003585@qq.com'
>>> LOGIN_PASSWORD='wu.com'
>>> data={'email':LOGIN_EMAIL,'password':LOGIN_PASSWORD}
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因为登录时还需要添加隐藏的_formkey属性,这个唯一的ID用来避免表单多次提交。每次加载网页时,都会产生不同的ID,然后服务器端就可以通过这个给定的ID来判断表单是否已经通过提交过。下面是获得该属性值:

>>> 
>>> import lxml.html
>>> def parse_form(html):
...     tree=lxml.html.fromstring(html)
...     data={}
...     for e in tree.cssselect('form input'):
...             if e.get('name'):
...                     data[e.get('name')]=e.get('value')
...     return data
... 
>>> import pprint
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> form=parse_form(html)
>>> pprint.pprint(form)
{'_formkey': '437e4660-0c44-4187-af8d-36487c62ffce',
 '_formname': 'login',
 '_next': '/places/default/index',
 'email': '',
 'password': '',
 'remember_me': 'on'}
>>> 

下面是通过_formkey和其他隐藏域的新版本自动登录代码。发现还是不成功!

>>> 
>>> html=urllib2.urlopen(LOGIN_URL).read()
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=urllib2.urlopen(request)
>>> response.geturl()
'http://127.0.0.1:8000/places/default/user/login'
>>> 

因为我们缺失了一个重要的组成部分——cookie。当普通用户加载登录表单时,_formkey的值将会保存在cookie中,然后该值会与提交的登录表单数据中的_formkey的值进行对比。下面是使用urllib2.HTTPCookieProcessor类增加了cookie支持之后的代码。最后登录成功了!

>>> 
>>> import cookielib
>>> cj=cookielib.CookieJar()
>>> opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> 
>>> html=opener.open(LOGIN_URL).read()      #opener
>>> data=parse_form(html)
>>> data['email']=LOGIN_EMAIL
>>> data['password']=LOGIN_PASSWORD
>>> encoded_data=urllib.urlencode(data)
>>> request=urllib2.Request(LOGIN_URL,encoded_data)
>>> response=opener.open(request)       #opener
>>> response.geturl()
'http://127.0.0.1:8000/places/default/index'
>>> 

1.3手工处理post请求登录的完整源代码:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import cookielib
import lxml.html

LOGIN_EMAIL = '1040003585@qq.com'
LOGIN_PASSWORD = 'wu.com'
#LOGIN_URL = 'http://example.webscraping.com/user/login'
LOGIN_URL = 'http://127.0.0.1:8000/places/default/user/login'


def login_basic():
    """fails because not using formkey
    """
    data = {'email': LOGIN_EMAIL, 'password': LOGIN_PASSWORD}
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_formkey():
    """fails because not using cookies to match formkey
    """
    html = urllib2.urlopen(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = urllib2.urlopen(request)
    print response.geturl()

def login_cookies():
    """working login
    """
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(LOGIN_URL).read()
    data = parse_form(html)
    data['email'] = LOGIN_EMAIL
    data['password'] = LOGIN_PASSWORD
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(LOGIN_URL, encoded_data)
    response = opener.open(request)
    print response.geturl()
    return opener

def parse_form(html):
    """extract all input properties from the form
    """
    tree = lxml.html.fromstring(html)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data

def main():
    #login_basic()
    #login_formkey()
    login_cookies()

if __name__ == '__main__':
    main()

2.从FF浏览器加载cookie登录网站

我们先用手工执行登录,我们先在FF浏览器用手工执行登录,然后关闭FF浏览器,然后用python脚本复用之前得到的cookie,从而实现自动登录。

2.1session文件位置

FireFox在sqlist数据库中存储cookie,在json文件中存储session,这两种存储方式都可以直接通过Python获取。对于登录操作而言,我们只需要获致session即可。对于不同的操作系统,FireFox存储的session文件的位置不同:
- Linux系统:~/.mozilla/firefox/*.default/sessionstore.js
- OS X系统:~/Library/Application Support/Firefox/Profiles/*.default/sessionstore.js
- Windows Vista及以上版本系统:%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default/sessionstore.js

下面是返回session文件路径的辅助函数代码:

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

注:glob模块会返回指定路径中所有匹配的文件。

2.2FF浏览器cookie内容

下面是Linux系统火狐浏览器session文件内容:

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ ls
addons.json           datareporting       key3.db             prefs.js                      storage
blocklist.xml         extensions          logins.json         revocations.txt               storage.sqlite
bookmarkbackups       extensions.ini      mimeTypes.rdf       saved-telemetry-pings         times.json
cert8.db              extensions.json     minidumps           search.json.mozlz4            webapps
compatibility.ini     features            permissions.sqlite  secmod.db                     webappsstore.sqlite
containers.json       formhistory.sqlite  places.sqlite       sessionCheckpoints.json       xulstore.json
content-prefs.sqlite  gmp                 places.sqlite-shm   sessionstore-backups
cookies.sqlite        gmp-gmpopenh264     places.sqlite-wal   sessionstore.js
crashes               healthreport        pluginreg.dat       SiteSecurityServiceState.txt
wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ more sessionstore.js 
{"version":["sessionrestore",1],
"windows":[{
    ...
    "cookies":[
        {"host":"127.0.0.1",
        "value":"127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5",
        "path":"/",
        "name":"session_id_welcome",
        "httponly":true,
        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
        {"host":"127.0.0.1",
        "value":"True",
        "path":"/",
        "name":"session_id_places",
        "httponly":true,
        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}},
        {"host":"127.0.0.1",
        "value":"\":oJoAPvH-ODMFDXwk3U...su0Dxr7doAgu9yQiSEmgQiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA==\"",
        "path":"/",
        "name":"session_data_places",
        "originAttributes":{"addonId":"","appId":0,"inIsolatedMozBrowser":false,"privateBrowsingId":0,"signedPkg":"","userContextId":0}}
    ],
    "title":"Example web scraping website",
    "_shouldRestore":true,
    "closedAt":1485228738310
}],
"selectedWindow":0,
"_closedWindows":[],
"session":{"lastUpdate":1485228738927,"startTime":1485226675190,"recentCrashes":0},
"global":{}
}

wu_being@ubuntukylin64:~/.mozilla/firefox/78n340f7.default$ 

根据seesion存储结构,我们用下面代码把session解析到CookieJar对象中。

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

2.3使用cookie测试加载登录

session_filename = find_ff_sessions()
cj = load_ff_sessions(session_filename)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
html = opener.open(COUNTRY_URL).read()

tree = lxml.html.fromstring(html)
print tree.cssselect('ul#navbar li a')[0].text_content()

如果得到的结果是Login则说明没能正确加载。如果出现这样情况,你就需要确认一下FireFox中是否已经成功登录救命网站。如果得到下面结果,有Welcome 用户的first name,则登录表示成功。

wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-406df419-ed33-4de5-bc46-cd2d9f3c431b'}
Log In
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 2login_firefox.py 
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_welcome',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'127.0.0.1-aabe0222-d083-44ee-94c8-e9343eefb2e5'}
{u'host': u'127.0.0.1',
 u'httponly': True,
 u'name': u'session_id_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'True'}
{u'host': u'127.0.0.1',
 u'name': u'session_data_places',
 u'originAttributes': {u'addonId': u'',
                       u'appId': 0,
                       u'inIsolatedMozBrowser': False,
                       u'privateBrowsingId': 0,
                       u'signedPkg': u'',
                       u'userContextId': 0},
 u'path': u'/',
 u'value': u'"ef34329782d4efe136522cb44fc4bd21:oJoAPvH-ODM...QiSy98Ga7C6K2tIQoZwzY0_4wBO0qHm-FlcBf-cPRk7GPAhix8yS4roOVIvMqP5I7ZB_uIA=="'}
Welcome Wu
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

如果你想从其他浏览器的cookie,可以使用browsercookie模块。通过pip install browsercookie命令进行安装,文档:https://pypi.python.org/pypi/browsercookie

2.4使用cookie登录源代码

# -*- coding: utf-8 -*-
import urllib2
import glob
import os
import cookielib
import json
import time
import lxml.html

COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def login_firefox():
    """load cookies from firefox
    """
    session_filename = find_ff_sessions()
    cj = load_ff_sessions(session_filename)
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    html = opener.open(COUNTRY_URL).read()

    tree = lxml.html.fromstring(html)
    print tree.cssselect('ul#navbar li a')[0].text_content()
    return opener

def load_ff_sessions(session_filename):
    cj = cookielib.CookieJar()
    if os.path.exists(session_filename):  
        try: 
            json_data = json.loads(open(session_filename, 'rb').read())
        except ValueError as e:
            print 'Error parsing session JSON:', str(e)
        else:
            for window in json_data.get('windows', []):
                for cookie in window.get('cookies', []):
                    import pprint; pprint.pprint(cookie)
                    c = cookielib.Cookie(0, cookie.get('name', ''), cookie.get('value', ''), 
                        None, False, 
                        cookie.get('host', ''), cookie.get('host', '').startswith('.'), cookie.get('host', '').startswith('.'), 
                        cookie.get('path', ''), False,
                        False, str(int(time.time()) + 3600 * 24 * 7), False, 
                        None, None, {})
                    cj.set_cookie(c)
    else:
        print 'Session filename does not exist:', session_filename
    return cj

def find_ff_sessions():
    paths = [
        '~/.mozilla/firefox/*.default',
        '~/Library/Application Support/Firefox/Profiles/*.default',
        '%APPDATA%/Roaming/Mozilla/Firefox/Profiles/*.default'
    ]
    for path in paths:
        filename = os.path.join(path, 'sessionstore.js')
        matches = glob.glob(os.path.expanduser(filename))
        if matches:
            return matches[0]

def main():
    login_firefox()

if __name__ == '__main__':
    main()

3.使用高级模块Mechanize自动化处理表单提交

使用Mechanize模块可以简化表单提交,先如下安装:pip install mechanize

3.1用高级模块Mechanize自动化处理表单提交并支持登录后网页内容更新

# -*- coding: utf-8 -*-
import mechanize
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def mechanize_edit():
    """Use mechanize to increment population
    """
    # login
    br = mechanize.Browser()
    br.open(login.LOGIN_URL)
    br.select_form(nr=0)
    print br.form
    br['email'] = login.LOGIN_EMAIL
    br['password'] = login.LOGIN_PASSWORD
    response = br.submit()

    # edit country
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population before:', br['population']
    br['population'] = str(int(br['population']) + 1)
    br.submit()

    # check population increased
    br.open(COUNTRY_URL)
    br.select_form(nr=0)
    print 'Population after:', br['population']

if __name__ == '__main__':
    mechanize_edit()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3mechanize_edit.py 
<POST http://127.0.0.1:8000/places/default/user/login# application/x-www-form-urlencoded
  <TextControl(email=)>
  <PasswordControl(password=)>
  <CheckboxControl(remember_me=[on])>
  <SubmitControl(<None>=Log In) (readonly)>
  <SubmitButtonControl(<None>=) (readonly)>
  <HiddenControl(_next=/places/default/index) (readonly)>
  <HiddenControl(_formkey=72282515-8f0d-4af1-9500-f7ac6f0526a4) (readonly)>
  <HiddenControl(_formname=login) (readonly)>>
Population before: 1330044000
Population after: 1330044001
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

文档:http://www.search.sourceforge.net/mechanize/

3.2用普通方法支持登录后网页内容更新

# -*- coding: utf-8 -*-
import urllib
import urllib2
import login

#COUNTRY_URL = 'http://example.webscraping.com/edit/United-Kingdom-239'
COUNTRY_URL = 'http://127.0.0.1:8000/places/default/edit/China-47'

def edit_country():
    opener = login.login_cookies()
    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    import pprint; pprint.pprint(data)
    print 'Population before: ' + data['population']
    data['population'] = int(data['population']) + 1
    encoded_data = urllib.urlencode(data)
    request = urllib2.Request(COUNTRY_URL, encoded_data)
    response = opener.open(request)

    country_html = opener.open(COUNTRY_URL).read()
    data = login.parse_form(country_html)
    print 'Population after:', data['population']

if __name__ == '__main__':
    edit_country()
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ python 3edit_country.py 
http://127.0.0.1:8000/places/default/index
{'_formkey': '3773506a-ef5e-4c4a-871d-084cb8451659',
 '_formname': 'places/5087',
 'area': '9596960.00',
 'capital': 'Beijing',
 'continent': 'AS',
 'country': 'China',
 'currency_code': 'CNY',
 'currency_name': 'Yuan Renminbi',
 'id': '5087',
 'iso': 'CN',
 'languages': 'zh-CN,yue,wuu,dta,ug,za',
 'neighbours': 'LA,BT,TJ,KZ,MN,AF,NP,MM,KG,PK,KP,RU,VN,IN',
 'phone': '86',
 'population': '1330044001',
 'postal_code_format': '######',
 'postal_code_regex': '^(\\d{6})$',
 'tld': '.cn'}
Population before: 1330044001
Population after: 1330044002
wu_being@ubuntukylin64:~/GitHub/WebScrapingWithPython/6.表单交互$ 

Wu_Being 博客声明:本人博客欢迎转载,请标明博客原文和原链接!谢谢!
【Python爬虫系列】《【Python爬虫6】表单交互》http://blog.csdn.net/u014134180/article/details/55507020
Python爬虫系列的GitHub代码文件https://github.com/1040003585/WebScrapingWithPython

Wu_Being 吴兵博客接受赞助费二维码

如果你看完这篇博文,觉得对你有帮助,并且愿意付赞助费,那么我会更有动力写下去。

网友评论

登录后评论
0/500
评论
wu_being
+ 关注