爬虫采集-基于webkit核心的客户端Ghost.py [爬虫实例]-阿里云开发者社区

爬虫采集-基于webkit核心的客户端Ghost.py [爬虫实例]

2017-11-15 1476

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

对与要时不时要抓取页面的我们来说，是痛苦的~

由于目前的Web开发中AJAX、Javascript、CSS的大量使用，一些网站上的重要数据是由Ajax或Javascript动态生成的，并不能直接通过解析html页面内容就能获得（例如采用urllib2，mechanize、lxml、Beautiful Soup ）。要实现对这些页面数据的爬取，爬虫必须支持Javacript、DOM、HTML解析。

比如：像监控的数据就不能用简单的curl和urllib解析到的。。。

还有这个用ajax 渲染的页面，用urllib2直接解析不了的。

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

常见的抓数据的方法：

urllib2+urlparse+re

最原始的办法，其中urllib2是python的web库、urlparse能处理url、re是正则库，这种方法写起来比较繁琐，但也比较“实在”

urllib2+beautifulsoup

这里的得力干将是beautifulsoup，beautifulsoup可以非常有效的解析HTML页面，就可以免去自己用re去写繁琐的正则等。

Mechanize+BeautifulSoup

Mechanize是对于urllib2的部分功能的替换，使得除了http以外其他任何连接也都能被打开，也更加动态可配置

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

其实像上面的页面，要是不嫌麻烦，可以从页面狂找接口，下出来的大多是xml的格式，然后你再费劲的去解析。。。是在他折腾了。

这时候大家可以用 webkit核心的web 客户端。他会像真正的浏览器一样来解析页面的。

WebKit: Safari, Google Chrome,傲游3 360浏览器等等都是基于 Webkit 核心开发。

我们一般是终端取值的，这些也有不少封装好的工具

Pyv8，PythonWebKit，Selenium，PhantomJS，Ghost.py 等等。。。。

我这里推荐用ghost.py 。。。。因为他够直接和实用

发现国内webkit的资料很少，ghost.py的资料就更少了，那我就根据官方的文档，简单的翻译下 ~

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

一个小例子，感受下Ghost~

 
         from ghost 
         import 
         Ghost 
        
         ghost = Ghost()
        
         page, extra_resources = ghost.open(
         "http://xiaorui.cc"
         ) 
        
         assert page.http_status==
         200 
         and 
         'xiaorui' 
         in 
         ghost.content

安装Ghost.py　以及相关的东东～~

用webkit，我们需要有pyqt或者是PySide

这些都安装好了后，再开始

运气好的直接 pip install Ghost.py

运气不好的：

中间会遇到好多蛋疼的问题，大家多搜搜~

要是解决不了了，请回帖哈~

 
         wget http:
         //sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz 
        
         tar zxvf sip-
         4.14
         .
         6
         .tar.gz 
        
         cd sip-
         4.14
         .
         6 
        
         python configure.py
        
         make
        
         sudo make install
        
         wget http:
         //sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-mac-gpl-4.10.1.tar.gz 
        
         tar zxvf PyQt-mac-gpl-
         4.10
         .
         1
         .tar.gz 
        
         cd PyQt-mac-gpl-
         4.10
         .
         1 
        
         python configure.py
        
         make
        
         sudo make install
        
         wget http:
         //pyside.markus-ullmann.de/pyside-1.1.1-qt48-py27apple.pkg 
        
         open pyside-
         1.1
         .
         1
         -qt48-py27apple.pkg 
        
         git clone https:
         //github.com/mitsuhiko/flask.git 
        
         cd flask
        
         sudo python setup.py install
        
         git clone git:
         //github.com/carrerasrodrigo/Ghost.py.git 
        
         cd Ghost.py
        
         sudo python setup.py install

创建一个实例对象：

 
         from ghost 
         import 
         Ghost 
        
         ghost = Ghost()

打开一个页面

 
         page, resources = ghost.open(
         'http://my.web.page'
         )

夹带着 javascript代码

 
         result, resources = ghost.evaluate(
        
         "document.getElementById('my-input').getAttribute('value');"
         )

模拟点击事件

 
         page, resources = ghost.evaluate(
        
         "document.getElementById('link').click();"
         , expect_loading=True)

填写表单中的字段中的值 (selector, value, blur=True, expect_loading=False):

 
         result, resources = ghost.set_field_value(
         "input[name=username]"
         , 
         "jeanphix"
         )

If you set optional parameter `blur` to False, the focus will be left on the field (usefull for autocomplete tests).

For filling file input field, simply pass file path as `value`.

你可以填写form表单 Ghost.fill(selector, values, expect_loading=False):

 
         result, resources = ghost.fill(
         "form"
         , { 
        
         "username"
         : 
         "jeanphix"
         , 
        
         "password"
         : 
         "mypassword" 
        
         })

提交表单~

 
         page, resources = ghost.fire_on(
         "form"
         , 
         "submit"
         , expect_loading=True)

这是对于高级属性的定义：

这些有很多好用的属性

wait_for_page_loaded()

That wait until a new page is loaded.

page, resources = ghost.wait_for_page_loaded()

这个是等页面都加载完毕，类似jquery

$(document).ready(function()

wait_for_selector(selector)

That wait until a element match the given selector.

result, resources = ghost.wait_for_selector("ul.results")

等你指定的dom名称出现

wait_for_text(text)

That wait until the given text exists inside the frame.

result, resources = ghost.wait_for_selector("My result")

等我们要的字符出现

官网出现了 FlASK 的例子：

可以通过ghost.py和unittest实现程序的单元测试：

 
         import 
         unittest 
        
         from 
         flask 
         import 
         Flask 
        
         from 
         ghost 
         import 
         GhostTestCase 
        
         app 
         = 
         Flask(__name__) 
        
         @app
         .route(
         '/'
         ) 
        
         def 
         home(): 
        
         return 
         'hello world' 
        
         class 
         MyTest(GhostTestCase): 
        
         port 
         = 
         5000 
        
         @classmethod 
        
         def 
         create_app(
         cls
         ): 
        
         return 
         app 
        
         def 
         test_open_home(
         self
         ): 
        
         self
         .ghost.
         open
         (
         "http://localhost:%s/" 
         % 
         self
         .port) 
        
         self
         .assertEqual(
         self
         .ghost.content, 
         'hello world'
         ) 
        
         if 
         __name__ 
         =
         = 
         '__main__'
         : 
        
         unittest.main()

~~~整体的小demo~~~

 
         # Opens the web page
        
         ghost.open(
         'http://www.openstreetmap.org/'
         ) 
        
         # Waits 
         for 
         form search field 
        
         ghost.wait_for_selector(
         'input[name=query]'
         ) 
        
         # Fills the form
        
         ghost.fill(
         "#search_form"
         , {
         'query'
         : 
         'France'
         }) 
        
         # Submits the form
        
         ghost.fire_on(
         "#search_form"
         , 
         "submit"
         ) 
        
         # Waits 
         for 
         results (an XHR has been called here) 
        
         ghost.wait_for_selector(
        
         '#search_osm_nominatim .search_results_entry a'
         ) 
        
         # Clicks first result link
        
         ghost.click(
        
         '#search_osm_nominatim .search_results_entry:first-child a'
         ) 
        
         # Checks 
         if 
         map has moved to expected latitude 
        
         lat, resources = ghost.evaluate(
         "map.center.lat"
         ) 
        
         assert float(lat.toString()) == 
         5860090.806537

aha，咱们来个实例哈~

咱们来个简单的模拟浏览器到百度去搜 xiaorui.cc 然后看看内容和headers头：

终端下的操作：

得到的是

http://www.baidu.com/s?wd=xiaorui.cc&rsv_bp=0&ch=&tn=baidu&bar=&rsv_spt=3&ie=utf-8

咱们访问下

看他的http头

 
   
     
       
       
         In [
         10
         ]: print page.headers 
        
 
         {u
         'BDQID'
         : u
         '0xf594a31a03344b4f'
         , u
         'Content-Encoding'
         : u
         'gzip'
         , u
         'Set-Cookie'
         : u
         'BDSVRTM=381; path=/\nH_PS_PSSID=2976_2981_3091; path=/; domain=.baidu.com'
         , u
         'BDUSERID'
         : u
         '0'
         , u
         'Server'
         : u
         'BWS/1.0'
         , u
         'Connection'
         : u
         'Keep-Alive'
         , u
         'Cache-Control'
         : u
         'private'
         , u
         'Date'
         : u
         'Tue, 03 Sep 2013 09:53:56 GMT'
         , u
         'Content-Type'
         : u
         'text/html;charset=utf-8'
         , u
         'BDPAGETYPE'
         : u
         '3'
         } 
        
 
     

    
  

他的内容：

先这样吧~ 更详细的功能大家看官网吧~

本文转自 rfyiamcool 51CTO博客，原文链接：http://blog.51cto.com/rfyiamcool/1287810，如需转载请自行联系原作者

爬虫采集-基于webkit核心的客户端Ghost.py [爬虫实例]

wait_for_page_loaded()

wait_for_selector(selector)

wait_for_text(text)

热门文章

最新文章

相关课程

相关电子书

相关实验场景