Scrapy结合Mysql爬取天气预报入库-阿里云开发者社区

Scrapy结合Mysql爬取天气预报入库

2017-11-14 1393

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

云数据库 RDS MySQL，集群系列 2核4GB

RDS MySQL Serverless 基础系列，0.5-2RCU 50GB

云数据库 RDS MySQL，高可用系列 2核4GB

简介：

创建Scrapy工程：

 
        scrapy startproject weather2

定义Items（items.py）：

 
        import 
        scrapy 
       
        class 
        Weather2Item(scrapy.Item): 
       
        # define the fields for your item here like: 
       
        # name = scrapy.Field() 
       
        weatherDate 
        = 
        scrapy.Field() 
       
        weatherDate2 
        = 
        scrapy.Field() 
       
        weatherWea 
        = 
        scrapy.Field() 
       
        weatherTem1 
        = 
        scrapy.Field() 
       
        weatherTem2 
        = 
        scrapy.Field() 
       
        weatherWin 
        = 
        scrapy.Field()

编写Spider（spiders/weatherSpider.py）：

 
        import 
        scrapy 
       
        from 
        weather2.items 
        import 
        Weather2Item  
       
        class 
        CatchWeatherSpider(scrapy.Spider): 
       
        name 
        = 
        'CatchWeather2' 
       
        allowed_domains 
        = 
        [
        'weather.com.cn'
        ] 
       
        start_urls 
        = 
        [ 
       
        "http://www.weather.com.cn/weather/101280101.shtml" 
       
        ] 
       
        def 
        parse(
        self
        , response): 
       
        for 
        sel 
        in 
        response.xpath(
        '//*[@id="7d"]/ul/li'
        ): 
       
        item 
        = 
        Weather2Item() 
       
        item[
        'weatherDate'
        ] 
        = 
        sel.xpath(
        'h1/text()'
        ).extract()  
       
        item[
        'weatherDate2'
        ] 
        = 
        sel.xpath(
        'h2/text()'
        ).extract() 
       
        item[
        'weatherWea'
        ] 
        = 
        sel.xpath(
        'p[@class="wea"]/text()'
        ).extract() 
       
        item[
        'weatherTem1'
        ] 
        = 
        sel.xpath(
        'p[@class="tem tem1"]/span/text()'
        ).extract() 
        + 
        sel.xpath(
        'p[@class="tem tem1"]/i/text()'
        ).extract() 
       
        item[
        'weatherTem2'
        ] 
        = 
        sel.xpath(
        'p[@class="tem tem2"]/span/text()'
        ).extract() 
        + 
        sel.xpath(
        'p[@class="tem tem2"]/i/text()'
        ).extract() 
       
        item[
        'weatherWin'
        ] 
        = 
        sel.xpath(
        'p[@class="win"]/i/text()'
        ).extract() 
       
        yield 
        item

name:定义蜘蛛的名字。
allowed_domains: 包含构成许可域的基础URL，供蜘蛛去爬。
start_urls: 是一个URL列表，蜘蛛从这里开始爬。蜘蛛从start_urls中的URL下载数据，所有后续的URL将从这些数据中获取。

数据来源是http://www.weather.com.cn/weather/101280101.shtml，101280101是广州的城市编号

这里用到了xpath分析html，感觉好简单

测试运行：

 
        scrapy crawl CatchWeather2

结果片断：

已经拿到我们想要的数据

创建数据库：

 
        CREATE 
        TABLE 
        `yunweiApp_weather` ( 
       
        `id` 
        int
        (11) 
        NOT 
        NULL 
        AUTO_INCREMENT, 
       
        `weatherDate` 
        varchar
        (10) 
        DEFAULT 
        NULL
        , 
       
        `weatherDate2` 
        varchar
        (10) 
        NOT 
        NULL
        , 
       
        `weatherWea` 
        varchar
        (10) 
        NOT 
        NULL
        , 
       
        `weatherTem1` 
        varchar
        (10) 
        NOT 
        NULL
        , 
       
        `weatherTem2` 
        varchar
        (10) 
        NOT 
        NULL
        , 
       
        `weatherWin` 
        varchar
        (10) 
        NOT 
        NULL
        , 
       
        `updateTime` datetime 
        NOT 
        NULL
        , 
       
        PRIMARY 
        KEY 
        (`id`) 
       
        ) ENGINE=InnoDB AUTO_INCREMENT=15 
        DEFAULT 
        CHARSET=utf8;

创建PipeLines（）：

 
        import 
        MySQLdb 
       
        import 
        datetime 
       
        DEBUG 
        = 
        True 
       
        if 
        DEBUG: 
       
        dbuser 
        = 
        'lihuipeng' 
       
        dbpass 
        = 
        'lihuipeng' 
       
        dbname 
        = 
        'game_main' 
       
        dbhost 
        = 
        '192.168.1.100' 
       
        dbport 
        = 
        '3306' 
       
        else
        : 
       
        dbuser 
        = 
        'root' 
       
        dbpass 
        = 
        'lihuipeng' 
       
        dbname 
        = 
        'game_main' 
       
        dbhost 
        = 
        '127.0.0.1' 
       
        dbport 
        = 
        '3306' 
       
        class 
        MySQLStorePipeline(
        object
        ): 
       
        def 
        __init__(
        self
        ): 
       
        self
        .conn 
        = 
        MySQLdb.connect(user
        =
        dbuser, passwd
        =
        dbpass, db
        =
        dbname, host
        =
        dbhost, charset
        =
        "utf8"
        , use_unicode
        =
        True
        ) 
       
        self
        .cursor 
        = 
        self
        .conn.cursor() 
       
        #清空表： 
       
        self
        .cursor.execute(
        "truncate table yunweiApp_weather;"
        ) 
       
        self
        .conn.commit()  
       
        def 
        process_item(
        self
        , item, spider):  
       
        curTime 
        =  
        datetime.datetime.now()   
       
        try
        : 
       
        self
        .cursor.execute(
        """INSERT INTO yunweiApp_weather (weatherDate, weatherDate2, weatherWea, weatherTem1, weatherTem2, weatherWin, updateTime)   
       
        VALUES (%s, %s, %s, %s, %s, %s, %s)"""
        ,  
       
        ( 
       
        item[
        'weatherDate'
        ][
        0
        ].encode(
        'utf-8'
        ),  
       
        item[
        'weatherDate2'
        ][
        0
        ].encode(
        'utf-8'
        ), 
       
        item[
        'weatherWea'
        ][
        0
        ].encode(
        'utf-8'
        ), 
       
        item[
        'weatherTem1'
        ][
        0
        ].encode(
        'utf-8'
        ), 
       
        item[
        'weatherTem2'
        ][
        0
        ].encode(
        'utf-8'
        ), 
       
        item[
        'weatherWin'
        ][
        0
        ].encode(
        'utf-8'
        ), 
       
        curTime, 
       
        ) 
       
        ) 
       
        self
        .conn.commit() 
       
        except 
        MySQLdb.Error, e: 
       
        print 
        "Error %d: %s" 
        % 
        (e.args[
        0
        ], e.args[
        1
        ]) 
       
        return 
        item

修改setting.py启用pipelines：

 
        ITEM_PIPELINES = { 
       
        #'weather2.pipelines.Weather2Pipeline': 300, 
       
        'weather2.pipelines.MySQLStorePipeline'
        : 400, 
       
        }

后面的数字只是一个权重，范围在0-1000内即可

重新测试运行：

 
        scrapy crawl CatchWeather2

结果：

结合运维后台随便展示一下：

搞完收工~~

本文转自运维笔记博客51CTO博客，原文链接http://blog.51cto.com/lihuipeng/1711852如需转载请自行联系原作者

lihuipeng

Scrapy结合Mysql爬取天气预报入库

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

训练营

直播

乘风者计划

下载

镜像站

技术资料

Scrapy结合Mysql爬取天气预报入库

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像