其实我只是想试试爬取图片而已,先看看网页,需要爬的地方有两个,一是封面图,二是下载地址,挺简单的
Item定义:
1
2
3
4
5
6
7
8
9
10
|
import
scrapy
class
TiantianmeijuItem(scrapy.Item):
name
=
scrapy.Field()
image_urls
=
scrapy.Field()
images
=
scrapy.Field()
image_paths
=
scrapy.Field()
episode
=
scrapy.Field()
episode_url
=
scrapy.Field()
|
name是保存名字
image_urls和images 是爬取图片的pipeline用的,一个是保存图片URL,一个是保存图片存放信息
image_paths其实没什么实际作用,只是记录下载成功的图片地址
epiosde和episode_url是保存集数和对应下载地址
Spider:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
import
scrapy
from
tiantianmeiju.items
import
TiantianmeijuItem
import
sys
reload
(sys)
# Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入
sys.setdefaultencoding(
'utf-8'
)
class
CacthUrlSpider(scrapy.Spider):
name
=
'meiju'
allowed_domains
=
[
'cn163.net'
]
start_urls
=
[
"http://cn163.net/archives/{id}/"
.
format
(
id
=
id
)
for
id
in
[
'16355'
,
'13470'
,
'18766'
,
'18805'
]]
def
parse(
self
, response):
item
=
TiantianmeijuItem()
item[
'name'
]
=
response.xpath(
'//*[@id="content"]/div[2]/div[2]/h2/text()'
).extract()
item[
'image_urls'
]
=
response.xpath(
'//*[@id="entry"]/div[2]/img/@src'
).extract()
item[
'episode'
]
=
response.xpath(
'//*[@id="entry"]/p[last()]/a/text()'
).extract()
item[
'episode_url'
]
=
response.xpath(
'//*[@id="entry"]/p[last()]/a/@href'
).extract()
yield
item
|
页面比较简单
Pipelines:这里写了两个管道,一个是把下载链接保存到文件,一个是下载图片
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
import
json
import
os
from
scrapy.pipelines.images
import
ImagesPipeline
from
scrapy.exceptions
import
DropItem
from
scrapy.http
import
Request
from
settings
import
IMAGES_STORE
class
TiantianmeijuPipeline(
object
):
def
process_item(
self
, item, spider):
return
item
class
WriteToFilePipeline(
object
):
def
process_item(
self
, item, spider):
item
=
dict
(item)
FolderName
=
item[
'name'
][
0
].replace(
'/'
, '')
downloadFile
=
'download_urls.txt'
with
open
(os.path.join(IMAGES_STORE, FolderName, downloadFile),
'w'
) as
file
:
for
name,url
in
zip
(item[
'episode'
], item[
'episode_url'
]):
file
.write(
'{name}: {url}\n'
.
format
(name
=
name, url
=
url))
return
item
class
MyImagesPipeline(ImagesPipeline):
def
get_media_requests(
self
, item, info):
for
image_url
in
item[
'image_urls'
]:
yield
Request(image_url, meta
=
{
'item'
: item})
def
item_completed(
self
, results, item, info):
image_paths
=
[x[
'path'
]
for
ok,x
in
results
if
ok]
if
not
image_paths:
raise
DropItem(
"Item contains no images"
)
item[
'image_paths'
]
=
image_paths
return
item
def
file_path(
self
, request, response
=
None
, info
=
None
):
item
=
request.meta[
'item'
]
FolderName
=
item[
'name'
][
0
].replace(
'/'
, '')
image_guid
=
request.url.split(
'/'
)[
-
1
]
filename
=
u
'{}/{}'
.
format
(FolderName, image_guid)
return
filename
|
get_media_requests和item_completed,因为默认的图片储存路径是
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg,
我需要把full改成以美剧名字目录来保存,所以重写了file_path
settings打开pipelines相关配置:
1
2
3
4
5
6
7
8
|
ITEM_PIPELINES
=
{
'tiantianmeiju.pipelines.WriteToFilePipeline'
:
2
,
'tiantianmeiju.pipelines.MyImagesPipeline'
:
1
,
}
IMAGES_STORE
=
os.path.join(os.getcwd(),
'image'
)
# 图片存储路径
IMAGES_EXPIRES
=
90
IMAGES_MIN_HEIGHT
=
110
IMAGES_MIN_WIDTH
=
110
|
爬下来之后就是这个效果了:
本文转自运维笔记博客51CTO博客,原文链接http://blog.51cto.com/lihuipeng/1713531如需转载请自行联系原作者
lihuipeng