Python爬虫框架Scrapy上手试用

Python下有个爬虫框架,Scrapy,用来抓取页面比较方便,适用一些结构简单的网站。

官方文档

知识点

  1. Selector
  2. Request and Response
  3. Item
  4. Logging

安装及使用

首先需要安装Python,推荐Python3。再通过pip3安装scrapy

pip3 install scrapy

新建一个示例test.py,代码如下,抓取某博客上的与python有关的全部博文

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['https://www.xxx.com/?s=python']

    def parse(self, response):
        file1 = open("result.txt", 'a+')

        for h2 in response.css('h2.article-title'):
            self.logger.info('A response from %s just arrived!', response.url)
            title = h2.css('a::text').get()
            href = h2.css('a').attrib['href']
            file1.write(title + " " + href + "\n")
            yield {'title': title}

        for next_page in response.css('a.page-ens'):
            yield response.follow(next_page, self.parse)

        file1.close()

开始跑

scrapy runspider test.py

抓取完成后在同目录下的result.txt就是全部博文的标题和链接列表。

Leave a Comment

豫ICP备19001387号-1