用Python開發爬蟲，看這篇文章就夠了

Python 網絡爬蟲 HTML Windows 文章 Windows NT Scrapy Gecko Mozilla 簡書 Firefox XML Google 大數據樂百川 2019-04-05

現在Python語言大火，在網絡爬蟲、人工智能、大數據等領域都有很好的應用。今天我向大家介紹一下Python爬蟲的一些知識和常用類庫的用法，希望能對大家有所幫助。

其實爬蟲這個概念很簡單，基本可以分成以下幾個步驟：

發起網絡請求
獲取網頁
解析網頁獲取數據

發起網絡請求這個步驟常用的類庫有標準庫urllib以及Python上常用的requests庫。解析網頁常用的類庫有的BeautifulSoup。另外requests的作者還開發了另一個很好用的庫requests-html，提供了發起請求和解析網頁的二合一功能，開發小型爬蟲非常方便。另外還有一些專業的爬蟲類庫，其中比較出名的就是scrapy。本文將會簡單介紹一下這些類庫，之後還會專門寫一篇文章介紹scrapy的用法。

標準庫urllib

首先先來看標準庫urllib。標準庫的優點是Python自帶的，不需要安裝任何第三方庫，缺點就是urllib屬於偏底層的庫，使用起來比較麻煩。下面是urllib發起請求的一個簡單例子，大家看看就好。可以看到為了發起一個簡單的請求，我們需要創建opener、request、ProxyHandler等好幾個對象，比較麻煩。

import urllib.request as request
import requests
proxies = {
 'https': 'https://127.0.0.1:1080',
 'http': 'http://127.0.0.1:1080'
}
headers = {
 'user-agent':
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
print('--------------使用urllib--------------')
url = 'http://www.google.com'
opener = request.build_opener(request.ProxyHandler(proxies))
request.install_opener(opener)
req = request.Request(url, headers=headers)
response = request.urlopen(req)
print(response.read().decode())

requests

requests是Kenneth Reitz大神的著名作品之一，優點就是極度簡單和好用。首先來安裝requests。

pip install requests

下面是一個簡單的例子，和上面urllib示例代碼實現的功能相同，但是代碼量少多了，也更易讀。

print('--------------使用requests--------------')
response = requests.get('https://www.google.com', headers=headers, proxies=proxies)
response.encoding = 'utf8'
print(response.text)

requests還可以方便的發送表單數據，模擬用戶登錄。返回的Response對象還包含了狀態碼、header、raw、cookies等很多有用的信息。

data = {
 'name': 'yitian',
 'age': 22,
 'friends': ['zhang3', 'li4']
}
response = requests.post('http://httpbin.org/post', data=data)
pprint(response.__dict__)
print(response.text)

關於requests我就不多做介紹了，因為它有中文文檔，雖然比官方落後幾個小版本號，不過無傷大雅，大家可以放心參閱。

http://cn.python-requests.org/zh_CN/latest/

beautifulsoup

利用前面介紹的requests類庫，我們可以輕易地獲取HTML代碼，但是為了從HTML中找到所需的數據，我們還需要HTML/XML解析庫，BeautifulSoup就是這麼一個常用的庫。首先先來安裝它：

pip install beautifulsoup4

這次就用我簡書主頁作為例子，爬取一下我簡書的文章列表。首先先用requests獲取到網頁內容。

from pprint import pprint
import bs4
import requests
headers = {
 'user-agent':
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
url = 'https://www.jianshu.com/u/7753478e1554'
response = requests.get(url, headers=headers)

然後就是BeautifulSoup的代碼了。在使用BeautifulSoup的時候首先需要創建一個HTML樹，然後從樹中查找節點。BeautifulSoup主要有兩種查找節點的辦法，第一種是使用find和find_all方法，第二種方法是使用select方法用css選擇器。拿到節點之後，用contents去獲取它的子節點，如果子節點是文本，就會拿到文本值，注意這個屬性返回的是列表，所以要加[0]。

html = bs4.BeautifulSoup(response.text, features='lxml')
note_list = html.find_all('ul', class_='note-list', limit=1)[0]
for a in note_list.select('li>div.content>a.title'):
 title = a.contents[0]
 link = f'https://www.jianshu.com{a["href"]}'
 print(f'《{title}》,{link}')

BeautifulSoup也有中文文檔，同樣也是稍微落後兩個小版本，影響不大。

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

requests-html

這個類庫是requests的兄弟，同樣也是Kenneth Reitz大神的作品。它將請求網頁和解析網頁結合到了一起。本來如果你用requests的話只能請求網頁，為了解析網頁還得使用BeautifulSoup這樣的解析庫。現在只需要requests-html一個庫就可以辦到。

首先先來安裝。

pip install requests-html

然後我們來看看用requests-html如何重寫上面這個例子。

from requests_html import HTMLSession
from pprint import pprint
url = 'https://www.jianshu.com/u/7753478e1554'
headers = {
 'user-agent':
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get(url, headers=headers)
note_list = r.html.find('ul.note-list', first=True)
for a in note_list.find('li>div.content>a.title'):
 title = a.text
 link = f'https://www.jianshu.com{a.attrs["href"]}'
 print(f'《{title}》,{link}')

requests-html除了可以使用css選擇器來搜索以外，還可以使用xpath來查找。

for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
 title = a.text
 link = f'https://www.jianshu.com{a.attrs["href"]}'
 print(f'《{title}》,{link}')

requests-html還有一個很有用的特性就是瀏覽器渲染。有些網頁是異步加載的，直接用爬蟲去爬只能得到一個空頁面，因為數據是靠瀏覽器運行JS腳本異步加載的，這時候就需要瀏覽器渲染了。而瀏覽器渲染用requests-html做非常簡單，只要多調用一個render函數即可。render函數有兩個參數，分別指定頁面下滑次數和暫停時間。render函數第一次運行的時候，requests-html會下載一個chromium瀏覽器，然後用它渲染頁面。

簡書的個人文章頁面也是一個異步加載的例子，默認只會顯示最近幾篇文章，通過瀏覽器渲染模擬頁面下滑，我們可以得到所有文章列表。

session = HTMLSession()
r = session.get(url, headers=headers)
# render函數指示requests-html用chromium瀏覽器渲染頁面
r.html.render(scrolldown=50, sleep=0.2)
for a in r.html.xpath('//ul[@class="note-list"]/li/div[@class="content"]/a[@class="title"]'):
 title = a.text
 link = f'https://www.jianshu.com{a.attrs["href"]}'
 print(f'《{title}》,{link}')

類似的，今日頭條的個人頁面也是異步加載的，所以也得調用render函數。

from requests_html import HTMLSession
headers = {
 'user-agent':
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'
}
session = HTMLSession()
r = session.get('https://www.toutiao.com/c/user/6662330738/#mid=1620400303194116', headers=headers)
r.html.render()
for i in r.html.find('div.rbox-inner a'):
 title = i.text
 link = f'https://www.toutiao.com{i.attrs["href"]}'
 print(f'《{title}》 {link}')

最後是requests-html的官網地址以及中文文檔。

https://html.python-requests.org/
https://cncert.github.io/requests-html-doc-cn/#/?id=requests-html

scrapy

以上介紹的幾個框架都是各自有各自的作用，把它們結合起來可以達到編寫爬蟲的目的，但是要說專業的爬蟲框架，還是得談談scrapy。作為一個著名的爬蟲框架，scrapy將爬蟲模型框架化和模塊化，利用scrapy，我們可以迅速生成功能強大的爬蟲。

不過scrapy概念眾多，要仔細說還得專門開篇文章，這裡就只簡單演示一下。首先安裝scrapy，如果是Windows系統，還需要安裝pypiwin32。

pip install scrapy
pip install pypiwin32

然後創建scrapy項目並添加一個新爬蟲。

scrapy startproject myproject
cd myproject
scrapy genspider my jianshu.com

打開配置文件settings.py，設置用戶代理，否則會遇到403錯誤。

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0'

然後修改一下爬蟲。

# -*- coding: utf-8 -*-
import scrapy
class JianshuSpider(scrapy.Spider):
 name = 'jianshu'
 allowed_domains = ['jianshu.com']
 start_urls = ['https://www.jianshu.com/u/7753478e1554']
 def parse(self, response):
 for article in response.css('div.content'):
 yield {
 'title': article.css('a.title::text').get(),
 'link': 'https://www.jianshu.com' + article.xpath('a[@class="title"]/@href').get()
 }

最後運行一下爬蟲。

scrapy crawl my

以上就是這篇文章的內容了，希望對大家有所幫助。