'各大網站中的爬蟲Python是怎樣出來的——基礎'

Python 網絡爬蟲 HTML 瀏覽器 PyCharm Mozilla XML 百度百科全能的程序員 2019-09-06

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib
from bs4 import BeautifulSoup
class HtmpParser(object):
 # 解析讀取到的網頁的方法
 def parse(self, new_url, html_content):
 if html_content is None:
 return
 soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
 new_urls = self.get_new_urls(new_url,soup)
 new_datas = self.get_new_datas(new_url,soup)
 return new_urls, new_datas
 # 獲取new_urls的方法
 def get_new_urls(self, new_url, soup):
 new_urls = set()
 # 查找網頁的a標籤，而且href包含/item
 links = soup.find_all('a',href=re.compile(r'/item'))
 for link in links:
 # 獲取到a必去哦啊Ian的href屬性
 url = link['href']
 # 合併url。使爬到的路徑變為全路徑，http://....的格式
 new_full_url = urllib.parse.urljoin(new_url,url)
 new_urls.add(new_full_url)
 return new_urls
 # 獲取new_data的方法
 def get_new_datas(self, new_url, soup):
 new_datas = {}
 # 獲取標題內容
 title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
 new_datas['title'] = title_node.get_text()
 #獲取簡介內容
 summary_node = soup.find('div',class_='lemma-summary')
 new_datas['summary'] = summary_node.get_text()
 new_datas['url'] = new_url
 return new_datas

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib
from bs4 import BeautifulSoup
class HtmpParser(object):
 # 解析讀取到的網頁的方法
 def parse(self, new_url, html_content):
 if html_content is None:
 return
 soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
 new_urls = self.get_new_urls(new_url,soup)
 new_datas = self.get_new_datas(new_url,soup)
 return new_urls, new_datas
 # 獲取new_urls的方法
 def get_new_urls(self, new_url, soup):
 new_urls = set()
 # 查找網頁的a標籤，而且href包含/item
 links = soup.find_all('a',href=re.compile(r'/item'))
 for link in links:
 # 獲取到a必去哦啊Ian的href屬性
 url = link['href']
 # 合併url。使爬到的路徑變為全路徑，http://....的格式
 new_full_url = urllib.parse.urljoin(new_url,url)
 new_urls.add(new_full_url)
 return new_urls
 # 獲取new_data的方法
 def get_new_datas(self, new_url, soup):
 new_datas = {}
 # 獲取標題內容
 title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
 new_datas['title'] = title_node.get_text()
 #獲取簡介內容
 summary_node = soup.find('div',class_='lemma-summary')
 new_datas['summary'] = summary_node.get_text()
 new_datas['url'] = new_url
 return new_datas

html_outputer.py

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib
from bs4 import BeautifulSoup
class HtmpParser(object):
 # 解析讀取到的網頁的方法
 def parse(self, new_url, html_content):
 if html_content is None:
 return
 soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
 new_urls = self.get_new_urls(new_url,soup)
 new_datas = self.get_new_datas(new_url,soup)
 return new_urls, new_datas
 # 獲取new_urls的方法
 def get_new_urls(self, new_url, soup):
 new_urls = set()
 # 查找網頁的a標籤，而且href包含/item
 links = soup.find_all('a',href=re.compile(r'/item'))
 for link in links:
 # 獲取到a必去哦啊Ian的href屬性
 url = link['href']
 # 合併url。使爬到的路徑變為全路徑，http://....的格式
 new_full_url = urllib.parse.urljoin(new_url,url)
 new_urls.add(new_full_url)
 return new_urls
 # 獲取new_data的方法
 def get_new_datas(self, new_url, soup):
 new_datas = {}
 # 獲取標題內容
 title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
 new_datas['title'] = title_node.get_text()
 #獲取簡介內容
 summary_node = soup.find('div',class_='lemma-summary')
 new_datas['summary'] = summary_node.get_text()
 new_datas['url'] = new_url
 return new_datas

html_outputer.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib
from bs4 import BeautifulSoup
class HtmpParser(object):
 # 解析讀取到的網頁的方法
 def parse(self, new_url, html_content):
 if html_content is None:
 return
 soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
 new_urls = self.get_new_urls(new_url,soup)
 new_datas = self.get_new_datas(new_url,soup)
 return new_urls, new_datas
 # 獲取new_urls的方法
 def get_new_urls(self, new_url, soup):
 new_urls = set()
 # 查找網頁的a標籤，而且href包含/item
 links = soup.find_all('a',href=re.compile(r'/item'))
 for link in links:
 # 獲取到a必去哦啊Ian的href屬性
 url = link['href']
 # 合併url。使爬到的路徑變為全路徑，http://....的格式
 new_full_url = urllib.parse.urljoin(new_url,url)
 new_urls.add(new_full_url)
 return new_urls
 # 獲取new_data的方法
 def get_new_datas(self, new_url, soup):
 new_datas = {}
 # 獲取標題內容
 title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
 new_datas['title'] = title_node.get_text()
 #獲取簡介內容
 summary_node = soup.find('div',class_='lemma-summary')
 new_datas['summary'] = summary_node.get_text()
 new_datas['url'] = new_url
 return new_datas

html_outputer.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

運行spider.py的主函數:(結果會將提取到的結果保存到html中)

一、需求:

抓取主頁面：百度百科Python詞條 https://baike.baidu.com/item/Python/407313

分析上面的源碼格式，便於提取：

關鍵詞分析:位於class為lemmaWgt-lemmaTitle-title的dd元素的第一個h1標籤內

簡介分析(位於class為lemma-summary的div的text內容)

其他相關聯的標籤的分析(是a標籤，且href以/item/開頭)

二、抓取過程流程圖:

三、分析:

1. 網頁下載器:

1.作用:

將互聯網上URL對應的網頁以HTML形式下載到本地

常用的本地下載器

1、urllib2 Python官方基礎模塊

2、requests 第三方包，功能更強大

2.urllib下載網頁的三種方法

(1)URL傳入urllib2.urlopen(url)方法

import urllib2
#直接請求
response = urllib2.urlopen('http://www.baidu.com')
#獲取狀態碼，如果是200表示成功
code = response.getcode()
#讀取內容
cont = response.read()

(2)添加data、http header

將url、data、header傳入urllib.Request方法

然後 URLlib.urlopen(request)

import urllib2
#創建Request對象
request = urllin2.Request(url)
#添加數據
request.add_data('a'.'1')
#添加http的header 將爬蟲程序偽裝成Mozilla瀏覽器
request.add_header('User-Agent','Mozilla/5.0')
#發送請求獲取結果
response = urllib2.urlopen(request)

(3)添加特殊情景的處理器

處理用戶登錄才能訪問的情況，添加Cookie

或者需要代理才能訪問使用ProxyHandler

或者需要使用https請求

2.網頁解析器

1.作用:

從網頁中提取有價值數據的工具

以HTML網頁字符串為輸入信息，輸出有價值的數據和新的待爬取url列表

網頁解析器種類

1、正則表達式將下載好的HTML字符串用正則表達式匹配解析，適用於簡單的網頁解析字符串形式的模糊匹配

2、html.parser python自帶模塊

3、BeautifulSoup 第三方插件

4、xml 第三方插件

原理是解析成DOM樹:

2.BeautifulSoup簡介及使用方法:

1.簡介:

BeautifulSoup:Python第三方庫，用於從HTML或XML中提取數據

安裝並測試beautifulsoup

方法1：-安裝：pip install beautifulsoup4

-測試：import bs4

方法2：pycharm--File--settings--Project Interpreter--添加beautifulsoup4

2.語法介紹:

根據HTML網頁字符串可以創建BeautifulSoup對象，創建好之後已經加載完DOM樹

即可進行節點搜索：find_all、find。搜索出所有/第一個滿足要求的節點（可按照節點名稱、屬性、文字進行搜索）

得到節點之後可以訪問節點名稱、屬性、文字

如：

<a href="123.html" class="aaa">Python</a>

可根據：

節點名稱：a

節點屬性：href="123.html" class="aaa"

節點內容：Python

創建BeautifulSoup對象：

from bs4 import BeautifulSoup

#根據下載好的HTML網頁字符串創建BeautifulSoup對象

soup = BeautifulSoup(

html_doc,#HTML文檔字符串

'html.parser'#HTML解析器

from_encoding='utf-8'#HTML文檔編碼

)

搜索節點：

方法：find_all(name,attrs,string)

#查找所有標籤為a的節點

soup.find_all('a')

#查找所有標籤為a，鏈接符合/view/123.html形式的節點

soup.find_all('a',href='/view/123.html')

soup.find('a',href=re.compile('aaa')) #用正則表達式匹配內容

#查找所有標籤為div，class為abc，文字為Python的節點

soup.find_all('div',class_='abc',string='Python') #class是Python關鍵字避免衝突

由於class是python的關鍵字，所以講class屬性加了個下劃線。

訪問節點信息：

得到節點：<a href="123.html" class="aaa">Python</a>

#獲取查找到的節點的標籤名稱

node.name

#獲取查找到的節點的href屬性

node['href']

#獲取查找到的節點的連接文字

node.gettext()

四、代碼實現:

spider.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

url_manager.py

# url管理器
class UrlManager(object):
 def __init__(self):
 # 定義兩個set，一個存放未爬取的url，一個爬取已經訪問過的url
 self.new_urls = set()
 self.old_urls = set()
 # 添加一個url的方法
 def add_new_url(self,url):
 if url is None:
 return None
 if url not in self.new_urls and url not in self.old_urls:
 self.new_urls.add(url)
 # 判斷是否還有待爬取的url(根據new_urls的長度判斷是否有待爬取的頁面)
 def has_new_url(self):
 return len(self.new_urls) != 0
 # 定義獲取一個新的url的方法
 def get_new_url(self):
 if len(self.new_urls)>0:
 # 從new_urls彈出一個並添加到old_urls中
 new_url = self.new_urls.pop()
 self.old_urls.add(new_url)
 return new_url
 # 批量添加url的方法
 def add_new_urls(self, new_urls):
 if new_urls is None:
 return
 for url in new_urls:
 self.add_new_url(url)

html_downloader.py

# 讀取網頁的類
import urllib.request
class HtmlDownLoader(object):
 def download(self, url):
 if url is None:
 return
 # 訪問url
 response = urllib.request.urlopen(url)
 # 如果返回的狀態碼不是200代表異常
 if response.getcode() != 200:
 return
 return response.read()

html_parser.py

# 網頁解析器類
import re
import urllib
from bs4 import BeautifulSoup
class HtmpParser(object):
 # 解析讀取到的網頁的方法
 def parse(self, new_url, html_content):
 if html_content is None:
 return
 soup = BeautifulSoup(html_content,'html.parser',from_encoding='utf-8')
 new_urls = self.get_new_urls(new_url,soup)
 new_datas = self.get_new_datas(new_url,soup)
 return new_urls, new_datas
 # 獲取new_urls的方法
 def get_new_urls(self, new_url, soup):
 new_urls = set()
 # 查找網頁的a標籤，而且href包含/item
 links = soup.find_all('a',href=re.compile(r'/item'))
 for link in links:
 # 獲取到a必去哦啊Ian的href屬性
 url = link['href']
 # 合併url。使爬到的路徑變為全路徑，http://....的格式
 new_full_url = urllib.parse.urljoin(new_url,url)
 new_urls.add(new_full_url)
 return new_urls
 # 獲取new_data的方法
 def get_new_datas(self, new_url, soup):
 new_datas = {}
 # 獲取標題內容
 title_node = soup.find('dd',class_='lemmaWgt-lemmaTitle-title').find('h1')
 new_datas['title'] = title_node.get_text()
 #獲取簡介內容
 summary_node = soup.find('div',class_='lemma-summary')
 new_datas['summary'] = summary_node.get_text()
 new_datas['url'] = new_url
 return new_datas

html_outputer.py

# 爬蟲的入口調度器
from baike import url_manager, html_downloader, html_parser, html_outputer
class SpiderMain(object):
 def __init__(self):
 self.urlManager = url_manager.UrlManager()
 self.downloader = html_downloader.HtmlDownLoader()
 self.parser = html_parser.HtmpParser()
 self.outputer = html_outputer.HtmlOutpter()
 def craw(self,url):
 count = 1 #定義爬取幾個頁面
 self.urlManager.add_new_url(url)
 while self.urlManager.has_new_url():
 try:
 # 獲取一個url
 new_url = self.urlManager.get_new_url()
 # 訪問url，獲取網站返回數據
 html_content = self.downloader.download(new_url)
 new_urls, new_datas = self.parser.parse(new_url, html_content)
 self.urlManager.add_new_urls(new_urls)
 self.outputer.collect_data(new_datas)
 print(count)
 if count == 5:
 break
 count = count+1
 except Exception as e:
 print("發生錯誤",e)
 # 將爬取結果輸出到html
 self.outputer.out_html()
if __name__=="__main__":
 url = 'https://baike.baidu.com/item/Python/407313'
 sm = SpiderMain()
 sm.craw(url)

運行spider.py的主函數:(結果會將提取到的結果保存到html中)

'各大網站中的爬蟲Python是怎樣出來的——基礎'

相關推薦