'在Python中使用tesseract識別驗證碼'

Python Google X86 半禿頭的程序猿 2019-07-18

前言

在對網站數據進行爬取的過程中，由於訪問過於頻繁或是其他的原因，經常會出現輸入驗證碼進行驗證的情況，面對這種驗證碼驗證的問題，一般有三種解決方法：

第一種，最簡單也是最費時的，手動輸入驗證碼；
第二種，使用一些公司的API接口對驗證碼進行判別和輸入；
第三種，使用tessract對驗證碼進行識別；

在這裡，我們使用tessract對驗證碼進行識別。

Tesseract簡介

tesseract是谷歌開源的一個ORC組件，並支持語言的訓練，支持中文的識別（需要下載語言包）

Python中使用Tesseract

在Python中安裝Tesseract一共分為三步：

1、pip安裝pytesseract及其他依賴庫

pip pytesseract

在使用pytesseract中需要讀取圖像，所以還需要安裝Pillow

2、安裝tesseract

下載並安裝：https://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe

3、修改tesseract.py文件

防止提示沒有匹配的文件

# tesseract_cmd = 'tesseract'
tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe" # tesseract的安裝目錄

防止提示Unicode編碼錯誤

# f = open(output_file_name)
f = open(output_file_name,encoding='utf-8')

做完這三步，就可以使用tesseract基本的功能了。

下面來看看在實際的代碼中如何利用tesseract進行驗證碼識別：

原始的驗證碼圖像為：

前言

第一種，最簡單也是最費時的，手動輸入驗證碼；
第二種，使用一些公司的API接口對驗證碼進行判別和輸入；
第三種，使用tessract對驗證碼進行識別；

在這裡，我們使用tessract對驗證碼進行識別。

Tesseract簡介

tesseract是谷歌開源的一個ORC組件，並支持語言的訓練，支持中文的識別（需要下載語言包）

Python中使用Tesseract

在Python中安裝Tesseract一共分為三步：

1、pip安裝pytesseract及其他依賴庫

pip pytesseract

在使用pytesseract中需要讀取圖像，所以還需要安裝Pillow

2、安裝tesseract

下載並安裝：https://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe

3、修改tesseract.py文件

防止提示沒有匹配的文件

# tesseract_cmd = 'tesseract'
tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe" # tesseract的安裝目錄

防止提示Unicode編碼錯誤

# f = open(output_file_name)
f = open(output_file_name,encoding='utf-8')

做完這三步，就可以使用tesseract基本的功能了。

下面來看看在實際的代碼中如何利用tesseract進行驗證碼識別：

原始的驗證碼圖像為：

示例驗證碼為：

#coding:utf-8
'''
 驗證碼識別
'''
from PIL import Image,ImageFilter,ImageEnhance
import pytesseract
# 二值化
threshold = 140
table = []
for i in range(256):
 if i < threshold:
 table.append(0)
 else:
 table.append(1)
# 識別驗證碼
def get_vcode():
 # 打開原始圖像
 image = Image.open("getimgbysig.jpg")
 # image = Image.open("e:/a.jpg")
 # 將圖像轉為灰度，並另存為
 bimage = image.convert('L')
 bimage.save('g'+"getimgbysig.jpg")
 # 進行二值化處理，並另存為
 out = bimage.point(table,'1')
 out.save('b'+"getimgbysig.jpg")
 icode = pytesseract.image_to_string(image)
 bcode = pytesseract.image_to_string(bimage)
 vcode = pytesseract.image_to_string(out)
 print(icode,bcode,vcode)
if __name__ == '__main__':
 get_vcode()

結果輸出為：7364，說明識別成功了。

對於簡單、清晰的數字，沒有經過任何訓練的Tesseract還是能夠很精確地識別出來。而對於那些模糊、變形的數字、字母或是中文，就需要先對Tesseract進行訓練了，暫且不表。

覺得文章不錯或者對你有用，可以關注小編，後面持續更新更多精彩內容喲~

前言

第一種，最簡單也是最費時的，手動輸入驗證碼；
第二種，使用一些公司的API接口對驗證碼進行判別和輸入；
第三種，使用tessract對驗證碼進行識別；

在這裡，我們使用tessract對驗證碼進行識別。

Tesseract簡介

tesseract是谷歌開源的一個ORC組件，並支持語言的訓練，支持中文的識別（需要下載語言包）

Python中使用Tesseract

在Python中安裝Tesseract一共分為三步：

1、pip安裝pytesseract及其他依賴庫

pip pytesseract

在使用pytesseract中需要讀取圖像，所以還需要安裝Pillow

2、安裝tesseract

下載並安裝：https://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe

3、修改tesseract.py文件

防止提示沒有匹配的文件

# tesseract_cmd = 'tesseract'
tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe" # tesseract的安裝目錄

防止提示Unicode編碼錯誤

# f = open(output_file_name)
f = open(output_file_name,encoding='utf-8')

做完這三步，就可以使用tesseract基本的功能了。

下面來看看在實際的代碼中如何利用tesseract進行驗證碼識別：

原始的驗證碼圖像為：

示例驗證碼為：

#coding:utf-8
'''
 驗證碼識別
'''
from PIL import Image,ImageFilter,ImageEnhance
import pytesseract
# 二值化
threshold = 140
table = []
for i in range(256):
 if i < threshold:
 table.append(0)
 else:
 table.append(1)
# 識別驗證碼
def get_vcode():
 # 打開原始圖像
 image = Image.open("getimgbysig.jpg")
 # image = Image.open("e:/a.jpg")
 # 將圖像轉為灰度，並另存為
 bimage = image.convert('L')
 bimage.save('g'+"getimgbysig.jpg")
 # 進行二值化處理，並另存為
 out = bimage.point(table,'1')
 out.save('b'+"getimgbysig.jpg")
 icode = pytesseract.image_to_string(image)
 bcode = pytesseract.image_to_string(bimage)
 vcode = pytesseract.image_to_string(out)
 print(icode,bcode,vcode)
if __name__ == '__main__':
 get_vcode()

結果輸出為：7364，說明識別成功了。

覺得文章不錯或者對你有用，可以關注小編，後面持續更新更多精彩內容喲~

'在Python中使用tesseract識別驗證碼'

前言

Tesseract簡介

Python中使用Tesseract

前言

Tesseract簡介

Python中使用Tesseract

前言

Tesseract簡介

Python中使用Tesseract

相關推薦