'NLP入門第一步：6種獨特的數據標記方式'

Python 自然語言處理 Falcon 機器學習 NLTK 英語讀書 SpaceX 讀芯術 2019-09-06

你是否對互聯網上大量可用的文本數據量著迷？你是否正在尋找使用該文本數據的方法，但不知道從何下手？畢竟，機器只能識別數字，而不是人類語言中的字母。在機器學習中，這是亟待解決的棘手問題。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

在處理文本數據之前需要識別構成一串字符的單詞，因此標記是繼續使用NLP（文本數據）的最基本步驟。這很重要，因為通過分析文本中的單詞可以很容易地解釋文本。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

先從最基本的split()方法開始，該方法在通過指定的分隔符切割給定字符串後會返回字符串列表。默認情況下，split()會在每個空格處切割一個字符串。可以將分隔符更改為任何內容。

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

“\\w”表示“任何單詞字符”，通常表示字母數字（字母、數字）和下劃線（_）。”+“意味著任一次數。所以[\\w’]+表示代碼應該找到所有包含字母和數字的字符，直到出現任何其他字符。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

句子標記

要執行句子標記，可以使用re.split()函數。該函數通過輸入某個模式將文本拆分為句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text
sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

這裡的方法比split()函數有優勢，因為可以同時傳遞多個分隔符。在上述代碼中遇到[.?!]時使用了re.compile()函數，這意味著只要遇到任何這些字符，句子就會被分割。

3.3 使用NLTK進行標記

如果經常和文本數據打交道，則應使用NLTK庫。NLTK是Natural Language ToolKit的縮寫，是一個用Python編寫的用於符號和統計自然語言處理的庫。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

句子標記

要執行句子標記，可以使用re.split()函數。該函數通過輸入某個模式將文本拆分為句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text
sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.3 使用NLTK進行標記

如果經常和文本數據打交道，則應使用NLTK庫。NLTK是Natural Language ToolKit的縮寫，是一個用Python編寫的用於符號和統計自然語言處理的庫。

可使用以下代碼安裝NLTK：

pip install --user -U nltk

NLTK包含一個名為tokenize()的模塊，進一步可分為兩個子類：

• 詞標記：使用word_tokenize()方法將句子拆分為token或單詞

• 句子標記：使用sent_tokenize()方法將文檔或段落拆分為句子

下面一一介紹這兩個子類的代碼

詞標記

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 
 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth', '.']

看到NLTK是如何將標點符號視為token了嗎？對於之後的任務，需要從初始列表中刪除標點符號。

句子標記

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.4 使用spaCy庫進行標記

spaCy是一個用於高級自然語言處理（NLP）的開源庫，支持超過49種語言，並提供最快的計算速度。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

句子標記

要執行句子標記，可以使用re.split()函數。該函數通過輸入某個模式將文本拆分為句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text
sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.3 使用NLTK進行標記

如果經常和文本數據打交道，則應使用NLTK庫。NLTK是Natural Language ToolKit的縮寫，是一個用Python編寫的用於符號和統計自然語言處理的庫。

可使用以下代碼安裝NLTK：

pip install --user -U nltk

NLTK包含一個名為tokenize()的模塊，進一步可分為兩個子類：

• 詞標記：使用word_tokenize()方法將句子拆分為token或單詞

• 句子標記：使用sent_tokenize()方法將文檔或段落拆分為句子

下面一一介紹這兩個子類的代碼

詞標記

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 
 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth', '.']

看到NLTK是如何將標點符號視為token了嗎？對於之後的任務，需要從初始列表中刪除標點符號。

句子標記

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.4 使用spaCy庫進行標記

spaCy是一個用於高級自然語言處理（NLP）的開源庫，支持超過49種語言，並提供最快的計算速度。

在Linux中安裝Spacy：

pip install -U spacy

python -m spacy download en

若在其他操作系統上安裝，點擊該網址https://spacy.io/usage可查看更多。

那麼，現在看看如何使用spaCY的強大功能來執行標記，spacy.lang.enwhich支持英語。

詞標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
# Create list of word tokens
token_list = []
for token in my_doc:
 token_list.append(token.text
token_list

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', '-', 'planet', '\\n', 'species', 'by', 'building', 'a', 'self', '-', 
 'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 
 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\\n', 
 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')
# Add the component to the pipeline
nlp.add_pipe(sbd)
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)
# create list of sentence tokens
sents_list = []
for sent in doc.sents:
 sents_list.append(sent.text)
sents_list

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

在執行NLP任務時，spaCy與其他庫相比速度非常快（甚至比NLTK快）。可以通過 DataHack Radio瞭解如何創建spaCy以及可以在何處使用：

• DataHack Radio #23: Ines Montani and Matthew Honnibal – The Brains behind spaCy

傳送門：https://www.analyticsvidhya.com/blog/2019/06/datahack-radio-ines-montani-matthew-honnibal-brains-behind-spacy/?utm_source=blog&utm_medium=how-get-started-nlp-6-unique-ways-perform-tokenization

以下是一個深入的spaCy入門教程：

• Natural Language Processing Made Easy – using SpaCy (in Python)

傳送門：https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/?utm_source=blog&utm_medium=how-get-started-nlp-6-unique-ways-perform-tokenization

3.5 使用Keras進行標記

Keras現在是業內最熱門的深度學習框架之一，是Python的開源神經網絡庫。Keras易於使用，可以在TensorFlow上運行。

在NLP環境中，可以使用Keras來清理平時收集的非結構化文本數據。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

句子標記

要執行句子標記，可以使用re.split()函數。該函數通過輸入某個模式將文本拆分為句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text
sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.3 使用NLTK進行標記

如果經常和文本數據打交道，則應使用NLTK庫。NLTK是Natural Language ToolKit的縮寫，是一個用Python編寫的用於符號和統計自然語言處理的庫。

可使用以下代碼安裝NLTK：

pip install --user -U nltk

NLTK包含一個名為tokenize()的模塊，進一步可分為兩個子類：

• 詞標記：使用word_tokenize()方法將句子拆分為token或單詞

• 句子標記：使用sent_tokenize()方法將文檔或段落拆分為句子

下面一一介紹這兩個子類的代碼

詞標記

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 
 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth', '.']

看到NLTK是如何將標點符號視為token了嗎？對於之後的任務，需要從初始列表中刪除標點符號。

句子標記

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.4 使用spaCy庫進行標記

spaCy是一個用於高級自然語言處理（NLP）的開源庫，支持超過49種語言，並提供最快的計算速度。

在Linux中安裝Spacy：

pip install -U spacy

python -m spacy download en

若在其他操作系統上安裝，點擊該網址https://spacy.io/usage可查看更多。

那麼，現在看看如何使用spaCY的強大功能來執行標記，spacy.lang.enwhich支持英語。

詞標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
# Create list of word tokens
token_list = []
for token in my_doc:
 token_list.append(token.text
token_list

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', '-', 'planet', '\\n', 'species', 'by', 'building', 'a', 'self', '-', 
 'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 
 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\\n', 
 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')
# Add the component to the pipeline
nlp.add_pipe(sbd)
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)
# create list of sentence tokens
sents_list = []
for sent in doc.sents:
 sents_list.append(sent.text)
sents_list

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

在執行NLP任務時，spaCy與其他庫相比速度非常快（甚至比NLTK快）。可以通過 DataHack Radio瞭解如何創建spaCy以及可以在何處使用：

• DataHack Radio #23: Ines Montani and Matthew Honnibal – The Brains behind spaCy

以下是一個深入的spaCy入門教程：

• Natural Language Processing Made Easy – using SpaCy (in Python)

3.5 使用Keras進行標記

Keras現在是業內最熱門的深度學習框架之一，是Python的開源神經網絡庫。Keras易於使用，可以在TensorFlow上運行。

在NLP環境中，可以使用Keras來清理平時收集的非結構化文本數據。

只需一行代碼就可以在機器上安裝Keras：

pip install Keras

Keras詞標記使用keras.preprocessing.text類中的text_to_word_sequence方法。

詞標記

from keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)
result

Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 
 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 
 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first', 
 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 
 'the', 'earth']

在標記數據之前，Keras會把所有字母變成小寫，這可以節省相當多的時間。

3.6 使用Gensim進行標記

本文介紹的最後一個標記方法是使用Gensim庫。它是用於無監督主題建模和自然語言處理的開源庫，旨在從給定文檔中自動提取語義。

以下是安裝Gensim的方法：

pip install gensim

可以使用gensim.utils類導入用於執行詞標記的方法。

詞標記

from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

Output : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 
 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 
 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars', 
 'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 
 'Earth']

句子標記

句子標記使用gensim.summerization.texttcleaner類中的split_sentences方法：

from gensim.summarization.textcleaner import split_sentences
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)
result

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet ', 
 'species by building a self-sustaining city on Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed ', 
 'liquid-fuel launch vehicle to orbit the Earth.']

Gensim對標點符號非常嚴格。只要遇到標點符號就會拆分。在句子拆分中，gensim也會在遇到“\\n”對文本進行標記，而其它庫通常忽略“\\n”。

那麼如何操縱和清理這些文本數據來構建模型呢？答案就在自然語言處理（NLP）的奇妙世界裡。

解決NLP問題是一個多階段的過程。在考慮進入建模階段之前，需要先清理非結構化文本數據。清理數據包括以下幾個關鍵步驟：

• 詞標記（也稱分詞）

• 預測每個token的詞性

• 文本詞形還原

• 識別和刪除停用詞等等

本文將討論第一步——標記。首先看看什麼是標記以及NLP中需要標記的原因，並瞭解在Python中標記數據的六種獨特方法。

本文不設前提條件，任何對NLP或數據科學感興趣的人都能上手。

1. NLP中的標記指的是什麼？

2. 為什麼NLP中需要標記？

3. 在Python中執行標記的不同方法

3.1 使用Python split()函數進行標記

3.2 使用正則表達式進行標記

3.3 使用NLTK進行標記

3.4 使用Spacy進行標記

3.5 使用Keras進行標記

3.6 使用Gensim進行標記

1. NLP中的標記指的是什麼？

在處理文本數據時，標記是最常見的任務之一。但“標記”一詞究竟意味著什麼呢？

通過下面圖像，可以更直觀地瞭解該定義：

token可以是單詞、數字或標點符號。在分詞中，通過定位詞邊界來創建較小的單元。那麼詞邊界是什麼？

詞邊界是一個單詞的結束點和下一個單詞的開頭。這些token被認為是詞幹化和詞形還原的第一步。

2. 為什麼NLP中需要標記？

在此應考慮一下英語語言的特性。說出能想到的任何句子，並在閱讀本節時牢記這一點。這有助於用更簡單的方式理解標記的重要性。

舉個例子，思考以下字符串：

“This is a cat.”

在標記該字符串後會發生什麼？可得到['This'，'is'，'a'，cat']。

這樣做有很多用途，可以使用這種標記形式：

• 計算文本中的單詞數

• 計算單詞的頻率，即特定單詞的出現次數

除此之外，還可以提取更多信息。現在，是時候深入瞭解在NLP中標記數據的不同方法。

3. 在Python中標記的方法

本文介紹了標記文本數據的六種獨特方式，併為每種方法提供了Python代碼，僅供參考。

3.1 使用Python的split()函數進行標記

詞標記

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at space 
text.split()

Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 
 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In', 
 '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

句子標記

句子標記類似於詞標記。分析例子中句子的結構，發現句子通常以句號（.）結束，因此可以用“.”作為分隔符拆分字符串：

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ')

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

使用Python的split()函數的主要缺點是一次只能使用一個分隔符。另外需要注意的是,在詞標記中，split()並未將標點符號視為單獨的標記。

3.2 使用正則表達式（RegEx）進行標記

首先要理解什麼是正則表達式。正則表達式基本上是一個特殊的字符序列，可以幫助匹配文本

可以使用Python中的RegEx庫來處理正則表達式，該庫預裝了Python安裝包。

現在，記住用RegEx實現詞標記和句子標記。

詞標記

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
tokens = re.findall("[\\w']+", text)
tokens

Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 
 'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth']

re.findall()函數找到所有匹配其模式的單詞並將其存儲在列表中。

句子標記

要執行句子標記，可以使用re.split()函數。該函數通過輸入某個模式將文本拆分為句子。

import re
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sentences = re.compile('[.!?] ').split(text
sentences

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.3 使用NLTK進行標記

如果經常和文本數據打交道，則應使用NLTK庫。NLTK是Natural Language ToolKit的縮寫，是一個用Python編寫的用於符號和統計自然語言處理的庫。

可使用以下代碼安裝NLTK：

pip install --user -U nltk

NLTK包含一個名為tokenize()的模塊，進一步可分為兩個子類：

• 詞標記：使用word_tokenize()方法將句子拆分為token或單詞

• 句子標記：使用sent_tokenize()方法將文檔或段落拆分為句子

下面一一介紹這兩個子類的代碼

詞標記

from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 
 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 
 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 
 'to', 'orbit', 'the', 'Earth', '.']

看到NLTK是如何將標點符號視為token了嗎？對於之後的任務，需要從初始列表中刪除標點符號。

句子標記

from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

3.4 使用spaCy庫進行標記

spaCy是一個用於高級自然語言處理（NLP）的開源庫，支持超過49種語言，並提供最快的計算速度。

在Linux中安裝Spacy：

pip install -U spacy

python -m spacy download en

若在其他操作系統上安裝，點擊該網址https://spacy.io/usage可查看更多。

那麼，現在看看如何使用spaCY的強大功能來執行標記，spacy.lang.enwhich支持英語。

詞標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
# Create list of word tokens
token_list = []
for token in my_doc:
 token_list.append(token.text
token_list

Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable', 
 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 
 'multi', '-', 'planet', '\\n', 'species', 'by', 'building', 'a', 'self', '-', 
 'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 
 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\\n', 
 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']

句子標記

from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')
# Add the component to the pipeline
nlp.add_pipe(sbd)
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)
# create list of sentence tokens
sents_list = []
for sent in doc.sents:
 sents_list.append(sent.text)
sents_list

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet \\nspecies by building a self-sustaining city on 
 Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \\nliquid-fuel 
 launch vehicle to orbit the Earth.']

在執行NLP任務時，spaCy與其他庫相比速度非常快（甚至比NLTK快）。可以通過 DataHack Radio瞭解如何創建spaCy以及可以在何處使用：

• DataHack Radio #23: Ines Montani and Matthew Honnibal – The Brains behind spaCy

以下是一個深入的spaCy入門教程：

• Natural Language Processing Made Easy – using SpaCy (in Python)

3.5 使用Keras進行標記

Keras現在是業內最熱門的深度學習框架之一，是Python的開源神經網絡庫。Keras易於使用，可以在TensorFlow上運行。

在NLP環境中，可以使用Keras來清理平時收集的非結構化文本數據。

只需一行代碼就可以在機器上安裝Keras：

pip install Keras

Keras詞標記使用keras.preprocessing.text類中的text_to_word_sequence方法。

詞標記

from keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)
result

Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans', 
 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 
 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 
 'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first', 
 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 
 'the', 'earth']

在標記數據之前，Keras會把所有字母變成小寫，這可以節省相當多的時間。

3.6 使用Gensim進行標記

本文介紹的最後一個標記方法是使用Gensim庫。它是用於無監督主題建模和自然語言處理的開源庫，旨在從給定文檔中自動提取語義。

以下是安裝Gensim的方法：

pip install gensim

可以使用gensim.utils類導入用於執行詞標記的方法。

詞標記

from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

Output : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 
 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet', 
 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars', 
 'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately', 
 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 
 'Earth']

句子標記

句子標記使用gensim.summerization.texttcleaner類中的split_sentences方法：

from gensim.summarization.textcleaner import split_sentences
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)
result

Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring 
 civilization and a multi-planet ', 
 'species by building a self-sustaining city on Mars.', 
 'In 2008, SpaceX’s Falcon 1 became the first privately developed ', 
 'liquid-fuel launch vehicle to orbit the Earth.']

Gensim對標點符號非常嚴格。只要遇到標點符號就會拆分。在句子拆分中，gensim也會在遇到“\\n”對文本進行標記，而其它庫通常忽略“\\n”。

留言點贊關注

我們一起分享AI學習與發展的乾貨

編譯組：陳楓、蔣馨怡

相關鏈接：

https://www.analyticsvidhya.com/blog/2019/07/how-get-started-nlp-6-unique-ways-perform-tokenization/

如需轉載，請後臺留言，遵守轉載規範

'NLP入門第一步：6種獨特的數據標記方式'

相關推薦