'Python利用深度學習進行文本摘要的綜合指南下篇(附教程)'

"


"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 解碼器讀取偏移一個時間步的整個目標序列,並輸出每個時間步的隱藏狀態,如s1,s2,s3


"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 解碼器讀取偏移一個時間步的整個目標序列,並輸出每個時間步的隱藏狀態,如s1,s2,s3


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


目標時間步i = 1

  • 使用得分函數從源隱藏狀態hi和目標隱藏狀態s1計算對齊得分e1j:
e11= score(s1, h1)
e12= score(s1, h2)
e13= score(s1, h3)
e14= score(s1, h4)
  • 使用softmax標準化對齊分數e1j會產生注意力權重a1j:
a11= exp(e11)/((exp(e11)+exp(e12)+exp(e13)+exp(e14))
a12= exp(e12)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a13= exp(e13)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a14= exp(e14)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))

參與的上下文向量C1由編碼器隱藏狀態hj和對齊分數a1j的乘積的線性和導出:

C1= h1 * a11 + h2 * a12 + h3 * a13 + h4 * a14

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 解碼器讀取偏移一個時間步的整個目標序列,並輸出每個時間步的隱藏狀態,如s1,s2,s3


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


目標時間步i = 1

  • 使用得分函數從源隱藏狀態hi和目標隱藏狀態s1計算對齊得分e1j:
e11= score(s1, h1)
e12= score(s1, h2)
e13= score(s1, h3)
e14= score(s1, h4)
  • 使用softmax標準化對齊分數e1j會產生注意力權重a1j:
a11= exp(e11)/((exp(e11)+exp(e12)+exp(e13)+exp(e14))
a12= exp(e12)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a13= exp(e13)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a14= exp(e14)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))

參與的上下文向量C1由編碼器隱藏狀態hj和對齊分數a1j的乘積的線性和導出:

C1= h1 * a11 + h2 * a12 + h3 * a13 + h4 * a14

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 將參與的上下文向量C1和目標隱藏狀態s1連接以產生參與的隱藏向量S1

S11= concatenate([s11; C1])

  • 然後將隱藏向量S1送到全連接層中以產生y1

y1= dense(S1)

目標時間步i = 2

  • 使用給出的得分函數從源隱藏狀態hi和目標隱藏狀態s2計算對齊分數e2j
e21= score(s2, h1)
e22= score(s2, h2)
e23= score(s2, h3)
e24= score(s2, h4)
  • 使用softmax標準化對齊分數e2j會產生注意力權重a2j:
a21= exp(e21)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a22= exp(e22)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a23= exp(e23)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a24= exp(e24)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
  • 參與的上下文向量C2由編碼器隱藏狀態hi和對齊分數a2j的乘積的線性和導出:

C2= h1 * a21 + h2 * a22 + h3 * a23 + h4 * a24

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 解碼器讀取偏移一個時間步的整個目標序列,並輸出每個時間步的隱藏狀態,如s1,s2,s3


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


目標時間步i = 1

  • 使用得分函數從源隱藏狀態hi和目標隱藏狀態s1計算對齊得分e1j:
e11= score(s1, h1)
e12= score(s1, h2)
e13= score(s1, h3)
e14= score(s1, h4)
  • 使用softmax標準化對齊分數e1j會產生注意力權重a1j:
a11= exp(e11)/((exp(e11)+exp(e12)+exp(e13)+exp(e14))
a12= exp(e12)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a13= exp(e13)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a14= exp(e14)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))

參與的上下文向量C1由編碼器隱藏狀態hj和對齊分數a1j的乘積的線性和導出:

C1= h1 * a11 + h2 * a12 + h3 * a13 + h4 * a14

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 將參與的上下文向量C1和目標隱藏狀態s1連接以產生參與的隱藏向量S1

S11= concatenate([s11; C1])

  • 然後將隱藏向量S1送到全連接層中以產生y1

y1= dense(S1)

目標時間步i = 2

  • 使用給出的得分函數從源隱藏狀態hi和目標隱藏狀態s2計算對齊分數e2j
e21= score(s2, h1)
e22= score(s2, h2)
e23= score(s2, h3)
e24= score(s2, h4)
  • 使用softmax標準化對齊分數e2j會產生注意力權重a2j:
a21= exp(e21)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a22= exp(e22)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a23= exp(e23)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a24= exp(e24)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
  • 參與的上下文向量C2由編碼器隱藏狀態hi和對齊分數a2j的乘積的線性和導出:

C2= h1 * a21 + h2 * a22 + h3 * a23 + h4 * a24

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 將參與的上下文向量C2和目標隱藏狀態s2連接以產生參與的隱藏向量S2

S2= concatenate([s22; C22])

  • 然後將隱藏向量S2送到全連接層中以產生y2

y2= dense(S2)

我們可以對目標時間步i = 3執行類似的步驟以產生y3。

我知道這部分數學和理論有點多,但理解這一點將幫助你掌握注意力機制背後的基本思想。它已經催生了NLP最近的許多發展,現在輪到你了!

結語

深吸一口氣,我們在本文中介紹了很多內容。並祝賀你使用深度學習構建了第一個文本摘要模型!我們已經瞭解瞭如何使用Python中的Seq2Seq構建自己的文本摘要生成器。

如果你對本文有任何反饋意見或任何疑問,請在下面的評論部分分享,我會盡快回復。確保你嘗試了我們在此建立的模型,並與社區分享你的模型結果!

你還可以參加以下課程來學習或提高NLP技能:

  • Natural Language Processing (NLP) using Python

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Introduction to Natural Language Processing (NLP)

https://courses.analyticsvidhya.com/courses/Intro-to-NLP?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

您還可以在Analytics Vidhya的Android APP上閱讀這篇文章。

原文標題:

Comprehensive Guide to Text Summarization using Deep Learning in Python

原文鏈接:

https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/


編輯:王菁

校對:林亦霖

譯者簡介

"


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

作者:ARAVIND PAI

翻譯:和中華

校對:申利彬

本文約7500字,建議閱讀15分鐘。

本文介紹瞭如何利用seq2seq來建立一個文本摘要模型,以及其中的注意力機制。並利用Keras搭建編寫了一個完整的模型代碼。

介紹

“我不想要完整的報告,只需給我一個結果摘要”。我發現自己經常處於這種狀況——無論是在大學還是在職場中。我們準備了一份綜合全面的報告,但教師/主管卻僅僅有時間閱讀摘要。

聽起來很熟悉?好吧,我決定對此採取一些措施。手動將報告轉換為摘要太耗費時間了,對吧?那我可以依靠自然語言處理(NLP)技術來幫忙嗎?

自然語言處理(NLP)

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

這就是使用深度學習的文本摘要真正幫助我的地方。它解決了以前一直困擾著我的問題——現在我們的模型可以理解整個文本的上下文。對於所有需要把文檔快速摘要的人來說,這個夢想已成現實!

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用深度學習完成的文本摘要結果如何呢?非常出色。因此,在本文中,我們將逐步介紹使用深度學習構建文本摘要器的過程,其中包含構建它所需的全部概念。然後將用Python實現我們的第一個文本摘要模型!

注意:本文要求對一些深度學習概念有基本的瞭解。 我建議閱讀以下文章。

  • A Must-Read Introduction to Sequence Modelling (with use cases)

https://www.analyticsvidhya.com/blog/2018/04/sequence-modelling-an-introduction-with-practical-use-cases/?

utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

https://www.analyticsvidhya.com/blog/2019/01/sequence-models-deeplearning/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Essentials of Deep Learning: Introduction to Long Short Term Memory

https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

目錄

1. NLP中的文本摘要是什麼?

2. 序列到序列(Seq2Seq)建模簡介

3. 理解編碼器(Encoder)-解碼器(Decoder)架構

4. 編碼器-解碼器結構的侷限性

5. 注意力機制背後的直覺

6. 理解問題陳述

7. 使用Keras在Python中實現文本摘要模型

8. 注意力機制如何運作?

我在本文的最後面保留了“注意力機制如何運作?”的部分。這是一個數學密集的部分,並不強制瞭解Python代碼的工作原理。但是,我鼓勵你通讀它,因為它會讓你對這個NLP概念有一個堅實的理解。

注:此篇包含內容7-8,1-6內容請見:Python利用深度學習進行文本摘要的綜合指南上篇(附教程)

7. 使用Keras在Python中實現文本摘要

現在是時候開啟我們的Jupyter notebook了!讓我們馬上深入瞭解實施細節。

自定義注意力層

Keras官方沒有正式支持注意力層。 因此,我們要麼實現自己的注意力層,要麼使用第三方實現。在本文中我們採用後者。

1. from attention import AttentionLayer

導入庫

1. import numpy as np 
2. import pandas as pd
3. import re
4. from bs4 import BeautifulSoup
5. from keras.preprocessing.text import Tokenizer
6. from keras.preprocessing.sequence import pad_sequences
7. from nltk.corpus import stopwords
8. from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional
9. from tensorflow.keras.models import Model
10. from tensorflow.keras.callbacks import EarlyStopping
11. import warnings
12. pd.set_option("display.max_colwidth", 200)
13. warnings.filterwarnings("ignore")

讀取數據集

該數據集包括亞馬遜美食的評論。 這些數據涵蓋了超過10年的時間,截至2012年10月的所有約500,000條評論。這些評論包括產品和用戶信息,評級,純文本評論和摘要。它還包括來自所有其他亞馬遜類別的評論。

我們將抽樣出100,000個評論,以縮短模型的訓練時間。如果你的機器具有強大的計算能力,也可以使用整個數據集來訓練模型。

1. data=pd.read_csv("../input/amazon-fine-food-reviews/Reviews.csv",nrows=100000) 

刪除重複項和NA值

1. data.drop_duplicates(subset=['Text'],inplace=True) #dropping duplicates 
2. data.dropna(axis=0,inplace=True) #dropping na

預處理

在我們進入模型構建部分之前,執行基本的預處理步驟非常重要。使用髒亂和未清理的文本數據是一個潛在的災難性舉措。因此,在此步驟中,我們將從文本中刪除不影響問題目標的所有不需要的符號,字符等。

這是我們用於擴展縮略形式的字典:

1. contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", 
2.
3. "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
4.
5. "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
6.
7. "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
8.
9. "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
10.
11. "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
12.
13. "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
14.
15. "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
16.
17. "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
18.
19. "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
20.
21. "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
22.
23. "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
24.
25. "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
26.
27. "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
28.
29. "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
30.
31. "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
32.
33. "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
34.
35. "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
36.
37. "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
38.
39. "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
40.
41. "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
42.
43. "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
44.
45. "you're": "you are", "you've": "you have"}

我們需要定義兩個不同的函數來預處理評論並生成摘要,因為文本和摘要中涉及的預處理步驟略有不同。

a)文字清理

讓我們看一下數據集中的前10個評論,以瞭解該如何進行文本預處理步驟:

1. data['Text'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們將為我們的數據執行以下預處理任務:

  • 將所有內容轉換為小寫
  • 刪除HTML標籤
  • 縮略形式映射
  • 刪除('s)
  • 刪除括號內的任何文本()
  • 消除標點符號和特殊字符
  • 刪除停用詞
  • 刪除簡短的單詞

讓我們定義一下這個函數:

1. stop_words = set(stopwords.words('english')) 
2. def text_cleaner(text):
3. newString = text.lower()
4. newString = BeautifulSoup(newString, "lxml").text
5. newString = re.sub(r'\\([^)]*\\)', '', newString)
6. newString = re.sub('"','', newString)
7. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
8. newString = re.sub(r"'s\\b","",newString)
9. newString = re.sub("[^a-zA-Z]", " ", newString)
10. tokens = [w for w in newString.split() if not w in stop_words]
11. long_words=[]
12. for i in tokens:
13. if len(i)>=3: #removing short word
14. long_words.append(i)
15. return (" ".join(long_words)).strip()
16.
17. cleaned_text = []
18. for t in data['Text']:
19. cleaned_text.append(text_cleaner(t))

b)摘要清理

現在,我們將查看前10行評論,以瞭解摘要列的預處理步驟:

1. data['Summary'][:10] 

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


定義此任務的函數:

1. def summary_cleaner(text): 
2. newString = re.sub('"','', text)
3. newString = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in newString.split(" ")])
4. newString = re.sub(r"'s\\b","",newString)
5. newString = re.sub("[^a-zA-Z]", " ", newString)
6. newString = newString.lower()
7. tokens=newString.split()
8. newString=''
9. for i in tokens:
10. if len(i)>1:
11. newString=newString+i+' '
12. return newString
13.
14. #Call the above function
15. cleaned_summary = []
16. for t in data['Summary']:
17. cleaned_summary.append(summary_cleaner(t))
18.
19. data['cleaned_text']=cleaned_text
20. data['cleaned_summary']=cleaned_summary
21. data['cleaned_summary'].replace('', np.nan, inplace=True)
22. data.dropna(axis=0,inplace=True)

請記住在摘要的開頭和結尾添加START和END特殊標記:

1. data['cleaned_summary'] = data['cleaned_summary'].apply(lambda x : '_START_ '+ x + ' _END_') 

現在,我們來看看前5個評論及其摘要:

1. for i in range(5): 
2. print("Review:",data['cleaned_text'][i])
3. print("Summary:",data['cleaned_summary'][i])
4. print("\\n")

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


瞭解序列的分佈

在這裡,我們將分析評論和摘要的長度,以全面瞭解文本長度的分佈。這將幫助我們確定序列的最大長度:

1. import matplotlib.pyplot as plt 
2. text_word_count = []
3. summary_word_count = []
4.
5. # populate the lists with sentence lengths
6. for i in data['cleaned_text']:
7. text_word_count.append(len(i.split()))
8.
9. for i in data['cleaned_summary']:
10. summary_word_count.append(len(i.split()))
11.
12. length_df = pd.DataFrame({'text':text_word_count, 'summary':summary_word_count})
13. length_df.hist(bins = 30)
14. plt.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


有趣。我們可以將評論的最大長度固定為80,因為這似乎是多數評論的長度。同樣,我們可以將最大摘要長度設置為10:

1. max_len_text=80 
2. max_len_summary=10

我們越來越接近模型的構建部分了。在此之前,我們需要將數據集拆分為訓練和驗證集。我們將使用90%的數據集作為訓練數據,並在其餘10%上評估(保留集)表現:

1. from sklearn.model_selection import train_test_split 
2. x_tr,x_val,y_tr,y_val=train_test_split(data['cleaned_text'],data['cleaned_summary'],test_size=0.1,random_state=0,shuffle=True)

準備分詞器(Tokenizer)

分詞器構建詞彙表並將單詞序列轉換為整數序列。繼續為文本和摘要構建分詞器:

  • a) 文本分詞器
b) #prepare a tokenizer for reviews on training data 
c) x_tokenizer = Tokenizer()
d) x_tokenizer.fit_on_texts(list(x_tr))
e)
f) #convert text sequences into integer sequences
g) x_tr = x_tokenizer.texts_to_sequences(x_tr)
h) x_val = x_tokenizer.texts_to_sequences(x_val)
i)
j) #padding zero upto maximum length
k) x_tr = pad_sequences(x_tr, maxlen=max_len_text, padding='post')
l) x_val = pad_sequences(x_val, maxlen=max_len_text, padding='post')
m)
n) x_voc_size = len(x_tokenizer.word_index) +1
  • b)摘要分詞器
1. #preparing a tokenizer for summary on training data 
2. y_tokenizer = Tokenizer()
3. y_tokenizer.fit_on_texts(list(y_tr))
4.
5. #convert summary sequences into integer sequences
6. y_tr = y_tokenizer.texts_to_sequences(y_tr)
7. y_val = y_tokenizer.texts_to_sequences(y_val)
8.
9. #padding zero upto maximum length
10. y_tr = pad_sequences(y_tr, maxlen=max_len_summary, padding='post')
11. y_val = pad_sequences(y_val, maxlen=max_len_summary, padding='post')
12.
13. y_voc_size = len(y_tokenizer.word_index) +1

模型構建

終於來到了模型構建的部分。但在構建之前,我們需要熟悉所需的一些術語。

  • Return Sequences = True:當return sequences參數設置為True時,LSTM為每個時間步生成隱藏狀態和單元狀態
  • Return State = True:當return state = True時,LSTM僅生成最後一個時間步的隱藏狀態和單元狀態
  • Initial State:用於在第一個時間步初始化LSTM的內部狀態
  • Stacked LSTM:Stacked LSTM具有多層LSTM堆疊在彼此之上。這能產生更好地序列表示。我鼓勵你嘗試將LSTM的多個層堆疊在一起(這是一個很好的學習方法)

在這裡,我們為編碼器構建一個3層堆疊LSTM:

1. from keras import backend as K 
2. K.clear_session()
3. latent_dim = 500
4.
5. # Encoder
6. encoder_inputs = Input(shape=(max_len_text,))
7. enc_emb = Embedding(x_voc_size, latent_dim,trainable=True)(encoder_inputs)
8.
9. #LSTM 1
10. encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
11. encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
12.
13. #LSTM 2
14. encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
15. encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
16.
17. #LSTM 3
18. encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
19. encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
20.
21. # Set up the decoder.
22. decoder_inputs = Input(shape=(None,))
23. dec_emb_layer = Embedding(y_voc_size, latent_dim,trainable=True)
24. dec_emb = dec_emb_layer(decoder_inputs)
25.
26. #LSTM using encoder_states as initial state
27. decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
28. decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
29.
30. #Attention Layer
31. Attention layer attn_layer = AttentionLayer(name='attention_layer')
32. attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
33.
34. # Concat attention output and decoder LSTM output
35. decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
36.
37. #Dense layer
38. decoder_dense = TimeDistributed(Dense(y_voc_size, activation='softmax'))
39. decoder_outputs = decoder_dense(decoder_concat_input)
40.
41. # Define the model
42. model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
43. model.summary()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我使用sparse categorical cross-entropy作為損失函數,因為它在運行中將整數序列轉換為獨熱(one-hot)向量。這克服了任何內存問題。

1. model.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy')

還記得early stopping的概念嗎?它用於通過監視用戶指定的度量標準,在適當的時間停止訓練神經網絡。在這裡,我監視驗證集損失(val_loss)。一旦驗證集損失反彈,我們的模型就會停止訓練:

1. es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

我們將在批量大小為512的情況下訓練模型,並在保留集(我們數據集的10%)上驗證它:

1. history=model.fit([x_tr,y_tr[:,:-1]], y_tr.reshape(y_tr.shape[0],y_tr.shape[1], 1)[:,1:] ,epochs=50,callbacks=[es],batch_size=512, validation_data=([x_val,y_val[:,:-1]], y_val.reshape(y_val.shape[0],y_val.shape[1], 1)[:,1:]))

瞭解診斷圖

現在,我們將繪製一些診斷圖來了解模型隨時間的變化情況:

1. from matplotlib import pyplot 
2. pyplot.plot(history.history['loss'], label='train')
3. pyplot.plot(history.history['val_loss'], label='test')
4. pyplot.legend() pyplot.show()

輸出:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們可以推斷,在第10個週期(epoch)之後,驗證集損失略有增加。因此,我們將在此之後停止訓練模型。

接下來,讓我們構建字典,將目標和源詞彙表中的索引轉換為單詞:

1. reverse_target_word_index=y_tokenizer.index_word 
2. reverse_source_word_index=x_tokenizer.index_word
3. target_word_index=y_tokenizer.word_index

推理

設置編碼器和解碼器的推理:

1. # encoder inference 
2. encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])
3.
4. # decoder inference
5. # Below tensors will hold the states of the previous time step
6. decoder_state_input_h = Input(shape=(latent_dim,))
7. decoder_state_input_c = Input(shape=(latent_dim,))
8. decoder_hidden_state_input = Input(shape=(max_len_text,latent_dim))
9.
10. # Get the embeddings of the decoder sequence
11. dec_emb2= dec_emb_layer(decoder_inputs)
12.
13. # To predict the next word in the sequence, set the initial states to the states from the previous time step
14. decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
15.
16. #attention inference
17. attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
18. decoder_inf_concat = Concatenate(axis=-1, name='concat')([decoder_outputs2, attn_out_inf])
19.
20. # A dense softmax layer to generate prob dist. over the target vocabulary
21. decoder_outputs2 = decoder_dense(decoder_inf_concat)
22.
23. # Final decoder model
24. decoder_model = Model(
25. [decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
26. [decoder_outputs2] + [state_h2, state_c2])

下面我們定義了一個函數,是推理過程的實現(我們在上一節中介紹過):

1. def decode_sequence(input_seq): 
2. # Encode the input as state vectors.
3. e_out, e_h, e_c = encoder_model.predict(input_seq)
4.
5. # Generate empty target sequence of length 1.
6. target_seq = np.zeros((1,1))
7.
8. # Chose the 'start' word as the first word of the target sequence
9. target_seq[0, 0] = target_word_index['start']
10.
11. stop_condition = False
12. decoded_sentence = ''
13. while not stop_condition:
14. output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
15.
16. # Sample a token
17. sampled_token_index = np.argmax(output_tokens[0, -1, :])
18. sampled_token = reverse_target_word_index[sampled_token_index]
19.
20. if(sampled_token!='end'):
21. decoded_sentence += ' '+sampled_token
22.
23. # Exit condition: either hit max length or find stop word.
24. if (sampled_token == 'end' or len(decoded_sentence.split()) >= (max_len_summary-1)):
25. stop_condition = True
26.
27. # Update the target sequence (of length 1).
28. target_seq = np.zeros((1,1))
29. target_seq[0, 0] = sampled_token_index
30.
31. # Update internal states
32. e_h, e_c = h, c
33.
34. return decoded_sentence

我們來定義函數,用於將摘要和評論中的整數序列轉換為單詞序列:

 1. def seq2summary(input_seq): 
2. newString=''
3. for i in input_seq:
4. if((i!=0 and i!=target_word_index['start']) and i!=target_word_index['end']):
5. newString=newString+reverse_target_word_index[i]+' '
6. return newString
7.
8. def seq2text(input_seq):
9. newString=''
10. for i in input_seq:
11. if(i!=0):
12. newString=newString+reverse_source_word_index[i]+' '
13. return newString
1. for i in range(len(x_val)):
2. print("Review:",seq2text(x_val[i]))
3. print("Original summary:",seq2summary(y_val[i]))
4. print("Predicted summary:",decode_sequence(x_val[i].reshape(1,max_len_text)))
5. print("\\n")

以下是該模型生成的一些摘要:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


這真的很酷。即使我們模型生成的摘要和實際摘要並不完全匹配,但它們都傳達了相同的含義。我們的模型能夠根據文本中的上下文生成清晰的摘要。

以上就是我們如何使用Python中的深度學習概念執行文本摘要。

我們如何進一步提高模型的性能?

你的學習並不止於此!你可以做更多的事情來嘗試模型:

  • 我建議你增加訓練數據集大小並構建模型。隨著訓練數據集大小的增加,深度學習模型的泛化能力增強
  • 嘗試實現雙向LSTM,它能夠從兩個方向捕獲上下文,併產生更好的上下文向量
  • 使用集束搜索策略(beam search strategy)解碼測試序列而不是使用貪婪方法(argmax)
  • 根據BLEU分數評估模型的性能
  • 實現pointer-generator網絡和覆蓋機制

8. 注意力機制如何運作?

現在,我們來談談注意力機制的內部運作原理。正如我在文章開頭提到的那樣,這是一個數學密集的部分,所以將其視為可選部分。不過我仍然強烈建議通讀來真正掌握注意力機制的運作方式。

編碼器輸出源序列中每個時間步j的隱藏狀態(hj)。

類似地,解碼器輸出目標序列中每個時間步i的隱藏狀態(si)。

我們計算一個被稱為對齊分數(eij)的分數,基於該分數,源詞與目標詞對齊。使用得分函數從源隱藏狀態hj和目標隱藏狀態si計算對齊得分。由下面公式給出:

eij =score(si,hj)

其中eij表示目標時間步i和源時間步j的對齊分數。

根據所使用評分函數的類型,存在不同類型的注意力機制。我在下面提到了一些流行的注意力機制:

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們使用softmax函數標準化對齊分數以獲得注意力權重(aij):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


我們計算注意力權重aij和編碼器hj的隱藏狀態的乘積的線性和,以產生參與的上下文向量(Ci):

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


將參與的上下文向量和在時間步長i處的解碼器的目標隱藏狀態連接以產生參與的隱藏向量Si;

Si= concatenate([si; Ci])

然後將參與的隱藏向量Si送入dense層以產生yi;

yi= dense(Si)

讓我們藉助一個例子來理解上面的注意力機制步驟。 將源序列視為[x1,x2,x3,x4],將目標序列視為[y1,y2]。

  • 編碼器讀取整個源序列並輸出每個時間步的隱藏狀態,如h1,h2,h3,h4


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 解碼器讀取偏移一個時間步的整個目標序列,並輸出每個時間步的隱藏狀態,如s1,s2,s3


Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


目標時間步i = 1

  • 使用得分函數從源隱藏狀態hi和目標隱藏狀態s1計算對齊得分e1j:
e11= score(s1, h1)
e12= score(s1, h2)
e13= score(s1, h3)
e14= score(s1, h4)
  • 使用softmax標準化對齊分數e1j會產生注意力權重a1j:
a11= exp(e11)/((exp(e11)+exp(e12)+exp(e13)+exp(e14))
a12= exp(e12)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a13= exp(e13)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a14= exp(e14)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))

參與的上下文向量C1由編碼器隱藏狀態hj和對齊分數a1j的乘積的線性和導出:

C1= h1 * a11 + h2 * a12 + h3 * a13 + h4 * a14

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 將參與的上下文向量C1和目標隱藏狀態s1連接以產生參與的隱藏向量S1

S11= concatenate([s11; C1])

  • 然後將隱藏向量S1送到全連接層中以產生y1

y1= dense(S1)

目標時間步i = 2

  • 使用給出的得分函數從源隱藏狀態hi和目標隱藏狀態s2計算對齊分數e2j
e21= score(s2, h1)
e22= score(s2, h2)
e23= score(s2, h3)
e24= score(s2, h4)
  • 使用softmax標準化對齊分數e2j會產生注意力權重a2j:
a21= exp(e21)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a22= exp(e22)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a23= exp(e23)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
a24= exp(e24)/(exp(e21)+exp(e22)+exp(e23)+exp(e24))
  • 參與的上下文向量C2由編碼器隱藏狀態hi和對齊分數a2j的乘積的線性和導出:

C2= h1 * a21 + h2 * a22 + h3 * a23 + h4 * a24

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


  • 將參與的上下文向量C2和目標隱藏狀態s2連接以產生參與的隱藏向量S2

S2= concatenate([s22; C22])

  • 然後將隱藏向量S2送到全連接層中以產生y2

y2= dense(S2)

我們可以對目標時間步i = 3執行類似的步驟以產生y3。

我知道這部分數學和理論有點多,但理解這一點將幫助你掌握注意力機制背後的基本思想。它已經催生了NLP最近的許多發展,現在輪到你了!

結語

深吸一口氣,我們在本文中介紹了很多內容。並祝賀你使用深度學習構建了第一個文本摘要模型!我們已經瞭解瞭如何使用Python中的Seq2Seq構建自己的文本摘要生成器。

如果你對本文有任何反饋意見或任何疑問,請在下面的評論部分分享,我會盡快回復。確保你嘗試了我們在此建立的模型,並與社區分享你的模型結果!

你還可以參加以下課程來學習或提高NLP技能:

  • Natural Language Processing (NLP) using Python

https://courses.analyticsvidhya.com/courses/natural-language-processing-nlp?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

  • Introduction to Natural Language Processing (NLP)

https://courses.analyticsvidhya.com/courses/Intro-to-NLP?utm_source=blog&utm_medium=comprehensive-guide-text-summarization-using-deep-learning-python

您還可以在Analytics Vidhya的Android APP上閱讀這篇文章。

原文標題:

Comprehensive Guide to Text Summarization using Deep Learning in Python

原文鏈接:

https://www.analyticsvidhya.com/blog/2019/06/comprehensive-guide-text-summarization-using-deep-learning-python/


編輯:王菁

校對:林亦霖

譯者簡介

Python利用深度學習進行文本摘要的綜合指南下篇(附教程)


和中華,留德軟件工程碩士。由於對機器學習感興趣,碩士論文選擇了利用遺傳算法思想改進傳統kmeans。目前在杭州進行大數據相關實踐。加入數據派THU希望為IT同行們儘自己一份綿薄之力,也希望結交許多志趣相投的小夥伴。

— 完 —

關注清華-青島數據科學研究院官方微信公眾平臺“THU數據派”及姊妹號“數據派THU”獲取更多講座福利及優質內容。

"

相關推薦

推薦中...