自然语言处理秘籍-全-

龙哥盟 / 2024-10-06 / 原文

自然语言处理秘籍（全）

原文：Natural Language Processing Recipes

协议：CC BY-NC-SA 4.0

一、提取数据

本章涵盖了文本数据的各种来源以及提取文本数据的方法。文本数据可以为企业提供信息或见解。涵盖以下食谱。

食谱 1。使用 API 收集文本数据
食谱 2。用 Python 读取 PDF 文件
食谱三。阅读 Word 文档
食谱 4。读取 JSON 对象
食谱 5。读取 HTML 页面和 HTML 解析
食谱 6。正则表达式
食谱 7。字符串处理
食谱 8。网页抓取

介绍

在进入这本书的细节之前，让我们看看普遍可用的数据来源。我们需要确定能够帮助解决数据科学用例的潜在数据源。

客户数据

对于任何问题陈述，来源之一是已经存在的数据。企业决定要将数据存储在哪里。数据存储取决于业务类型、数据量以及与数据源相关的成本。以下是一些例子。

SQL 数据库
分布式文件系统
云存储
平面文件

免费资源

互联网上有大量的免费数据。你只需要简化问题，并开始探索多个免费的数据源。

像 Twitter 这样的免费 API
维基百科(一个基于 wiki 技术的多语言的百科全书协作计划ˌ也是一部用不同语言写成的网络百科全书ˌ 其目标及宗旨是为全人类提供自由的百科全书)ˌ开放性的百科全书
政府数据(如 http://data.gov )
人口普查数据(如 www.census.gov/data.html )
医疗保健索赔数据(如 www.healthdata.gov )
数据科学社区网站(如 www.kaggle.com )
谷歌数据集搜索(如 https://datasetsearch.research.google.com )

网页抓取

使用 Python 中的网络抓取包，从网站、博客、论坛和零售网站中提取内容/数据，以便在获得相应来源的许可后进行审查。

还有很多其他来源，如新闻数据和经济数据，可以用来进行分析。

配方 1-1。收集数据

有很多免费的 API，你可以通过它们收集数据，并用它来解决问题。让我们讨论一下 Twitter API。

问题

您希望使用 Twitter APIs 收集文本数据。

解决办法

Twitter 拥有海量数据，其中蕴含着巨大的价值。社交媒体营销人员以此为生。每天都有大量的推文，每条推文都有故事要讲。当所有这些数据被收集和分析时，它给企业提供了关于他们的公司、产品、服务等等的巨大洞察力。

现在让我们看看如何提取数据，然后在接下来的章节中探讨如何利用它。

它是如何工作的

步骤 1-1。登录 Twitter 开发者门户

在 https://developer.twitter.com 登录 Twitter 开发者门户。

在 Twitter 开发者门户中创建自己的应用，并获得以下密钥。一旦有了这些凭证，就可以开始提取数据了。

消费者密钥:与应用程序(Twitter、脸书等)相关联的密钥。)
消费者秘密:用于向认证服务器(Twitter、脸书等)认证的密码。)
访问令牌:成功认证密钥后给予客户端的密钥
访问令牌密码:访问密钥的密码

步骤 1-2。在 Python 中执行查询

一旦所有的凭证都准备好了，就使用下面的代码来获取数据。

# Install tweepy
!pip install tweepy
# Import the libraries
import numpy as np
import tweepy
import json
import pandas as pd
from tweepy import OAuthHandler
# credentials

consumer_key = "adjbiejfaaoeh"
consumer_secret = "had73haf78af"
access_token = "jnsfby5u4yuawhafjeh"
access_token_secret = "jhdfgay768476r"
# calling API

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
# Provide the query you want to pull the data. For example, pulling data for the mobile phone ABC
query ="ABC"
# Fetching tweets

Tweets = api.search(query, count = 10,lang='en',exclude='retweets',tweet_mode='extended')

当搜索产品 ABC 时，该查询提取前十条推文。API 提取英语推文，因为给定的语言是'en'。它不包括转发。

配方 1-2。从 pdf 收集数据

您的大部分数据存储在 PDF 文件中。您需要从这些文件中提取文本，并存储它以供进一步分析。

问题

你想阅读 PDF 文件。

解决办法

阅读 PDF 文件最简单的方法是使用 PyPDF2 库。

它是如何工作的

按照本节中的步骤从 PDF 文件中提取数据。

步骤 2-1。安装并导入所有必需的库

这是第一行代码。

!pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader

Note

您可以从网上下载任何 PDF 文件，并将其放在运行这个 Jupyter 笔记本或 Python 脚本的位置。

第 2-2 步。从 PDF 文件中提取文本

现在我们来摘抄课文。

#Creating a pdf file object
pdf = open("file.pdf","rb")
#creating pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf)
#checking number of pages in a pdf file
print(pdf_reader.numPages)
#creating a page object
page = pdf_reader.getPage(0)
#finally extracting text from the page
print(page.extractText())
#closing the pdf file
pdf.close()

请注意，该功能不适用于扫描的 pdf。

配方 1-3。从 Word 文件中收集数据

接下来我们来看另一个用 Python 读取 Word 文件的小菜谱。

问题

你想读 Word 文件。

解决办法

最简单的方法是使用 docx 库。

它是如何工作的

按照本节中的步骤从 Word 文件中提取数据。

步骤 3-1。安装并导入所有必需的库

下面是安装和导入 docx 库的代码。

#Install docx
!pip install docx
#Import library
from docx import Document

Note

您可以从网上下载任何 Word 文件，并将其放在运行 Jupyter 笔记本或 Python 脚本的位置。

第 3-2 步。从 Word 文件中提取文本

现在让我们来读课文。

#Creating a word file object
doc = open("file.docx","rb")
#creating word reader object
document = docx.Document(doc)
#create an empty string and call this document. #This document variable stores each paragraph in the Word document.
#We then create a "for" loop that goes through each paragraph in the Word document and appends the paragraph.
docu=""
for para in document.paragraphs.
       docu += para.text
#to see the output call docu
print(docu)

配方 1-4。从 JSON 收集数据

JSON 是一种开放的标准文件格式，代表 JavaScript 对象符号。当数据从服务器发送到网页时经常使用。这个方法解释了如何读取 JSON 文件/对象。

问题

你想读取一个 JSON 文件/对象。

解决办法

最简单的方法是使用请求和 JSON 库。

它是如何工作的

按照本节中的步骤从 JSON 中提取数据。

步骤 4-1。安装并导入所有必需的库

下面是导入库的代码。

import requests
import json

第 4-2 步。从 JSON 文件中提取文本

现在我们来摘抄课文。

#extracting the text from "https://quotes.rest/qod.json"
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent = 4))
#output
{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Where there is ruin, there is hope for a treasure.",
                "length": "50",
                "author": "Rumi",
                "tags": [
                    "failure",
                    "inspire",
                    "learning-from-failure"
                ],
                "category": "inspire",
                "date": "2018-09-29",
                "permalink": "https://theysaidso.com/quote/dPKsui4sQnQqgMnXHLKtfweF/rumi-where-there-is-ruin-there-is-hope-for-a-treasure",
                "title": "Inspiring Quote of the day",
                "background": "https://theysaidso.com/img/man_on_the_mountain.jpg",
                "id": "dPKsui4sQnQqgMnXHLKtfweF"
            }
        ],
        "copyright": "2017-19 theysaidso.com"
    }

}
#extract contents
q = res['contents']['quotes'][0]
q
#output
{'author': 'Rumi',
 'background': 'https://theysaidso.com/img/man_on_the_mountain.jpg',
 'category': 'inspire',
 'date': '2018-09-29',
 'id': 'dPKsui4sQnQqgMnXHLKtfweF',
 'length': '50',
 'permalink': 'https://theysaidso.com/quote/dPKsui4sQnQqgMnXHLKtfweF/rumi-where-there-is-ruin-there-is-hope-for-a-treasure',
 'quote': 'Where there is ruin, there is hope for a treasure.',
 'tags': ['failure', 'inspire', 'learning-from-failure'],
 'title': 'Inspiring Quote of the day'}
#extract only quote

print(q['quote'], '\n--', q['author'])
#output
It wasn't raining when Noah built the ark....
-- Howard Ruff

配方 1-5。从 HTML 收集数据

HTML 是超文本标记语言的简称。它构建网页并在浏览器中显示它们。有各种 HTML 标签来构建内容。这个食谱着眼于阅读 HTML 页面。

问题

你想读解析/读 HTML 页面。

解决办法

最简单的方法是使用 bs4 库。

它是如何工作的

按照本节中的步骤从 web 中提取数据。

步骤 5-1。安装并导入所有必需的库

首先，导入库。

!pip install bs4
import urllib.request as urllib2
from bs4 import BeautifulSoup

第 5-2 步。获取 HTML 文件

你可以选择任何你想提取的网站。让我们在这个例子中使用维基百科。

response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

第 5-3 步。解析 HTML 文件

现在我们来获取数据。

#Parsing
soup = BeautifulSoup(html_doc, 'html.parser')
# Formating the parsed html file
strhtm = soup.prettify()
# Print few lines
print (strhtm[:1000])
#output
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
  </script>
  <script>

   (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Natural_language_processing","wgTitle":"Natural language processing","wgCurRevisionId":860741853,"wgRevisionId":860741853,"wgArticleId":21652,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All accuracy disputes","Articles with disputed statements from June 2018","Wikipedia articles with NDL identifiers","Natural language processing","Computational linguistics","Speech recognition","Computational fields of stud

第 5-4 步。提取标签值

您可以使用以下代码从标记的第一个实例中提取标记的值。

print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)
#output
 <title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
None
Natural language processing

第 5-5 步。提取特定标签的所有实例

这里我们得到了我们感兴趣的标签的所有实例。

for x in soup.find_all('a'): print(x.string)
#sample output
 None
Jump to navigation
Jump to search
Language processing in the brain
None
None
automated online assistant
customer service
[1]
computer science
artificial intelligence
natural language
speech recognition
natural language understanding
natural language generation

第 5-6 步。从特定标签中提取所有文本

最后，我们得到文本。

for x in soup.find_all('p'): print(x.text)
#sample output
Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

The history of natural language processing generally started in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.

注意，p标签提取了页面上的大部分文本。

配方 1-6。使用正则表达式解析文本

这个菜谱讨论了正则表达式在处理文本数据时是如何有用的。当处理来自 web 的包含 HTML 标签、长文本和重复文本的原始数据时，正则表达式是必需的。在开发应用程序的过程中，以及在输出中，您不需要这样的数据。

您可以使用正则表达式进行各种基本和高级的数据清理。

问题

您希望使用正则表达式来解析文本数据。

解决办法

最好的方法是使用 Python 中的re库。

它是如何工作的

让我们看看在我们的任务中使用正则表达式的一些方法。

基本标志是 I，L，M，S，U，x。

re.I忽略大小写。
re.L找一个当地的家属。
re.M在多行中查找模式。
re.S查找点匹配。
re.U适用于 Unicode 数据。
以更易读的格式编写正则表达式。

下面描述正则表达式的功能。

查找出现一次的字符 a 和 b: [ab]
查找除 a 和 b 以外的字符:[^ab]
求 a 到 z 的字符范围:[a-z]
查找除 a 到 z : [^a-z]以外的字符范围
查找从 A 到 Z 和 A 到 Z 的所有字符:[a-zA-Z]
查找任意单个字符:[]
查找任何空白字符:\s
查找任何非空白字符:\S
查找任意数字:\d
查找任何非数字:\D
查找任何非单词:\W
查找任何单词:\w
查找 a 或 b: (a|b)
的出现次数要么为零，要么为一
- 匹配零个或不超过一个事件: a? ; ?
- a 的出现次数为零或更多次: a* ; * matches zero or more than that
- a 出现一次或多次: a+ ; + matches occurrences one or more than one time
匹配三个同时出现的: a{3 }
匹配三个或更多同时出现的 a: a{3,}
匹配三到六个同时出现的: a{3,6}
字符串的开头:^
字符串结尾:$
匹配单词边界:\b
非单词边界:\B

re.match()和re.search()函数查找模式，然后根据应用程序的要求进行处理。

我们来看看re.match()和re.search()的区别。

re.match()仅在字符串开头检查匹配。因此，如果它在输入字符串的开头找到一个模式，它将返回匹配的模式；否则，它返回一个名词。
re.search()检查字符串中任何地方的匹配。它在给定的输入字符串或数据中查找该模式的所有匹配项。

现在让我们看几个使用这些正则表达式的例子。

符号化

记号化就是把一个句子拆分成单词。一种方法是使用re.split。

# Import library
import re
#run the split query
re.split('\s+','I like this book.')
['I', 'like', 'this', 'book.']

关于正则表达式的解释，请参考主食谱。

提取电子邮件 id

提取电子邮件 id 最简单的方法是使用re.findall.

阅读/创建文档或句子。
执行re.findall功能。

doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"

addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)
for address in addresses.
    print(address)
#Output
xyz@abc.com
pqr@mno.com

替换电子邮件 id

让我们将句子或文档中的电子邮件 id 替换为其他电子邮件 id。最简单的方法是使用re.sub。

阅读/创建文档或句子。

执行re.sub功能。

new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'pqr@mno.com', doc)
print(new_email_address)
#Output
For more details please mail us at pqr@mno.com

doc = "For more details please mail us at xyz@abc.com"

关于正则表达式的解释，请参考配方 1-6。

如果您在使用 regex 处理电子邮件时观察到了这两种情况，那么我们已经实现了一个非常基本的实例。我们声明由@分隔的单词有助于捕获电子邮件 id。然而，可能有许多边缘情况；例如，点(.)包含域名并处理数字、+(加号)等，因为它们可以是电子邮件 ID 的一部分。

以下是提取/查找/替换电子邮件 id 的高级正则表达式。

([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)

还有更复杂的方法来处理所有的边缘情况(例如，“电子邮件 id”中的“. co.in”)。请试一试。

从电子书中提取数据并执行正则表达式

让我们通过使用您到目前为止所学的技术来解决一个从电子书中提取数据的案例研究。

从书中摘录内容。

# Import library
import re
import requests
#url you want to extract
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
#function to extract
def get_book(url).
 # Sends a http request to get the text from project Gutenberg
 raw = requests.get(url).text
 # Discards the metadata from the beginning of the book
 start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
 # Discards the metadata from the end of the book
 stop = re.search(r"II", raw).start()
 # Keeps the relevant text
 text = raw[start:stop]
 return text
# processing

def preprocess(sentence).
 return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)
# Output
 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees

picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when da

使用 regex 对此数据执行探索性数据分析。

# Count number of times "the" is appeared in the book
len(re.findall(r'the', processed_book))
#Output
302
#Replace "i" with "I"
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)
#output
 produced by martin adamson david widger

with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when da

#find all occurance of text in the format "abc--xyz"
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)
#output
 ['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',

 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',

 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',

 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',

 'do--by',
 'know--my',
 'illness--I',
 'well--here',
 'fellow--you']

配方 1-7。处理字符串

这个菜谱讨论了如何处理字符串和文本数据。您可以使用字符串操作进行各种基本的文本探索。

问题

你想探索处理字符串。

解决办法

最简单的方法是使用下面的字符串功能。

s.find(t)是 s 中字符串 t 的第一个实例的索引(如果找不到，则为–1)
s.rfind(t)是 s 中字符串 t 的最后一个实例的索引(如果找不到，则为–1)
s.index(t)类似于s.find(t),除了它在未找到时引发 ValueError
s.rindex(t)类似于s.rfind(t),除了它在未找到时引发 ValueError
使用 s 作为粘合剂将文本的单词组合成一个字符串
在找到 t 的地方将 s 分割成一个列表(默认为空白)
将 s 拆分成一个字符串列表，每行一个
s.lower()是字符串 s 的小写版本
s.upper()是字符串 s 的大写版本
s.title()是字符串 s 的带标题的大小写版本
s.strip()是没有前导或尾随空白的 s 的副本
s.replace(t, u)用 s 中的 u 替换 t 的实例

它是如何工作的

现在让我们来看几个例子。

替换内容

创建一个字符串并替换内容。创建字符串很容易。这是通过用单引号或双引号将字符括起来实现的。而要替换，可以使用replace功能。

创建一个字符串。

String_v1 = "I am exploring NLP"
#To extract particular character or range of characters from string
print(String_v1[0])
#output
"I"
#To extract the word “exploring”
print(String_v1[5:14])
#output
exploring

将前面字符串中的"exploring"替换为"learning"。

String_v2 = String_v1.replace("exploring", "learning")
print(String_v2)
#Output
I am learning NLP

连接两个字符串

下面是简单的代码。

s1 = "nlp"
s2 = "machine learning"
s3 = s1+s2
print(s3)

#output
'nlpmachine learning'

在字符串中搜索子字符串

使用find函数获取整个字符串中子串的起始索引值。

var="I am learning NLP"
f= "learn"
var.find(f)
#output
5

配方 1-8。从网上抓取文本

这个食谱讨论了如何从网上抓取数据。

Caution

在抓取任何网站、博客或电子商务网站之前，请确保您阅读了该网站是否允许数据抓取的条款和条件。通常，robots.txt 包含条款和条件(例如，参见 www.alixpartners.com/robots.txt )，而站点地图包含 URL 的地图(例如，参见 www.alixpartners.com/sitemap.xml )。

Web 抓取也称为 web 采集和 web 数据提取。它是一种从网站中提取大量数据并保存在数据库或本地的技术。您可以使用这些数据来提取与您的客户、用户或产品相关的信息，从而使企业受益。

对 HTML 有基本的了解是先决条件。

问题

你想通过抓取从网上提取数据。让我们以 IMDB.com 为例来搜集顶级电影。

解决办法

最简单的方法是使用 Python 漂亮的 Soup 或 Scrapy 库。让我们在这个食谱中使用美丽的汤。

它是如何工作的

按照本节中的步骤从 web 中提取数据。

步骤 8-1。安装所有必需的库

!pip install bs4

!pip install requests

步骤 8-2。导入库

from bs4 import BeautifulSoup

import requests
import pandas as pd
from pandas import Series, DataFrame
from ipywidgets import FloatProgress
from time import sleep
from IPython.display import display
import re
import pickle

步骤 8-3。识别提取数据的 URL

url = 'http://www.imdb.com/chart/top?ref_=nv_mv_250_6'

步骤 8-4。使用 Beautiful Soup 请求 URL 并下载内容

result = requests.get(url)
c = result.content

soup = BeautifulSoup(c,"lxml")

步骤 8-5。理解网站的结构以提取所需的信息

转到网站，右键单击页面内容，检查网站的 HTML 结构。

确定要提取的数据和字段。例如，您想要电影名称和 IMDB 分级。

检查 HTML 中的哪个 div 或类包含电影名称，并相应地解析这个美丽的 Soup。在这个例子中，你可以通过<table class ="chart full-width">和<td class="titleColumn">解析这个汤来提取电影名。

同样，您可以获取其他数据；请参考步骤 8-6 中的代码。

img/475440_2_En_1_Figa_HTML.jpg

步骤 8-6。使用 Beautiful Soup 从 HTML 标签中提取和解析数据

summary = soup.find('div',{'class':'article'})
# Create empty lists to append the extracted data

.
moviename = []
cast = []
description = []
rating = []
ratingoutof = []
year = []
genre = []
movielength = []
rot_audscore = []
rot_avgrating = []
rot_users = []
# Extracting the required data from the html soup.

rgx = re.compile('[%s]' % '()')
f = FloatProgress(min=0, max=250)
display(f)
for row,i in zip(summary.find('table').findAll('tr'),range(len(summary.find('table').findAll('tr')))):

    for sitem in row.findAll('span',{'class':'secondaryInfo'}).
        s = sitem.find(text=True)
        year.append(rgx.sub(", s))
    for ritem in row.findAll('td',{'class':'ratingColumn imdbRating'}).
        for iget in ritem.findAll('strong').
            rating.append(iget.find(text=True))
            ratingoutof.append(iget.get('title').split(' ', 4)[3])
    for item in row.findAll('td',{'class':'titleColumn'}).
        for href in item.findAll('a',href=True).
            moviename.append(href.find(text=True))
            rurl = 'https://www.rottentomatoes.com/m/'+ href.find(text=True)
            try.
                rresult = requests.get(rurl)
            except requests.exceptions.ConnectionError.
                status_code = "Connection refused"
            rc = rresult.content
            rsoup = BeautifulSoup(rc)
            try:

                rot_audscore.append(rsoup.find('div',{'class':'meter-value'}).find('span',{'class':'superPageFontColor'}).text)
                rot_avgrating.append(rsoup.find('div',{'class':'audience-info hidden-xs superPageFontColor'}).find('div').contents[2].strip())
                rot_users.append(rsoup.find('div',{'class':'audience-info hidden-xs superPageFontColor'}).contents[3].contents[2].strip())
            except AttributeError.
                rot_audscore.append("")
                rot_avgrating.append("")
                rot_users.append("")
            cast.append(href.get('title'))
            imdb = "http://www.imdb.com" + href.get('href')
            try.
                iresult = requests.get(imdb)
                ic = iresult.content

                isoup = BeautifulSoup(ic)
                description.append(isoup.find('div',{'class':'summary_text'}).find(text=True).strip())
                genre.append(isoup.find('span',{'class':'itemprop'}).find(text=True))
                movielength.append(isoup.find('time',{'itemprop':'duration'}).find(text=True).strip())

            except requests.exceptions.ConnectionError.
                description.append("")
                genre.append("")
                movielength.append("")
    sleep(.1)
    f.value = i

请注意，由于以下原因，您很有可能在执行该脚本时遇到错误。

您对 URL 的请求失败。如果是这样，请过一段时间再试一次。这在网络抓取中很常见。
网页是动态的，这意味着 HTML 标签不断变化。研究一下标签，并根据 HTML 对代码做一些小的修改，你就可以开始了。

步骤 8-7。将列表转换为数据框架，并执行符合业务要求的分析

# List to pandas series

moviename = Series(moviename)
cast = Series(cast)
description = Series(description)
rating = Series(rating)
ratingoutof = Series(ratingoutof)
year = Series(year)
genre = Series(genre)
movielength = Series(movielength)
rot_audscore = Series(rot_audscore)
rot_avgrating = Series(rot_avgrating)
rot_users = Series(rot_users)
# creating dataframe and doing analysis
imdb_df = pd.concat([moviename,year,description,genre,movielength,cast,rating,ratingoutof,rot_audscore,rot_avgrating,rot_users],axis=1)
imdb_df.columns = ['moviename','year','description','genre','movielength','cast','imdb_rating','imdb_ratingbasedon','tomatoes_audscore','tomatoes_rating','tomatoes_ratingbasedon']
imdb_df['rank'] = imdb_df.index + 1
imdb_df.head(1)
#output

img/475440_2_En_1_Figb_HTML.jpg

第八步。下载数据框

# Saving the file as CSV.
imdb_df.to_csv("imdbdataexport.csv")

本章实现了从数据源中提取文本数据的大部分技术。在接下来的章节中，您将了解如何探索、处理和清理数据。您还将学习功能工程和构建 NLP 应用程序。

二、探索和处理文本数据

本章讨论预处理文本数据和探索性数据分析的各种方法和技术。它涵盖了以下食谱。

食谱 1。用小写字体书写
食谱 2。删除标点符号
食谱三。停止单词删除
食谱 4。文本标准化
食谱 5。拼写纠正
食谱 6。标记化
食谱 7。堵塞物
食谱 8。词汇化
食谱 9。探索性数据分析
食谱 10。处理表情符号和表情符号
食谱 11。端到端处理流水线

在直接进入食谱之前，让我们先了解一下预处理文本数据的必要性。众所周知，世界上大约 90%的数据是非结构化的，可能以图像、文本、音频和视频的形式存在。文本可以有多种形式，从单个单词的列表到句子，再到包含特殊字符的多个段落(如 tweets 和其他标点符号)。它也可能以 web、HTML、文档等形式出现。而且这些数据从来都不是干净的，包含了大量的噪音。需要对其进行处理，然后执行一些预处理功能，以确保为特征工程和模型构建提供正确的输入数据。如果不对数据进行预处理，任何建立在这些数据之上的算法都不会给企业带来任何价值。这让我们想起了数据科学中非常流行的一句话:“垃圾进，垃圾出。”

预处理包括将原始文本数据转换成可理解的格式。真实世界的数据通常是不完整的、不一致的、充满了大量噪声，并且很可能包含许多错误。预处理是解决这类问题的一种行之有效的方法。数据预处理为进一步处理准备原始文本数据。

食谱 2-1。将文本数据转换为小写

这个菜谱讨论了如何将文本数据小写，以使所有数据具有统一的格式，并确保“nlp”和“NLP”被同等对待。

问题

你想要小写的文本数据。

解决办法

最简单的方法是使用 Python 中默认的lower()函数。

lower()方法将字符串中的所有大写字符转换成小写字符并返回它们。

它是如何工作的

按照本节中的步骤对给定的文本或文档进行小写。这里用的是 Python。

步骤 1-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                                               tweet
0                        This is introduction to NLP
1               It is likely to be useful, to people
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this

步骤 1-2。对文本数据执行 lower()函数

当只有一个字符串时，直接应用lower()函数如下。

x = 'Testing'
x2 = x.lower()
print(x2)
#output

'testing'

当您想要对数据框执行小写操作时，请按如下方式使用应用函数。

df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['tweet']
#output
0                          this is introduction to nlp
1                 it is likely to be useful, to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                             python is the best tool!
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this

或者，您可以使用以下代码:

df['tweet'] = df['tweet'].str.lower()

仅此而已。整个 tweet 列被转换成小写。让我们看看在接下来的食谱中你还能做些什么。

食谱 2-2。删除标点符号

这个食谱讨论了如何从文本数据中删除标点符号。这一步非常重要，因为标点符号不会增加任何额外的信息或价值。因此，删除所有这样的实例减少了数据的大小并提高了计算效率。

问题

您希望从文本数据中删除标点符号。

解决办法

最简单的方法是使用 Python 中的 regex 和replace()函数。

它是如何工作的

按照本节中的步骤删除文本数据中的标点符号。

步骤 2-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP
1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage
6 I like this book
7 I want more books like this

第 2-2 步。对文本数据执行 replace()函数

使用 regex 和replace()函数，您可以删除标点符号，如下所示。

import re
s = "I. like. This book!"
s1 = re.sub(r'[^\w\s]',",s)
s1
#output

'I like This book'

或者:

df['tweet'] = df['tweet'].str.replace('[^\w\s]',")
df['tweet']
#output
0                          this is introduction to nlp
1                  it is likely to be useful to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                              python is the best tool
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this

或者:

import string
s = "I. like. This book!"
for c in string.punctuation:
      s= s.replace(c,"")
s
#output

'I like This book'

食谱 2-3。删除停用词

这个食谱讨论了如何删除停用词。停用词是非常常见的词，与其他关键字相比没有意义或意义较小。如果去掉不常用的词，就可以把重点放在重要的关键词上。例如，在一个搜索引擎中，如果您的搜索查询是“如何使用 Python 开发聊天机器人”，如果搜索引擎试图查找包含单词 how 、 to 、 develop 、 chatbot 、 using 和 python 的网页，搜索引擎找到的包含 how 和 to 的网页要比包含关于开发聊天机器人的信息的网页多得多因此，如果你删除这些词，搜索引擎可以专注于检索包含关键字开发、聊天机器人和 python 的页面，这将更接近真正感兴趣的页面。同样，你也可以去掉其他常用词和生僻字。

问题

你想删除停用词。

解决办法

最简单的方法是使用 NLTK 库或构建自己的停用词文件。

它是如何工作的

按照本节中的步骤从文本数据中删除停用词。

步骤 3-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP
1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage

6 I like this book
7 I want more books like this

第 3-2 步。从文本数据中删除标点符号

使用 NLTK 库，您可以删除标点符号，如下所示。

#install and import libraries
!pip install nltk
import nltk
nltk.download()
from nltk.corpus import stopwords
#remove stop words
stop = stopwords.words('english')
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['tweet']
#output
0                                  introduction nlp
1                              likely useful people
2                   machine learning new electrcity
3    would less hype around ai action going forward
4                                  python best tool
5                                  r good langauage

6                                         like book
7                                   want books like

现在没有停止词了。这一步什么都去掉了。

食谱 2-4。标准化文本

这个食谱讨论了如何标准化文本。但在此之前，我们先讨论一下文本标准化以及为什么需要它。大多数文本数据存在于客户评论、博客或推文中，在这些地方，人们很有可能使用短词和缩写来表示相同的意思。这可以帮助下游过程容易地理解和解决文本的语义。

问题

你想要标准化文本。

解决办法

您可以编写自己的自定义词典来查找短单词和缩写。首先，您需要查看数据中是否存在任何短词和缩写。这可以通过查看我们文档中单词的频率分布或在创建自定义词典之前通过单词云可视化来实现。

它是如何工作的

按照本节中的步骤执行文本标准化。

步骤 4-1。创建自定义查找字典

该词典用于根据您的数据进行文本标准化。

lookup_dict = {'nlp':'natural language processing', 'ur':'your', "wbu" : "what about you"}
import re

第 4-2 步。为文本标准化创建自定义函数

代码如下:

def text_std(input_text):
 words = input_text.split()
 new_words = []
 for word in words:
     word = re.sub(r'[^\w\s]',",word)
     if word.lower() in lookup_dict:
         word = lookup_dict[word.lower()]
         new_words.append(word)
         new_text = " ".join(new_words)
 return new_text

第 4-3 步。运行文本 _ 标准函数

输出也需要检查。

text_std("I like nlp it's ur choice")
#output

'natural language processing your'
Here, nlp has standardized to 'natural language processing' and ur to 'your'.

食谱 2-5。纠正拼写

这个食谱讨论了如何进行拼写纠正。但在此之前，我们先来看看为什么这个拼写纠正很重要。大多数文本数据存在于客户评论、博客或推文中，在这些地方，人们很有可能使用短词并犯打字错误。这样就减少了代表同一个意思的单词的多个副本。例如,“processing”和“processing”被视为不同的词，即使在相同的意义上使用。

请注意，缩写应该在这一步之前处理，否则校正器有时会失败。比如说，“ur”(实际上是“你的”)会被更正为“or”

问题

你想做拼写纠正。

解决办法

最简单的方法是使用 TextBlob 库。

它是如何工作的

按照本节中的步骤进行拼写更正。

步骤 5-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['Introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                                    tweet
0                     Introduction to NLP
1    It is likely to be useful, to people
2  Machine learning is the new electrcity
3                     R is good langauage
4                        I like this book
5             I want more books like this

第 5-2 步。对文本数据执行拼写纠正

使用 TextBlob，您可以进行拼写，如下所示。

#Install textblob library
!pip install textblob

#import libraries and use 'correct' function
from textblob import TextBlob
df['tweet'].apply(lambda x: str(TextBlob(x).correct()))
#output
0                        Introduction to NLP
1       It is likely to be useful, to people
2    Machine learning is the new electricity
3                         R is good language
4                           I like this book
5                I want more books like this

请注意，electricity的拼写和语言已被更正。

#You can also use autocorrect library as shown below
#install autocorrect
!pip install autocorrect
from autocorrect import spell
print(spell(u'mussage'))
print(spell(u'sirvice'))

#output
'message'
'service'

配方 2-6。标记文本

这个食谱着眼于标记化的方法。标记化是指将文本分割成最小的有意义的单元。有一个句子分词器和一个单词分词器。你在这个食谱中看到一个单词分词器。对于任何类型的分析，这都是文本预处理中必不可少的步骤。有许多库可以执行标记化，如 NLTK、spaCy 和 TextBlob。这里有几种方法可以实现。

问题

你想做记号化。

解决办法

最简单的方法是使用 TextBlob 库。

它是如何工作的

按照本节中的步骤执行标记化。

步骤 6-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
 tweet
0 This is introduction to NLP

1 It is likely to be useful, to people
2 Machine learning is the new electrcity
3 There would be less hype around AI and more ac...
4 python is the best tool!
5 R is good langauage
6 I like this book
7 I want more books like this

第 6-2 步。将文本数据标记化

标记化的结果是一个标记列表。

#Using textblob
from textblob import TextBlob
TextBlob(df['tweet'][3]).words

#output
WordList(['would', 'less', 'hype', 'around', 'ai', 'action', 'going', 'forward'])
#using NLTK
import nltk
#create data
mystring = "My favorite animal is cat"
nltk.word_tokenize(mystring)
#output

['My', 'favorite', 'animal', 'is', 'cat']
#using split function from python
mystring.split()
#output
['My', 'favorite', 'animal', 'is', 'cat']

食谱 2-7。堵塞物

这个食谱讨论词干。词干提取是提取词根的过程。比如鱼、鱼、鱼被词干化为鱼。

问题

你想做词干分析。

解决办法

最简单的方法是使用 NLTK 或 TextBlob 库。

它是如何工作的

按照本节中的步骤执行词干分析。

步骤 7-1。读取文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['I like fishing','I eat fish','There are many fishes in pound']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
#output
                            tweet
0                  I like fishing
1                      I eat fish
2  There are many fishes in pound

第 7-2 步。阻止文本

对文本数据执行以下代码。

#Import library

from nltk.stem import PorterStemmer
st = PorterStemmer()
df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
#output
0                     I like fish
1                      I eat fish
2    there are mani fish in pound

注意，fish、fishing 和 fish 都被词干化为 fish。

食谱 2-8。词汇化

这个食谱讨论了词汇化，即通过考虑词汇表提取词根的过程。比如好的、更好的，或者最好的被词条化为好的。

词的词性是在词条化中确定的。它返回单词的字典形式，该形式必须有效。而词干提取只是提取词根。

简化处理匹配车到车以及匹配车到车。
炮泥手柄将车与车匹配。

引理化可以得到更好的结果。

叶的词干形式为叶。
叶的词干形式是叶。
叶的词汇化形式为叶。
叶的词汇化形式为叶。

问题

你想进行词汇化。

解决办法

最简单的方法是使用 NLTK 或 TextBlob 库。

它是如何工作的

按照本节中的步骤执行术语化。

步骤 8-1。读取文本数据

让我们创建一个字符串列表，并将其赋给一个变量。

text=['I like fishing','I eat fish','There are many fishes in pound', 'leaves and leaf']
#convert list to dataframe
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)
                                tweet
0                      I like fishing
1                          I eat fish
2  There are multiple fishes in pound
3                     leaves and leaf

步骤 8-2。将数据符号化

对文本数据执行以下代码。

#Import library
from textblob import Word
#Code for lemmatize
df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df['tweet']
#output

0                      I like fishing
1                          I eat fish
2    There are multiple fish in pound
3                       leaf and leaf

你可以观察到，鱼和鱼都被旅鼠化为鱼，叶子和叶子都被旅鼠化为叶子。

食谱 2-9。浏览文本数据

到目前为止，您应该已经熟悉了数据收集和文本预处理。让我们进行一些探索性的数据分析。

问题

你想要探索和理解文本数据。

解决办法

最简单的方法是使用 NLTK 或 TextBlob 库。

它是如何工作的

遵循此过程中的步骤。

步骤 9-1。读取文本数据

如果您还没有下载数据集，请执行以下代码下载数据集。

nltk.download().
#Importing data
import nltk
from nltk.corpus import webtext
nltk.download('webtext')
wt_sentences = webtext.sents('firefox.txt')
wt_words = webtext.words('firefox.txt')

步骤 9-2。导入必要的库

用于计算频率的导入库:

from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string

步骤 9-3 检查数据中的字数

数数单词的数量:

len(wt_sentences)
#output
1142
len(wt_words)
#output
102457

步骤 9-4。计算评论中所有单词的频率

为所有单词生成频率:

frequency_dist = nltk.FreqDist(wt_words)
frequency_dist
#showing only top few results
FreqDist({'slowing': 1,
          'warnings': 6,
          'rule': 1,
          'Top': 2,
          'XBL': 12,
          'installation': 44,
          'Networking': 1,
          'inccorrect': 1,
          'killed': 3,
          ']"': 1,
          'LOCKS': 1,
          'limited': 2,
          'cookies': 57,
          'method': 12,
          'arbitrary': 2,
          'b': 3,
          'titlebar': 6,
sorted_frequency_dist =sorted(frequency_dist,key=frequency_dist.__getitem__, reverse=True)
sorted_frequency_dist

['.',
 'in',
 'to',
 '"',
 'the',
 "'",
 'not',
 '-',
 'when',
 'on',
 'a',
 'is',
 't',

 'and',
 'of',

第 9-5 步。考虑长度大于 3 的单词并绘图

我们只考虑频率大于三的词。

large_words = dict([(k,v) for k,v in frequency_dist.items() if len(k)>3])
frequency_dist = nltk.FreqDist(large_words)
frequency_dist.plot(50,cumulative=False)
#output

img/475440_2_En_2_Figa_HTML.jpg

第 9-6 步。构建单词云

单词云是最频繁重复的单词的图形表示。

#install library

!pip install wordcloud
#build wordcloud
from wordcloud import WordCloud
wcloud = WordCloud().generate_from_frequencies(frequency_dist)
#plotting the wordcloud
import matplotlib.pyplot as plt
plt.imshow(wcloud, interpolation="bilinear")
plt.axis("off")
(-0.5, 399.5, 199.5, -0.5)
plt.show()
#output

img/475440_2_En_2_Figb_HTML.jpg

接下来，让我们删除停用词，然后建立词云。输出应该类似于下一个菜谱中显示的内容。

img/475440_2_En_2_Figc_HTML.jpg

配方 2-10。处理表情符号和表情符号

什么是表情符号？表情符号这个词本质上是“形象人物”的意思(来源于日语 e 意为“形象:莫吉意为“字母人物”)。什么是表情？😃:-]表情符号代表一个人的面部表情，只使用键盘字符，如字母、数字和标点符号。

在今天的网络世界中，表情符号和表情符号是主要的语言，当我们需要快速准确地与全球任何人交流时，它们可以让我们与他们交流。表情符号和表情符号在文本分析中都起着重要的作用。它们最常用于社交媒体、电子邮件和短信，尽管它们可以在任何类型的电子通信中找到。在情感没有用的情况下，你可能需要把它们从你的文本分析中去除。另一方面，您应该保留它们，因为它们提供了有价值的信息，尤其是在情感分析中，删除它们可能不是一个合适的解决方案；例如，如果一家公司想知道人们对社交媒体上的新产品、新活动或品牌的感受。

表情符号通过想象用户的情绪、态度和观点，帮助识别消费者参与度需要提高的地方。这提供了重要的信息，对于任何企业来说，更好地理解客户的感受都是至关重要的。表情符号和表情符号数据的收集和分析为公司提供了有用的信息。它们可以转换成 word 格式，用于建模过程。在这本书里，你看到了如何用 Python 保存 word 格式的表情符号和表情符号。

这本书用了一个叫 emot 的库。

问题

你想用一个相关的、有意义的词来代替表情符号。

解决办法

最简单的方法就是使用 emot 库。

它是如何工作的

遵循以下步骤。

步骤 10-A1。读取文本数据

img/475440_2_En_2_Figd_HTML.png

步骤 10-A2。安装并导入必要的库

#Installing emot library

!pip install emot

#Importing libraries
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

步骤 10-A3。写一个把表情符号转换成单词的函数

# Function for converting emojis into word

def converting_emojis(text):
    for emot in UNICODE_EMO:
        text = text.replace(emot, "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()))
    return text

步骤 10-A4。将带有表情符号的文本传递给函数

converting_emojis(text1) 

#output
What are you saying face_with_tears_of_joy. I am the boss smiling_face_with_sunglasses, and why are you so unamused_face

问题

你想移除表情符号

解决办法

最简单的方法是使用 Unicode。

它是如何工作的

遵循以下步骤。

步骤 10-B1。读取文本数据

img/475440_2_En_2_Fige_HTML.png

第十步-B2。安装并导入必要的库

#Importing libraries
import re

from emot.emo_unicode import UNICODE_EMO, EMOTICONS

第十步-B3。写一个移除表情符号的函数

def emoji_removal(string)

:
    emoji_unicodes = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols
                               u"\U0001F680-\U0001F6FF"  # transport
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_unicodes.sub(r'', string)

步骤 10-B4。将带有表情符号的文本传递给函数

emoji_removal(text1)

#output

What are you saying. I am the boss, and why are you so emoji_removal

问题

你想用相关的，有意义的单词替换表情符号。

解决办法

最简单的方法就是使用 emot 库。

它是如何工作的

遵循以下步骤。

第十步-C1。读取文本数据

#create sample text data with emoticons

text2 = "Hey, how are you :-) how was your day, what are you doing?:-)"

第十步-C2。安装并导入必要的库

#Installing emot library

!pip install emot
#Importing libraries
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

第十步-C3。编写将表情符号转换为文字的功能

# Function to convert emoticons into word

def converting_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

第十步-C4。将带有表情符号的文本传递给函数

converting_emoticons(text2)

#output

Hey, how are you Happy_face_smiley how was your day, what are you doing? Happy_face_smiley"

问题

您想删除表情图标吗

解决办法

最简单的方法是使用 Unicode。移除表情符号和表情符号的代码保持不变；但是，您需要添加各自的 Unicode。

它是如何工作的

遵循以下步骤。

步骤 10-D1 读取文本数据

#create sample text data with emoticons

text2 = "Hey, how are you :-) how was your day, what are you doing?:-)"

第十步-D2。安装并导入必要的库

#Importing libraries
import re
from emot.emo_unicode import UNICODE_EMO, EMOTICONS

步骤 10-D3。写函数去掉表情

def removing_emoticons(string)

:
    emot_unicodes = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols
                               u"\U0001F680-\U0001F6FF"  # transport
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"
                               u"\u3030"
                               "]+", flags=re.UNICODE)

    return emot_unicodes.sub(r'', string)

第十步-D4。将带有表情符号的文本传递给函数

removing_emoticons(text2)

#output

Hey, how are you how was your day, what are you doing?"

问题

找到检测表情符号并确定其含义的库。

解决办法

你可以使用 demoji 库。

它是如何工作的

遵循以下步骤。

第十步-E1。读取文本数据

img/475440_2_En_2_Figf_HTML.png

步骤 10-E2。安装并导入必要的库

#Installing & Importing libraries
!pip install demoji
import demoji
demoji.download_codes()

第十步-E3。找到所有表情符号并确定它们的含义

img/475440_2_En_2_Figg_HTML.png

食谱 2-11。构建文本预处理管道

到目前为止，您已经完成了大部分的文本操作和处理技术和方法。让我们在这个食谱中做一些有趣的事情。

问题

您希望构建一个端到端的文本预处理管道。每当您想要对任何 NLP 应用程序进行预处理时，您可以直接将数据插入这个管道函数，并获得所需的干净文本数据作为输出。

解决办法

最简单的方法是使用您到目前为止学到的所有技术创建一个自定义函数。

它是如何工作的

它的工作原理是将所有可能的处理技术放入一个包装器函数中，并通过它传递数据。

步骤 11-1。读取/创建文本数据

让我们创建一个字符串列表，并将其赋给一个变量——也许是一条样本 tweet。

tweet_sample= "How to take control of your #debt https://personal.vanguard.com/us/insights/saving-investing/debt-management.#Best advice for #family #financial #success (@PrepareToWin)"

你也可以使用你在第一章中提取的 Twitter 数据。

步骤 11-2。处理文本

执行下面的函数来处理 tweet。

def processRow(row):
    import re

    import nltk
    from textblob import TextBlob
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from textblob import Word
    from nltk.util import ngrams
    import re
    from wordcloud import WordCloud, STOPWORDS
    from nltk.tokenize import word_tokenize
    tweet = row
    #Lower case
    tweet.lower()
    #Removes unicode strings like "\u002c" and "x96"
    tweet = re.sub(r'(\\u[0-9A-Fa-f]+)', "r", tweet)
    tweet = re.sub(r'[^\x00-\x7f]', "r",tweet)
    #convert any url to URL
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert any @Username to "AT_USER"
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    tweet = re.sub('[\n]+', ' ', tweet)
    #Remove not alphanumeric symbols white spaces
    tweet = re.sub(r'[^\w]', ' ', tweet)
    #Removes hastag in front of a word """
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #Remove :( or :)

    tweet = tweet.replace(':)',")
    tweet = tweet.replace(':(',")
    #remove numbers
    tweet = ".join([i for i in tweet if not i.isdigit()])
    #remove multiple exclamation
    tweet = re.sub(r"(\!)\1+", ' ', tweet)
    #remove multiple question marks
    tweet = re.sub(r"(\?)\1+", ' ', tweet)
    #remove multistop

    tweet = re.sub(r"(\.)\1+", ' ', tweet)
    #lemma
    from textblob import Word
    tweet =" ".join([Word(word).lemmatize() for word in tweet.split()])
    #stemmer

    #st = PorterStemmer()
    #tweet=" ".join([st.stem(word) for word in tweet.split()])
    #Removes emoticons from text
    tweet = re.sub(':\)|;\)|:-\)|\(-:|:-D|=D|:P|xD|X-p|\^\^|:-*|\^\.\^|\^\-\^|\^\_\^|\,-\)|\)-:|:\'\(|:\(|:-\(|:\S|T\.T|\.\_\.|:<|:-\S|:-<|\*\-\*|:O|=O|=\-O|O\.o|XO|O\_O|:-\@|=/|:/|X\-\(|>\.<|>=\(|D:', ", tweet)
    #trim
    tweet = tweet.strip('\'"')
    row = tweet
    return row
#call the function with your data
processRow(tweet_sample)

#output
'How to take control of your debt URL Best advice for family financial success AT_USER'

到目前为止，您已经学习了如何读取文本数据，然后处理和清理它。下一章着眼于将文本转换成构建 NLP 应用程序的有意义的特征。

三、将文本转换为特征

本章涵盖了从基础到高级特征工程(文本到特征)的方法。到本章结束时，你会对下面的食谱感到舒适。

食谱 1。一键编码
食谱 2。计数矢量器
食谱三。n-grams
食谱 4。共生矩阵
食谱 5。哈希矢量化
食谱 6。术语频率-逆文档频率(TF-IDF)
食谱 7。单词嵌入
食谱 8。实现快速文本
食谱 9。使用最先进的嵌入技术将文本转换为特征

既然已经讨论了所有的文本预处理步骤，让我们来探索特征工程，它是自然语言处理的基础。如你所知，机器或算法无法理解字符、单词或句子。他们只能接受数字作为输入，包括二进制。但是文本数据的固有性质是非结构化和嘈杂的，这使得它无法与机器进行交互。

将原始文本转换成机器可理解的格式(数字)的过程称为特征工程。机器学习和深度学习算法的性能和准确性从根本上取决于特征工程技术。

本章讨论不同的特征工程方法和技术；它们的功能、优点和缺点；和示例来帮助您认识到特征工程的重要性。

配方 3-1。使用一键编码将文本转换为特征

一键编码是特征工程中使用的传统方法。任何了解机器学习基础的人都遇到过 one-hot 编码。它是将分类变量转换为特征或列，并为该特定类别编码 1 或 0 的过程。这里使用相同的逻辑，特征的数量是语料库中存在的总标记的数量。

问题

您希望使用一键编码将文本转换为特征。

解决办法

一键编码将字符或单词转换成二进制数，如下所示。

| |

我

爱

自然语言处理

存在

将来的

|
| --- | --- | --- | --- | --- | --- |
| 我爱 NLP | one | one | one | Zero | Zero |
| NLP 是未来 | Zero | Zero | one | one | one |

它是如何工作的

有许多函数可以生成一键编码特征。我们拿一个函数来深入讨论一下。

步骤 1-1。将文本存储在变量中

下面显示了一行。

Text = "I am learning NLP"

步骤 1-2。对文本数据执行函数

以下是来自pandas库中的一个将文本转换成特征的函数。

# Importing the library
import pandas as pd
# Generating the features
pd.get_dummies(Text.split())
Result :
   I  NLP  am  learning
0  1    0   0         0
1  0    0   1         0
2  0    0   0         1
3  0    1   0         0

输出有四个特征，因为输入中出现的不同单词的数量是 4。

食谱 3-2。使用计数矢量器将文本转换为特征

配方 3-1 中使用的方法有一个缺点。它不考虑单词的频率。如果某个特定的单词出现多次，那么如果不将其包括在分析中，就有可能丢失信息。一个计数矢量器解决了这个问题。这个食谱介绍了另一种将文本转换为特征的方法:计数矢量器。

问题

如何使用计数矢量器将文本转换为特征？

解决办法

计数矢量器类似于一键编码，但它不是检查特定的单词是否存在，而是对文档中存在的单词进行计数。

在下面的示例中，单词 I 和 NLP 在第一个文档中出现了两次。

| |

我

爱

自然语言处理

存在

将来的

将

学习

在…里

2 个月

|
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 我喜欢自然语言处理，我将在两个月后学习自然语言处理 | Two | one | Two | Zero | Zero | one | one | one | one |
| NLP 是未来 | Zero | Zero | one | one | one | Zero | Zero | Zero | Zero |

它是如何工作的

sklearn 具有从文本中提取特征的特征提取功能。让我们看看如何执行它。下面从 sklearn 导入CountVectorizer函数。

#importing the function
from sklearn.feature_extraction.text import CountVectorizer
# Text
text = ["I love NLP and I will learn NLP in 2month "]
# create the transform

vectorizer = CountVectorizer()
# tokenizing
vectorizer.fit(text)
# encode document
vector = vectorizer.transform(text)
# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())
Result:

{'love': 4, 'nlp': 5, 'and': 1, 'will': 6, 'learn': 3, 'in': 2, '2month': 0}
[[1 1 1 1 1 2 1]]

第五个标记 nlp 在文档中出现了两次。

食谱 3-3。生成 n 元语法

在前面的方法中，每个单词都被视为一个特征。这种方法有一个缺点。它不考虑前面的单词和后面的单词来判断它是否给出正确和完整的意思。例如，考虑短语“不错”。如果它被分割成单个的单词，它就失去了传达好的意思，这就是这个短语的意思。

正如你所看到的，你可能会失去潜在的信息或见解，因为许多单词一旦放在一起就有意义了。n-grams 可以解决这个问题。

n -grams 是多个字母或多个单词的融合。它们以这样的方式形成，甚至上一个和下一个单词也被捕获。

单字是一个句子中唯一的单词。
二元结构是两个词的组合。
三元模型是三个词的组合。诸如此类。

比如看句子，“我在学 NLP。”

单字:“我”，“我”，“学习”，“自然语言处理”
二元模型:“我是”，“正在学习”，“学习 NLP”
三元模型:“我正在学习”，“我正在学习 NLP”

问题

为给定的句子生成 n 元语法。

解决办法

有很多生成 n 元语法的包。TextBlob 是最常用的。

它是如何工作的

按照本节中的步骤操作。

步骤 3-1。使用 TextBlob 生成 n 元语法

让我们看看如何使用 TextBlob 生成 n-grams。

Text = "I am learning NLP"

使用下面的TextBlob函数创建 n 元语法。使用已定义的文本，并根据要求陈述 n。

#Import textblob
from textblob import TextBlob
#For unigram : Use n = 1
TextBlob(Text).ngrams(1)

这是输出。

[WordList(['I']), WordList(['am']), WordList(['learning']), WordList(['NLP'])]
#For Bigram : For bigrams, use n = 2
TextBlob(Text).ngrams(2)
[WordList(['I', 'am']),
 WordList(['am', 'learning']),
 WordList(['learning', 'NLP'])]

一个实例中有三个包含两个单词的列表。

第 3-2 步。为文档生成基于二元模型的特征

就像上一个方法一样，计数矢量器生成特征。使用相同的函数，让我们生成 bigram 特性，看看输出是什么样子。

#importing the function
from sklearn.feature_extraction.text import CountVectorizer
# Text
text = ["I love NLP and I will learn NLP in 2month "]
# create the transform
vectorizer = CountVectorizer(ngram_range=(2,2))
# tokenizing
vectorizer.fit(text)
# encode document
vector = vectorizer.transform(text)
# summarize & generating output
print(vectorizer.vocabulary_)
print(vector.toarray())

这就是结果。

{'love nlp': 3, 'nlp and': 4, 'and will': 0, 'will learn': 6, 'learn nlp': 2, 'nlp in': 5, 'in 2month': 1}
[[1 1 1 1 1 1 1]]

输出中的特征包含二元模型；在本例中，所有令牌的计数都是 1。你同样可以使用三元模型。

配方 3-4。生成共生矩阵

让我们讨论一种称为共生矩阵的特征工程方法。

问题

你想要理解并生成一个共现矩阵。

解决办法

共生矩阵类似于计数矢量器；它计算一组单词的出现次数，而不是单个单词的出现次数。

它是如何工作的

让我们看看如何使用 NLTK , bigrams,和一些基本的 Python 编码技巧来生成这种矩阵。

步骤 4-1。导入必要的库

这是代码。

import numpy as np

import nltk
from nltk import bigrams
import itertools

第 4-2 步。为共生矩阵创建函数

下面是co_occurrence_matrix函数。

def co_occurrence_matrix(corpus):
    vocab = set(corpus)
    vocab = list(vocab)
    vocab_to_index = { word:i for i, word in enumerate(vocab) }
    # Create bigrams from all words in corpus
    bi_grams = list(bigrams(corpus))
    # Frequency distribution of bigrams ((word1, word2), num_occurrences)
    bigram_freq = nltk.FreqDist(bi_grams).most_common(len(bi_grams))
    # Initialise co-occurrence matrix
    # co_occurrence_matrix[current][previous]
    co_occurrence_matrix = np.zeros((len(vocab), len(vocab)))
    # Loop through the bigrams taking the current and previous word,
    # and the number of occurrences of the bigram.
    for bigram in bigram_freq:
        current = bigram[0][1]
        previous = bigram[0][0]
        count = bigram[1]

        pos_current = vocab_to_index[current]
        pos_previous = vocab_to_index[previous]
        co_occurrence_matrix[pos_current][pos_previous] = count
    co_occurrence_matrix = np.matrix(co_occurrence_matrix)
    # return the matrix and the index
    return co_occurrence_matrix,vocab_to_index

第 4-3 步。生成共生矩阵

下面是测试用的句子。

sentences = [['I', 'love', 'nlp'],
                   ['I', 'love','to' 'learn'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'is', 'cool']]
# create one list using many lists
merged = list(itertools.chain.from_iterable(sentences))
matrix = co_occurrence_matrix(merged)
# generate the matrix
CoMatrixFinal = pd.DataFrame(matrix[0], index=vocab_to_index, columns=vocab_to_index)
print(CoMatrixFinal)

           I   is  love  future  tolearn  cool  nlp
I        0.0  0.0   0.0     0.0      0.0   0.0  1.0
is       0.0  0.0   0.0     0.0      0.0   0.0  2.0
love     2.0  0.0   0.0     0.0      0.0   0.0  0.0
future   0.0  1.0   0.0     0.0      0.0   0.0  0.0
tolearn  0.0  0.0   1.0     0.0      0.0   0.0  0.0
cool     0.0  1.0   0.0     0.0      0.0   0.0  0.0
nlp      0.0  0.0   1.0     1.0      1.0   0.0  0.0

我，爱，是， nlp 一起出现了两次，其他几个词只出现了一次。

配方 3-5。哈希矢量化

计数矢量器和共生矩阵都有一个限制:词汇表会变得非常大，导致内存/计算问题。

一个散列矢量器是解决这个问题的一种方法。

问题

你想要理解并生成一个散列矢量器。

解决办法

哈希矢量器是内存高效的，矢量器不是将令牌存储为字符串，而是应用哈希技巧将它们编码为数字索引。缺点是它是单向的，一旦矢量化，就无法检索特征。

它是如何工作的

让我们看一个使用sklearn.的例子

步骤 5-1。导入必要的库并创建文档

这是代码。

from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

第 5-2 步。生成哈希矢量矩阵

让我们创建一个矢量大小为 10 的散列矢量矩阵(HashingVectorizer)。

# transform
vectorizer = HashingVectorizer(n_features=10)
# create the hashing vector
vector = vectorizer.transform(text)
# summarize the vector
print(vector.shape)
print(vector.toarray())
(1, 10)
[[ 0\.           0.57735027  0\.       0\.       0\.      0\.     0.
  -0.57735027  -0.57735027  0\.       ]]

它创建了一个大小为 10 的向量，现在它可以用于任何有人监督/无人监督的任务。

配方 3-6。使用 TF-IDF 将文本转换为特征

上述文本到特征的方法有一些缺点，因此引入了 TF-IDF。以下是一些缺点。

假设一个特定的单词出现在所有的语料库文档中。它在我们以前的方法中获得了更高的重要性，但这可能与你的情况无关。
TF-IDF 反映了一个单词对集合中的一个文档有多重要，因此对频繁出现在所有文档中的单词进行了规范化。

问题

您希望使用 TF-IDF 将文本转换为特征。

解决办法

词频(TF) 是一个句子中出现的特定单词的计数与同一个句子中单词总数的比值。TF 捕捉单词的重要性，而不考虑文档的长度。例如，在 10 个单词的句子中，频率为 3 的单词与句子的单词长度为 100 个单词时不同。它应该在第一个场景中更重要，这就是 TF 所做的。 TF (t) =(术语 t 在文档中出现的次数)/(文档中的总术语数)。

逆文档频率(IDF) 是总行数与出现单词的特定文档中的行数之比的对数。IDF = log(N/n)，其中 N 是总行数，N 是单词所在的行数。

IDF 衡量一个术语的稀有程度。像 a 和 the 这样的词出现在所有的语料库文档中，但是稀有词并不在所有的文档中。因此，如果一个单词出现在几乎所有的文档中，这个单词就没有用了，因为它对分类或信息检索没有帮助。以色列国防军解决了这个问题。

TF-IDF 是 TF 和 IDF 的简单产品，解决了两个缺点，使预测和信息检索相关。

TF-IDF = TF * IDF

它是如何工作的

按照本节中的步骤操作。

步骤 6-1。读取文本数据

以下是大家熟悉的一句话。

Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]

第 6-2 步。创建特征

对文本数据执行以下代码。

#Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#Create the transform
vectorizer = TfidfVectorizer()
#Tokenize and build vocab
vectorizer.fit(Text)
#Summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)

这就是结果。

Text = ["The quick brown fox jumped over the lazy dog.",
"The dog.",
"The fox"]
{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[ 1.69314718  1.28768207  1.28768207  1.69314718   1.69314718  1.69314718  1.69314718  1\.   ]

注意，在所有三个文档中都出现了，因此它没有增加多少价值。向量值为 1，小于所有其他标记。

到目前为止，您看到的所有方法或技术都是基于频率的。它们被称为基于频率的嵌入或特征。下一个方法着眼于基于预测的嵌入，通常称为单词嵌入。

食谱 3-7。实现单词嵌入

这个方法假设你对神经网络的工作原理和神经网络中的权值更新机制有一定的了解。如果你是神经网络的新手，我们建议你浏览第 6 章，对神经网络如何工作有一个基本的了解。

即使所有以前的方法解决了大多数问题，一旦你进入更复杂的问题，你想要捕捉单词之间的语义关系(上下文)，这些方法就不能执行了。

下面解释了到目前为止讨论的方法所面临的挑战。

这些技术没有抓住单词的上下文和含义。它们取决于单词的出现或频率。您需要知道如何捕捉上下文或语义关系。
1. 我正在吃一个苹果。
2. 我正在用一个苹果手机。

在例子中，苹果与不同(邻近)的相邻词吃和用连用，意思就不一样了。

对于像文档分类(图书馆的图书分类)这样的问题，一个文档很庞大，会产生很多令牌。在这些场景中，您的特性数量可能会失去控制(其中)，从而妨碍准确性和性能。

一个机器/算法可以匹配两个文档/文本，并说出它们是否相同。当你搜索 Dhoni 女士时，我们如何让机器谈论板球或 Virat Kohli？你怎么让机器明白“苹果是好吃的水果”中的苹果这个词是可以吃的水果而不是公司？

这些问题的答案在于为单词创建一种表示法，以捕捉它们的含义、语义关系以及它们在不同类型的上下文中的用法。

单词嵌入解决了这些挑战。单词嵌入是一种特征学习技术，其中词汇被映射到实数向量，捕捉上下文层次。

在下表中，每个单词都由四个数字表示，称为矢量。使用单词嵌入技术，我们导出了每个单词的那些向量，以便在将来的分析和构建应用中使用它们。在本例中，维度是 4，但您通常使用大于 100 的维度。

话

向量

|
| --- | --- |
| 文本 | Zero point three six | Zero point three six | -0.43 | Zero point three six |
| 想法 | -0.56 | -0.56 | Zero point seven two | -0.56 |
| 单词 | Zero point three five | -0.43 | Zero point one two | Zero point seven two |
| 编码 | Zero point one nine | Zero point one nine | Zero point one nine | Zero point four three |
| 文件 | -0.43 | Zero point one nine | -0.43 | Zero point four three |
| 克 | Zero point seven two | -0.43 | Zero point seven two | Zero point one two |
| 过程 | Zero point four three | Zero point seven two | Zero point four three | Zero point four three |
| 特征 | Zero point one two | Zero point four five | Zero point one two | Zero point eight seven |

问题

你想实现单词嵌入。

解决办法

单词嵌入是基于预测的，它们使用浅层神经网络来训练模型，该模型导致学习权重并将它们用作向量表示。

word2vec 是深度学习 Google 框架，用于训练单词嵌入。它使用整个语料库的所有单词并预测附近的单词。它为语料库中存在的所有单词创建一个向量，以便捕获上下文。在单词相似性和单词类比方面，它也优于任何其他方法。

word2vec 主要有两种。

船克
连续单词包(CBOW)

img/475440_2_En_3_Figa_HTML.png

它是如何工作的

上图显示了用于构建单词嵌入的 CBOW 和 skip-gram 算法的架构。让我们看看这些模型是如何工作的。

船克

跳格模型 ¹ 在给定一个或多个单词的上下文的情况下，预测该单词的概率。

我们来取一个小句子，了解一下它的工作原理。每个句子生成一个目标单词和上下文，即附近的单词。目标变量周围要考虑的单词数称为窗口大小。下表显示了窗口大小为 2 的所有可能的目标和上下文变量。窗口大小需要根据数据和您可支配的资源来选择。窗口越大，计算能力越高。

Text = "我喜欢 NLP，我将在两个月后学习 NLP "

| |

目标字

语境

|
| --- | --- | --- |
| 我爱 NLP | 我 | 爱你的，NLP |
| 我喜欢 NLP 和 | 爱 | 爱情，自然语言处理，和 |
| 我喜欢 NLP，我会学习的 | 自然语言处理 | 我，爱，还有，我 |
| … | … | … |
| 两个月后 | 月 | 在，2 |

因为它需要大量的文本和计算能力，所以让我们使用样本数据来构建一个跳格模型。

导入文本语料库并将其分解成句子。执行一些清理和预处理，如删除标点符号和数字，并将句子拆分为单词或标记。

#Example sentences
sentences = [['I', 'love', 'nlp'],
                  ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                  ['nlp', 'is', 'future'],
                  ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                  ['nlp', 'uses', 'machine', 'learning']]
#import library
!pip install gensim
import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# training the model
skipgram = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)
print(skipgram)
# access vector for one word
print(skipgram['nlp'])
[ 0.00552227 -0.00723104  0.00857073  0.00368054 -0.00071274  0.00837146
  0.00179965 -0.0049786  -0.00448666 -0.00182289  0.00857488 -0.00499459
  0.00188365 -0.0093498   0.00174774 -0.00609793 -0.00533857 -0.007905
 -0.00176814 -0.00024082 -0.00181886 -0.00093836 -0.00382601 -0.00986026

  0.00312014 -0.00821249  0.00787507 -0.00864689 -0.00686584 -0.00370761
  0.0056183   0.00859488 -0.00163146  0.00928791  0.00904601  0.00443816
 -0.00192308  0.00941    -0.00202355 -0.00756564 -0.00105471  0.00170084
  0.00606918 -0.00848301 -0.00543473  0.00747958  0.0003408   0.00512787
 -0.00909613  0.00683905]

因为我们的向量大小参数是 50，所以模型为每个单词提供了一个大小为 50 的向量。

# access vector for another one word
print(skipgram['deep'])
KeyError: "word 'deep' not in vocabulary"

我们得到一个错误，说这个单词不存在，因为这个单词不在我们的输入训练数据中。我们需要在尽可能大的数据集上训练算法，这样我们就不会漏掉单词。

还有一种方法可以解决这个问题。阅读食谱 3-6 寻找答案。

# save model
skipgram.save('skipgram.bin')
# load model
skipgram = Word2Vec.load('skipgram.bin')

t-SNE 图是评估单词嵌入的方法之一。让我们生成它，看看它是什么样子。

# T – SNE plot
X = skipgram[skipgram.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(skipgram.wv.vocab)
for i, word in enumerate(words):
       pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

img/475440_2_En_3_Figb_HTML.jpg

连续单词包(CBOW)

现在让我们看看如何构建 CBOW 模型。

#import library
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
#Example sentences
sentences = [['I', 'love', 'nlp'],
                   ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                   ['nlp', 'uses', 'machine', 'learning']]
# training the model
cbow = Word2Vec(sentences, size =50, window = 3, min_count=1,sg = 1)
print(cbow)
# access vector for one word
print(cbow['nlp'])
# save model

cbow.save('cbow.bin')
# load model
cbow = Word2Vec.load('cbow.bin')
# T – SNE plot
X = cbow[cbow.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection

pyplot.scatter(result[:, 0], result[:, 1])
words = list(cbow.wv.vocab)
for i, word in enumerate(words):
       pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

img/475440_2_En_3_Figc_HTML.jpg

训练这些模型需要巨大的计算能力。我们用 Google 的预训练模型，已经训练了 1000 多亿字。

从 https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit 下载模型并保存在您的本地存储器中。

https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

导入gensim包，按照步骤学习 Google 的 word2vec。

# import gensim package
import gensim
# load the saved model
model = gensim.models.Word2Vec.load_word2vec_format('C:\\Users\\GoogleNews-vectors-negative300.bin', binary=True)
#Checking how similarity works.
print (model.similarity('this', 'is'))
Output:
0.407970363878
#Lets check one more.
print (model.similarity('post', 'book'))
Output:
0.0572043891977

这和 is 有很大的相似性，但是单词 post 和 book 之间的相似性很差。对于任何给定的单词集，它使用两个单词的向量，并计算它们之间的相似度。

# Finding the odd one out.
model.doesnt_match('breakfast cereal dinner lunch';.split())

输出是

'cereal'

在早餐、麦片、晚餐和午餐中，麦片这个词与其他三个词的关联最少。

# It is also finding the relations between words.
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])

这是输出。

queen: 0.7699

如果加上女人和国王减去男人，它以 77%的置信度预测女王为输出。这不是很神奇吗？

img/475440_2_En_3_Figd_HTML.png

让我们来看几个有趣的例子，使用 t-SNE 图进行单词嵌入，比如室内和室外。比如所有和电器配件有关的词都是近在咫尺的；同样，与浴室配件相关的词语也是彼此靠近，等等。这就是单词嵌入的美妙之处。

img/475440_2_En_3_Fige_HTML.jpg

食谱 3-8。实现快速文本

fastText 是脸书开发的另一个深度学习框架，用于捕捉上下文并生成特征向量。

问题

您想学习如何用 Python 实现 fastText。

解决办法

fastText 是 word2vec 的临时版本，它考虑用单词来构建表示。但是 fastText 在计算一个单词的表示时会取每个字符。

它是如何工作的

让我们看看如何构建一个 fastText 单词嵌入。

# Import FastText
from gensim.models import FastText
from sklearn.decomposition import PCA
from matplotlib import pyplot
#Example sentences
sentences = [['I', 'love', 'nlp'],
                   ['I', 'will', 'learn', 'nlp', 'in', '2','months'],
                   ['nlp', 'is', 'future'],
                   ['nlp', 'saves', 'time', 'and', 'solves', 'lot', 'of', 'industry', 'problems'],
                   ['nlp', 'uses', 'machine', 'learning']]
fast = FastText(sentences,size=20, window=1, min_count=1, workers=5, min_n=1, max_n=2)

# vector for word nlp
print(fast['nlp'])
[-0.00459182  0.00607472 -0.01119007  0.00555629 -0.00781679  -0.01376211
  0.00675235 -0.00840158 -0.00319737  0.00924599  0.00214165  -0.01063819
  0.01226836  0.00852781  0.01361119 -0.00257012  0.00819397  -0.00410289
 -0.0053979  -0.01360016]
# vector for word deep
print(fast['deep'])
[ 0.00271002 -0.00242539 -0.00771885 -0.00396854  0.0114902   -0.00640606
  0.00637542 -0.01248098 -0.01207364  0.01400793 -0.00476079  -0.00230879
  0.02009759 -0.01952532  0.01558956 -0.01581665  0.00510567  -0.00957186
 -0.00963234 -0.02059373]

这就是使用 fastText 的好处。单词 deep 在训练单词 2vec 中不存在，并且我们没有获得该单词的向量。但是由于 fastText 正在建立字符级别，它提供了结果——甚至是训练中没有的单词。你可以看到单词 deep 的向量。

# load model
fast = Word2Vec.load('fast.bin')
# visualize
X = fast[fast.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(fast.wv.vocab)
for i, word in enumerate(words):
      pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

img/475440_2_En_3_Figf_HTML.jpg

上图显示了 fastText 的嵌入表示。如果你仔细观察，单词“爱”和“解决”在 fastText 中离得很近，但在 skip-gram 和 CBOW 中,“爱”和“学习”离得很近。这是因为字符级嵌入。

食谱 3-9。使用最先进的嵌入技术将文本转换为特征

让我们讨论并实现一些高级的基于上下文的特征工程方法。

问题

您希望使用最先进的嵌入技术将文本转换为特征。

解决办法

下面我们来讨论一下这七种方法。

手套嵌入
工程与后勤管理局
句子编码器
- doc2vec
- 句子-伯特
- 通用编码器
- InferSent
开放 AI GPT

GloVe 是一种替代的单词嵌入方法，用于创建单词的向量子空间。手套模型对单词的共现计数进行训练，并通过最小化最小平方误差来产生向量空间。

在 GloVe 中，首先构建共现:每行是一个单词，每列是上下文。这个矩阵通过上下文计算单词的频率。由于上下文维度非常大，所以你想减少上下文，学习单词嵌入的低维表示。这个过程可以看作是共生矩阵的重构问题，即重构损失。GloVe 的动机是基于共生矩阵显式地强制模型学习这样的关系。

word2vec、skip-gram 和 CBOW 是预测性的，忽略了一些上下文单词比其他单词出现得更频繁的事实。它们只考虑了当地的情况，因此没有抓住全球的情况。

虽然 word2vec 预测给定单词的上下文，但 GloVe 通过构建共现矩阵来学习。

word2vec 没有嵌入全局信息，而 GloVe 创建了一个全局共现矩阵，对每个单词的上下文频率进行计数。全球信息的出现让 GloVe 变得更好。

GloVe 不会像 word2vec 那样通过神经网络进行学习。相反，它具有单词嵌入的乘积和同现概率的对数之间的差的简单损失函数。

研究论文在 https://nlp.stanford.edu/pubs/glove.pdf 。

工程与后勤管理局

ELMo 向量是作为给定句子的函数的向量。这种方法的主要优点是在不同的上下文中可以有不同的词向量。

ELMo 是一个深度上下文化的单词表示模型。它着眼于单词的复杂特征(例如，句法和语义)，并研究它们如何在语言环境中变化(例如，模拟一词多义)。

词向量是深度双向语言模型(biLM)的内部状态的学习函数，该模型在大型文本语料库上预先训练。

不同句子中语境不同的词称为多义词。ELMo 可以成功地处理这种性质的单词，而 GloVe 和 fastText 无法捕捉到这些单词。

研究论文在 www.aclweb.org/anthology/N18-1202.pdf 。

研究论文链接: https://www.aclweb.org/anthology/N18-1202.pdf

句子编码器

为什么要学习句子嵌入？传统技术使用单词嵌入的平均值来形成句子嵌入。但是这种方法也有缺点，比如没有考虑单词的顺序，如果单词在句子中互换，通过平均单词向量获得的相似性是相同的。

doc2vec

doc2vec 基于 word2vec。单词保持一个语法结构，但是文档没有任何语法结构。为了解决这个问题，在 word2vec 模型中添加了另一个向量(段落 ID)。这是 word2vec 和 doc2vec 唯一的区别。

word2vec 计算单词表示的所有向量的均值，而 doc2vec 直接将一个句子表示为向量。与 word2vec 一样，有两种 doc2vec 型号可供选择。

段落向量的分布式记忆模型
段落向量分布式单词包版本(PV-DBOW)

分布式内存(DM)模型类似于 CBOW 模型。CBOW 在给定上下文作为输入的情况下预测目标单词，而在 doc2vec 中，添加一个段落 ID。

分布式单词包(DBOW)模型类似于 word2vec 中的 skip-gram 模型，它从目标单词预测上下文单词。该模型仅将段落 ID 作为输入，并从词汇表中预测上下文。

研究论文在 https://cs.stanford.edu/~quocle/paragraph_vector.pdf 。

句子-伯特

句子-BERT (SBERT)是预训练的 BERT 网络的修改，其使用连体和三元组网络结构来导出语义上有意义的句子嵌入，其可以使用余弦相似性来比较。这将寻找最相似对的工作从使用 BERT/RoBERTa 的 65 小时减少到使用 SBERT 的大约 5 秒，同时保持了 BERT 的准确性。

其中的领导者 Sentence-BERT 于 2018 年推出，并立即占据了句子嵌入的领先地位。这个基于 BERT 的模型的核心有四个关键概念。

注意力
变形金刚(电影名)
伯特
暹罗网络

句子被传递到 BERT 模型和池层以生成它们的嵌入。

研究论文在 www.aclweb.org/anthology/D19-1410.pdf 。

通用编码器

通用句子编码器模型专门针对 NLP 任务的迁移学习，并生成嵌入。

它在各种数据源上接受训练，以学习各种各样的任务。来源包括维基百科、网络新闻、网络问答页面和论坛。输入是可变长度的英文文本，输出是一个 512 维的向量。

通过平均句子中单词的所有嵌入来计算句子嵌入；然而，仅仅相加或平均有局限性，并且不适合推导句子的真实语义。通用句子编码器使得获得句子级嵌入变得容易。

TensorFlow 模型的两个变体允许在精确度和计算资源之间进行权衡。

变形金刚(电影名)
深度平均网络

研究论文在 https://arxiv.org/pdf/1803.11175v2.pdf 。

InferSent

2017 年，脸书引入了 InferSent，作为使用斯坦福自然语言推理数据集(SNLI)训练的句子表示模型。SNLI 是一个由 570，000 个英语句子组成的数据集，每个句子都是一个由前提、假设组成的句子对，这些句子被分为以下几类:蕴涵、矛盾或中性。

研究论文在 https://arxiv.org/pdf/1705.02364.pdf 。

开放 AI GPT

GPT(生成预训练变换器)架构实现了深度神经网络，特别是变换器模型，其使用注意力来代替先前的基于递归和基于卷积的架构。注意机制允许模型有选择地关注它预测为最相关的输入文本片段。

由于训练数据集的广度和方法的广度，GPT 能够执行多种多样的任务，而不仅仅是简单的文本生成:回答问题、总结，甚至在各种特定领域的语言之间进行翻译。

研究论文在 https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf 。

它是如何工作的

从 www.kaggle.com/rounakbanik/ted-talks 下载数据集，并保存在您的本地文件夹中。然后按照本节中的步骤操作。

数据集链接: https://www.kaggle.com/rounakbanik/ted-talks

步骤 9-1。将笔记本和数据导入 Google Colab

Google Colab 用于解决这个项目，因为 BERT 模型非常庞大，在 Colab 中构建它更容易、更快。

前往 https://colab.research.google.com/notebooks/intro.ipynb 的谷歌实验室。

https://colab.research.google.com/notebooks/intro.ipynb

转到“文件”并打开一个新笔记本，或者通过选择“上载笔记本”从本地上载笔记本。

img/475440_2_En_3_Figg_HTML.jpg

要导入数据，请转到文件，单击上传到会话存储选项，然后导入 csv 文件。

img/475440_2_En_3_Figh_HTML.jpg

步骤 9-2。安装和导入库

#If any of these libraries

are not installed, please install them using pip before importing.

import pandas as pd
import numpy as np
import scipy
import os

import nltk
nltk.download('stopwords')
nltk.download('punkt')
ltk.download('wordnet')
import string
import csv
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer # used for preprocessing
import warnings
warnings.filterwarnings('ignore')

from sklearn import preprocessing
import spacy
from tqdm import tqdm
import re
import matplotlib.pyplot as plt # our main display package
import plotly.graph_objects as go

import tensorflow_hub as hub

import tensorflow as tf
print(tf.__version__)

第 9-3 步。读取文本数据

df = pd.read_csv('Ted talks.csv')

df_sample=df.iloc[0:100,:]

步骤 9-4。处理文本数据

让我们实现您在第二章中学到的预处理步骤。

# remove urls, handles, and the hashtag from hashtags
def remove_urls(text):
    new_text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([⁰-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
    return new_text
# make all text lowercase
def text_lowercase(text):
    return text.lower()
# remove numbers
def remove_numbers(text):
    result = re.sub(r'\d+', '', text)
    return result
# remove punctuation
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
# tokenize
def tokenize(text):
    text = word_tokenize(text)
    return text
# remove stopwords
stop_words = set(stopwords.words('english'))
def remove_stopwords(text):
    text = [i for i in text if not i in stop_words]
    return text
# lemmatize
lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    text = [lemmatizer.lemmatize(token) for token in text]
    return text

def preprocessing(text):

    text = text_lowercase(text)
    text = remove_urls(text)
    text = remove_numbers(text)
    text = remove_punctuation(text)
    text = tokenize(text)
    text = remove_stopwords(text)
    text = lemmatize(text)
    text = ' '.join(text)
    return text

#preprocessing input
for i in range(df_sample.shape[0]):
    df_sample['description'][i]=preprocessing(str(df_sample['description'][i]))
#in case if description has next line character
for text in df_sample.description:
    text=text.replace('\n',' ')

第 9-5 步。生成特征向量

#Implementations of above methods

#GloVe:
#loading pre-trained glove model
#downloading and unzipping all word embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip
!ls
!pwd

#importing 100-d glove model
glove_model_100vec = pd.read_table("glove.6B.100d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

# getting mean vector for each sentence

def get_mean_vector(glove_model, words):
    # remove out-of-vocabulary words
    #assuming 100-d vector
    words = [word for word in word_tokenize(words) if word in list(glove_model_100vec.index)] #if word is in vocab
    if len(words) >= 1:
        return np.mean(glove_model_100vec.loc[words].values, axis=0)
    else:
        return np.array([0]*100)

#creating empty list and appending all mean arrays for comparing cosine similarities
glove_vec=[]
for i in df_sample.description:
    glove_vec.append(list(get_mean_vector(glove_model_100vec, i)))

glove_vec=np.asarray(glove_vec)
glove_vec

输出

array([[-0.11690753,  0.17445151,  0.04606778, ..., -0.48718723,
         0.28744267,  0.16625453],
       [-0.12658561,  0.17125735,  0.44709804, ..., -0.18936391,
         0.51547109,  0.2958283 ],
       [-0.06018609,  0.12372995,  0.27105957, ..., -0.38565426,
         0.39135596,  0.2519755 ],
       ...,
       [-0.12469988,  0.11091088,  0.16328073, ..., -0.08730062,
         0.25822592,  0.12540627],
       [ 0.09014104,  0.09796044,  0.13403036, ..., -0.371885  ,
         0.19138244,  0.05781978],
       [ 0.00891036,  0.09064478,  0.22670132, ..., -0.26099886,
         0.47415786,  0.30951336]])

埃尔莫:

# Due to some open issue with TensorFlow Hub on latest version (2.x), we are degrading to tensorflow 1.x version
#!pip uninstall tensorflow
!pip install tensorflow==1.15

import tensorflow as tf
import tensorflow_hub as hub
print(tf.__version__)

#Load pre-trained model
embed_ = hub.Module("https://tfhub.dev/google/elmo/3")

#function to average word vectors of each sentence

def elmo_vectors_sentence(x):
  sentence_embeddings = embed_(x.tolist(), signature="default", as_dict=True)["elmo"]

  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    # Average of each vector
    return sess.run(tf.reduce_mean(sentence_embeddings,1))

#if your data set is large , make a batch of 100 samples. Just remove comment and run the code given below. As we have just 100 samples, we are not doing this
 #samples= [df[i:i+100] for i in range(0,df.shape[0],100)]
 # elmo_vec = [elmo_vectors_sentence(x['description']) for x in samples]
 #elmo_vec_full= np.concatenate(elmo_vec, axis = 0)

#embeddings on our dataset
elmo_vec = elmo_vectors_sentence(df_sample['description'])

elmo_vec

输出

array([[ 0.0109894 , -0.16668989, -0.06553215, ...,  0.07014981,
         0.09196191,  0.04669906],
       [ 0.15317157, -0.19256656,  0.01390844, ...,  0.03459582,
         0.28029835,  0.11106762],
       [ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,
         0.12729007,  0.17192583],
       ...,
       [ 0.29017407, -0.45098212,  0.0250571 , ..., -0.12281103,
         0.23303834,  0.15486737],
       [ 0.22871418,  0.12254314, -0.22637479, ...,  0.04150296,
         0.31900924,  0.28121516],
       [ 0.05940952,  0.01366339, -0.17502695, ...,  0.20946877,
         0.0020928 ,  0.1114894 ]], dtype=float32)

Doc2Vec：

#importing doc2vec and tagged document

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

#tokenizing data
tokenized_data=[word_tokenize(word) for word in df_sample.description]

train_data=[TaggedDocument(d, [i]) for i, d in enumerate(tokenized_data)]   #adding paragraph id as mentioned in explanation for training
train_data

输出

[TaggedDocument(words=['sir', 'ken', 'robinson', 'make', 'entertaining', 'profoundly', 'moving', 'case', 'creating', 'education', 'system', 'nurture', 'rather', 'undermines', 'creativity'], tags=[0]),
 TaggedDocument(words=['humor', 'humanity', 'exuded', 'inconvenient', 'truth', 'al', 'gore', 'spell', 'way', 'individual', 'address', 'climate', 'change', 'immediately', 'buying', 'hybrid', 'inventing', 'new', 'hotter', 'brand', 'name', 'global', 'warming'], tags=[1]),
 TaggedDocument(words=['new', 'york', 'time', 'columnist', 'david', 'pogue', 'take', 'aim', 'technology', 'worst', 'interface', 'design', 'offender', 'provides', 'encouraging', 'example', 'product', 'get', 'right', 'funny', 'thing', 'burst', 'song'], tags=[2]),
 TaggedDocument(words=['emotionally', 'charged', 'talk', 'macarthur', 'winning', 'activist', 'majora', 'carter', 'detail', 'fight', 'environmental', 'justice', 'south', 'bronx', 'show', 'minority', 'neighborhood', 'suffer', 'flawed', 'urban', 'policy'], tags=[3]),
 TaggedDocument(words=['never', 'seen', 'data', 'presented', 'like', 'drama', 'urgency', 'sportscaster', 'statistic', 'guru', 'han', 'rosling', 'debunks', 'myth', 'called', 'developing', 'world'], tags=[4])

……….

## Train doc2vec model
model = Doc2Vec(train_data, vector_size = 100, window = 2, min_count = 1, epochs = 100)

def get_vectors(model,words):
  words = [word for word in word_tokenize(words) if word in list(model.wv.vocab)]
  #words = [word for word in word_tokenize(words) if word in list(model.wv.index_to_key)] #if gensim version is >4.0.0 ,use this line
  if len(words)>=1:
    return model.infer_vector(words)
  else:
    return np.array([0]*100)

#defining empty list
doc2vec_vec=[]
for i in df_sample.description:
    doc2vec_vec.append(list(get_vectors(model, i)))

doc2vec_vec=np.asarray(doc2vec_vec)

doc2vec_vec

输出

array([[ 0.00505156, -0.582084  , -0.33430266, ...,  0.29665616,
        -0.5472022 ,  0.48537165],
       [ 0.05787622, -0.6559785 , -0.41140306, ...,  0.24132295,
        -0.73182726,  0.6089837 ],
       [ 0.02416484, -0.48238695, -0.29850838, ...,  0.2710957 ,
        -0.51971895,  0.4405582 ],
       ...,
       [ 0.0511999 , -0.5991625 , -0.34839907, ...,  0.29519215,
        -0.68761116,  0.4545323 ],
       [ 0.0180944 , -0.8318272 , -0.3488748 , ...,  0.30490136,
        -0.7558393 ,  0.56117946],
       [-0.04790357, -0.66188   , -0.3797214 , ...,  0.34476635,
        -0.7202311 ,  0.5834031 ]], dtype=float32)

句子-伯特

#BERT sentence transformer for sentence encoding
!pip install sentence-transformers

#importing bert-base model

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

#one more model to try
#model = SentenceTransformer('paraphrase-MiniLM-L12-v2')

#embeding on description column
sentence_embeddings_BERT = sbert_model.encode(df_sample['description'])
print('Sample BERT embedding vector - length', len(sentence_embeddings_BERT[0]))

输出

Sample BERT embedding vector - length 768

sentence_embeddings_BERT
array([[-0.31804532,  0.6571422 ,  0.5327481 , ..., -0.76469   ,
        -0.4919126 ,  0.1543465 ],
       [-0.08962823,  1.0855986 ,  0.37181526, ..., -0.84685326,
         0.5427714 ,  0.32389015],
       [-0.13385592,  0.8280815 ,  0.76139224, ..., -0.33403403,
         0.2664094 , -0.05493931],
       ...,
       [ 0.05133615,  1.1150284 ,  0.75921553, ...,  0.5516633 ,
         0.46614835,  0.28827885],
       [-1.3568689 ,  0.2995725 ,  0.99510914, ...,  0.26881158,
        -0.1879525 ,  0.18646894],
       [-0.20679009,  0.8725009 ,  1.2933054 , ..., -0.44921246,
         0.14516312, -0.2050481 ]], dtype=float32)

sentence_embeddings_BERT.shape

输出

(100, 768)

通用编码器

加载预训练模型

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model_USE = hub.load(module_url)

embeddings_USE = model_USE(df_sample['description'])

embeddings_USE = tf.reshape(embeddings_USE,[100,512])

embeddings_USE.shape

输出

TensorShape([Dimension(100), Dimension(512)])

#output is tensor

Infersent

InferSent 有两个版本。版本 1 使用 GloVe，而版本 2 使用 fastText 向量。你可以选择和任何模型一起工作。我们使用的是版本 2，所以我们下载了推断模型和预先训练的单词向量。

! mkdir encoder
! curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

! mkdir GloVe
! curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
! unzip GloVe/glove.840B.300d.zip -d GloVe/

! unzip GloVe/glove.840B.300d.zip -d GloVe/

from models import InferSent
import torch

V = 2
MODEL_PATH = '/content/drive/MyDrive/yolov3/encoder/infersent%s.pkl' % V
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': V}
model_infer = InferSent(params_model)
model_infer.load_state_dict(torch.load(MODEL_PATH))

W2V_PATH = '/content/drive/MyDrive/yolov3/GloVe/glove.840B.300d.txt'
model_infer.set_w2v_path(W2V_PATH)

#building vocabulary
model_infer.build_vocab(df_sample.description, tokenize=True)

输出

Found 1266(/1294) words with w2v vectors
Vocab size : 1266

#encoding sample dataset

infersent_embed = model_infer.encode(df_sample.description,tokenize=True)

#shape of our vector

infersent_embed.shape

输出

(100, 4096)

get_embed(df_sample,'infersent')

输出

array([[ 0.00320979,  0.0560745 ,  0.11894835, ...,  0.04763867,
         0.02359796,  0.09751415],
       [ 0.00983471,  0.11757359,  0.12201475, ...,  0.06545023,
         0.04181211,  0.07941461],
       [-0.02874381,  0.18418473,  0.12211668, ...,  0.07526097,
         0.06728931,  0.1058861 ],
       ...,
       [ 0.00766308,  0.10781102,  0.13686652, ...,  0.08371441,
         0.01190174,  0.12111058],
       [-0.02874381,  0.20537955,  0.11543981, ...,  0.08811261,
         0.03787484,  0.08826952],
       [ 0.12408942,  0.30591702,  0.23708522, ...,  0.1063919 ,
         0.0908693 ,  0.14098585]], dtype=float32)

开放 AI GPT

#installing necessary model

!pip install pytorch_pretrained_bert

import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel

tokenizer_openai = OpenAIGPTTokenizer.from_pretrained('openai-gpt')  #Construct a GPT Tokenizer. Based on Byte-Pair-Encoding with the following peculiarities:

model_openai = OpenAIGPTModel.from_pretrained('openai-gpt')
model_openai.eval()
print('Model Loaded')

#function to get embedding of each token
def Embedding_openai(Sentence):
  tokens = word_tokenize(Sentence)
  vectors = np.zeros((1,768))
  for word in tokens:
      subwords = tokenizer_openai.tokenize(word)
      indexed_tokens = tokenizer_openai.convert_tokens_to_ids(subwords)
      tokens_tensor = torch.tensor([indexed_tokens])
      with torch.no_grad():
          try:
            vectors += np.array(torch.mean(model_openai(tokens_tensor),1))
          except Exception as ex:
            continue
  vectors /= len(tokens)
  return vectors

# Initialize Matrix with dimension of numberof rows*vector dimension
open_ai_vec = np.zeros((df_sample.shape[0], 768))

# generating sentence embedding for each row
for iter in range(df_sample.shape[0]):
    text = df_sample.loc[iter,'description']
    open_ai_vec[iter] = Embedding_openai(text)

open_ai_vec

输出

array([[ 0.16126736,  0.14900037,  0.10306535, ...,  0.22078205,
        -0.38590393, -0.09898915],
       [ 0.17074709,  0.20849738,  0.14996684, ...,  0.21315758,
        -0.46983403,  0.02419061],
       [ 0.25158801,  0.12217634,  0.09847356, ...,  0.25541687,
        -0.44979091, -0.0174561 ],
       ...,
       [ 0.26624974,  0.15842849,  0.10565209, ...,  0.23473342,
        -0.40087843, -0.07652373],
       [ 0.22917288,  0.22115094,  0.09217898, ...,  0.18310198,
        -0.33768173, -0.16026535],
       [ 0.21503123,  0.21615047,  0.04715349, ...,  0.25044506,
        -0.42287723, -0.01473052]])

第 9-6 步。使用选择的嵌入方法自动生成特征向量函数

#takes input as dataframe and embedding model name as mentioned in function

def get_embed(df,model):

  if model=='Glove':
    return glove_vec
  if model=='ELMO':
    return elmo_vec
  if model=='doc2vec':
    return doc2vec_vec
  if model=='sentenceBERT':
    return sentence_embeddings_BERT
  if model=='USE':
    return embeddings_USE
  if model=='infersent':
    return infersent_embed
  if model=='Open-ai':
    return open_ai_vec

get_embed(df_sample,'ELMO')

#output
array([[ 0.0109894 , -0.16668989, -0.06553215, ...,  0.07014981,
         0.09196191,  0.04669906],
       [ 0.15317157, -0.19256656,  0.01390844, ...,  0.03459582,
         0.28029835,  0.11106762],
       [ 0.20210212, -0.13186318, -0.20647219, ..., -0.15281932,
         0.12729007,  0.17192583],
       ...,
       [ 0.29017407, -0.45098212,  0.0250571 , ..., -0.12281103,
         0.23303834,  0.15486737],
       [ 0.22871418,  0.12254314, -0.22637479, ...,  0.04150296,
         0.31900924,  0.28121516],
       [ 0.05940952,  0.01366339, -0.17502695, ...,  0.20946877,
         0.0020928 ,  0.1114894 ]], dtype=float32)

我们希望您现在已经熟悉了自然语言处理。既然已经清理和处理了数据，并且创建了特性，那么让我们开始构建解决业务问题的应用程序。

Footnotes 1

https://arxiv.org/abs/1310.4546

四、高级自然语言处理

本章涵盖了各种高级 NLP 技术，并利用机器学习算法从文本数据中提取信息，以及高级 NLP 应用程序的解决方案和实现。

食谱 1。名词短语抽取
食谱 2。文本相似度
食谱三。词性标注
食谱 4。信息提取-NER-实体识别
食谱 5。主题建模
食谱 6。文本分类
食谱 7。情感分析
食谱 8。词义消歧
食谱 9。语音识别和语音转文本
食谱 10。文本到语音
食谱 11。语言检测和翻译

在进入食谱之前，让我们先了解 NLP 管道和生命周期。这本书实现了许多概念，你可能会被它的内容淹没。为了使它更简单、更流畅，让我们看看 NLP 解决方案需要遵循的流程。

例如，让我们考虑对一个产品、一个品牌或一项服务的客户情感分析和预测。

定义问题。了解所有产品的客户情绪。
了解问题的深度和广度。了解整个产品的客户/用户情绪。我们为什么要这么做？业务影响是什么？
做 数据需求头脑风暴。进行头脑风暴活动，列出所有可能的数据点。
- 顾客在亚马逊、Flipkart 等电子商务平台上的所有评论
- 客户发送的电子邮件
- 保修索赔表
- 调查数据
- 使用语音转文本的呼叫中心对话
- 反馈表格
- Twitter、脸书和 LinkedIn 等社交媒体数据
数据收集:你在第一章中学习了不同的数据收集技术。根据数据和问题，您可能需要整合不同的数据收集方法。在这种情况下，您可以使用 web 抓取和 Twitter APIs。
文本预处理:你知道数据不会总是干净的。你需要花大量的时间来处理它，并使用第二章中讨论的方法提取洞察力。
文本是由字符组成的，机器很难理解它们。使用您在前面章节中学习的任何方法，将它们转换为机器和算法可以理解的功能。
机器学习/深度学习:机器学习和深度学习是人工智能保护伞的一部分，它使系统在没有被编程的情况下自动学习数据中的模式。大多数 NLP 解决方案都基于此。由于文本被转换为特征，因此您可以利用机器学习或深度学习算法来实现文本分类和自然语言生成等目标。
洞察力和部署:如果没有正确的洞察力与业务沟通，构建 NLP 解决方案是没有用的。总是花时间将模型/分析输出和业务之间的点连接起来，从而产生最大的影响。

配方 4-1。提取名词短语

这个菜谱从文本数据(一个句子或文档)中提取一个名词短语。

问题

你想提取一个名词短语。

解决办法

当你想分析一个句子中的谁时，名词短语提取是很重要的。让我们看一个使用TextBlob的例子。

它是如何工作的

执行下面的代码来提取名词短语。

#Import libraries
import nltk
from textblob import TextBlob
#Extract noun
blob = TextBlob("John is learning natural language processing")
for np in blob.noun_phrases:
    print(np)

这是输出。

john
natural language processing

食谱 4-2。寻找文本之间的相似性

这个菜谱讨论了如何找到两个文档或文本之间的相似性。有许多类似的度量标准，如欧几里德、余弦和雅克卡。文本相似性的应用可以在拼写校正、重复数据删除、简历筛选、跨各种领域的搜索应用以及基于内容的推荐系统中找到。

下面是一些相似性度量。

余弦相似度:计算两个向量之间角度的余弦值。
Jaccard 相似度:使用单词的交集或并集计算得分。
Jaccard Index :(两组中的数字)/(任一组中的数字)* 100。
Levenshtein 距离:将字符串 a 转换成字符串 b 需要最少的插入、删除和替换。
汉明距离:两个字符串中符号相同的位置个数。只能为长度相等的字符串定义它。

您希望找到文本和文档之间的相似之处。

解决办法

最简单的方法是使用 sklearn 库中的余弦相似度。

它是如何工作的

按照本节中的步骤计算文本文档之间的相似性得分。

步骤 2-1。创建/读取文本数据

这是数据。

documents = (
"I like NLP",
"I am exploring NLP",
"I am a beginner in NLP",
"I want to learn NLP",
"I like advanced NLP"
)

第 2-2 步。寻找相似之处

执行下面的代码来找出相似之处。

#Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
#Compute tfidf : feature engineering(refer previous chapter – Recipe 3-4)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
tfidf_matrix.shape
#output
(5, 10)
#compute similarity for first sentence with rest of the sentences
cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)
#output
array([[ 1\.       ,  0.17682765,  0.14284054,  0.13489366,  0.68374784]])

与其余句子相比，第一句和最后一句具有更高的相似性。

语音匹配

相似性检查的下一个版本是语音匹配，它粗略地匹配两个单词或句子，并创建一个字母数字字符串作为文本或单词的编码版本。这对于搜索大型文本语料库、纠正拼写错误和匹配相关名称非常有用。Soundex 和 Metaphone 是用于此目的的两种主要语音算法。最简单的方法是使用模糊库。

安装并导入库。
运行 Soundex 函数。

!pip install fuzzy
import fuzzy

生成语音形式。

soundex('natural')
#output
'N364'
soundex('natuaral')
#output
'N364'
soundex('language')
#output
'L52'
soundex('processing')
#output
'P625'

soundex = fuzzy.Soundex(4)

Soundex 对自然和天然一视同仁。两个字符串的语音代码都是 N364。而对于语言和处理，分别是 L52 和 P625。

食谱 4-3。词性标注

词性标注是自然语言处理的另一个重要部分，它涉及用词性(如名词、动词、形容词等)来标记单词。词性是命名实体解析、问答和词义消歧的基础。

问题

你想标记句子中的词类。

解决办法

有两种方法可以创建标记器。

基于规则的:人工创建的规则，标记属于特定词性的单词。
基于随机的:这些算法捕捉单词的序列，并使用隐马尔可夫模型标记序列的概率。

它是如何工作的

同样，NLTK 拥有最好的 POS 标记模块。nltk.pos_tag(word)是为任何给定单词生成词性标注的函数。使用 for 循环并为文档中出现的所有单词生成 POS。

步骤 3-1。将文本存储在变量中

这里是变量。

Text  =  "I love NLP and I will learn NLP in 2 month"

第 3-2 步。为 POS 导入 NLTK

这是代码。

# Importing necessary packages and

stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
# Tokenize the text
tokens = sent_tokenize(text)
#Generate tagging for all the tokens using loop
for i in tokens:
    words = nltk.word_tokenize(i)
    words = [w for w in words if not w in stop_words]
    #  POS-tagger.
    tags = nltk.pos_tag(words)
tags

这些是结果。

[('I', 'PRP'),
 ('love', 'VBP'),
 ('NLP', 'NNP'),
 ('I', 'PRP'),
 ('learn', 'VBP'),
 ('NLP', 'RB'),
 ('2month', 'CD')]

以下是词性标注的简写形式和解释。爱这个词是 VBP，意思是唱。当前，非 3D 拍摄。

CC 并列连词
CD 基数
DT 限定词
EX 存在那里(如有...想象一下存在
FW 外来词
在介词/从属连词中
JJ 形容词例如:大
JJR 形容词，比较级例如:更大的
JJS 形容词，最高级例如:最大的
LS 列表标记 1)
MD modal 可能会
NN 名词，单数书桌
NNS 名词复数书桌
NNP 专有名词，单数哈里森
NNPS 专有名词，复数美国人
PDT 预定器所有孩子
词尾所有格父母
PRP 人称代词我，他，她
PRP$所有格代词我的，他的，她的
非常，无声地
RBR 副词，比较级更好
RBS 副词，最高级 best
RP 粒子放弃
去商店
呃感叹词
VB 动词，基本形式 take
VBD 动词，过去式带了
VBG 动词，动名词/现在分词取
VBN 动词，过去分词取
VBP 动词，唱，现在。非 3D 拍摄
VBZ 动词，第三人称演唱。呈现带走
WDT 疑问词 which
代词谁，什么
WP$所有格 wh-代词谁的
wh-副词 where，when

食谱 4-4。从文本中提取实体

这个菜谱讨论了如何从文本中识别和提取实体，称为命名实体识别。多个库执行这项任务，如 NLTK chunker、斯坦福 NER、spaCy、OpenNLP 和 NeuroNER。还有很多 API，像沃森 NLU，阿尔切米亚皮，呆子，谷歌云自然语言 API，等等。

问题

您希望从文本中识别和提取实体。

解决办法

最简单的方法是使用 NLTK 或 spaCy 中的ne_chunk。

它是如何工作的

按照本节中的步骤执行 NER。

步骤 4-1。读取/创建文本数据

这是正文。

sent = "John is studying at Stanford University in California"

第 4-2 步。提取实体

执行下面的代码。

使用 NLTK

#import libraries

import nltk
from nltk import ne_chunk
from nltk import word_tokenize
#NER
ne_chunk(nltk.pos_tag(word_tokenize(sent)), binary=False)
#output
Tree('S', [Tree('PERSON', [('John', 'NNP')]), ('is', 'VBZ'), ('studying', 'VBG'), ('at', 'IN'), Tree('ORGANIZATION', [('Stanford', 'NNP'), ('University', 'NNP')]), ('in', 'IN'), Tree('GPE', [('California', 'NNP')])])
Here "John" is tagged as "PERSON"
"Stanford" as "ORGANIZATION"
"California" as "GPE". Geopolitical entity, i.e. countries, cities, states.

使用空间

import spacy

nlp = spacy.load('en')
# Read/create a sentence
doc = nlp(u'Apple is ready to launch new phone worth $10000 in New york time square ')
for ent in doc.ents:
   print(ent.text, ent.start_char, ent.end_char, ent.label_)
#output

Apple 0 5 ORG
10000 42 47 MONEY
New york 51 59 GPE

按产量来说，苹果是个组织，10000 是钱，纽约是个地方。结果是准确的，可用于任何自然语言处理应用。

配方 4-5。从文本中提取主题

这个菜谱讨论了如何从文档中识别主题。例如，有一个在线图书馆，根据图书的种类/流派设有多个部门。你查看独特的关键词/主题来决定这本书可能属于哪个部门，并相应地放置它。在这种情况下，主题建模会派上用场。它被称为文档标记和聚类。

问题

您希望从文档中提取或识别主题。

解决办法

最简单的方法是使用 gensim 库。

它是如何工作的

按照本节中的步骤使用 genism 识别文档中的主题。

步骤 5-1。创建文本数据

这是正文。

doc1 = "I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning"
doc2 = "My father is a data scientist and he is nlp expert"
doc3 = "My sister has good exposure into android development"
doc_complete = [doc1, doc2, doc3]
doc_complete
#output
['I am learning NLP, it is very interesting and exciting. it includes machine learning and deep learning',
 'My father is a data scientist and he is nlp expert',
 'My sister has good exposure into android development']

第 5-2 步。清理和预处理数据

接下来，我们来清理一下。

# Install and import libraries
!pip install gensim
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
# Text preprocessing as discussed in chapter 2
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ".join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
doc_clean = [clean(doc).split() for doc in doc_complete]
doc_clean
#output
[['learning',
  'nlp',
  'interesting',
  'exciting',

  'includes',
  'machine',
  'learning',
  'deep',
  'learning'],
 ['father', 'data', 'scientist', 'nlp', 'expert'],
 ['sister', 'good', 'exposure', 'android', 'development']]

第 5-3 步。准备文档术语矩阵

下面是代码。

# Importing gensim
import gensim
from gensim import corpora
# Creating the term dictionary of our corpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(doc_clean)
# Converting a list of documents (corpus) into Document-Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
doc_term_matrix
#output
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 3), (5, 1), (6, 1)],
 [(6, 1), (7, 1), (8, 1), (9, 1), (10, 1)],
 [(11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]]

第 5-4 步。创建 LDA 模型

最后一部分创建 LDA 模型。

# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix for 3 topics.
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
# Results
print(ldamodel.print_topics())
#output
[(0, '0.063*"nlp" + 0.063*"father" + 0.063*"data" + 0.063*"scientist" + 0.063*"expert" + 0.063*"good" + 0.063*"exposure" + 0.063*"development" + 0.063*"android" + 0.063*"sister"'), (1, '0.232*"learning" + 0.093*"nlp" + 0.093*"deep" + 0.093*"includes" + 0.093*"interesting" + 0.093*"machine" + 0.093*"exciting" + 0.023*"scientist" + 0.023*"data" + 0.023*"father"'), (2, '0.087*"sister" + 0.087*"good" + 0.087*"exposure" + 0.087*"development" + 0.087*"android" + 0.087*"father" + 0.087*"scientist" + 0.087*"data" + 0.087*"expert" + 0.087*"nlp"')]

所有与句子主题相关的权重看起来都差不多。您可以对大型文档执行此操作，以提取重要的主题。在样本数据上实现这一点的整个想法是让您熟悉它，并且您可以使用相同的代码片段在大量数据上执行，以获得重要的结果和见解。

配方 4-6。文本分类

文本分类根据预先训练的类别自动对文本文档进行分类。它有以下应用。

情感分析
文件分类
垃圾邮件/业余邮件分类
投诉分类
产品分类
假新闻检测

问题

使用机器学习的垃圾邮件/火腿分类。

解决办法

Gmail 有一个叫垃圾邮件的文件夹。它将你的电子邮件分类为垃圾邮件和火腿，这样你就不必阅读不必要的电子邮件。

它是如何工作的

按照循序渐进的方法构建分类器。

步骤 6-1。收集和理解数据

请从www.kaggle.com/uciml/sms-spam-collection -dataset#spam.csv下载数据并保存在您的工作目录中。

#Read the data
Email_Data = pd.read_csv("spam.csv",encoding ='latin1')
#Data undestanding
Email_Data.columns
#output
Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype="object")
Email_Data = Email_Data[['v1', 'v2']]
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})
Email_Data.head()
#output
    Target   Email
0      ham   Go until jurong point, crazy.. Available only ...
1      ham   Ok lar... Joking wif u oni...
2      spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham   U dun say so early hor... U c already then say...
4      ham   Nah I don't think he goes to usf, he lives aro...

第 6-2 步。文本处理和特征工程

下面是代码。

#import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
#pre processing steps like lower case, stemming and lemmatization
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stop = stopwords.words('english')
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
st = PorterStemmer()
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
Email_Data.head()

#output
  Target                                              Email
0    ham  go jurong point, crazy.. avail bugi n great wo...
1    ham                        ok lar... joke wif u oni...
2    spam free entri 2 wkli comp win fa cup final tkt 21...
3    ham          u dun say earli hor... u c alreadi say...
4    ham              nah think goe usf, live around though
#Splitting data into train and validation

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Email_Data['Email'], Email_Data['Target'])
# TFIDF feature generation for a maximum of 5000 features
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)

tfidf_vect.fit(Email_Data['Email'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)
xtrain_tfidf.data
#output
array([0.39933971, 0.36719906, 0.60411187, ..., 0.36682939, 0.30602539, 0.38290119])

第 6-3 步。模特培训

这是用于训练任何给定模型的通用函数。

def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    return metrics.accuracy_score(predictions, valid_y)
# Naive Bayes trainig
accuracy = train_model(naive_bayes.MultinomialNB(alpha=0.2), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)
#output
Accuracy:  0.985642498205
# Linear Classifier on Word Level TF IDF Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("Accuracy: ", accuracy)
#output

Accuracy:  0.970567121321

朴素贝叶斯分类器比线性分类器提供更好的结果。您应该尝试其他几个分类器，然后选择最好的一个。

配方 4-7。进行情感分析

这个食谱讨论了一个特定句子或语句的情感。情感分析是各行业广泛使用的技术之一，用于了解客户/用户对产品/服务的情感。情感分析给出倾向于正面或负面的句子/语句的情感分数。

问题

你想做一个情感分析。

解决办法

最简单的方法是使用 TextBlob 或 VADER。

它是如何工作的

按照本节中的步骤使用 TextBlob 进行情感分析。它有两个指标。

极性位于[–1，1]的范围内，其中 1 表示肯定的陈述，而–1 表示否定的陈述。
主观性[0，1]是一种观点，而不是事实信息。

步骤 7-1。创建示例数据

这是样本数据。

review = "I like this phone. screen quality and camera

clarity is really good."
review2 = "This tv is not good. Bad quality, no clarity, worst experience"

第 7-2 步。清理和预处理数据

该步骤请参考第二章，配方 2-10。

第 7-3 步。获取情感分数

使用预先训练的 TextBlob 来获得情感分数。

#import libraries
from textblob import TextBlob
#TextBlob has a pre trained sentiment prediction model
blob = TextBlob(review)
blob.sentiment
#output
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

这似乎是一个非常积极的评论。

#now lets look at the sentiment of review2
blob = TextBlob(review2)
blob.sentiment
#output

Sentiment(polarity=-0.6833333333333332, subjectivity=0.7555555555555555)

这是一个负面评价，因为极性为–0.68。

Note

在下一章的菜谱 5-2 中会介绍一个端到端实现的实时情感分析用例。

食谱 4-8。消除文本歧义

歧义的产生是由于单词在不同的上下文中有不同的含义。

举个例子，

Text1 = 'I went to the bank to deposit my money'
Text2 = 'The river bank was full of dead fish'

在文本中，单词 bank 根据句子的上下文有不同的含义。

问题

你想要理解消除词义的歧义。

解决办法

Lesk 算法是词义消歧的最佳算法之一。让我们看看如何使用pywsd和nltk包来解决它。

它是如何工作的

以下是实现结果的步骤。

步骤 8-1。导入库

首先，导入库。

#Install pywsd
!pip install pywsd
#Import functions
from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain
from pywsd.lesk import simple_lesk

步骤 8-2。消除词义的歧义

这是代码。

# Sentences
bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']
# calling the lesk function and printing results for both the sentences
print ("Context-1:", bank_sents[0])
answer = simple_lesk(bank_sents[0],'bank')
print ("Sense:", answer)
print ("Definition : ", answer.definition())
print ("Context-2:", bank_sents[1])
answer = simple_lesk(bank_sents[1],'bank','n')
print ("Sense:", answer)
print ("Definition : ", answer.definition())
#Result:

Context-1: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition :  a financial institution that accepts deposits and channels the money into lending activities

Context-2: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition :  sloping land (especially the slope beside a body of water)

观察在上下文-1 中，“银行”是金融机构，而在上下文-2 中，“银行”是坡地。

食谱 4-9。将语音转换为文本

将语音转换成文本是一种非常有用的自然语言处理技术。

问题

你想把语音转换成文本。

解决办法

最简单的方法是使用语音识别和 PyAudio。

它是如何工作的

按照本节中的步骤实现语音到文本转换。

步骤 9-1。定义业务问题

与机器的互动正趋向于语音，这是人类交流的常用方式。常见的例子有苹果 Siri、亚马逊 Alexa 和谷歌 Home。

步骤 9-2。安装并导入必要的库

这里是图书馆。

!pip install SpeechRecognition
!pip install PyAudio
import speech_recognition as sr

第 9-3 步。运行代码

现在，在您运行下面的代码片段之后，您在麦克风上说的任何话(使用recognize_google函数)都将被转换成文本。

r=sr.Recognizer()
with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
try:
    print("I think you said: "+r.recognize_google(audio));
except:
    pass;
#output
Please say something
Time over, thanks
I think you said: I am learning natural language processing

此代码适用于默认的英语语言。如果您说的是另一种语言，例如印地语，则文本将以英语解释，如下所示。

#code snippet
r=sr.Recognizer()
with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
try:
    print("I think you said: "+r.recognize_google(audio));
except:
    pass;
#output
Please say something
Time over, thanks
I think you said: aapka naam kya hai

如果您希望文本出现在口语中，请运行下面的代码片段。对recognize_google –language ( 'hi-IN'，意为印地语)做了一个小改动。

#code snippet
r=sr.Recognizer()
with sr.Microphone() as source:
    print("Please say something")
    audio = r.listen(source)
    print("Time over, thanks")
try:

    print("I think you said: "+r.recognize_google(audio, language ='hi-IN'));
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))
except:
    pass;

配方 4-10。将文本转换为语音

将文本转换成语音是另一种有用的 NLP 技术。

问题

你想把文本转换成语音。

解决办法

最简单的方法是使用 gTTs 库。

它是如何工作的

按照本节中的步骤将文本转换为语音。

步骤 10-1。安装并导入必要的库

这里是图书馆。

!pip install gTTS
from gtts import gTTS

步骤 10-2。用 gTTs 函数运行代码

现在，在您运行下面的代码片段之后，您在 text 参数中输入的任何内容都会被转换成音频。

#chooses the language, English('en')
convert = gTTS(text='I like this NLP book', lang="en", slow=False)
# Saving the converted audio in a mp3 file named
myobj.save("audio.mp3")
#output

Please play the audio.mp3 file saved in your local machine to hear the audio.

食谱 4-11。翻译演讲

语言检测和翻译。

问题

每当您试图分析来自全球各地的博客的数据时，尤其是来自像中国这样的中文占主导地位的国家的网站，分析这样的数据或对这样的数据执行 NLP 任务将是困难的。这就是语言翻译的用处。你想把一种语言翻译成另一种语言。

解决办法

最简单的方法是使用 goslate 库。

它是如何工作的

按照本节中的步骤用 Python 实现语言翻译。

步骤 11-1。安装并导入必要的库

这里是图书馆。

!pip install goslate
import goslate

步骤 11-2。输入文本

输入以下简单的短语。

text = "Bonjour le monde"

步骤 11-3。运行 goslate 函数

运行翻译功能。

gs = goslate.Goslate()
translatedText = gs.translate(text,'en')
print(translatedText)
#output

Hi world

Note

你也可以使用多语言库。它有各种多语言应用程序，并在 NLP 任务中支持 100 多种语言，如语言检测、标记化、NER、词性标注和情感分析。

嗯，感觉很圆满。您已经实现了许多高级的 NLP 应用程序和技术。但这还不是全部，伙计们。前面有更多有趣的章节，您将看到围绕 NLP 的工业应用、它们的解决方案和端到端的实现。

五、实现行业应用

本章为 NLP 的一些行业应用实现了端到端的解决方案。

食谱 1。消费者投诉分类
食谱 2。客户评论情绪预测
食谱三。使用记录链接的数据拼接
食谱 4。主题笔记的文本摘要
食谱 5。文档聚类
食谱 6。搜索引擎和学习排名
食谱 7。假新闻检测
食谱 8。电影类型标记

我们相信，经过四章的学习，您已经熟悉了自然语言处理的概念，并准备好解决一些业务问题。你需要记住所有的四个章节，并思考解决这些问题的方法。它可以是用于构建应用程序的一个概念或一系列概念。

所以，让我们一个接一个地看一下，以便更好地理解端到端的实现。

配方 5-1。实现多类分类

让我们通过解决金融行业的消费者投诉分类来了解如何在 Python 中对文本数据进行多类分类。

问题

每周，美国消费者金融保护局( www.consumerfinance.gov )都会向公司发送数千份消费者对金融产品和服务的投诉，要求公司做出回应。您希望使用投诉描述将这些消费者投诉归入其所属的产品类别。

解决办法

该项目的目标是将投诉分类到特定的产品类别中。由于它有多个类别，所以它成为一个多类分类，可以通过许多机器学习算法来解决。

一旦算法到位，每当有新的投诉时，您可以轻松地对其进行分类，并将其重定向到相关人员。这可以节省大量时间，因为您可以最大限度地减少人工干预来决定该向谁投诉。

它是如何工作的

让我们探索数据，并使用许多机器学习算法建立分类问题，看看哪一个给出更好的结果。

步骤 1-1。从 Kaggle 那里获取数据

前往 www.kaggle.com/subhassing/exploring-consumer-complaint-data/data 下载数据。

步骤 1-2。导入库

这里是图书馆。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import string
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import os
from textblob import TextBlob
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
import sklearn.feature_extraction.text as text
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from io import StringIO
import seaborn as sns

步骤 1-3。导入数据

导入在上一步中下载的数据。

Data = pd.read_csv("/Consumer_Complaints.csv",encoding='latin-1')

步骤 1-4。分析日期

我们来分析一下柱子。

Data.dtypes
date_received                   object
product                         object
sub_product                     object
issue                           object
sub_issue                       object
consumer_complaint_narrative    object
company_public_response         object
company                         object
state                           object
zipcode                         object
tags                            object
consumer_consent_provided       object
submitted_via                   object
date_sent_to_company            object
company_response_to_consumer    object
timely_response                 object
consumer_disputed?              object
complaint_id                     int64
# Selecting required columns and rows
Data = Data[['product', 'consumer_complaint_narrative']]
Data = Data[pd.notnull(Data['consumer_complaint_narrative'])]
# See top 5 rows
Data.head()
                product                        consumer_complaint_narrative
190126  Debt collection    XXXX has claimed I owe them {$27.00} for XXXX ...
190135    Consumer Loan    Due to inconsistencies in the amount owed that...
190155         Mortgage    In XX/XX/XXXX my wages that I earned at my job...
190207         Mortgage    I have an open and current mortgage with Chase...
190208         Mortgage    XXXX was submitted XX/XX/XXXX. At the time I s...
# Factorizing the category column

Data['category_id'] = Data['product'].factorize()[0]
Data.head()
                product    consumer_complaint_narrative   \
190126  Debt collection    XXXX has claimed I owe them {$27.00} for XXXX ...
190135    Consumer Loan    Due to inconsistencies in the amount owed that...
        category_id
190126            0
190135            1
# Check the distriution of complaints by category
Data.groupby('product').consumer_complaint_narrative.count()
product
Bank account or service     5711
Consumer Loan               3678
Credit card                 7929
Credit reporting           12526
Debt collection            17552
Money transfers              666
Mortgage                   14919
Other financial service      110
Payday loan                  726
Prepaid card                 861
Student loan                2128
# Lets plot it and see
fig = plt.figure(figsize=(8,6))
Data.groupby('product').consumer_complaint_narrative.count().plot.bar(ylim=0)
plt.show()

img/475440_2_En_5_Figa_HTML.jpg

收债和抵押贷款的投诉数量最高。

步骤 1-5。拆分数据

将数据分为训练和验证。

train_x, valid_x, train_y, valid_y = model_selection.train_test_split(Data['consumer_complaint_narrative'], Data['product'])

步骤 1-6。使用 TF-IDF 进行特征工程

创建 TF-IDF 向量，如第三章所述。这里我们认为最大特征是 5000。

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(Data['consumer_complaint_narrative'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

步骤 1-7。建立模型并评估

让我们在单词级 TF-IDF 向量上建立一个线性分类器。默认的超参数用于分类器。可以像 C、max_iter 或 solver 一样更改参数，以获得更好的结果。

model = linear_model.LogisticRegression().fit(xtrain_tfidf, train_y)
# Model summary
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class="ovr", n_jobs=1,
          penalty='l2', random_state=None, solver="liblinear", tol=0.0001,
          verbose=0, warm_start=False)
# Checking accuracy
accuracy = metrics.accuracy_score(model.predict(xvalid_tfidf), valid_y)
print ("Accuracy: ", accuracy)
Accuracy:  0.845048497186
# Classification report
print(metrics.classification_report(valid_y, model.predict(xvalid_tfidf),target_names=Data['product'].unique()))
                        precision    recall  f1-score   support
        Debt collection    0.81      0.79      0.80       1414
          Consumer Loan    0.81      0.56      0.66        942
               Mortgage    0.80      0.82      0.81       1997
            Credit card    0.85      0.85      0.85       3162
       Credit reporting    0.82      0.90      0.86       4367
           Student loan    0.77      0.48      0.59        151
Bank account or service    0.92      0.96      0.94       3717
            Payday loan    0.00      0.00      0.00         26
        Money transfers    0.76      0.23      0.35        172

Other financial service    0.77      0.57      0.65        209
           Prepaid card    0.92      0.76      0.83        545
            avg / total    0.84      0.85      0.84      16702
#confusion matrix
conf_mat = confusion_matrix(valid_y, model.predict(xvalid_tfidf))
# Vizualizing confusion matrix
category_id_df = Data[['product', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'product']].values)
fig, ax = plt.subplots(figsize=(8,6))
sns.heatmap(conf_mat, annot=True, fmt="d", cmap="BuPu",
            xticklabels=category_id_df[['product']].values, yticklabels=category_id_df[['product']].values)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

img/475440_2_En_5_Figb_HTML.jpg

85%的准确度对于基线模型来说是好的。除了“发薪日贷款”之外，其他类别的精确度和召回率都相当不错如果你寻找有效负载贷款，大多数错误的预测是债务催收和信用卡，这可能是因为在这一类别的样本数量较少。听起来也像是信用卡的一个子类。您可以将这些样本添加到任何其他组中，以使模型更加稳定。让我们看看一个例子的预测是什么样的。

# Prediction example
texts = ["This company refuses to provide me verification and validation of debt"+ "per my right under the FDCPA. I do not believe this debt is mine."]
text_features = tfidf_vect.transform(texts)
predictions = model.predict(text_features)
print(texts)
print("  - Predicted as: '{}'".format(id_to_category[predictions[0]]))
Result :
['This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine.']
  - Predicted as: 'Credit reporting'

为了提高准确性，您可以做以下事情。

用不同的算法重复这个过程，比如随机森林，SVM，GBM，神经网络，朴素贝叶斯。
也可以使用像 RNN 和 LSTM 这样的深度学习技术(在第六章讨论)。
在每一种算法中，都有许多参数需要调整以获得更好的结果。通过网格搜索可以很容易地做到这一点，网格搜索会尝试所有可能的组合并给出最佳结果。

食谱 5-2。实现情感分析

这个菜谱实现了一个流行的 NLP 工业应用:情绪分析。从商业的角度来看，了解客户对他们提供的产品/服务的反馈是非常重要的，以便对产品/服务进行即兴创作以满足客户的需求。

问题

你想实现情感分析。

解决办法

最简单的方法是使用TextBlob或vaderSentiment库。既然已经用了 TextBlob，现在就用 Vader 吧。

它是如何工作的

按照本节中的步骤对业务问题实现情感分析。

步骤 2-1。定义业务问题

了解产品在市场上的表现。客户对特定产品的反应如何？消费者跨产品的情绪是怎样的？使用情感分析可以回答更多类似的问题。

第 2-2 步。识别潜在的数据来源并提取洞察力

我们有亚马逊美食评论的数据集。让我们利用这些数据，从中获得真知灼见。您可以从 www.kaggle.com/snap/amazon-fine-food-reviews 下载数据。

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#Read the data
df = pd.read_csv('Reviews.csv')
# Look at the top 5 rows of the data
df.head(5)
#output

img/475440_2_En_5_Figc_HTML.jpg

# Understand the data types of the columns
df.info()
# Output
Data columns (total 10 columns):
Id                        5 non-null int64
ProductId                 5 non-null object
UserId                    5 non-null object
ProfileName               5 non-null object
HelpfulnessNumerator      5 non-null int64
HelpfulnessDenominator    5 non-null int64
Score                     5 non-null int64
Time                      5 non-null int64
Summary                   5 non-null object
Text                      5 non-null object
dtypes: int64(5), object(5)
# Looking at the summary of the reviews.
df.Summary.head(5)
# Output
0    Good Quality Dog Food
1        Not as Advertised
2    "Delight" says it all
3           Cough Medicine
4              Great taffy
# Looking at the description of the reviews
df.Text.head(5)
#output

0    I have bought several of the Vitality canned d...
1    Product arrived labeled as Jumbo Salted Peanut...
2    This is a confection that has been around a fe...
3    If you are looking for the secret ingredient i...
4    Great taffy at a great price.  There was a wid...

第 2-3 步。预处理数据

你知道这一步的重要性。执行预处理任务，如第二章所述。

# Import libraries
from nltk.corpus import stopwords
from textblob import TextBlob
from textblob import Word
# Lower casing and removing punctuations
df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['Text'] = df['Text'].str.replace('[^\w\s]',")
df.Text.head(5)
# Output
0    i have bought several of the vitality canned d...
1    product arrived labeled as jumbo salted peanut...
2    this is a confection that has been around a fe...
3    if you are looking for the secret ingredient i...
4    great taffy at a great price there was a wide ...
# Removal of stop words
stop = stopwords.words('english')
df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.Text.head(5)
# Output
0    bought several vitality canned dog food produc...
1    product arrived labeled jumbo salted peanutsth...
2    confection around centuries light pillowy citr...
3    looking secret ingredient robitussin believe f...
4    great taffy great price wide assortment yummy ...
# Spelling correction
df['Text'] = df['Text'].apply(lambda x: str(TextBlob(x).correct()))
df.Text.head(5)
# Output
0    bought several vitality canned dog food produc...
1    product arrived labelled lumbo halted peanutst...
2    connection around centuries light pillow citie...
3    looking secret ingredient robitussin believe f...
4    great staff great price wide assortment mummy ...
# Lemmatization

df['Text'] = df['Text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.Text.head(5)
# Output
0    bought several vitality canned dog food produc...
1    product arrived labelled lumbo halted peanutst...
2    connection around century light pillow city ge...
3    looking secret ingredient robitussin believe f...
4    great staff great price wide assortment mummy ...

第 2-4 步。分析数据

这一步与预测情绪没有任何联系；我们在这里试图做的是更深入地挖掘数据并理解它。

# Create a new data frame "reviews" to perform exploratory data analysis upon that
reviews = df
# Dropping null values
reviews.dropna(inplace=True)
# The histogram reveals this dataset is highly unbalanced towards high rating.
reviews.Score.hist(bins=5,grid=False)
plt.show()
print(reviews.groupby('Score').count().Id)

img/475440_2_En_5_Figd_HTML.jpg

# To make it balanced data, we sampled each score by the lowest n-count from above. (i.e. 29743 reviews scored as '2')
score_1 = reviews[reviews['Score'] == 1].sample(n=29743)
score_2 = reviews[reviews['Score'] == 2].sample(n=29743)
score_3 = reviews[reviews['Score'] == 3].sample(n=29743)
score_4 = reviews[reviews['Score'] == 4].sample(n=29743)
score_5 = reviews[reviews['Score'] == 5].sample(n=29743)
# Here we recreate a 'balanced' dataset.
reviews_sample = pd.concat([score_1,score_2,score_3,score_4,score_5],axis=0)
reviews_sample.reset_index(drop=True,inplace=True)

如果您正在从头开始训练自己的情感分类器，可以使用这个数据集。要做到这一点，你可以遵循与文本分类相同的步骤(方法 5-1)。这里，我们的目标变量将是使用 score 创建的正数、负数和中性变量。

分数< = 2:负
得分= 3:中性
得分> =4:正

说到这里，让我们回到我们的探索性数据分析。

# Printing count by 'Score' to check dataset is now balanced.
print(reviews_sample.groupby('Score').count().Id)
# Output
Score
1    29743
2    29743
3    29743
4    29743
5    29743
# Let's build a word cloud looking at the 'Summary'  text
from wordcloud import WordCloud
from wordcloud import STOPWORDS
# Wordcloud function's input needs to be a single string of text.
# Here I'm concatenating all Summaries into a single string.
# similarly you can build for Text column
reviews_str = reviews_sample.Summary.str.cat()
wordcloud = WordCloud(background_color='white').generate(reviews_str)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.show()

img/475440_2_En_5_Fige_HTML.jpg

# Now let's split the data into Negative (Score is 1 or 2) and Positive (4 or #5) Reviews

.
negative_reviews = reviews_sample[reviews_sample['Score'].isin([1,2]) ]
positive_reviews = reviews_sample[reviews_sample['Score'].isin([4,5]) ]
# Transform to single string
negative_reviews_str = negative_reviews.Summary.str.cat()
positive_reviews_str = positive_reviews.Summary.str.cat()
# Create wordclouds
wordcloud_negative = WordCloud(background_color='white').generate(negative_reviews_str)
wordcloud_positive = WordCloud(background_color='white').generate(positive_reviews_str)
# Plot
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(211)
ax1.imshow(wordcloud_negative,interpolation='bilinear')
ax1.axis("off")
ax1.set_title('Reviews with Negative Scores',fontsize=20)

img/475440_2_En_5_Figf_HTML.jpg

ax2 = fig.add_subplot(212)
ax2.imshow(wordcloud_positive,interpolation='bilinear')

ax2.axis("off")
ax2.set_title('Reviews with Positive Scores',fontsize=20)
plt.show()
#output

img/475440_2_En_5_Figg_HTML.jpg

第 2-5 步。使用预先训练好的模型

这一步不是必需的，因为我们不是从零开始构建模型；相反，我们使用来自 Vader perspection 库的预训练模型。

如果您想从头开始构建模型，您可以利用在探索时创建的正类和负类作为目标变量，然后训练模型。你可以按照配方 5-1 中解释的文本分类的相同步骤从头开始构建一个情感分类器。

第 2-6 步。做情感分析

在情感分析中，预训练模型从文本描述中获取输入，并为每个句子输出范围从–1 到+1 的情感得分。

#Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import os
import sys
import ast
plt.style.use('fivethirtyeight')
# Function for getting the sentiment
cp = sns.color_palette()
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# Generating sentiment for all the sentence present in the dataset
emptyline=[]
for row in df['Text']:
    vs=analyzer.polarity_scores(row)
    emptyline.append(vs)

# Creating new dataframe with sentiments
df_sentiments=pd.DataFrame(emptyline)
df_sentiments.head(5)
# Output
      compound    neg    neu    pos
0     0.9413      0.000  0.503  0.497
1    -0.5719      0.258  0.644  0.099
2     0.8031      0.133  0.599  0.268
3     0.4404      0.000  0.854  0.146
4     0.9186      0.000  0.455  0.545
# Merging the sentiments back to reviews dataframe
df_c = pd.concat([df.reset_index(drop=True), d], axis=1)
df_c.head(3)
#output sample

img/475440_2_En_5_Figh_HTML.jpg

# Convert scores into positive and negetive sentiments

using some threshold
df_c['Sentiment'] = np.where(df_c['compound'] >= 0 , 'Positive', 'Negative')
df_c.head(5)
#output sample

img/475440_2_En_5_Figi_HTML.jpg

第 2-7 步。获得业务洞察力

让我们看看整体情绪是如何使用我们生成的情绪的。

result=df_c['Sentiment'].value_counts()
result.plot(kind='bar', rot=0,color='br');

img/475440_2_En_5_Figj_HTML.jpg

我们只是抽取了 1000 条评论的样本，并完成了情感分析。如果你看，超过 900 (>90%)的评论是积极的，这对任何企业都有好处。

还可以对副产品进行分组，即情绪副产品，以了解高层客户对产品的反馈。

#Sample code snippet
result=df_c.groupby('ProductId')['Sentiment'].value_counts().unstack()
result[['Negative','Positive']].plot(kind='bar', rot=0,color='rb')

类似地，您可以使用时间列和许多其他类似的属性按月分析情绪。

食谱 5-3。应用文本相似度函数

这个食谱包括使用文本相似性的数据拼接。

问题

数据库中有多个表，有时没有一个公共的 ID 或键来连接它们，包括如下场景。

分散在多个表和系统中的客户信息
没有全局键将它们链接在一起
名字和地址有很多变化

解决办法

这可以通过对人口统计的列(如名字、姓氏、地址等)应用文本相似性函数来解决。根据几个公共列的相似性得分，您可以决定记录对是匹配还是不匹配。

它是如何工作的

按照本节中的步骤链接记录。

以下是技术挑战。

需要链接/拼接/消除重复数据的大量记录。
记录来自具有不同模式的不同系统。

没有要合并的全局关键字或客户 ID。数据拼接或链接记录有两种可能的情况。

同一个客户在同一个表中有多条记录，并且您希望进行重复数据删除。
需要合并多个表中相同客户的记录。

对于方案 3a，我们来解决方案 1，即重复数据消除，作为方案 3b 的一部分，我们来解决方案 2，即多个表的记录链接。

同一表中的重复数据删除

步骤 3a-1。阅读并理解数据

首先，你需要数据。

# Import package
!pip install recordlinkage
import recordlinkage
#For this demo let us use the built-in dataset from recordlinkage library
#import data set
from recordlinkage.datasets import load_febrl1
#create a dataframe - dfa
dfA = load_febrl1()
dfA.head()
#output

img/475440_2_En_5_Figk_HTML.jpg

步骤 3a-2。提取阻止密钥

缩小比较窗口并创建记录对。

有非常多的记录；比方说，1 亿条记录意味着(1 亿条选择 2) ≈ 10¹⁶ 可能对
需要一种启发式方法来快速击倒 10¹⁶ 而不输掉很多比赛

这可以通过提取阻止密钥来实现。下面是一个例子。

记录:名:约翰，姓:罗伯茨，地址:马萨诸塞州梅因街 20 号，邮编:01111
封锁键:名字-约翰
搭配:约翰·雷...011
不会搭配:弗兰克·辛纳屈...07030
仅为同一块中的记录生成对

这里，阻塞是在“Sndx-SN”列上完成的，这是姓氏列的 Soundex 值，如第四章中所讨论的。

img/475440_2_En_5_Figl_HTML.jpg

还有许多高级的阻塞技术，如下所示。

标准阻塞
- 单柱的
- 多列
排序邻域
Q-gram:模糊阻塞
激光冲徊化
树冠集群

这可能是一个全新的话题，但是现在，让我们使用名字作为阻塞索引来构建配对。

indexer = recordlinkage.BlockIndex(on='given_name')
pairs = indexer.index(dfA)
print (len(pairs))
#output
2082

步骤 3a-3。进行相似性匹配和评分

这里，我们计算上一步中生成的记录对之间的列(如名、姓和地址)的相似性得分。对于像出生日期、郊区和州这样的列，我们使用精确匹配，因为该列拥有精确的记录是很重要的。

我们使用 Jaro-Winkler，但是你可以使用在第四章中讨论的任何其他相似性度量。

# This cell can take some time to compute.
compare_cl = recordlinkage.Compare()
compare_cl.string('given_name', 'given_name',method='jarowinkler', label="given_name")
compare_cl.string('surname', 'surname', method="jarowinkler", label="surname")
compare_cl.exact('date_of_birth', 'date_of_birth', label="date_of_birth")
compare_cl.exact('suburb', 'suburb', label="suburb")
compare_cl.exact('state', 'state', label="state")
compare_cl.string('address_1', 'address_1',method='jarowinkler', label="address_1")
features = compare_cl.compute(pairs, dfA)
features.sample(5)
#output

img/475440_2_En_5_Figm_HTML.jpg

这里，记录“rec-115-dup-0”与“rec-120-dup-0”进行了比较因为他们的名字(阻止列)匹配，所以在这些对的公共列上计算相似性分数。

步骤 3a-4。使用 ECM 分类器预测记录是否匹配

以下是计算记录匹配概率的无监督学习方法。

# select all the features except for given_name since its our blocking key
features1 = features[['suburb','state','surname','date_of_birth','address_1']]
# Unsupervised learning – probabilistic
ecm = recordlinkage.ECMClassifier()
result_ecm = ecm.learn((features1).astype(int),return_type = 'series')
result_ecm
#output
rec_id rec_id
rec-122-org rec-183-dup-0 0
 rec-248-org 0
 rec-469-org 0
 rec-74-org 0
 rec-183-org 0
 rec-360-dup-0 0
 rec-248-dup-0 0
 rec-469-dup-0 0
rec-183-dup-0 rec-248-org 0
 rec-469-org 0
 rec-74-org 0
 rec-183-org 1
 rec-360-dup-0 0
 rec-248-dup-0 0
 rec-469-dup-0 0
rec-248-org rec-469-org 0
 rec-74-org 0
 rec-360-dup-0 0
 rec-469-dup-0 0
rec-122-dup-0 rec-122-org 1
 rec-183-dup-0 0
 rec-248-org 0
 rec-469-org 0
 rec-74-org 0
 rec-183-org 0
 rec-360-dup-0 0
 rec-248-dup-0 0
 rec-469-dup-0 0
rec-469-org rec-74-org 0
rec-183-org rec-248-org 0

 ..
rec-208-dup-0 rec-208-org 1
rec-363-dup-0 rec-363-org 1
rec-265-dup-0 rec-265-org 1
rec-315-dup-0 rec-315-org 1
rec-410-dup-0 rec-410-org 1
rec-290-org rec-93-org 0
rec-460-dup-0 rec-460-org 1
rec-499-dup-0 rec-499-org 1
rec-11-dup-0 rec-11-org 1
rec-97-dup-0 rec-97-org 1
rec-213-dup-0 rec-421-dup-0 0
rec-349-dup-0 rec-376-dup-0 0
rec-371-dup-0 rec-371-org 1
rec-129-dup-0 rec-129-org 1
rec-462-dup-0 rec-462-org 1
rec-328-dup-0 rec-328-org 1
rec-308-dup-0 rec-308-org 1
rec-272-org rec-308-dup-0 0
 rec-308-org 0
rec-5-dup-0 rec-5-org 1
rec-407-dup-0 rec-407-org 1
rec-367-dup-0 rec-367-org 1
rec-103-dup-0 rec-103-org 1
rec-195-dup-0 rec-195-org 1
rec-184-dup-0 rec-184-org 1
rec-252-dup-0 rec-252-org 1

rec-48-dup-0 rec-48-org 1
rec-298-dup-0 rec-298-org 1
rec-282-dup-0 rec-282-org 1
rec-327-org rec-411-org 0

输出清楚地显示“rec-183-dup-0”与“rec-183-org”匹配，并且可以链接到一个全局 ID。到目前为止，我们所做的被称为重复数据删除:从单个表中识别相同用户的多个记录。

多个表中相同客户的记录

接下来，让我们看看如果记录在多个表中，而没有唯一的 id 来合并，如何解决这个问题。

步骤 3b-1。阅读并理解数据

让我们使用 recordlinkage 库中的内置数据集。

from recordlinkage.datasets import load_febrl4
dfA, dfB = load_febrl4()
dfA.head()
#output

img/475440_2_En_5_Fign_HTML.jpg

dfB.head()
#output

img/475440_2_En_5_Figo_HTML.jpg

步骤 3b-2。块来缩小比较窗口并创建记录对

这与前面的解释相同，将 given_name 视为一个阻塞索引。

indexer = recordlinkage.BlockIndex(on='given_name')
pairs = indexer.index(dfA, dfB)

步骤 3b-3。做相似性匹配

解释是一样的。

compare_cl = recordlinkage.Compare()
compare_cl.string('given_name', 'given_name',method='jarowinkler', label="given_name")
compare_cl.string('surname', 'surname', method="jarowinkler", label="surname")
compare_cl.exact('date_of_birth', 'date_of_birth', label="date_of_birth")
compare_cl.exact('suburb', 'suburb', label="suburb")
compare_cl.exact('state', 'state', label="state")
compare_cl.string('address_1', 'address_1',method='jarowinkler', label="address_1")
features = compare_cl.compute(pairs, dfA, dfB)

features.head(10)
#output

img/475440_2_En_5_Figp_HTML.jpg

因此，这里将记录“rec-1070-org”与“rec-3024-dup-0”、“rec-2371-dup-0”、“rec-4652-dup-0”、“rec-4795-dup-0”和“rec-1314-dup-0”进行比较，因为它们的名字(阻止列)是匹配的，并且在这些对的公共列上计算相似性得分。

步骤 3b-4。使用 ECM 分类器预测记录是否匹配

下面是一个无监督的学习方法来计算记录匹配的概率。

# select all the features except for given_name since its our blocking key
features1 = features[['suburb','state','surname','date_of_birth','address_1']]
# unsupervised learning - probablistic
ecm = recordlinkage.ECMClassifier()
result_ecm = ecm.learn((features1).astype(int),return_type = 'series')
result_ecm
#output sample
rec_id        rec_id
rec-1070-org  rec-3024-dup-0    0
              rec-2371-dup-0    0
              rec-4652-dup-0    0
              rec-4795-dup-0    0
              rec-1314-dup-0    0
rec-2371-org  rec-3024-dup-0    0
              rec-2371-dup-0    1
              rec-4652-dup-0    0
              rec-4795-dup-0    0
              rec-1314-dup-0    0
rec-3582-org  rec-3024-dup-0    0
              rec-2371-dup-0    0
              rec-4652-dup-0    0
              rec-4795-dup-0    0
              rec-1314-dup-0    0
rec-3024-org  rec-3024-dup-0    1
              rec-2371-dup-0    0
              rec-4652-dup-0    0
              rec-4795-dup-0    0
              rec-1314-dup-0    0
rec-4652-org  rec-3024-dup-0    0
              rec-2371-dup-0    0
              rec-4652-dup-0    1
              rec-4795-dup-0    0
              rec-1314-dup-0    0
rec-4795-org  rec-3024-dup-0    0
              rec-2371-dup-0    0
              rec-4652-dup-0    0
              rec-4795-dup-0    1
              rec-1314-dup-0    0
                               ..
rec-2820-org  rec-2820-dup-0    1
              rec-991-dup-0     0
rec-1984-org  rec-1984-dup-0    1
rec-1662-org  rec-1984-dup-0    0
rec-4415-org  rec-1984-dup-0    0
rec-1920-org  rec-1920-dup-0    1
rec-303-org   rec-303-dup-0     1
rec-1915-org  rec-1915-dup-0    1
rec-4739-org  rec-4739-dup-0    1
              rec-4865-dup-0    0
rec-681-org   rec-4276-dup-0    0

rec-4603-org  rec-4848-dup-0    0
              rec-4603-dup-0    1
rec-3122-org  rec-4848-dup-0    0
              rec-4603-dup-0    0
rec-3711-org  rec-3711-dup-0    1
rec-4912-org  rec-4912-dup-0    1
rec-664-org   rec-664-dup-0     1
              rec-1311-dup-0    0
rec-4031-org  rec-4031-dup-0    1
rec-1413-org  rec-1413-dup-0    1
rec-735-org   rec-735-dup-0     1
rec-1361-org  rec-1361-dup-0    1
rec-3090-org  rec-3090-dup-0    1
rec-2571-org  rec-2571-dup-0    1
rec-4528-org  rec-4528-dup-0    1
rec-4887-org  rec-4887-dup-0    1
rec-4350-org  rec-4350-dup-0    1
rec-4569-org  rec-4569-dup-0    1
rec-3125-org  rec-3125-dup-0    1

输出清楚地显示“rec-122-dup-0”与“rec-122-org”匹配，并且可以链接到一个全局 ID。

通过这种方式，您可以创建一个由惟一的全局 ID 和跨表的一致数据组成的数据湖，并执行统计分析。

食谱 5-4。汇总文本数据

如果你只是四处看看，有很多文章和书籍。让我们假设你想学习 NLP 中的一个概念，如果你谷歌它，你找到一篇文章。你喜欢这篇文章的内容，但它太长了，不能再看一遍。您希望对文章进行总结，并将其保存在某个地方，以便以后阅读。

NLP 有一个解决方案。文本摘要有助于做到这一点。你不必每次都阅读整篇文章或整本书。

问题

在 Python 中使用不同算法的文章/文档的文本摘要。

解决办法

文本摘要是在不丢失上下文的情况下将大文档变成小文档的过程，最终节省了读者的时间。这可以使用不同的技术来完成，如下所示。

TextRank:一种基于图的排序算法
基于特征的文本摘要
基于图的算法的 TF-IDF
基于主题的
使用句子嵌入
编码器-解码器模型:深度学习技术

它是如何工作的

探索这个食谱中的前两种方法。

步骤 4-1。使用文本排名

TextRank 是 NLP 的基于图的排序算法。它的灵感来自于 PageRank，这是谷歌搜索引擎中使用的，但专门为文本设计的。它提取主题，从中创建节点，并捕获节点之间的关系来总结文本。

让我们看看如何使用 Python 的 gensim 包来实现。功能是summarize。

首先，让我们导入注释。假设你正在看一篇关于自然语言处理的维基百科文章。

# Import BeautifulSoup and urllib libraries to fetch data from Wikipedia.
from bs4 import BeautifulSoup
from urllib.request import urlopen
# Function to get data from Wikipedia
def get_only_text(url):
 page = urlopen(url)
 soup = BeautifulSoup(page)
 text = ' '.join(map(lambda p: p.text, soup.find_all('p')))
 print (text)
 return soup.title.text, text
# Mention the Wikipedia url
url="https://en.wikipedia.org/wiki/Natural_language_processing"
# Call the function created above
text = get_only_text(url)

# Count the number of letters
len(".join(text))
Result:
Out[74]: 8519
# Lets see first 1000 letters from the text
text[:1000]
Result :
Out[72]: '(\'Natural language processing

- Wikipedia\', \'Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language\\xa0data.\\n Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.\\n The history of natural language processing generally started in the 1950s, although work can be found from earlier periods.\\nIn 1950, Alan Turing published an article titled "Intelligence" which proposed what is now called the Turing test as a criterion of intelligence.\\n The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solved problem.[2]  However, real progress was '
# Import summarize from gensim
from gensim.summarization.summarizer import summarize
from gensim.summarization import keywords
# Convert text to string format
text = str(text)
#Summarize the text with ratio 0.1 (10% of the total words.)
summarize(text, ratio=0.1)
Result:
Out[77]: 'However, part-of-speech tagging introduced the use of hidden Markov models to natural language processing, and increasingly, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to the features making up the input data.\nSuch models are generally more robust when given unfamiliar input, especially input that contains errors (as is very common for real-world data), and produce more reliable results when integrated into a larger system comprising multiple subtasks.\\n Many

of the notable early successes occurred in the field of machine translation, due especially to work at IBM Research, where successively more complicated statistical models were developed.'

就是这样。生成的摘要就是这么简单。如果你看了这个总结和整篇文章，就足够接近了。但是，仍然有很大的改进空间。

#keywords
print(keywords(text, ratio=0.1))
Result:
learning
learn
languages
process
systems
worlds
world
real
natural language processing
research
researched
results
result
data
statistical
hand
generation
generally
generic
general
generated
tasks
task
large
human
intelligence

input
called
calling
calls
produced
produce
produces
producing
possibly
possible
corpora
base
based

第 4-2 步。使用基于特征的文本摘要

基于特征的文本摘要方法从句子中提取一个特征，并检查其重要性。位置、长度、术语频率、命名实体和许多其他特征用于计算分数。

Luhn 的算法是基于特征的算法之一。让我们看看如何使用 sumy 库来实现它。

# Install sumy
!pip install sumy
# Import the packages
from sumy.parsers.html import HtmlParser
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.luhn import LuhnSummarizer
# Extracting and summarizing
LANGUAGE = "english"
SENTENCES_COUNT = 10
url="https://en.wikipedia.org/wiki/Natural_language_processing"
parser = HtmlParser.from_url(url, Tokenizer(LANGUAGE))
summarizer = LsaSummarizer()
summarizer = LsaSummarizer(Stemmer(LANGUAGE))
summarizer.stop_words = get_stop_words(LANGUAGE)
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    print(sentence)
Result :
[2] However, real progress was much slower, and after the ALPAC report in 1966, which found that ten-year-long research had failed to fulfill the expectations, funding for machine translation was dramatically reduced.
However, there is an enormous amount of non-annotated data available (including, among other things, the entire content of the World Wide Web ), which can often make up for the inferior results if the algorithm used has a low enough time complexity to be practical

, which some such as Chinese Whispers do.
Since the so-called "statistical revolution"
in the late 1980s and mid 1990s, much natural language processing research has relied heavily on machine learning .
Increasingly, however, research has focused on statistical models , which make soft, probabilistic decisions based on attaching real-valued weights to each input feature.
Natural language understanding

Convert chunks of text into more formal representations such as first-order logic structures that are easier for computer programs to manipulate.
[18] ^ Implementing an online help desk system based on conversational agent Authors: Alisa Kongthon, Chatchawal Sangkeettrakarn, Sarawoot Kongyoung and Choochart Haruechaiyasak.
[ self-published source ] ^ Chomskyan linguistics encourages the investigation of " corner cases " that stress the limits of its theoretical models (comparable to pathological phenomena in mathematics), typically created using thought experiments , rather than the systematic investigation of typical phenomena that occur in real-world data, as is the case in corpus linguistics .
^ Antonio Di Marco - Roberto Navigili, "Clustering and Diversifying Web Search Results with Graph Based Word Sense Induction" , 2013 Goldberg, Yoav (2016).
Scripts, plans, goals, and understanding: An inquiry into human knowledge structures ^ Kishorjit, N., Vidya Raj RK., Nirmal Y., and Sivaji B.
^ PASCAL Recognizing Textual Entailment Challenge (RTE-7) https://tac.nist.gov//2011/RTE/ ^ Yi, Chucai; Tian, Yingli (2012), "Assistive Text Reading from Complex Background for Blind Persons" , Camera-Based Document Analysis and Recognition , Springer Berlin Heidelberg, pp.

问题解决了。现在你不必读所有的笔记；只要在我们时间不多的时候读一下摘要就可以了。

你可以使用深度学习技术来获得更好的准确性和更好的结果，比如编码器-解码器模型。你将在下一章看到如何去做。

食谱 5-5。聚类文档

文档聚类，也叫文本聚类，是对文本文档的聚类分析。一个典型的应用是文档管理。

问题

您希望根据模式和相似性对文档进行聚类或分组。

解决办法

文档聚类包括类似的步骤。

符号化。
词干化和词汇化。
删除停用词和标点符号。
计算术语频率或 TF-IDF。
聚类:k 均值/层次。然后，您可以使用任何聚类算法，根据我们生成的特征对不同的文档进行聚类
评估和可视化:可以通过将聚类绘制到二维空间中来可视化聚类结果。

它是如何工作的

步骤 5-1。导入数据和库

下面是库，后面是数据。

!pip install mpld3
import numpy as np
import pandas as pd
import nltk
from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from sklearn.metrics.pairwise import cosine_similarity
import os
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.manifold import MDS
#Lets use the same complaint dataset we use for classification
Data = pd.read_csv("/Consumer_Complaints.csv",encoding='latin-1')
#selecting required columns and rows
Data = Data[['consumer_complaint_narrative']]
Data = Data[pd.notnull(Data['consumer_complaint_narrative'])]
# lets do the clustering for just 200 documents

. Its easier to interpret.
Data_sample=Data.sample(200)

第 5-2 步。预处理和使用 TF-IDF 特征工程

现在我们来预处理一下。

# Remove unwanted symbol
Data_sample['consumer_complaint_narrative'] = Data_sample['consumer_complaint_narrative'].str.replace('XXXX',")
# Convert dataframe to list
complaints = Data_sample['consumer_complaint_narrative'].tolist()
# create the rank of documents – we will use it later
ranks = []
for i in range(1, len(complaints)+1):
    ranks.append(i)
# Stop Words
stopwords = nltk.corpus.stopwords.words('english')
# Load 'stemmer'
stemmer = SnowballStemmer("english")
# Functions for sentence tokenizer, to remove numeric tokens and raw #punctuation
def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems
def tokenize_only(text):
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens
from sklearn.feature_extraction.text import TfidfVectorizer
# tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words="english",
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
#fit the vectorizer to data
tfidf_matrix = tfidf_vectorizer.fit_transform(complaints)
terms = tfidf_vectorizer.get_feature_names()
print(tfidf_matrix.shape)
(200, 30)

第 5-3 步。使用 k-均值聚类

让我们开始聚类。

#Import Kmeans
from sklearn.cluster import KMeans
# Define number of clusters
num_clusters = 6
#Running clustering algorithm
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
#final clusters
clusters = km.labels_.tolist()
complaints_data = { 'rank': ranks, 'complaints': complaints, 'cluster': clusters }
frame = pd.DataFrame(complaints_data, index = [clusters] , columns = ['rank', 'cluster'])
#number of docs per cluster

frame['cluster'].value_counts()
0 42
1 37
5 36
3 36
2 27
4 22

第 5-4 步。识别集群行为

确定最接近聚类质心的前五个单词。

totalvocab_stemmed = []
totalvocab_tokenized = []
for i in complaints:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
    print("Cluster %d words:" % i, end=")
    for ind in order_centroids[i, :6]:
        print(' %s' % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
    print()
Cluster 0 words: b'needs', b'time', b'bank', b'information', b'told'
Cluster 1 words: b'account', b'bank', b'credit', b'time', b'months'
Cluster 2 words: b'debt', b'collection', b'number', b'credit', b"n't"
Cluster 3 words: b'report', b'credit', b'credit', b'account', b'information'

Cluster 4 words: b'loan', b'payments', b'pay', b'months', b'state'
Cluster 5 words: b'payments', b'pay', b'told', b'did', b'credit'

第 5-5 步。在 2D 图上画出聚类图

最后，绘制聚类图。

#Similarity
similarity_distance = 1 - cosine_similarity(tfidf_matrix)
# Convert two components as we're plotting points in a two-dimensional plane
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(similarity_distance)  # shape (n_components, n_samples)
xs, ys = pos[:, 0], pos[:, 1]
#Set up colors per clusters using a dict
cluster_colors = {0: '#1b9e77', 1: '#d95f02', 2: '#7570b3', 3: '#e7298a', 4: '#66a61e', 5: '#D2691E'}
#set up cluster names using a dict
cluster_names = {0: 'property, based, assist',
                 1: 'business, card',
                 2: 'authorized, approved, believe',
                 3: 'agreement, application,business',
                 4: 'closed, applied, additional',
                 5: 'applied, card'}
# Finally plot it
%matplotlib inline
#Create data frame that has the result of the MDS and the cluster
df = pd.DataFrame(dict(x=xs, y=ys, label=clusters))
groups = df.groupby('label')
# Set up plot
fig, ax = plt.subplots(figsize=(17, 9)) # set size
for name, group in groups:
    ax.plot(group.x, group.y, marker="o", linestyle=", ms=20,

            label=cluster_names[name], color=cluster_colors[name],
            mec='none')
    ax.set_aspect('auto')
    ax.tick_params(\
        axis= 'x',
        which='both',
        bottom='off',
        top='off',
        labelbottom='off')
    ax.tick_params(\
        axis= 'y',
        which='both',
        left='off',
        top='off',
        labelleft='off')
ax.legend(numpoints=1)
plt.show()

img/475440_2_En_5_Figq_HTML.jpg

就是这样。使用 k-means 聚类将 200 个投诉分成六组。它使用 TF-IDF 将类似的投诉分为六类。你也可以使用嵌入这个词，并解决这个问题，以实现更好的集群。2D 图提供了一个易于理解的集群行为视图。您会看到相同颜色的点(文档)彼此更靠近。

配方 5-6。搜索引擎中的自然语言处理

这个菜谱从 NLP 的角度讨论了构建一个搜索引擎需要什么。然而，实现超出了本书的范围。

问题

你想知道构建搜索引擎的架构和 NLP 管道。

解决办法

图 5-1 显示了整个过程。每个步骤都在“工作原理”一节中进行了解释。

img/475440_2_En_5_Fig1_HTML.jpg

图 5-1

搜索引擎中的自然语言处理过程

它是如何工作的

从 NLP 的角度来看，一步一步地遵循这个架构来构建一个搜索引擎。

步骤 6-1。预处理

每当用户输入搜索查询时，它都会被传递到 NLP 预处理管道。

噪音和停用词的去除
标记化
堵塞物
词汇化

第 6-2 步。使用实体提取模型

来自管道的输出被输入到实体提取模型中。您可以通过使用任何像 StanfordNER 或 NLTK 这样的库来构建定制的实体识别模型。

或者，您可以使用条件随机场或马尔可夫模型从头开始构建实体识别模型。

例如，假设你正在为一个电子商务巨头建立一个搜索引擎。以下是可以用来训练模型的实体。

性别
颜色
品牌
积范畴
产品类型
价格
大小

你可以使用像 RNN 和 LSTM 这样的深度学习框架来构建命名实体消歧。这对实体提取器理解使用实体的内容是非常重要的。比如，粉色可以是一种颜色，也可以是一个品牌。内德有助于消除这种歧义。

书呆子模型构建步骤。

数据清理和预处理
培训 NER 模型
测试和验证
部署

培训/建立书呆子模型的方法

命名实体识别和消歧
斯坦福 NER 定制
循环神经网络(RNN)–LSTM(长短期记忆)使用上下文来消除歧义
联合命名实体识别和消歧

第 6-3 步。进行查询增强/扩展

理解实体可能的同义词非常重要，以确保搜索结果不会错过潜在的相关性。比如男鞋也可以叫做男鞋、男式运动鞋、男式正装鞋、男式休闲鞋，或者男式运动鞋。

使用本地训练的单词嵌入(使用单词 2vec/ 手套模型)来实现这一点。

第 6-4 步。使用搜索平台

Solr 或 Elasticsearch 等搜索平台的主要功能包括全文搜索点击突出显示、分面搜索、实时索引、动态聚类和数据库集成。这与 NLP 无关。从端到端应用的角度来看，我们刚刚向您介绍了这是什么。

第 6-5 步。学会排名

一旦从 Solr 或 Elasticsearch 获取了搜索结果，就应该使用过去的行为根据用户的偏好对它们进行排序。

食谱 5-7。检测假新闻

在社交媒体时代，假新闻带来了很多问题。当然，没有人会想到假新闻会成为这个时代最大的问题之一。但是，像所有问题一样，数据科学也有一个解决这个问题的方法，如果不是非常准确，也有一些可接受的准确性。

问题

为什么假新闻是一个需要解决的重要问题？近年来，假新闻的制造和传播显著增加，主要是在社交媒体平台上，如 WhatsApp、脸书和 Instagram。假新闻可以出现在任何类型的新闻中——政治、地方、健康、娱乐、技术相关问题等等。许多错误的决定都是基于未经证实的假新闻。区分真假新闻很有挑战性。

目标是建立一个二元分类模型，可以根据内容区分真实新闻和虚假新闻。

解决办法

这是一个二元分类问题。假新闻一类，真新闻一类。除了这些类，还有一组文本文档。以下是解决问题的步骤。

首先，将原始数据转换为数据帧，以便进一步处理。
预处理和清理文本数据。
清洗完数据后，进行特征工程。
对数据进行探索以发掘潜在的洞察力。
这些特征构建了模型，在本例中，是一个使用逻辑回归、随机森林、朴素贝叶斯、SVM 等算法的分类器模型。
模型的超参数被调整和评估。最后，对模型的结果进行了验证。

它是如何工作的

步骤 7-1。收集数据

让我们使用 http://web.eecs.umich.edu/~mihalcea/downloads/fakeNewsDatasets.zip/ 的免费源数据集。一小部分数据已经被标记。

“假新闻”数据集包含六个不同的领域:商业、技术、政治、教育、体育和娱乐。数据集中包含的合法新闻来自各种新闻网站，如 ABC 新闻、今日美国、CNN、纽约时报、福克斯新闻频道、彭博等等。该数据集中的假新闻由使用 Mechanical Turk 编写的真实新闻的虚假版本组成。

所有的原始新闻都被聚合起来放在 fakenews_dataset.csv 中，这是本书用到的。

第 7-2 步。安装库

让我们导入该任务所需的所有库。

# Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the Figures Inline
%matplotlib inline

#NLP
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import TfidfVectorizer

#Machine Learning

from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.model_selection import train_test_split

第 7-3 步。分析数据

让我们导入数据集。

#importing dataset
df = pd.read_csv(' fakenews_dataset.csv')
df.shape

(980, 3)

df.sample(5)

img/475440_2_En_5_Figr_HTML.jpg

数据框由 980 个观测值和三列组成。

类别是新闻文章所属的类别(即名人、教育(edu)等。).
新闻是整个新闻内容
假的是标签；1 表示新闻文章是假的，0 表示是真的。

第 7-4 步。做探索性数据分析

我们来深入分析一下数据。

df['Category'].value_counts()

Celebrity    500
biz           80
tech          80
sports        80
entmt         80
polit         80
edu           80
Name: Category, dtype: int64

#no. of news per category
sns.countplot(x='Category', data=df)

img/475440_2_En_5_Figs_HTML.jpg

#adding additional column for length

df['length']=df['News'].str.len()
df.head()

img/475440_2_En_5_Figt_HTML.jpg

#Cheking max and min length of the News articles

maxlength = df['length'].max()
minlength = df['length'].min()
maxlength,minlength

(76490, 144)

合法新闻与假新闻的文章长度柱状图。

df.hist(column='length', by='Fake', bins=50,figsize=(12,4),color='orange')

img/475440_2_En_5_Figu_HTML.jpg

假新闻文章大多不到 5000 字。

各种类别的文章长度。

df.hist(column='length', by='Category', bins=50,figsize=(20,10),color='orange')

img/475440_2_En_5_Figv_HTML.jpg

关于名人的文章比其他新闻文章要长。

第 7-5 步。预处理数据

数据预处理包括以下四个步骤。

将转换为小写。

df['News'] = df['News'].apply(lambda x: " ".join(x.lower() for x in x.split()))
print(df['News'].head())

去掉标点符号。

df['News'] = df['News'].str.replace('[^\w\s]','')
print(df['News'].head())

去掉停止的话。停用词从 nltk 库中导入。

stop = stopwords.words('english')
df['News'] = df['News'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

做词汇化。

from nltk.stem import WordNetLemmatizer
df['News'] = df['News'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df['News'].head()

0    jennifer aniston hint friend reunionjennifer a...
1    brad pitt text jennifer aniston nonstop seek i...
2    jennifer aniston pregnant miracle baby 47 repo...
3    heartbroken jennifer aniston run border adoptj...
4    jennifer aniston cheated brad pitt matt leblan...
Name: News, dtype: object

第 7-6 步。使用训练测试分割

从 sklearn.model_selection 导入 train_test_split 库，将数据帧分成两部分。

# splitting into training and testing data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=100, stratify=y)

#validate the shape of train and test dataset
print (X_train.shape)
print (y_train.shape)

print (X_test.shape)
print (y_test.shape)

(686, 3)
(686,)
(294, 3)
(294,)

第 7-7 步。做特征工程

tfidf 矢量器库(来自sklearn.feature_extraction.text)从文本中生成特征。

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['News'])
xtrain_tfidf =  tfidf_vect.transform(X_train['News'])
xtest_tfidf =  tfidf_vect.transform(X_test['News'])

这些特征是从文本中提取出来的。下一步让我们构建模型。

第 7-8 步。建立一个模型

这是一个以假新闻和真新闻为类的分类算法。文本已被转换为特征。我们来试试各种机器学习算法，看看哪种准确率更好。

建立模型时考虑了各种算法。

朴素贝叶斯分类器
线性分类器(逻辑回归)
支持向量机分类器
随机森林分类器

让我们构建一个可以用于各种算法的通用函数。

def train_model(classifier, feature_vector_train, label, feature_vector_test):

    # fitting
    classifier.fit(feature_vector_train, label)

    # predicting
    predictions = classifier.predict(feature_vector_test)

        return metrics.accuracy_score(predictions, y_test)

先说朴素贝叶斯。

accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, y_train, xtest_tfidf)
print ("Accuracy of Naive Bayes: ", accuracy)

Accuracy:  0.47959183673469385

下面是逻辑回归。

accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, y_train, xtest_tfidf)
print ("Accuracy of logistic regression: ", accuracy)

Accuracy:  0.5374149659863946

SVM。

accuracy = train_model(svm.SVC(), xtrain_tfidf, y_train, xtest_tfidf)
print ("Accuracy of SVM: ", accuracy)

Accuracy:  0.5306122448979592

下面是随机森林。

accuracy = train_model(RandomForestClassifier(), xtrain_tfidf, y_train, xtest_tfidf)
print ("Accuracy of RandomForest: ", accuracy)

Accuracy:  0.5884353741496599

观察:二项式线性分类器(logistics regression)被认为是这个问题中最合适的算法。让我们再次构建模型，并了解其他精度参数

LogisticRegression()函数存在于 sklearn 的 linear_model 库中。

#fit the model
model=linear_model.LogisticRegression()
nb = model.fit(xtrain_tfidf,y_train)
nb

分类器模型的结果以二进制数组的形式显示预测，其中 1 表示虚假，0 表示合法。

## Model Results:

predictions = nb.predict(xtest_tfidf)
predictions

img/475440_2_En_5_Figw_HTML.jpg

模型评估

可以基于两个参数来评估分类器模型。

| 长吨 | 冰点 | | 【数学】函数 | 东帝汶的网络域名代号 |

混淆矩阵

以下是模型的混淆矩阵报告。

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions,target_names=["Legit", "Fake"]))

              precision    recall  f1-score   support

      Legit       0.54      0.56      0.55       147
       Fake       0.54      0.52      0.53       147

avg / total       0.54      0.54      0.54       294

数据的 f1 总分是 54%。

AUC(曲线下面积)分数

img/475440_2_En_5_Figx_HTML.jpg

AUC for the built model.
from sklearn.metrics import roc_auc_score
nb_auc = roc_auc_score(y_test, nb.predict_proba(xtest_tfidf)[:, 1])
print("AUC for Model: {:.3f}".format(nb_auc))

AUC for Model: 0.559

曲线下面积为 56%。

可以看到有高达 55%的准确率。您可以通过更多的标记数据收集和调整参数来提高准确性。

让我们看看超参数调整能否提高精度。

第 7-9 步。调整超参数

让我们使用网格搜索参数调整，它有条不紊地为网格中指定的每个参数组合构建和评估模型。

以下是超参数调整的步骤。

导入参数网格。
符合模型。

from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 15,20,100]}

进行网格搜索。

cf_model=linear_model.LogisticRegression()

将模型拟合到训练数据上。

grid_search = GridSearchCV(cf_model, param_grid, cv=5)

评估。

grid_search.fit(xtrain_tfidf, y_train)

找到模型的最佳参数。

print("Test set score: {:.2f}".format(grid_search.score(xtest_tfidf, y_test)))

Test set score: 0.54

使用这些调整后的参数重新构建模型。

print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best parameters: {'C': 20}
Best cross-validation score: 0.55

print("Best estimator:\n{}".format(grid_search.best_estimator_))
Best estimator:
LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

model=linear_model.LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
nb=model.fit(xtrain_tfidf, y_train)
nb

LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

第 7-10 步。生效

一旦模型被最佳参数拟合，它就被交叉验证以检查模型在文本数据上的准确性。验证参数是相同的。

以下是模型的混淆矩阵报告。

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions,target_names=["Legit", "Fake"]))

              precision    recall  f1-score   support

       Legit       0.53      0.54      0.53       147
        Fake       0.53      0.52      0.53       147

    accuracy                           0.53       294
   macro avg       0.53      0.53      0.53       294
weighted avg       0.53      0.53      0.53       294

AUC for the tuned model.
from sklearn.metrics import roc_auc_score
nb_auc = roc_auc_score(y_test, nb.predict_proba(xtest_tfidf)[:, 1])
print("AUC for tuned SVC model: {:.3f}".format(nb_auc))

AUC for tuned SVC model: 0.525

摘要

这个食谱为假新闻分类建立了一个基线模型，并使用了一个来自多个新闻频道的免费源数据集。首先，您学习了如何使用 TF-IDF 创建数据集、预处理数据以及执行特征工程。后来你看到了不同的机器学习方法来训练模型。最后，您观察到，与其他分类器相比，线性分类器的性能更好。

使用不同的方法来提高精确度还有很大的空间。这只是基线模型，您可以使用高级功能工程技术(如单词嵌入)来提高准确性。你也可以用深度学习的方法来调查是否有准确度的提高。

食谱 5-8。电影类型标记

多标记/标签分类源于对文本分类问题的研究，每个问题可以同时属于多个预定义的主题。例如，这可以基于电影的绘图摘要找到电影所属的流派类型。在多标签分类中，学习集由与一组标签相关联的实例组成。任务是通过分析具有已知标签集的学习实例来预测不可见的实例标签集。

多类别分类假设每个样本被分配给一个且仅一个标签:客户可以是流失客户，也可以不是，但不能两者都是。例如，多标签分类可能是文本可能同时涉及宗教、政治、金融、教育或任何这些内容。

问题

将电影分类是经典的人工智能问题之一。像 IMDb 这样的在线电影预订平台和评论网站将电影分为不同的类型。类型可以是动作片、冒险片、喜剧片、爱情片等等。

在 IMDb，每部电影都有一个与之相关的流派。但挑战在于，一部电影也可以有多种类型。例如，复仇者联盟:残局被标记为动作、冒险和幻想类型。

img/475440_2_En_5_Figy_HTML.jpg

我们的目标是在给定电影描述的情况下标记电影类型。该模型必须预测一部电影可能属于的所有可能的类别(类型)。虽然你以前已经看过简单的多类分类，但是这个食谱探索了如何解决多标签学习和分类问题。

解决办法

你遇到过许多分类问题，所有这些问题要么是二元类问题，要么是多类问题。现在让我们探索一种独特的问题，称为多标签学习解决电影类型标签。

那么，什么是多标记分类或者俗称的多标记学习呢？数据集中的每个观察值(x)都有一个包含多个类别的目标变量(y)。例如，带有描述的电影名称是你的独立特征，相关的类型(动作、冒险、幻想)是因变量。

以下是解决多标签问题的一些方法。

一个对另一个
二元相关
分类链
powerset 标签

迎面流

但在此之前，你需要处理一些与文本相关的日常任务。以下是端到端解决这个问题的步骤。

数据预处理和清洗
特征生成
探索性数据分析
多标记分类器
验证和预测

它是如何工作的

步骤 8-1。收集数据

我们使用的是已经标记的免费源数据集。然而，只有一小部分数据被标记。数据量越大，模型的准确性和泛化能力就越强。

从 www.kaggle.com/cryptexcode/mpst-movie-plot-synopses-with-tags#mpst_full_data.csv 下载数据集。

步骤 8-2。安装库

导入文本挖掘任务所需的所有基本库。

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the Figures Inline
%matplotlib inline

#Natural language processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
from sklearn.feature_extraction.text import TfidfVectorizer

#Machine Learning

from sklearn.model_selection import train_test_split

步骤 8-3。分析数据

导入之前下载的数据集。

#Importing dataset
df = pd.read_csv('movies_genres1.csv',encoding = 'ISO-8859-1')
df.shape

(9502, 30)

#Top 5 rows to see the data
df.head()

img/475440_2_En_5_Figz_HTML.jpg

df.columns

Index(['title', 'plot', 'Action', 'Adult', 'Adventure', 'Animation',
       'Biography', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Family',
       'Fantasy', 'Game-Show', 'History', 'Horror', 'Music', 'Musical',
       'Mystery', 'News', 'Reality-TV', 'Romance', 'Sci-Fi', 'Short', 'Sport',
       'Talk-Show', 'Thriller', 'War', 'Western', 'count', 'length', 'desc'],
      dtype='object')

步骤 8-4。做探索性数据分析

连接电影标题和情节。

df["desc"] = df["title"].map(str) + df["plot"]
df = df[df['desc'].notnull()]

以下是按流派分类的标签。

print (df.apply(pd.to_numeric, errors='coerce').sum())

title                0.0
plot                 0.0
Action             659.0
Adult                7.0
Adventure          547.0
Animation          689.0
Biography           71.0
Comedy            2217.0
Crime              914.0
Documentary       1377.0
Drama             3150.0
Family            1075.0
Fantasy            377.0
Game-Show          103.0
History            166.0
Horror             146.0
Music              194.0
Musical             29.0
Mystery            633.0
News              1069.0
Reality-TV        1008.0
Romance           1338.0
Sci-Fi             477.0
Short               61.0
Sport              286.0
Talk-Show          733.0
Thriller           488.0
War                 67.0
Western            201.0
count            18082.0
length         3623621.0
desc                 0.0
Total_tags       18082.0
dtype: float64

大多数电影被归入戏剧类，其次是喜剧类。

df.groupby(['Total_tags']).size()

Total_tags
1     4540
2     2969
3     1038
4      510
5      279
6      121
7       27
8       15
10       2
dtype: int64

4540 部电影只被标记为一种类型，大约 1000 部电影属于三种类型。

步骤 8-5。预处理数据

数据预处理包括以下四个步骤。

将转换为小写。

df['desc'] = df['desc'].apply(lambda x:" ".join(x.lower() for x in x.split()))

去掉标点符号。

df['desc'] = df['desc'].str.replace('\d+', '')
df['desc'] = df['desc'].str.replace('[^\w\s]','')

移除停止字。

The stopwords are imported from the nltk library.

stop = stopwords.words('english')
df['desc'] = df['desc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

词汇化。

from nltk.stem import WordNetLemmatizer
df['desc'] = df['desc'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

df['desc'].head()

0    dayslater dayslater interactive comedy series ...
1    lawstinwoods happening gang discus shock reali...
2    mykpop kpop growing check mnet america newest ...
3    allo allo general conference general gather co...
4    til death mixed double steph introduces eddie ...
Name: desc, dtype: object

在完成预处理步骤后，文本是干净的，可以进入下一步了。

步骤 8-6。使用训练测试分割

#create input dataset without including target variable( class column)
X = df['desc']

X.shape

(9501,)

#Create the target data with only class column
y = df.iloc[:,2:29]
y.shape

(9501, 27)

# splitting into training and testing data
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=100, stratify=y)

#validate the shape of train and test dataset
print (X_train.shape)
print (y_train.shape)

print (X_test.shape)
print (y_test.shape)

(6650,)
(6650, 27)
(2851,)
(2851, 27)

步骤 8-7。做特征工程

tfidf 矢量器库(来自sklearn.feature_extraction.text)从文本中生成特征。

#Generating features using tfidf

tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(df['desc'])

xtrain_tfidf =  tfidf_vect.transform(X_train)
xtest_tfidf =  tfidf_vect.transform(X_test)

第八步。做模型建立和预测

有两种方法可以解决这个问题。让我们看看所有的方法，并使用不同的库来实现它们。

问题转化
算法适应

先说问题转化方法。

问题转化

问题转换方法背后的基本思想是使用数据操作技术将多标签转换成单个标签。这是使用 skmultilearn 包完成的，它支持不同类型的转换。

有三种可能的转换类型。让我们讨论这三个并使用下面的库实现它们:二进制相关性，

二元相关性

二元相关性将每个标签转换成不同的单类分类问题。如果有五个标签，这种方法会为一个标签创建五个新数据集。这些创建的数据集分别用于单标签分类器。

img/475440_2_En_5_Figaa_HTML.png

让我们实现这个方法，看看结果是什么样的。

# Import the packages

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB

# initialize binary relevance multi-label classifier with a gaussian naive Bayes base classifier

classifier = BinaryRelevance(GaussianNB())

# train the algorithms
classifier.fit(xtrain_tfidf, y_train)

# predict
predictions = classifier.predict(xtest_tfidf)

#output
print(predictions)
(16, 0)        1
  (38, 0)      1
  (42, 0)      1
  (49, 0)      1
  (52, 0)      1
  (56, 0)      1
  (62, 0)      1
  (72, 0)      1
  (141, 0)     1
  (171, 0)     1
  (173, 0)     1
  (198, 0)     1
  (243, 0)     1
  (258, 0)     1
  (292, 0)     1
  (343, 0)     1
  (371, 0)     1
  (398, 0)     1
  (429, 0)     1

#Evaluation

from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

0.21220624342336022

metrics.hamming_loss(y_test, predictions)

0.07472361874326097

请注意，我们的准确率非常低，只有 21%。然而，它会随着更多的训练数据而增加。你可以多尝试几个算法，看能不能得出更好的数字。

现在让我们尝试二进制相关的 SVM，看看它对多标记分类的性能。

#import the library
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.svm import SVC

# define the classifier

classifier = BinaryRelevance(
    classifier=SVC(),
    require_dense=[False, True]
)

# train

classifier.fit(xtrain_tfidf, y_train)

# predict
predictions = classifier.predict(xtest_tfidf)

# evaluation
accuracy_score(y_test,predictions)

0.09

即使是 SVM 也不能产生好的结果。

让我们转到下一种称为分类器链的多标记算法，看看它是否能提高精确度。

分类器链

分类器链是一个分类器链，因此得名。第一个分类器位于 x(独立特征)和其中一个标签之间。对于下一个分类器，甚至来自前一个分类器的标签也被用作输入。让我们通过下面的例子来理解这一点。

这里，x 是独立特征(文本)，y 是标签/目标。

img/475440_2_En_5_Figab_HTML.png

让我们使用多重学习库来实现。

# using classifier chains
from skmultilearn.problem_transform import ClassifierChain
from sklearn.naive_bayes import GaussianNB

# initialize classifier chains multi-label classifier
# with a gaussian naive Bayes base classifier

classifier = ClassifierChain(GaussianNB())

# train
classifier.fit(xtrain_tfidf, y_train)

# predict

predictions = classifier.predict(xtest_tfidf)

accuracy_score(y_test,predictions)

0.21255699754472115

即使这种方法也不能给出有希望的结果。让我们继续下一个类型，看看算法是如何执行的。

Powerset 标签

此方法将多标签分类器转换为多类分类器。在所有唯一标签组合上创建一个标签。

让我们用下面的例子来看看。

img/475440_2_En_5_Figac_HTML.png

这就是实现。

# using Label Powerset

from skmultilearn.problem_transform import LabelPowerset
from sklearn.naive_bayes import GaussianNB

# initialize Label Powerset multi-label classifier
# with a gaussian naive bayes base classifier

classifier = LabelPowerset(GaussianNB())

# train

classifier.fit(xtrain_tfidf, y_train)

# predict

predictions = classifier.predict(xtest_tfidf)
accuracy_score(y_test,predictions)

0.32620133286566116

与以前的方法相比，精确度略有提高。让我们试试超参数调优，看看你能不能让它变得更好。

众所周知，超参数调整是提高精度和为算法选择最佳参数的一个重要方面。您可以利用 GridSearchCV 进行超参数调优，如下所示。

#Import libraries
from skmultilearn.problem_transform import LabelPowerset
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

#define the parameters

parameters = [
    {
        'classifier': [MultinomialNB()],
        'classifier__alpha': [0.7, 1.0],
    },
    {
        'classifier': [RandomForestClassifier()],
        'classifier__criterion': ['gini', 'entropy'],
        'classifier__n_estimators': [10, 20, 50],
    },
]

# tuning
clf = GridSearchCV(LabelPowerset(), parameters, scoring='accuracy')
clf.fit(xtrain_tfidf, y_train)

#print the best accuracy
print (clf.best_params_, clf.best_score_)
{'classifier': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False), 'classifier__criterion': 'gini', 'classifier__n_estimators': 50}

0.4548872180451128

这比以前的方法有了很大的改进。通过调优，可以达到 45%的准确率。

适应算法

适配的算法适配机器学习算法以直接执行多标签学习。例如，多标签版本的 kNN 由 MLkNN 表示。

来做实现吧。

#convert the data to matrix
y_train = y_train.as_matrix()

#import the package
from skmultilearn.adapt import MLkNN
classifier = MLkNN(k=20)

# train
classifier.fit(xtrain_tfidf, y_train)

# predict
predictions = classifier.predict(xtest_tfidf)

accuracy_score(y_test,predictions)

0.364784286215363

# import
from skmultilearn.adapt import BRkNNaClassifier
classifier = BRkNNaClassifier(k=3)

# train
classifier.fit(xtrain_tfidf, y_train)

# predict
predictions = classifier.predict(xtest_tfidf)
print(predictions)

(0, 7)         1
  (1, 5)       1
  (1, 8)       1
  (2, 8)       1
  (2, 19)      1
  (3, 17)      1
  (3, 23)      1
  (4, 23)      1
  (5, 5)       1
  (5, 9)       1
  (6, 5)       1
  (6, 20)      1
  (9, 19)      1
  (10, 7)      1
  (11, 17)     1
  (11, 23)     1
  (12, 8)      1
  (13, 2)      1
  (13, 7)      1
  (14, 8)      1
  (14, 19)     1
  (15, 5)      1
  (15, 9)      1
  (16, 5)      1
  (18, 6)      1

accuracy_score(y_test,predictions)

0.45001753770606806

类似地，您可以利用支持向量机，如下所示。

from skmultilearn.adapt import MLTSVM
classifier = MLTSVM(c_k = 2**-1)

# train
classifier.fit(xtrain_tfidf, y_train)

# predict

predictions = classifier.predict(xtest_tfidf)

accuracy_score(y_test,predictions)

你已经实现了一个几端 - 到 - 端的项目。不刺激吗？

下一章看在用 NLP 和深度学习解决问题。****

六、面向自然语言处理的深度学习

在本章中，您将实现 NLP 的深度学习。涵盖以下食谱。

食谱 1。使用深度学习的信息检索
食谱 2。使用 CNN 的文本分类，RNN，LSTM
食谱三。使用电子邮件 LSTM 预测下一个单词/单词序列
食谱 4。堆栈溢出问题推荐

深度学习简介

深度学习是受大脑功能启发的机器学习的一个子领域。就像神经元在大脑中相互连接一样，神经网络也以同样的方式工作。每个神经元接受输入，在神经元内进行一些操作，并产生更接近预期输出的输出(在标记数据的情况下)。

神经元内发生的事情是我们感兴趣的，以获得最准确的结果。它为每个输入赋予权重，并生成一个函数来累积所有这些权重，并将它们传递给下一层，最终是输出层。

该网络有三个组成部分。

输入层
隐藏层
输出层

img/475440_2_En_6_Figa_HTML.jpg

基于问题或数据，函数可以是不同的类型。这些也被称为激活功能 s 。下面描述了这些类型。

线性激活函数:线性神经元采用加权输入的线性组合，输出可以采用负无穷大到无穷大之间的任何值。
非线性激活函数:这些是最常用的，它们使输出限制在某个范围内。
- sigmoid 或 logit 激活函数通过应用对数函数在 0 和 1 之间缩小输出，使分类问题变得更容易。
- softmax 函数类似于 sigmoid，但它计算 n 个不同类上的事件概率，这有助于确定多类分类问题中的目标。
- 双曲正切函数是–1 比 1；否则与乙状结肠相同。
- 整流线性单位激活函数将任何小于零的值转换为零。所以，范围变成了 0 到无穷大。

我们仍然没有讨论在神经网络中如何进行训练。让我们用卷积神经网络来做这件事。

卷积神经网络

卷积神经网络(CNN)类似于普通的神经网络，但有多个隐藏层，以及一个称为卷积层的过滤器。CNN 成功识别人脸、物体和交通标志，也用于自动驾驶汽车。

数据

算法基本上对数字数据起作用。图像和文本数据是非结构化数据，甚至在我们开始任何事情之前，它们就需要被转换成数值。

图像:计算机将一幅图像作为像素值的数组。根据图像的分辨率和大小，它会看到一个 X * Y * Z 的数字数组。例如，有一幅彩色图像，其大小为 480×480 像素。该数组的表示形式为 480×480×3，其中 3 是颜色的 RGB 值。这些数字中的每一个都从 0 到 255 不等，这描述了该点的像素强度/密度。这个概念是，如果给计算机和这个数组的数字，它输出的概率图像是某一类的情况下，分类问题。
Text :我们已经讨论过如何从文本中创建特征。您可以使用任何一种技术将文本转换为特征。RNN 和 LSTM 更适合文本相关的解决方案，我们将在下一节中讨论。

img/475440_2_En_6_Figb_HTML.jpg

体系结构

CNN 是具有输入层、输出层和多个隐藏层的神经网络的特例。隐藏层有四个不同的程序来完成网络。每一个都有详细的解释。

盘旋

img/475440_2_En_6_Figc_HTML.jpg

卷积层是卷积神经网络的核心，它执行大多数计算操作。该名称来源于从输入图像中提取特征的“卷积”运算符。这些也被称为滤镜(一个橙色的 33 矩阵)。通过在整个图像上滑动滤波器并计算这两个矩阵之间的点积而形成的矩阵被称为卷积特征、激活图或特征图*。例如，假设在表数据中计算不同类型的特征，例如从“出生日期”计算“年龄”直边、简单颜色和曲线是过滤器可以从图像中提取的一些特征。

在 CNN 的训练过程中，它学习过滤器中存在的数字或值，并将其用于测试数据。特征的数量越多，提取的图像特征就越多，并识别看不见的图像中的模式。

非线性(ReLU)

img/475440_2_En_6_Figd_HTML.jpg

一个整流线性单元 ( ReLU )是 CNN 架构中卷积层之后使用的非线性函数。它用零替换矩阵中的所有负值。ReLU 的目的是在 CNN 中引入非线性以更好地执行。

联营

img/475440_2_En_6_Fige_HTML.jpg

池化或二次采样在不丢失重要信息的情况下降低了特征的维数。这样做是为了减少全连接层的大量输入和处理模型所需的计算。它还减少了模型的过拟合。它使用一个 2×2 的窗口，在图像上滑动，并在每个区域取最大值，如图所示。这就是它降低维度的方式。

展平、完全连接和 Softmax 层

最后一层是需要特征向量作为输入的密集层。但是池层的输出不是 1D 特征向量。这个将卷积输出转换为特征向量的过程称为展平。完全连接层从展平层获取输入，并给出一个 n 维向量，其中 n 是类的数量。全连接层的功能是使用这些特征，根据训练数据集的损失函数将输入图像分类为各种类别。在最后使用 softmax 函数将这些 n 维向量转换成每个类别的概率，最终将图像分类到特定类别。

反向传播:训练神经网络

在正常的神经网络中，您进行前向传播以获得输出，并检查该输出是否正确并计算误差。在反向传播中，你通过你的网络反向寻找误差相对于每个权重的偏导数。

让我们看看它到底是如何工作的。

输入图像被输入网络。它完成前向传播，即卷积、ReLU 和池化操作，在全连接层中进行前向传播，并为每个类生成输出概率。根据前馈规则，权重被随机分配，并完成训练和输出随机概率的第一次迭代。第一步结束后，网络使用以下公式计算输出层的误差

总误差= ∑(目标概率-输出概率) ²

现在，反向传播开始计算网络中所有权重的误差梯度，并使用梯度下降来更新所有滤波器值和权重，最终使输出误差最小化。过滤器数量、过滤器尺寸和网络架构等参数在构建网络时就已确定。过滤器矩阵和连接权重在每次运行时都会更新。对整个训练集重复整个过程，直到误差最小。

循环神经网络

CNN 用于计算机视觉问题，但无法解决序列模型。序列模型是那些甚至实体的序列也很重要的模型。例如，在文章中，单词的顺序对于创造有意义的句子很重要。这就是 rnn 发挥作用的地方，它对顺序数据很有用，因为每个神经元都可以用自己的记忆来记住上一步的信息。

img/475440_2_En_6_Figf_HTML.jpg

要理解 RNN 到底是如何运作的相当复杂。如果你看到上图，循环神经网络从隐藏层获取输出，并在给出预测之前将其发送回同一层。

训练 RNN:穿越时间的反向传播(BPTT)

你知道前馈和反向传播在 CNN 是如何工作的，所以让我们看看 RNN 是如何进行培训的。

img/475440_2_En_6_Figg_HTML.jpg

如果我们只讨论隐藏层，它不仅从隐藏层获取输入，而且您还可以向同一个隐藏层添加另一个输入。现在反向传播就像你以前见过的任何其他训练一样发生；现在，它依赖于时间。这里，通过展开隐藏层，误差从最后一个时间戳反向传播到第一个时间戳。这允许计算每个时间戳的误差并更新权重。在隐藏单元之间具有循环连接的循环网络读取整个序列，然后产生所需的输出。

当梯度值太小时，模型需要太长时间来学习，这被称为消失梯度。LSTM 解决了这个问题。

长短期记忆(LSTM)

LSTMs 类似于 rnn，但具有更好的方程和反向传播，这使它们的性能更好。LSTMs 的工作方式类似于 rnn，但是它们可以学习很长时间间隔的东西，并且它们可以像计算机一样存储信息。

img/475440_2_En_6_Figh_HTML.jpg

该算法通过加权方法学学习单词或字符的重要性，并决定是否存储它。为此，它使用被称为门的调节结构，可以向细胞中添加或删除信息。这些细胞有一个 sigmoid 层，决定应该传递多少信息。它有三层——输入、遗忘和输出——来执行这个过程。

对 CNN 和 RNN 工作的深入讨论超出了本书的范围。如果你有兴趣了解更多，书的末尾有参考资料。

食谱 6-1。检索信息

信息检索是自然语言处理中使用频率很高的应用之一，并且非常棘手。单词或句子的意思取决于所用的确切单词以及上下文和意思。两个句子可能是完全不同的词，但可以传达相同的意思。你应该能捕捉到。

信息检索(IR)系统允许用户基于搜索文本/查询有效地搜索文档并检索有意义的信息。

img/475440_2_En_6_Figi_HTML.jpg

问题

使用词嵌入的信息检索。

解决办法

有多种方法可以进行信息检索。使用单词嵌入是非常有效的，因为它也考虑了上下文。我们在第三章中讨论了单词嵌入是如何构建的。预训练的 word2vec 只在这种情况下使用。

让我们举一个简单的例子，看看如何使用查询输入构建文档检索。假设数据库中有四个文档，如下所示。(这展示了它是如何工作的。现实世界的应用程序中有许多文档。)

Doc1 = ["With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders." ]
Doc2 = ["Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."]
Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."]
Doc4 = ["But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg."]

假设有许多这样的文档。并且您想要检索与查询“cricket”最相关的一个让我们来看看如何建立它。

query = "cricket"

它是如何工作的

步骤 1-1。导入库

这里是图书馆。

import gensim
from gensim.models import Word2Vec
import numpy as np
import nltk
import itertools
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

步骤 1-2。创建或导入文档

从网上随机抽取句子。

Doc1 = ["With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders." ]
Doc2 = ["Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data."]
Doc3 = ["He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems."]
Doc4 = ["But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg."]

# Put all the documents in one list
fin= Doc1+Doc2+Doc3+Doc4

步骤 1-3。下载 word2vec

接下来，让我们使用单词嵌入来解决这个问题。从 https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit 下载 word2vec。

#load the model
Model= gensim.models.KeyedVectors.load_word2vec_format('/GoogleNews-vectors-negative300.bin', binary=True)

步骤 1-4。创建一个 IR 系统

现在，建立信息检索系统。

#Preprocessing
def remove_stopwords(text, is_lower_case=False):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, ", ".join(text))
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text
# Function to get the embedding vector for n dimension, we have used "300"
def get_embedding(word):
    if word in model.wv.vocab:
        return model[x]
    else:
        return np.zeros(300)

对于每一个文档，我们都会得到很多基于单词数量的向量。您需要通过取所有单词向量的平均值来计算文档的平均向量。

# Getting average vector for each document
out_dict =  {}
for sen in fin:
    average_vector = (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(remove_stopwords(sen))]), axis=0))
    dict = { sen : (average_vector) }
    out_dict.update(dict)
# Function to calculate the similarity between the query vector and document vector
def get_sim(query_embedding, average_vector_doc):
    sim = [(1 - scipy.spatial.distance.cosine(query_embedding, average_vector_doc))]
    return sim
# Rank all the documents based on the similarity to get relevant docs
def Ranked_documents(query):
    query_words =  (np.mean(np.array([get_embedding(x) for x in nltk.word_tokenize(query.lower())],dtype=float), axis=0))
    rank = []
    for k,v in out_dict.items():
        rank.append((k, get_sim(query_words, v)))
    rank = sorted(rank,key=lambda t: t[1], reverse=True)
    print('Ranked Documents :')
    return rank

步骤 1-5。结果和应用

让我们通过几个例子来看看我们构建的信息检索系统。

# Call the IR function with a query
Ranked_documents("cricket")
Result :
[('But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg.',
  [0.44954327116871795]),
 ('He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems.',
  [0.23973446569030055]),
 ('With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders.',
  [0.18323712012013349]),
 ('Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.',
  [0.17995060855459855])]

Doc4(在结果的顶部)与查询“cricket”最相关，尽管单词 cricket 一次也没有提到，相似度为 0.449。

让我们再举一个可能开车的例子。

Ranked_documents("driving")
[('With the Union cabinet approving the amendments to the Motor Vehicles Act, 2016, those caught for drunken driving will have to have really deep pockets, as the fine payable in court has been enhanced to Rs 10,000 for first-time offenders.',
  [0.35947287723800669]),
 ('But the man behind the wickets at the other end was watching just as keenly. With an affirmative nod from Dhoni, India captain Rohit Sharma promptly asked for a review. Sure enough, the ball would have clipped the top of middle and leg.',
  [0.19042556935316801]),
 ('He points out that public transport is very good in Mumbai and New Delhi, where there is a good network of suburban and metro rail systems.',
  [0.17066536985237601]),
 ('Natural language processing (NLP) is an area of computer science and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.',
  [0.088723080005327359])]

同样，由于驾驶与运输和机动车辆法案相关联，它会将最相关的文档放在最上面。前两个文档与查询相关。

您可以使用相同的方法，并将其扩展到尽可能多的文档。为了获得更高的准确性，你可以为特定的行业构建你自己的嵌入，正如你在第三章中所学的，因为我们使用的是通用的。

这是可用于许多应用程序的基本方法，如下所示。

搜索引擎
文档检索
段落检索
问答

img/475440_2_En_6_Figj_HTML.jpg

事实证明，当查询越长，结果越短时，结果越好。这就是为什么当搜索查询的字数较少时，你在搜索引擎中得不到很好的结果。

食谱 6-2。利用深度学习对文本进行分类

这个食谱使用深度学习方法构建了一个文本分类器。

问题

您想要使用 CNN、RNN 和 LSTM 建立一个文本分类模型。

解决办法

方法和 NLP 管道保持不变。唯一的变化是，我们不再使用机器学习算法，而是使用深度学习算法建立模型。

它是如何工作的

按照本节中的步骤，使用深度学习方法构建电子邮件分类器。

步骤 2-1。定义业务问题

电子邮件分类(垃圾邮件或垃圾邮件)。您需要根据电子邮件内容对垃圾邮件进行分类。

第 2-2 步。确定潜在的数据来源并收集

使用第四章中配方 4-6 中的相同数据。

#read file
file_content = pd.read_csv('spam.csv', encoding = "ISO-8859-1")
#check sample content in the email
file_content['v2'][1]
#output
'Ok lar... Joking wif u oni...'

第 2-3 步。预处理文本

让我们对数据进行预处理。

#Import library
from nltk.corpus import stopwords
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# Remove stop words
stop = stopwords.words('english')
file_content['v2'] = file_content['v2'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
# Delete unwanted columns
Email_Data = file_content[['v1', 'v2']]
# Rename column names
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})
Email_Data.head()
 #output

    Target Email
0   ham    Go jurong point, crazy.. Available bugis n gre...
1   ham    Ok lar... Joking wif u oni...
2   spam   Free entry 2 wkly comp win FA Cup final tkts 2...
3   ham    U dun say early hor... U c already say...
4   ham    Nah I think goes usf, lives around though
#Delete punctuations, convert text in lower case and delete the double space
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: re.sub('[!@#$:).;,?&]', ", x.lower()))
Email_Data['Email'] = Email_Data['Email'].apply(lambda x: re.sub(' ', ' ', x))
Email_Data['Email'].head(5)
#output
0 go jurong point crazy available bugis n great ...
1 ok lar joking wif u oni
2 free entry 2 wkly comp win fa cup final tkts 2...
3 u dun say early hor u c already say
4 nah i think goes usf lives around though
Name: Email, dtype: object

#Separating text(input) and target classes
list_sentences_rawdata = Email_Data["Email"].fillna("_na_").values
list_classes = ["Target"]
target = Email_Data[list_classes].values
To_Process=Email_Data[['Email', 'Target']]

第 2-4 步。为建模准备数据

接下来，准备数据。

#Train and test split with 80:20 ratio
train, test = train_test_split(To_Process, test_size=0.2)
# Define the sequence lengths, max number of words and embedding dimensions
# Sequence length of each sentence. If more, truncate. If less, pad with zeros
MAX_SEQUENCE_LENGTH = 300 #user_transform
# Top 20000 frequently occurring words
MAX_NB_WORDS = 20000
# Get the frequently occurring words
 tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train.Email)
train_sequences = tokenizer.texts_to_sequences(train.Email)
test_sequences = tokenizer.texts_to_sequences(test.Email)
# dictionary containing words and their index
word_index = tokenizer.word_index
# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))
# get only the top frequent words on train
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
# get only the top frequent words on test
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(train_data.shape)
print(test_data.shape)
#output
Found 8443 unique tokens.
(4457, 300)
(1115, 300)
train_labels = train['Target']
test_labels = test['Target'] 

#import library
from sklearn.preprocessing import LabelEncoder
# converts the character array to numeric array. Assigns levels to unique labels.
le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)
print(le.classes_)
print(np.unique(train_labels, return_counts=True))
print(np.unique(test_labels, return_counts=True))
#output
['ham' 'spam']
(array([0, 1]), array([3889, 568]))
(array([0, 1]), array([936, 179]))
# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', labels_train.shape)
print('Shape of label tensor:', labels_test.shape)
#output
Shape of data tensor: (4457, 300)
Shape of label tensor: (4457, 2)
Shape of label tensor: (1115, 2)
EMBEDDING_DIM = 100
print(MAX_SEQUENCE_LENGTH)
#output
300

第 2-5 步。建模和预测

我们正在使用不同的深度学习方法构建模型，如 CNN、RNN、LSTM 和双向 LSTM，并使用不同的准确性指标比较每个模型的性能。首先，我们来定义一下 CNN 模型。

这里我们定义了一个有 128 个存储单元的隐藏层。网络使用概率为 0.5 的辍学。输出图层是使用 softmax 激活函数输出概率预测的密集图层。

# Import Libraries
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D, Conv1D, SimpleRNN
from keras.models import Model
from keras.models import Sequential
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout, BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequential
 print('Training CNN 1D model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH

 ))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation="relu"))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation="relu"))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation="relu"))
model.add(Dense(2, activation="softmax"))
model.compile(loss='categorical_crossentropy',
 optimizer="rmsprop",
 metrics=['acc'])

我们现在正在使我们的模型符合数据。有五个时期和 64 个模式的批量大小。

model.fit(train_data, labels_train,
 batch_size=64,
 epochs=5,
 validation_data=(test_data, labels_test))
#output

img/475440_2_En_6_Figk_HTML.jpg

#predictions on test data
predicted=model.predict(test_data)
predicted
#output
array([[0.5426713 , 0.45732868],
 [0.5431667 , 0.45683333],
 [0.53082496, 0.46917507],
 ...,
 [0.53582424, 0.46417573],
 [0.5305845 , 0.46941552],
 [0.53102577, 0.46897423]], dtype=float32)
#model evaluation
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")

print(sklearn.metrics.classification_report(labels_test, predicted.round()))
#output

img/475440_2_En_6_Figl_HTML.jpg

我们现在可以定义我们的 RNN 模型。

#import library
from keras.layers.recurrent import SimpleRNN
#model training
print('Training SIMPLERNN model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(SimpleRNN(2, input_shape=(None,1)))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy']) 

model.fit(train_data, labels_train,
 batch_size=16,
 epochs=5,
 validation_data=(test_data, labels_test))
#output

img/475440_2_En_6_Figm_HTML.jpg

# prediction on test data
predicted_Srnn=model.predict(test_data)
predicted_Srnn
#output
array([[0.9959137 , 0.00408628],
 [0.99576926, 0.00423072],
 [0.99044365, 0.00955638],
 ...,
 [0.9920791 , 0.00792089],
 [0.9958105 , 0.00418955],
 [0.99660563, 0.00339443]], dtype=float32)
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted_Srnn.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall)) 

print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_Srnn.round()))
#output

img/475440_2_En_6_Fign_HTML.jpg

这是我们的长期短期记忆(LSTM)模型。

#model training
print('Training LSTM model.')
model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(LSTM(output_dim=16, activation="relu", inner_activation="hard_sigmoid",return_sequences=True))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy'])
model.fit(train_data, labels_train,
 batch_size=16,
 epochs=5,
 validation_data=(test_data, labels_test))
#output

img/475440_2_En_6_Figo_HTML.jpg

#prediction on text data
predicted_lstm=model.predict(test_data)
predicted_lstm
array([[1.0000000e+00, 4.0581045e-09],
 [1.0000000e+00, 8.3188789e-13],
 [9.9999976e-01, 1.8647323e-07],
 ...,
 [9.9999976e-01, 1.8333606e-07],
 [1.0000000e+00, 1.7347950e-09],
 [9.9999988e-01, 1.3574694e-07]], dtype=float32) 

#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted_lstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_lstm.round()))
#output

img/475440_2_En_6_Figp_HTML.jpg

最后，让我们讨论双向 LSTM 并实现它。

LSTM 使用隐藏状态保存输入的信息。在双向 LSTMs 中，输入有两种方式:一种是从过去到未来，另一种是从未来到过去，帮助学习未来的表示。众所周知，双向 LSTMs 可以产生非常好的结果，因为它们可以更好地理解上下文。

#model training
print('Training Bidirectional LSTM model.') 

model = Sequential()
model.add(Embedding(MAX_NB_WORDS,
 EMBEDDING_DIM,
 input_length=MAX_SEQUENCE_LENGTH
 ))
model.add(Bidirectional(LSTM(16, return_sequences=True, dropout=0.1, recurrent_dropout=0.1)))
model.add(Conv1D(16, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform"))
model.add(GlobalMaxPool1D())
model.add(Dense(50, activation="relu"))
model.add(Dropout(0.1))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer="adam",metrics = ['accuracy'])
model.fit(train_data, labels_train,
 batch_size=16,
 epochs=3,
 validation_data=(test_data, labels_test))
#output

img/475440_2_En_6_Figq_HTML.jpg

# prediction on test data
predicted_blstm=model.predict(test_data) 

predicted_blstm
#output
array([[9.9999976e-01, 2.6086647e-07],
 [9.9999809e-01, 1.9633851e-06],
 [9.9999833e-01, 1.6918856e-06],
 ...,
 [9.9999273e-01, 7.2622524e-06],
 [9.9999964e-01, 3.3541210e-07],
 [9.9999964e-01, 3.5427794e-07]], dtype=float32)
#model evaluation
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(labels_test, predicted_blstm.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(labels_test, predicted_blstm.round()))
#output

img/475440_2_En_6_Figr_HTML.jpg

你可以看到双向 LSTM 优于其他算法。

食谱 6-3。下一个单词预测

自动填充/显示潜在的单词序列节省了大量写电子邮件的时间，并使用户乐于在任何产品中使用它。

问题

您希望使用电子邮件数据，根据前面的单词序列建立一个模型来预测/建议下一个单词。

如下图所示，语言被建议为下一个单词。

img/475440_2_En_6_Figs_HTML.jpg

解决办法

本节构建了一个 LSTM 模型，用于从电子邮件数据中学习单词序列。这个模型预测下一个单词。

它是如何工作的

按照本节中的步骤，使用深度学习方法建立下一个单词预测模型。

步骤 3-1。定义业务问题

根据单词或句子的顺序预测下一个单词。

第 3-2 步。确定潜在的数据来源并收集

对于这个问题，让我们使用第四章中配方 4-6 的相同电子邮件数据。这里的数据要少得多，但是为了展示工作流程，我们对这些数据很满意。数据越多，准确性越好。

file_content = pd.read_csv('spam.csv', encoding = "ISO-8859-1")
# Just selecting emails and connverting it into list
Email_Data = file_content[[ 'v2']]
list_data = Email_Data.values.tolist()
list_data
#output
[['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'],
 ['Ok lar... Joking wif u oni...'],
 ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005\. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],
 ['U dun say so early hor... U c already then say...'],
 ["Nah I don't think he goes to usf, he lives around here though"],
 ["FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv"],
 ['Even my brother is not like to speak with me. They treat me like aids patent.'],
 ["As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"],
 ['WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To claim call 09061701461\. Claim code KL341\. Valid 12 hours only.'] 

,
 ['Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030'],

第 3-3 步。导入并安装必要的库

这里是图书馆。

import numpy as np
import random
import pandas as pd
import sys
import os
import time
import codecs
import collections
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
from nltk.tokenize import sent_tokenize, word_tokenize
import scipy
from scipy import spatial
from nltk.tokenize.toktok import ToktokTokenizer
import re
tokenizer = ToktokTokenizer()

第 3-4 步。处理数据

接下来，处理数据。

#Converting list to string
from collections import Iterable
def flatten(items):
    """Yield items from any nested iterable"""
    for x in items:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            for sub_x in flatten(x):
                yield sub_x
        else:
            yield x
TextData=list(flatten(list_data))
TextData = ".join(TextData)
# Remove unwanted lines and converting into lower case
TextData = TextData.replace('\n',")
TextData = TextData.lower()
pattern = r'[^a-zA-z0-9\s]'
TextData = re.sub(pattern, ", ".join(TextData))
# Tokenizing
tokens = tokenizer.tokenize(TextData)
tokens = [token.strip() for token in tokens]
# get the distinct words and sort it

word_counts = collections.Counter(tokens)
word_c = len(word_counts)
print(word_c)
distinct_words = [x[0] for x in word_counts.most_common()]
distinct_words_sorted = list(sorted(distinct_words))
# Generate indexing for all words
word_index = {x: i for i, x in enumerate(distinct_words_sorted)}
# decide on sentence length
sentence_length = 25

第 3-5 步。为建模准备数据

这一步将邮件分成固定长度为十个单词的单词序列。(您可以根据业务问题和计算能力选择任何内容。)文本由单词序列分割。当创建这些序列时，这个窗口沿着整个文档滑动，一次一个单词，允许每个单词学习前一个单词。

#prepare the dataset of input to output pairs encoded as integers
# Generate the data for the model

#input = the input sentence to the model with index
#output = output of the model with index
InputData = []
OutputData = []
for i in range(0, word_c - sentence_length, 1):
    X = tokens[i:i + sentence_length]
    Y = tokens[i + sentence_length]
    InputData.append([word_index[char] for char in X])
    OutputData.append(word_index[Y])
print (InputData[:1])
print ("\n")
print(OutputData[:1])
#output

[[5086, 12190, 6352, 9096, 3352, 1920, 8507, 5937, 2535, 7886, 5214, 12910, 6541, 4104, 2531, 2997, 11473, 5170, 1595, 12552, 6590, 6316, 12758, 12087, 8496]]
[4292]
# Generate  X
X = numpy.reshape(InputData, (len(InputData), sentence_length, 1))
# One hot encode the output variable
Y = np_utils.to_categorical(OutputData)
Y
#output
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

第 3-6 步。建立模型

接下来，让我们定义 LSTM 模型，一个具有 256 个存储单元的单一隐藏 LSTM 层。这个模型使用 0.2 的 dropout。输出层使用 softmax 激活函数。这里我们使用 ADAM 优化器。

# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer="adam")
#define the checkpoint
file_name_path="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(file_name_path, monitor="loss", verbose=1, save_best_only=True, mode="min")
callbacks = [checkpoint]

现在，您可以根据数据拟合模型。这里我们使用五个时期和 128 个模式的批量大小。为了获得更好的结果，您可以使用更多的纪元，如 50 或 100。当然，你可以在更多的数据上使用它们。

#fit the model
model.fit(X, Y, epochs=5, batch_size=128, callbacks=callbacks)

Note

我们没有将数据分为训练和测试数据。我们对精确的模型不感兴趣。深度学习模型需要大量的数据进行训练，并且需要大量的时间进行训练，因此我们使用模型检查点来捕获所有的模型权重以进行归档。我们使用最佳的权重集进行预测。

#output

img/475440_2_En_6_Figt_HTML.jpg

运行代码后，您的本地目录中就有了权重检查点文件。选择保存在工作目录中的网络权重文件。当我们运行这个示例时，我们获得了下面的检查点，它的损失最小，这是我们在五个时期内实现的。

# load the network weights
file_name = "weights-improvement-05-6.8213.hdf5"
model.load_weights(file_name)
model.compile(loss='categorical_crossentropy', optimizer="adam")

第 3-7 步。预测下一个单词

随机生成一系列单词并输入到模型中，看看它能预测什么。

# Generating random sequence
start = numpy.random.randint(0, len(InputData))
input_sent = InputData[start]
# Generate index of the next word of the email
X = numpy.reshape(input_sent, (1, len(input_sent), 1))
predict_word = model.predict(X, verbose=0)
index = numpy.argmax(predict_word)
print(input_sent)
print ("\n")
print(index)
# Output
[9122, 1920, 8187, 5905, 6828, 9818, 1791, 5567, 1597, 7092, 11606, 7466, 10198, 6105, 1837, 4752, 7092, 3928, 10347, 5849, 8816, 7092, 8574, 7092, 1831] 

5849
# Convert these indexes back to words
word_index_rev = dict((i, c) for i, c in enumerate(tokens))
result = word_index_rev[index]
sent_in = [word_index_rev[value] for value in input_sent]
print(sent_in)
print ("\n")
print(result)
Result :
['us', 'came', 'use', 'respecthe', 'would', 'us', 'are', 'it', 'you', 'to', 'pray', 'because', 'you', 'do', 'me', 'out', 'youre', 'thk', 'where', 'are', 'mrng', 'minutes', 'long', '500', 'per']
shut

因此，给定 25 个输入单词，它预测单词“shut”是下一个单词。当然，这没有多大意义，因为它是在更少的数据和时期上训练的。确保你有强大的计算能力，并对大量的数据进行训练。

食谱 6-4。堆栈溢出问题推荐

Stack Overflow 是一个面向专业人士和狂热开发者的问答网站。 It 是程序员学习、分享知识和建立职业生涯的最大在线社区

问题

每天都有成千上万的问题在 Stack Overflow 社区中被提问和回答。很有可能会问相同类型的问题，这会在系统中产生不必要的重复问题。如果当用户提出新问题时，系统能够推荐一个类似的问题，这就减少了大量的重复。

解决办法

我们需要将提出的问题与现有的所有其他问题进行比较。这是一个句子对比任务。我们使用预先训练的嵌入来获得句子级的平均向量，以计算相似性得分。这些分数随后用于对问题进行排名。

让我们使用各种预先训练的模型，如伯特句子变形金刚、开放人工智能 GPT 和 GloVe 来解决这个项目和传统的 TFIDF 方法。

它是如何工作的

步骤 4-1。收集数据

如果存在堆栈溢出问题，请使用免费的源数据集。原始数据集位于 www.kaggle.com/c/predict-closed-questions-on-stack-overflow/data?select=train-sample.csv 。

下载该数据集，并将其保存在一个文件夹中以备将来使用。

第 4-2 步。将笔记本和数据导入 Google Colab

鉴于 BERT 模型很大，Google Colab 用于解决这个项目，在 Colab 中构建它更容易、更快。

去谷歌 Colab ( https://colab.research.google.com/notebooks/intro.ipynb )。

然后转到“文件”,打开一个新笔记本或使用“上传笔记本”从本地上传笔记本。跟随下面截图。

img/475440_2_En_6_Figu_HTML.jpg

要导入数据，请转到“文件”，然后单击“上传到会话存储”选项。示例如下图所示。

img/475440_2_En_6_Figv_HTML.jpg

第 4-3 步。导入库

让我们导入必要的库。

#importing necessary libraries
import pandas as pd
import numpy as np
import pickle
import time

import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import ToktokTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

from string import punctuation
import matplotlib.pyplot as plt
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix, hstack,csr_matrix

第 4-4 步。导入数据和 EDA

在解决实际问题之前，让我们导入数据并做一些快速 EDA。

#importing training data
df=pd.read_csv('train-sample.csv')

#counting all null values

df.isnull().sum()

PostId                                      0
PostCreationDate                            0
OwnerUserId                                 0
OwnerCreationDate                           0
ReputationAtPostCreation                    0
OwnerUndeletedAnswerCountAtPostTime         0
Title                                       0
BodyMarkdown                                0
Tag1                                       10
Tag2                                    27251
Tag3                                    64358
Tag4                                   100622
Tag5                                   124558
PostClosedDate                          70136
OpenStatus                                  0
dtype: int64

#Replacing missing values with blank space

df['Tag1']=df['Tag1'].replace(np.NaN,'')
df['Tag2']=df['Tag2'].replace(np.NaN,'')
df['Tag3']=df['Tag3'].replace(np.NaN,'')
df['Tag4']=df['Tag4'].replace(np.NaN,'')
df['Tag5']=df['Tag5'].replace(np.NaN,'')

#converting column type into string

df['Title']=df['Title'].astype(str)
df['BodyMarkdown']=df['BodyMarkdown'].astype(str)

#checking top 10 most common words from the Body column

from collections import Counter
cnt = Counter()
for text in df["BodyMarkdown"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)  #top 10 common words

# Result

[('the', 514615),
 ('to', 410003),
 ('I', 358442),
 ('a', 285706),
 ('and', 212914),
 ('is', 192019),
 ('in', 182699),
 ('=', 180765),
 ('of', 164698),
 ('that', 121863)]

这些是停用词，在数据清理步骤中会被删除。

第 4-5 步。清理文本数据

让我们做一些标准的数据清理过程。

#importing tokenizer for sentence tokenization
token=ToktokTokenizer()

#stop words removing function
def stopWords(text):

    stop_words = set(stopwords.words("english"))   #importing stopwords dictionary
    #text = re.sub('[^a-zA-Z]', ' ', text)
    words=token.tokenize(text)                   # tokenizing sentences

    filtered = [w for w in words if not w in stop_words] #filtering words which are not in stopwords

    return ' '.join(map(str, filtered))  #creating string combining all filtered words

#function to remove punctuations
def remove_punctuations(text):
    punct = '!"#$%&\'()*+,./:;<=>?@[\\]^_`{|}~'   #list of punctuation marks
    for punctuation in punct:
        text = text.replace(punctuation, '')   #replacing punctuation mark with blank space
    return text

#function to remove frequent words but they were mostly from stopwords

FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):

    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

#cleaning the text

def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub(r"\'\n", " ", text)
    text = re.sub('\s+', ' ', text)
    text = re.sub(r"\'ve", " have ", text)x`
    text = re.sub(r"can't", "can not ", text)
    text = text.strip(' ')
    return text

#URL removing function
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

#function to remove html tag and replacing with blank space
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

让我们对数据集中的 BodyMarkdown 列使用上面创建的函数。

#Applying all preprocessing steps defined above on both Body

df['BodyMarkdown']=df['BodyMarkdown'].apply(lambda x: clean_text(x))
df['BodyMarkdown'] = df['BodyMarkdown'].apply(remove_punctuations)
df['BodyMarkdown'] = df['BodyMarkdown'].apply(remove_urls)
df['BodyMarkdown'] = df['BodyMarkdown'].apply(remove_html)
df['BodyMarkdown'] = df['BodyMarkdown'].apply(lambda x:stopWords(x))

第 4-6 步。使用 TFIDF 进行特征工程

有多种方法可以将文本转换为特征。先说 TFIDF。

#Importing TFIDF vector as tfidf_vectorizer

tfidf_vectorizer = TfidfVectorizer()

#applying tfidf on Body  column

tfidf_matrix2 = tfidf_vectorizer.fit_transform(df['BodyMarkdown'])

这个矩阵稍后用于提取类似的问题。

第 4-7 步。将手套嵌入用于特征工程

让我们导入手套预训练模型。

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip
!ls
!pwd

Note

提取和计算平均向量需要几个小时。

为整个数据集提取特征需要花费大量时间。首先，考虑减少开发和测试代码的样本数量。一旦一切正常，就可以考虑整个数据集来提取特征并找到类似的问题。

#creating sample data set with 100 rows for testing. Comment this line to run it on the whole dataset.

dfg=df.iloc[0:100,:]

# load the glove model

glove_model = pd.read_table("glove.6B.100d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)

# getting mean vector for each sentence

def get_mean_vector(glove_model, words):
    # remove out-of-vocabulary words
    words = [word for word in word_tokenize(words) if word in list(glove_model.index)] #if word is in vocab
    if len(words) >= 1:
        return np.mean(glove_model.loc[words].values, axis=0)
    else:
        return np.array([0]*100)

#Defining empty list and appending array to the list

glove_embeddings=[]
for i in dfg.BodyMarkdown:
    glove_embeddings.append(list(get_mean_vector(glove_model, i)))

glove_embeddings 拥有数据集中所有句子的向量。运行一次并保存向量以避免每次都运行，因为这会花费很多时间。

让我们将它导出到一个 CSV 文件中，以便您可以在需要时使用它。

# Saving vectors of each abstract in the #data frame so that we can use them directly while running code again

glove_embeddings_t=pd.DataFrame(K1).transpose()
glove_embeddings_t.to_csv('glove-vec.csv')

下面的代码导入 CSV 文件，该文件包含我们刚刚导出的向量。当新会话打开时，您需要运行代码来加载该数据集的向量。

#Loading our pre-trained vectors of each abstract

K=pd.read_csv('glove-vec.csv')
glove_embeddings_loaded=[]

#transforming data frame into a required array-#like structure as we did in the above step

for i in range(dfg.shape[0]):
  glove_embeddings_loaded.append(K[str(i)].values)
glove_embeddings_loaded=np.asarray(glove_embeddings_loaded)

第 4-8 步。使用 GPT 进行特征工程

让我们安装开放人工智能的 GPT 模型。

!pip install pytorch_pretrained_bert

# importing necessary libraries for GPT

import torch
from pytorch_pretrained_bert import OpenAIGPTTokenizer, OpenAIGPTModel

tokenizer = OpenAIGPTTokenizer.from_pretrained('openai-gpt')
model = OpenAIGPTModel.from_pretrained('openai-gpt')
model.eval()
print('Model Loaded')

#function to get embedding of each token
def returnEmbedding(pSentence):
  tokens = pSentence.split(' ')
  hidden_states = np.zeros((1,768))
  for token in tokens:
      subwords = tokenizer.tokenize(token)
      indexed_tokens = tokenizer.convert_tokens_to_ids(subwords)
      tokens_tensor = torch.tensor([indexed_tokens])

      with torch.no_grad():
          try:
            hidden_states += np.array(torch.mean(model(tokens_tensor),1))
          except Exception as ex:
            continue
  hidden_states /= len(tokens)
  return hidden_states

# Initialize Matrix with number of dataset records as rows and 768 columns as embedding dimension
X = np.zeros((df_gpt.shape[0], 768))

# Generate sentence level embedding by calculating average of all word embedding
for iter in range(df_gpt.shape[0]):
    text = df_gpt.loc[iter,'BodyMarkdown']
    #print(iter)
    X[iter] = returnEmbedding(text)

embeddings_GPT = X

嵌入 _GPT 稍后用于查找前 n 个类似的问题。

步骤 4-9。使用句子-BERT 进行特征工程

# Install BERT sentence transformer for sentence encoding
!pip install sentence-transformers

#running on 100 rows only for testing. Later comment this line
df_bert=df.iloc[0:100,:]

#importing bert-base model

from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

#embeding on Body column
sentence_embeddings = sbert_model.encode(df['BodyMarkdown'])
print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))

#output
Sample BERT embedding vector - length 768

步骤 4-10。创建获取热门问题的函数

现在我们已经为数据集中的每个问题提取了向量或嵌入，让我们创建函数来

寻找余弦相似性
返回 top n 相似问题(排名)

#defining function to derive cosine similarity
from numpy import dot
from numpy.linalg import norm
def cos_sim(a,b):

    return dot(a, b)/(norm(a)*norm(b))

#Function which returns Top N similar sentence from data frame directly

def top_n(user,p,df):

    #Converting cosine similarities of overall data set with input queries into LIST
    x=cosine_similarity(user,p).tolist()[0]

 #store list in temp file to retrieve index
    tmp=list(x)

 #sort the list
    x.sort(reverse=True)

    print( x[0:5])

 #get index of top 5
    L=[]
    for i in x[0:5]:

        L.append(tmp.index(i))
    return df.iloc[L, [6,7]]

这两个函数和所有嵌入将在下一节中使用。

步骤 4-11。预处理用户输入

用户可以输入任何问题。您需要对用户输入文本进行预处理，并根据所选的一种模型提取嵌入内容。下面的函数可以做到这一点。

#function to pre-process and extract embeddings for the user input text

def user_transform(query,model):
    query= clean_text(query)
    query= remove_punctuations(query)
    query= remove_urls(query)
    query= remove_html(query)
    query= stopWords(query)
    print(query)
    if model=='TFIDF':
      k=tfidf_vectorizer.transform([str(query)])
    elif model=='BERT':
      k=sbert_model.encode(str(query))
    elif model=='glove_model':
      k=get_mean_vector(glove_model,query)
      k=k.reshape(1,-1)
    elif model=='GPT':
      k=returnEmbedding(query)

    return k

 pd.set_option("display.max_colwidth", -1)       #this function will display full text from each column

步骤 4-12。查找类似问题

使用 TFIDF 查找类似问题。

# Getting top 5 similar questions for user input query using TFIDF

input=user_transform('do we have any other Q&A platform like stackoverflow which is free source?','TFIDF')

top_n(input,tfidf_matrix2,df)

img/475440_2_En_6_Figw_HTML.jpg

如果查看结果，第一个问题类似于用户输入的查询。

使用 GloVe 查找类似问题。

# Getting top 5 similar questions using Glove model

input=user_transform('do we have any other Q&A platform like stackoverflow which is free source?','glove_model')   #query

top_n(input,glove_embeddings_loaded,df)

img/475440_2_En_6_Figx_HTML.jpg

如果查看结果，第一个问题类似于用户输入的查询。

用 GPT 找出类似的问题。

#similar questions from GPT (from 100 rows)

input=user_transform('do we have any other Q&A platform like stackoverflow which is free source?','GPT')   #query

top_n(input,embeddings_GPT,df)

img/475440_2_En_6_Figy_HTML.jpg

这个输出表明 GPT 模型并不完全适合我们的用例。

使用 BERT 查找类似问题。

#similar questions from BERT

input=user_transform('do we have any other Q&A platform like stackoverflow which is free source?','BERT')   #query

top_n(input,sentence_embeddings,df)

img/475440_2_En_6_Figz_HTML.jpg

如果查看结果，第一个问题类似于用户输入的查询。

七、总结和下一代自然语言处理

本章总结了各种过去、现在和未来的自然语言处理方法和技术。本章以关于 NLP 和深度学习的优秀研究论文的信息结束。

到目前为止，您已经了解了以下内容。

如何收集、读取、清理和处理文本数据
如何将文本转换为特征
如何使用 NLP 技术结合机器学习和深度学习来构建应用程序

现在让我们看看 NLP 的最新进展和未来。

我们讨论以下食谱。

食谱 1。文本到特征或分布式表示的最新进展
食谱 2。面向自然语言处理的高级深度学习
配方 3:强化学习在 NLP 中的应用
食谱 4。转移学习和预培训模型
食谱 5。自然语言处理中的元学习
食谱 6。面向自然语言处理的胶囊网络

但是在开始食谱之前，让我们快速回顾一下。

到目前为止，你已经在人工智能和 NLP 的保护伞下解决了一些最有趣的项目。

您还看到了 NLP 在与机器学习和深度学习相结合时，如何帮助解决跨行业和领域的复杂业务问题。

大约 50 年前，人类首次开始使用计算方法来分析人类语言，尽管这些技术中的大多数最近都取得了成功。

NLP 是 Siri 和 Alexa 背后的声音。类似地，客户服务聊天机器人利用 NLP 的力量在电子商务、医疗保健和公用事业部门生成个性化的响应。一些最流行的 NLP 应用程序是虚拟助手、情感分析、客户服务和翻译。

随着技术的发展，NLP 的未来变得更加以用户为中心。例如，虚拟助理可以回答更复杂的问题，并评估问题的含义和字面意义。(问:今天天气怎么样？甲:下雨了。你需要一把伞。)未来，公司将能够提供各种专业的客户服务，接听电话，向真人传递问题。

NLP 的应用不仅限于解决客户问题或提供个性化建议；这主要是技术援助。目前，如果你用 NLP 问，“我的网络怎么了？”，您可以训练它提供一个错误列表。未来，NLP 将能够理解用户的真实意图。NLP 的未来令人兴奋，因为 NLP 的进步将使人类从问题转向结果。当 NLP 理解用户评论并为他们的真实意图提供更复杂的解决方案时，这将是一个巨大的飞跃。

NLP 将能够理解人类的情感。随着 NLP 技术的发展，计算机将把它们目前的处理能力扩展到对人类语言的整体理解。到目前为止，NLP 仅限于解释有限的人类情感，包括喜悦或愤怒。最终，NLP 将被编程为理解更复杂的人类语言元素，如幽默、讽刺等等。

在一个激动人心的时刻，NLP 将结合人脸和手势识别等其他技术为企业创造收入，使其更加灵活和高效。

有了 Alexa、Siri 和 Google Duplex，下一代 NLP 才刚刚开始。

NLP 对于教导机器执行复杂的自然语言任务也是有用的，例如机器翻译和对话生成。

现在，让我们通过菜谱来揭示一些最先进的下一代算法将如何在未来的 NLP 时代发挥重要作用。

配方 7-1。文本到特征或分布式表示的最新进展

本食谱讨论了文本到特征或分布式表示的最新进展。

问题

在文本到特征或分布式表示方面，有哪些超出你所学的新进展(例如，单词嵌入、GloVe、fastText 等)。)?

解决办法

分布式表示在过去已经被广泛用于研究各种 NLP 任务，但是 CBOW 和 skip-gram 模型的流行度已经增加。第三章讨论了大多数最新的嵌入技术。但是请保持警惕，因为这个领域正在快速发展。

配方 7-2。面向自然语言处理的高级深度学习

这个食谱着眼于 NLP 的一些高级深度学习技术。

问题

你想了解 NLP 深度学习技术的最新进展。

解决办法

让我们讨论循环神经网络和深度生成模型。

循环神经网络

循环神经网络的基本形式，网络函数将组件向上组合，以计算更高级句子的表示。循环神经网络用于各种应用中，如下所示。

从语法上分析
使用短语级表示的情感分析
语义关系的分类(例如，主题消息)

深度生成模型

变分自编码器(VAE)和生成对抗网络(GAN)等深度生成模型被应用于自然语言处理，以发现丰富的自然语言结构。众所周知，由于不受限制的潜在空间，标准句子自编码器不能生成真实的句子。VAE 执行隐藏空间的先验分布，允许模型生成适当的样本。VAE 由编码到潜在空间的编码器和生成器网络组成，然后从该空间生成样本。训练旨在最大化在生成的模型中观察到的数据的对数概率的变化下限。

配方 7-3。强化学习在自然语言处理中的应用

这个食谱讨论了强化学习在 NLP 中的应用。

问题

你想了解强化学习在 NLP 领域的作用。

解决办法

强化学习利用行为心理学，软件代理在环境中执行操作，增加代理的累积回报。该系统试图通过在模拟器环境中反复试验来理解行为随时间的变化。

我们来讨论一下强化学习方法。

勘探与开发的权衡

在这里，代理必须对文本进行分类。智能体通过不同的神经网络层来决定下一步的动作状态。探索和利用之间的权衡涉及到代理必须探索可能的行为状态的困境，这将有助于合法地对文本进行分类，并利用当前的行为状态来获得最佳结果。在 NLP 中，我们应该通过使用 softmax 函数计算置信限来解决这个问题。即使存在未知的不确定性，softmax 和更高的置信限也有助于获得最高的回报。

时间差异

时差概念涉及一种无模型的强化学习方法。它是基于时间 t + 1 的下一个行动状态可能比时间 t 的结果更好的概念，这就像蒙特卡罗方法。您可以根据结果调整权重。但是在这里，使用称为启动的概念，在结果已知之前，您根据当前输入调整结果以获得最佳结果。例如，您希望对所有未知数据进行排名。在时间 t + 1，你应该可以预测标签，同理，在 t + 2，你应该可以确定下一个动作状态。下一步是您应该能够对属于该标签的所有数据进行分类，并通过行为克隆找到所有其他类似的标签。

配方 7-4。转移学习和预培训模型

这份食谱讨论了迁移学习和预训练模型是如何改变 NLP 前景的。

问题

您希望深入了解迁移学习和 NLP 预训练模型的最新进展。

解决办法

简单来说，迁移学习就是在大规模数据集上形成一个模型，然后用这个预先训练好的模型去学习另一个下游任务(如目标任务)的过程。通过 ImageNet 数据集，迁移学习在计算机视觉领域获得了广泛的应用。在这里，我们不讨论计算机视觉。相反，让我们关注这些概念如何应用于自然语言处理领域。

迁移学习旨在使用源领域中有价值的知识来帮助模拟目标领域中的表现。

为什么需要迁移学习 NLP？

在 NLP 应用中，特别是当我们没有足够的数据集来解决任务(称为 T-target tasks )时，我们希望从其他任务中转移知识，以避免过拟合，提高 T 的性能。

将知识转移到语义相似/相同但数据集不同的任务。

NLP 中神经传递的学习很大程度上取决于源数据集和目标数据集的语义相似度。

对于 NLP 和 ML 的研究人员和爱好者来说，这些都是激动人心的时刻。最近，预训练的语言模型在广泛的 NLP 任务中获得了最新的结果，例如序列标签和句子分类。最近使用预训练语言模型的工作包括 ULMFit、ELMo、GLoMo 和 OpenAI 转换。这些语言建模系统使用分层表示或图形表示来执行整个模型的大量预训练。这个概念改变了多年来用于处理许多 NLP 任务的一元嵌套单词的使用，支持更复杂和抽象的表示。

NLP 中的前期训练是如何进行的？为什么前期培训有用？它允许模型从大规模语料库中捕捉和学习各种语言现象，如长期依赖和否定。然后使用(转移)这些知识来初始化，然后形成另一个模型来执行特定的 NLP 任务，例如情感分类。

这在 NLP 中是可行的；例如，否定是从文本信息中检测情感极性的重要属性。此外，否定也可能是有用的，例如，检测情绪或讽刺，这是最复杂和未解决的 NLP 任务之一。否定在许多 NLP 任务中是有用的，因此预训练模型具有共同的属性。

在 NLP 研究中缺少带注释的数据集或语言资源时，具有通用属性的语言模型可能是有用的。这个想法令人兴奋，因为我们正试图建立一个通用模型，并解决 NLP 研究的一些困难挑战:数据和语言资源的可用性。

到目前为止，我们知道从预先形成的语言模型(如嵌入单词的形式)中获得的知识适用于许多 NLP 任务。这里的问题是，这种潜在的知识形式的知识不够广泛或不足以正确地执行目标任务或下游任务。对此有许多解释——有些事情我们知道，有些我们不知道——但现在，我们简要介绍一种解决这些限制的最新方法。

ELMo 是一种最近流行的方法，它被描述为“使用跨神经层堆栈的深度上下文表示来预训练整个模型”，而不是简单地使用嵌套单词(一键编码特征表示)作为初始化。

伯特、埃尔莫、乌尔姆菲特。(NLP 破解迁移学习的方式)

2018 年是基于文本的机器学习模型(更准确地说，是自然语言处理或 NLP)的转折点。我们对如何最好地表达单词和短语以更好地理解潜在含义和关系的概念理解正在迅速发展。此外，NLP 社区强调，你可以免费下载所有像 BERT 这样的预训练模型，并在你自己的模板和管道(例如 ImageNet)中使用它们，这表明了类似的开发如何加速了具有令人难以置信的强大组件的机器学习的发展。

这一发展的最新里程碑之一是 BERT 的发布，它被描述为标志着 NLP 新时代的开始。BERT 是一个模型，它打破了一些记录，并确定了模型如何处理基于语言的任务。在模型文档发布后不久，该团队还开放了模型代码，将预先训练好的模型版本下载到大型数据集。这是一个突破，任何建立涉及语言处理的机器学习模型的人都可以将这个强大的工具作为现成的组件使用——节省时间和可以形成语言处理模型的精力、知识和资源。

BERT 基于 NLP 社区中出现的一些最新发展，包括半监督学习序列(Andrew Wells Dale)、ELMo (Matthew Peters 和研究人员 AI2 和 UW CSE)、ULMFiT(由创始人 Fast.ai、杰瑞米·霍华德和塞巴斯蒂安·罗德斯开发)、OpenAI transformer(研究人员 OpenAI Redford、Nara Singham 和 Salimans Sutskever)和 transformers (Vaswani 等)。).

你需要了解一些概念才能理解 BERT 是什么。因此，在查看模型本身所涉及的概念之前，让我们先看看使用 BERT 的不同方式。

BERT 是一个转换器编码器堆栈，它有两种型号。

BERT-Base: OpenAI 变压器
BERT-Large:一个非常大的模型，具有最先进的结果

两种 BERT 模型都有许多编码器层(也称为变换块):12 的基本版本和 24 的大型版本。它们还具有更高的预测网络和更多的注意力头(分别为 12 和 16)，而不是基础纸上的 transformer 实现的默认配置(分别为 768 和 1024 个隐藏单元)(六层编码器、512 个隐藏单元和八个注意力头)。

嵌入的新时代

这些新的发展导致了单词编码方式的新变化。到目前为止，主流 NLP 模型如何对待语言，嵌入词一直是主力。word2vec 和 Glove 等方法已广泛用于此类任务。在指出变化之前，让我们回顾一下它们的用法。

该领域很快意识到，使用大量文本数据比形成与这种通常很小的数据集模型并行的预集成要好。因此，word2vec 或 GloVe 可以下载预训练期间生成的单词列表及其组合。

ULMFiT:自然语言处理中的迁移学习

ULMFiT 引入了一种有效利用模型在预训练期间学习的大部分内容的方法——这不仅仅是情境化的组合。ULMFiT 使用语言模型和过程来有效地使语言模型适应各种任务。

NLP 最终会找到让迁移学习成为可能的方法，就像计算机视觉一样。

变形金刚:超越 LSTM

《变形金刚》文档、代码和机器翻译的发布让一些人相信他们正在取代 LSTM。此外,《变形金刚》在管理长期依赖关系方面比 LSTM 更胜一筹。

天资

flair 框架在解决 NER (POS)、词义消歧和文本分类等自然语言处理问题方面提供了一流的性能。这个 NLP 框架直接建立在 PyTorch 之上。

当今大多数先进的方法都是基于一种叫做文本嵌入的技术。它将文本转换成大空间的数字表示。它允许文档、句子、单词、字符在这个大空间中作为向量来表达它们自己。

flair 是 NLP 的一个令人兴奋的新成员，因为 Zalando Research 最近的文章“用于序列标记的上下文字符串嵌入”( http://alanakbik.github.io/papers/coling2018.pdf )描述了一种总是比尖端解决方案更好的方法。它在 flair 中得到实现和完全支持，可以创建文本分类器。

为什么是伯特？

随着网络的大规模增长，我们有了大量的数据。并且只有一些文本数据被注释。对于像自然语言处理这样的任务，我们需要大量带注释的数据来学习有监督的或无注释的数据来进行无监督学习。各种研究人员更喜欢无监督学习。他们在网上使用大量无注释的文本(称为预训练)来突出一些常见的语言表示模型训练技术。

BERT 是谷歌开发的这些预训练模型之一，可以适应新的数据，并创建 NLP 系统，如回答问题，生成文本，排序文本，文本合成和情感分析。由于 BERT 是在大量数据的基础上形成的，它简化了语言建模过程。使用预先形成的 BERT 模型的主要优点是，与这些数据集的训练相比，精确度显著提高。

BERT 是基于在训练之前在上下文表示中的最近工作。这是先前使用纯文本语料库形成的第一个深度双向无监督语言表示。BERT 表示具有左右上下文的上下文表示。概念简单，经验丰富。BERT 优于其他方法，因为它是第一个具有域自适应特征的用于 NLP 预训练的无监督深度双向系统。从 BERT 文档中可以确定，通过适当的语言模型训练方法，基于转换器(自关注)的编码器可以潜在地用作语言模型的替代。

GPT 使用短语分隔符([SEP])和分类标记([CLS])，它们只在特定时间输入。
伯特在整个预训练过程中学习 SEP、CLS 句子和句子的整合。
GPT 对所有的拟合实验都使用类似的 5 到 5 的学习率。BERT 在开发集上选择精确的、特定任务的和最有效的开发学习率。

以下是一些挑战。

因为我们有大量的训练数据，即使用一个 GPU 也很难训练；因此，可以使用谷歌 TPU。
推理需要的时间长。因此，我们修改了超参数，以使系统准确并尽快得到结果。它对每个超参数都有一个记录，并选择了超参数的优化组合。

我们发现元学习语言模型可以形成记忆最近条目的文章，并且是预测文章下一部分的有用起点。

n-gram 卷积层是标准的卷积层，通过各种卷积滤波器提取句子中不同位置的 n-gram 特征。
主胶囊层为第一胶囊层。该胶囊用矢量输出胶囊代替 CNN 标量输出特征检测器，以保留诸如单词的局部顺序和单词的语义表示等参数。
在卷积胶囊层中，每个胶囊仅连接到下层中的局部区域。区域中的这些胶囊乘以一个变换矩阵来学习父子关系。然后执行弦路由以在该层中创建父胶囊。胶囊层完全连接。层下面的胶囊在胶囊列表中被展平，并被引入完全连接的胶囊层。该封装通过一个转换矩阵，然后通过一个路由协议产生一个倍增的最终封装及其类别。
动态路由的基本思想是设计非线性映射。通过确保每个胶囊的输出被发送到下一层中适当的父节点，非线性映射被迭代地构造。对于每个潜在的父代，胶囊网络可以增加或减少与动态路由的连接，这比原始路由策略更有效，例如 CNN 中的最大累积，它检测文本中是否存在某个特征。但是，关于实体的空间信息会丢失。研究人员探索了三种策略，通过减少一些嘈杂胶囊的不便来提高路由过程的准确性。不与特定类别相关联的词有助于胶囊网络更有效地建立亲子关系模型。

研究人员已经证明了胶囊网络在文本分类中的有效性。更重要的是，胶囊网络还显示了从多类到多标签标签的文本分类的显著改进。