NLP学习1

qbning / 2024-09-18 / 原文

1.常用库
2.python处理字符串
- 1.str类型
- 2.bytes类型
- 3.StringIO类
3.python 处理语料
- 1.读取语料
- 2，去重
- 3.停用词
- 4.编辑距离
- 5.文本规范化
- 6.分词
- 7 .词频-逆文本频率
- 8.独热编码
4.PyTorch & Transformers的安装
- 英伟达显卡
- 对于AMD
- CPU
- 检查
- Transformers
- 检查
5.Pytorch基本使用
- 张量的创建
  - 从列表或numpy.array创建
  - 创建全0、1或随机张量
  - 填充张量
- 张量的变换
  - 拼接(cat)、堆叠(stack)
  - 切分(chunk/split)
  - 改变形状(reshape)
  - 交换维度(transpose)
  - 插入/去掉维度(unsqueeze/squeeze)
  - 扩展维度(expand)
  - 重复(repeat)

使用书籍《pytorch自然语言处理入门与实战》可能比较适合我这种初学小白，后面肯定还是要看李沐大神的，或者其它书籍

1.常用库

numpy 科学计算

matplotlib 图表可视化

scikit-learn 数据挖掘和数据分析

nltk 包含50种语料和常见算法

spacy 实体命名，预训练词向量需要先安装对应语言的模型

jieba 中文分词

pkuseg pku论文的中文分词

wn 加载使用wordnet的包

pandas 数据处理

2.python处理字符串

1.str类型

不可变对象

ord()获得字符编码值

chr()编码值转换字符

split+join转换为列表

常用方法

find 返回第一次出现下标
rfind 倒数第一次出现下标
count 出现次数
startswith 是否以某串开头
endswith 是否以某串结尾
isdigit 是否为数字
isalpha 是否为字母
isupper 是否为大写字母
istrip 删除开头指定字符
rstrip 删除结尾指定字符
strip 删除首尾指定字符
replace 字符替换
center 指定宽度字符串居中

2.bytes类型

>>> byte1 = b"hello"

与字符串转换

>>> print(str(byte1))
b'hello'
>>> print((byte1.decode()))
hello

str可以用encode指定一种编码方式编码为byte

3.StringIO类

可变

>>> import io
>>> sio = io.StringIO()
>>> sio.write('hello')
5
>>> sio.write(' ')
1
>>> sio.write('world')
5
>>> print(sio.getvalue())
hello world
>>> sio.close()

3.python 处理语料

1.读取语料

txt文本

f = open('text.txt',encoding='utf8') #用utf8编码打开文件
words = [] #定义空的list用于存放所有词语
for l in f:
    word = l.strip().splt(' ') # 删除行尾换行符，切分单词和中文
    words.append(word)
f.close() #关闭文件

csv

import csv
f = open('file.csv',encoding='utf8') #用utf8编码打开文件
reader = csv.reader(f)
lines = [] 
for l in reader:
    lines.append(l)

json

import json
f = open('file.json', 'r', encoding='utf8')  # 用utf8编码以读取模式打开文件
data = json.load(f)  # 直接读取JSON文件内容

2，去重

使用set去重(add添加，in判断是否在内)，大数据使用BitMap或Bloom Filter

3.停用词

去GitHub找stopwords

4.编辑距离

衡量两个字符串之间的差异。定义了三种操作：插入一个字符，删除一个字符，替换一个字符，编辑距离就是一个字符串变成另一个字符串的最小操作，可以使用dp来进行计算

def minDistance(word1:str,word2:str)->int:
    n = len(word1)
    m = len(word2)
    dp = [[0]*(m+1) for _ in range(n+1)]
    for i in range(m+1):dp[0][i]=I
    for i in range(n+1):dp[i][0]=i
    for i in range(1,n+1):
        for j in range(1,m+1):
            if word1[i-1] == word2[j-1]:
				dp[i][j] = dp[i-1][j-1]
            else
            	dp[i][j] = min(dp[i][j-1],dp[i-1][j],dp[i-1][j-1])+1
    return dp[-1][-1] #最后一个元素

5.文本规范化

6.分词

7 .词频-逆文本频率

8.独热编码

4.PyTorch & Transformers的安装

PyTorch

【布客】PyTorch 中文翻译 (apachecn.org)

英伟达显卡

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

对于AMD

唉，windows不支持

ROCm is not available on Windows

CPU

pip3 install torch torchvision torchaudio

检查

>>> import torch
>>> torch.version
<module 'torch.version' from '\\.conda\\envs\\nlp\\Lib\\site-packages\\torch\\version.py'>
>>> torch.cuda.is_available
<function is_available at 0x000001F4D67EE0C0>

Transformers

pip install transformers

检查

>>> from transformers import pipeline
>>> print(pipeline('sentiment-analysis')('I love you'))
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9998656511306763}]

如果报错，可能是网络原因

requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /distilbert/distilbert-base-uncased-finetuned-sst-2-english/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)')))"), '(Request ID: cf626477-ad07-40c9-b4ce-dcf8371fe213)')

5.Pytorch基本使用

基础数据类型是张量(tensor)

是一种可以定义和运行在GPU上的多维数组。(关于GPU可以看龚大的系列视频上帝视角看GPU（1）：图形流水线基础)

运行在GPU上就能利用GPGPU的并行性来进行快速运算。

张量的创建

从列表或numpy.array创建

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]],dtype=torch.float32)
>>> print(t,t.shape,t.dtype)
tensor([[1., 2., 3.],
        [4., 5., 6.]]) torch.Size([2, 3]) torch.float32

创建全0、1或随机张量

>>> import torch
>>> rand_t = torch.rand((3,3)) #均匀分布，还有randint指定范围,randn标准正态分布,normal高斯分布
>>> ones_t = torch.ones((2,2)) #可以通过arange(x)创建0到x-1行向量
>>> zeros_t = torch.zeros((1,8))
>>> print(rand_t)
tensor([[0.5173, 0.6960, 0.7608],
        [0.6487, 0.5882, 0.0938],
        [0.7563, 0.0548, 0.2958]])
>>> print(ones_t)
tensor([[1., 1.],
        [1., 1.]])
>>> print(zeros_t)
tensor([[0., 0., 0., 0., 0., 0., 0., 0.]])

填充张量

>>> import torch
>>> t = torch.full((4,5),9)
>>> print(t)
tensor([[9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9],
        [9, 9, 9, 9, 9]])

张量的变换

import torch
t = torch.tensor([[1,2,3],[4,5,6]])
print(t,t.shape)
s = t.repeat(1,1,2,2)
print(s,s.shape)

拼接(cat)、堆叠(stack)

>>> import torch
>>> t1 = torch.tensor([1,2,3]) #只有一维 dim也是0
>>> t2 = torch.tensor([4,5,6])
>>> t3 = torch.cat([t1,t2]) # dim默认为0
>>> print(t3)
tensor([1, 2, 3, 4, 5, 6])
>>> t4 = torch.tensor([[1,2,3],[4,5,6]])
>>> t5 = torch.tensor([[4,5,6],[7,8,9]])
>>> t6 = torch.cat([t4,t5])
>>> t7 = torch.cat([t4,t5],dim = 1) #指定拼接成二维
>>> print(t6)
tensor([[1, 2, 3],
        [4, 5, 6],
        [4, 5, 6],
        [7, 8, 9]])
>>> print(t7)
tensor([[1, 2, 3, 4, 5, 6],
        [4, 5, 6, 7, 8, 9]])
>>> t8 = torch.stack([t1,t2])
>>> print(t8)
tensor([[1, 2, 3],
        [4, 5, 6]])

切分(chunk/split)

>>> import torch
>>> t1 = torch.tensor([1,2,3,4,5])
>>> print(torch.chunk(t1,1))
(tensor([1, 2, 3, 4, 5]),)
>>> print(torch.chunk(t1,2))
(tensor([1, 2, 3]), tensor([4, 5]))
>>> print(torch.chunk(t1,3))
(tensor([1, 2]), tensor([3, 4]), tensor([5]))
>>> print(torch.chunk(t1,4))
(tensor([1, 2]), tensor([3, 4]), tensor([5]))
>>> print(torch.chunk(t1,5))
(tensor([1]), tensor([2]), tensor([3]), tensor([4]), tensor([5]))
>>> t2 = torch.tensor([[1,2,3],[4,5,6],[7,8,9]])
>>> print(torch.split(t2,2,0))
(tensor([[1, 2, 3],
        [4, 5, 6]]), tensor([[7, 8, 9]]))
>>> print(torch.split(t2,2,1))
(tensor([[1, 2],
        [4, 5],
        [7, 8]]), tensor([[3],
        [6],
        [9]]))

改变形状(reshape)

>>> import torch
>>> t = torch.tensor([1,2,3,4,5,6])
>>> print(torch.reshape(t,(2,3)))
tensor([[1, 2, 3],
        [4, 5, 6]])

交换维度(transpose)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t)
tensor([[1, 2, 3],
        [4, 5, 6]])
>>> print(torch.transpose(t,0,1))
tensor([[1, 4],
        [2, 5],
        [3, 6]])

插入/去掉维度(unsqueeze/squeeze)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
>>> print(t,t.shape)
tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]) torch.Size([4, 3])
>>> t1 = torch.unsqueeze(t,0)
>>> print(t1,t1.shape)
tensor([[[ 1,  2,  3],
         [ 4,  5,  6],
         [ 7,  8,  9],
         [10, 11, 12]]]) torch.Size([1, 4, 3])
>>> t2 = torch.unsqueeze(t,1)
>>> print(t2,t2.shape)
tensor([[[ 1,  2,  3]],

        [[ 4,  5,  6]],

        [[ 7,  8,  9]],

        [[10, 11, 12]]]) torch.Size([4, 1, 3])
>>> t3 = t2.squeeze()
>>> print(t3,t3.shape)
tensor([[ 1,  2,  3],
        [ 4,  5,  6],
        [ 7,  8,  9],
        [10, 11, 12]]) torch.Size([4, 3])

扩展维度(expand)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t,t.shape)
tensor([[1, 2, 3],
        [4, 5, 6]]) torch.Size([2, 3])
>>> s = t.expand(1,2,2,3) #目标维度
>>> print(s,s.shape)
tensor([[[[1, 2, 3],
          [4, 5, 6]],

         [[1, 2, 3],
          [4, 5, 6]]]]) torch.Size([1, 2, 2, 3])

重复(repeat)

>>> import torch
>>> t = torch.tensor([[1,2,3],[4,5,6]])
>>> print(t,t.shape)
tensor([[1, 2, 3],
        [4, 5, 6]]) torch.Size([2, 3])
>>> s = t.repeat(1,1,2,2) # 复制的倍数
>>> print(s,s.shape)
tensor([[[[1, 2, 3, 1, 2, 3],
          [4, 5, 6, 4, 5, 6],
          [1, 2, 3, 1, 2, 3],
          [4, 5, 6, 4, 5, 6]]]]) torch.Size([1, 1, 4, 6])