加载中…
正文 字体大小:

自然语言处理词向量化总结

(2017-04-04 23:52:47)
标签:

nlp

word2vec

fasttext

glove

tsne

分类: 数据挖掘

自然语言处理

1. 词向量表示

distributional representation vs. distributed representation 分布式表达(一类表示方法,基于统计含义),分散式表达(从一个高维空间X映射到一个低维空间Y) 分布假说(distributional hypothesis)为这一设想提供了 理论基础:上下文相似的词,其语义也相似.

自然语言处理的基础是词向量化,即文本数值化,后面进行数据挖掘工作就和常见的任务类似,即分类,聚类等等。

1.1 one-hot encoding

In vector space terms, this is a vector with one 1 and a lot of zeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

1.2 count-based

tf*idf

svd 

Glove

1.3 word embedding

基于神经网络的词向量表示 word2vec(2*2=4)四种训练方法

网络结构 CBOW,skip-gram

训练方法 Hierarchical Softmax,negative sampling

2. 词向量实现工具

word2vec

https://code.google.com/archive/p/word2vec/

gensim

https://github.com/RaRe-Technologies/gensim

fasttext

https://github.com/facebookresearch/fastText


from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("wiki.en.vec")
words = []
for word in model.vocab:
    words.append(word)

print("word count: {}".format(len(words)))

print("Dimensions of word: {}".format(len(model[words[0]])))

demo_word = "car"

for similar_word in model.similar_by_word(demo_word):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
    ))

word count: 2519370
Dimensions of word: 300
Word: cars, Similarity: 0.83
Word: automobile, Similarity: 0.72
Word: truck, Similarity: 0.71
Word: motorcar, Similarity: 0.70
Word: vehicle, Similarity: 0.70
Word: driver, Similarity: 0.69
Word: drivecar, Similarity: 0.69
Word: minivan, Similarity: 0.67
Word: roadster, Similarity: 0.67
Word: racecars, Similarity: 0.67

Glove

https://github.com/stanfordnlp/GloVe



import os

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tsne import bh_sne

def read_glove(glove_file):
    embeddings_index = {}
    embeddings_vector = []
    f = open("glove.6B.100d.txt", "rb")
    word_idx = 0
    for line in f:
        values = line.decode("utf-8").split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float64")
        embeddings_index[word] = word_idx
        embeddings_vector.append(vector)
        word_idx += 1
    f.close()
    inv_index = {v : k for k, v in embeddings_index.items()}
    glove_embeddings = np.vstack(embeddings_vector)
    glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
    glove_embeddings_normed = glove_embeddings / glove_norms
    # glove_embeddings_normed.fill(0)

    return embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index

def get_emb(word, embeddings_index, glove_embeddings):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings[idx]

def get_normed_emb(word, embeddings_index, glove_embeddings_normed):
    idx = embeddings_index.get(word)
    if idx is None:
        return None
    else:
        return glove_embeddings_normed[idx]


def most_similar(words, inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed, topn=10):
    query_emb = 0

    if type(words) == list:
        for word in words:
            query_emb += get_emb(word, embeddings_index, glove_embeddings)
    else:
        query_emb = get_emb(words, embeddings_index, glove_embeddings)

    query_emb = query_emb / np.linalg.norm(query_emb)

    cosin = np.dot(glove_embeddings_normed, query_emb)

    idxs = np.argsort(cosin)[::-1][:topn]

    return [(inv_index[idx], cosin[idx]) for idx in idxs]


def plot_tsne(glove_embeddings_normed, inv_index, perplexity, img_file_name, word_cnt=100):
    #word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:word_cnt])
    word_emb_tsne = bh_sne(glove_embeddings_normed[:word_cnt], perplexity=perplexity)
    plt.figure(figsize=(40, 40))
    axis = plt.gca()
    np.set_printoptions(suppress=True)
    plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)
    for idx in range(word_cnt):
        plt.annotate(inv_index[idx],
                     xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
                     xytext=(0, 0), textcoords='offset points')
    plt.savefig(img_file_name)
    plt.show()

def main():
    glove_input_file = "glove.6B.100d.txt"
    embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index = read_glove(glove_input_file)
    print(np.isfinite(glove_embeddings_normed).all())
    print(glove_embeddings.shape)
    print(get_emb("computer", embeddings_index, glove_embeddings))
    print(most_similar("cpu", inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
    print(most_similar(["river", "chinese"], inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
    # plot tsne viz
    plot_tsne(glove_embeddings_normed, inv_index, 30.0, "tsne.png", word_cnt=1000)

if __name__ == "__main__":
    main()

自然语言处理词向量化总结

3. papers

  1. Neural Word Embeddings as Implicit Matrix Factorization
  2. Linguistic Regularities in Sparse and Explicit Word Representation
  3. Random Walks on Context Spaces Towards an Explanation of the Mysteries of Semantic Word Embeddings
  4. word2vec Explained Deriving Mikolov et al.’s Negative Sampling Word Embedding Method
  5. Linking GloVe with word2vec
  6. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective
  7. Hierarchical Probabilistic Neural Network Language Model
  8. Notes on Noise Contrastive Estimation and Negative Sampling
  9. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
  10. Distributed Representations of Words and Phrases and their Compositionality
  11. Efficient Estimation of Word Representations in Vector Space
  12. GloVe Global Vectors forWord Representation
  13. Neural probabilistic language models
  14. Natural language processing (almost) from scratch
  15. Learning word embeddings efficiently with noise contrastive estimation
  16. A scalable hierarchical distributed language model
  17. Three new graphical models for statistical language modelling
  18. Improving word representations via global context and multiple word prototypes
  19. A Primer on Neural Network Models for Natural Language Processing
  20. Joulin, Armand, et al. "Bag of tricks for efficient text classification." FAIR 2016
  21. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

https://github.com/oxford-cs-deepnlp-2017/lectures

wget http://nlp.stanford.edu/data/glove.6B.zip

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zh.zip

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip

https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27

4. 自然语言处理应用

自然语言处理的基础是文本词向量化,之后可以进行分类,聚类,情感分析等等。

Natural Language Processing

Topic Classification

Topic modeling

Sentiment Analysis

Google Translate

Chatbots / dialogue systems

Natural language query understanding (Google Now, Apple Siri, Amazon Alexa)

Summarization

0

阅读 评论 收藏 转载 喜欢 打印举报
已投稿到:
  • 评论加载中,请稍候...
发评论

    发评论

    以上网友发言只代表其个人观点,不代表新浪网的观点或立场。

      

    新浪BLOG意见反馈留言板 不良信息反馈 电话:4006900000 提示音后按1键(按当地市话标准计费) 欢迎批评指正

    新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 会员注册 | 产品答疑

    新浪公司 版权所有