自然语言处理词向量化总结

(2017-04-04 23:52:47)

自然语言处理

1. 词向量表示

distributional representation vs. distributed representation 分布式表达（一类表示方法，基于统计含义），分散式表达（从一个高维空间X映射到一个低维空间Y） 分布假说(distributional hypothesis)为这一设想提供了 理论基础:上下文相似的词，其语义也相似.

1.1 one-hot encoding

In vector space terms, this is a vector with one 1 and a lot of zeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

tf*idf

svd

Glove

2. 词向量实现工具

word2vec

gensim

https://github.com/RaRe-Technologies/gensim

fasttext

```
```from gensim.models import KeyedVectors

words = []
for word in model.vocab:
words.append(word)

print("word count: {}".format(len(words)))

print("Dimensions of word: {}".format(len(model[words[0]])))

demo_word = "car"

for similar_word in model.similar_by_word(demo_word):
print("Word: {0}, Similarity: {1:.2f}".format(
similar_word[0], similar_word[1]
))```
```
```
```word count: 2519370
Dimensions of word: 300
Word: cars, Similarity: 0.83
Word: automobile, Similarity: 0.72
Word: truck, Similarity: 0.71
Word: motorcar, Similarity: 0.70
Word: vehicle, Similarity: 0.70
Word: driver, Similarity: 0.69
Word: drivecar, Similarity: 0.69
Word: minivan, Similarity: 0.67
Word: racecars, Similarity: 0.67```
```

Glove

https://github.com/stanfordnlp/GloVe

```
```
import os

import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from tsne import bh_sne

embeddings_index = {}
embeddings_vector = []
f = open("glove.6B.100d.txt", "rb")
word_idx = 0
for line in f:
values = line.decode("utf-8").split()
word = values[0]
vector = np.asarray(values[1:], dtype="float64")
embeddings_index[word] = word_idx
embeddings_vector.append(vector)
word_idx += 1
f.close()
inv_index = {v : k for k, v in embeddings_index.items()}
glove_embeddings = np.vstack(embeddings_vector)
glove_norms = np.linalg.norm(glove_embeddings, axis=-1, keepdims=True)
glove_embeddings_normed = glove_embeddings / glove_norms
# glove_embeddings_normed.fill(0)

return embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index

def get_emb(word, embeddings_index, glove_embeddings):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings[idx]

def get_normed_emb(word, embeddings_index, glove_embeddings_normed):
idx = embeddings_index.get(word)
if idx is None:
return None
else:
return glove_embeddings_normed[idx]

def most_similar(words, inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed, topn=10):
query_emb = 0

if type(words) == list:
for word in words:
query_emb += get_emb(word, embeddings_index, glove_embeddings)
else:
query_emb = get_emb(words, embeddings_index, glove_embeddings)

query_emb = query_emb / np.linalg.norm(query_emb)

cosin = np.dot(glove_embeddings_normed, query_emb)

idxs = np.argsort(cosin)[::-1][:topn]

return [(inv_index[idx], cosin[idx]) for idx in idxs]

def plot_tsne(glove_embeddings_normed, inv_index, perplexity, img_file_name, word_cnt=100):
#word_emb_tsne = TSNE(perplexity=30).fit_transform(glove_embeddings_normed[:word_cnt])
word_emb_tsne = bh_sne(glove_embeddings_normed[:word_cnt], perplexity=perplexity)
plt.figure(figsize=(40, 40))
axis = plt.gca()
np.set_printoptions(suppress=True)
plt.scatter(word_emb_tsne[:, 0], word_emb_tsne[:, 1], marker=".", s=1)
for idx in range(word_cnt):
plt.annotate(inv_index[idx],
xy=(word_emb_tsne[idx, 0], word_emb_tsne[idx, 1]),
xytext=(0, 0), textcoords='offset points')
plt.savefig(img_file_name)
plt.show()

def main():
glove_input_file = "glove.6B.100d.txt"
embeddings_index, glove_embeddings, glove_embeddings_normed, inv_index = read_glove(glove_input_file)
print(np.isfinite(glove_embeddings_normed).all())
print(glove_embeddings.shape)
print(get_emb("computer", embeddings_index, glove_embeddings))
print(most_similar("cpu", inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
print(most_similar(["river", "chinese"], inv_index, embeddings_index, glove_embeddings, glove_embeddings_normed))
# plot tsne viz
plot_tsne(glove_embeddings_normed, inv_index, 30.0, "tsne.png", word_cnt=1000)

if __name__ == "__main__":
main()```
```

3. papers

1. Neural Word Embeddings as Implicit Matrix Factorization
2. Linguistic Regularities in Sparse and Explicit Word Representation
3. Random Walks on Context Spaces Towards an Explanation of the Mysteries of Semantic Word Embeddings
4. word2vec Explained Deriving Mikolov et al.’s Negative Sampling Word Embedding Method
6. Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective
7. Hierarchical Probabilistic Neural Network Language Model
8. Notes on Noise Contrastive Estimation and Negative Sampling
9. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
10. Distributed Representations of Words and Phrases and their Compositionality
11. Efficient Estimation of Word Representations in Vector Space
12. GloVe Global Vectors forWord Representation
13. Neural probabilistic language models
14. Natural language processing (almost) from scratch
15. Learning word embeddings efficiently with noise contrastive estimation
16. A scalable hierarchical distributed language model
17. Three new graphical models for statistical language modelling
18. Improving word representations via global context and multiple word prototypes
19. A Primer on Neural Network Models for Natural Language Processing
20. Joulin, Armand, et al. "Bag of tricks for efficient text classification." FAIR 2016
21. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

https://github.com/oxford-cs-deepnlp-2017/lectures

wget http://nlp.stanford.edu/data/glove.6B.zip

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zh.zip

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip

4. 自然语言处理应用

Natural Language Processing

Topic Classification

Topic modeling

Sentiment Analysis

Chatbots / dialogue systems

Natural language query understanding (Google Now, Apple Siri, Amazon Alexa)

Summarization

0

