潜在语义分析(python)
(2015-05-13 16:10:06)
标签:
python |
gensim文档-语料库与向量空间
http://cloga.info/python/2014/01/27/Gensim_Corpora_and_Vector_Spaces/
如何计算两个文档的相似度(二)
http://www.52nlp.cn/如何计算两个文档的相似度二
Experiments on the English Wikipedia
http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation
gensim做主题模型
gensim文档-主题与转换
>>> def cleanDoc(doc):
...
...
...
...
...
...
...
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
Traceback (most recent call last):
NameError: name 'corpora' is not defined
>>> from nltk.corpus import stopwords
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
Traceback (most recent call last):
NameError: name 'stop' is not defined
>>> stop = stopwords.words('english')
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']
>>>
[hd@localhost ~]$ from gensim import corpora, models, similarities
bash: from: 未找到命令...
[hd@localhost ~]$ python
Python 2.7.5 (default, Apr 10 2015, 08:09:05)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
SyntaxError: invalid syntax
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> documents = [“Shipment of gold damaged in a fire”,
SyntaxError: invalid syntax
>>> … “Delivery of silver arrived in a silver truck”,
SyntaxError: invalid syntax
>>> documents = ["Shipment of gold damaged in a fire",
... "Delivery of silver arrived in a silver truck",
... "Shipment of gold arrived in a truck"]
>>> texts = [[word for word in document.lower().split()] for document in documents]
>>> print texts
[['shipment', 'of', 'gold', 'damaged', 'in', 'a', 'fire'], ['delivery', 'of', 'silver', 'arrived', 'in', 'a', 'silver', 'truck'], ['shipment', 'of', 'gold', 'arrived', 'in', 'a', 'truck']]
>>> dictionary = corpora.Dictionary(texts)
2015-05-13 13:33:39,048 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-05-13 13:33:39,049 : INFO : built Dictionary(11 unique tokens: [u'a', u'damaged', u'gold', u'fire', u'of']...) from 3 documents (total 22 corpus positions)
>>> print dictionary
Dictionary(11 unique tokens: [u'a', u'damaged', u'gold', u'fire', u'of']...)
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> print corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2), (10, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (10, 1)]]
>>> tfidf = models.TfidfModel(corpus)
2015-05-13 13:53:00,075 : INFO : collecting document frequencies
2015-05-13 13:53:00,076 : INFO : PROGRESS: processing document #0
2015-05-13 13:53:00,076 : INFO : calculating IDF weights for 3 documents and 10 features (21 matrix non-zeros)
>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
... print doc
IndentationError: expected an indented block
>>> for doc in corpus_tfidf:
...
...
[(1, 0.6633689723434505), (2, 0.6633689723434505), (3, 0.2448297500958463), (6, 0.2448297500958463)]
[(7, 0.16073253746956623), (8, 0.4355066251613605), (9, 0.871013250322721), (10, 0.16073253746956623)]
[(3, 0.5), (6, 0.5), (7, 0.5), (10, 0.5)]
>>>
IndentationError: unexpected indent
>>> print tfidf.dfs
{0: 3, 1: 1, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 1, 10: 2}
>>> print tfidf.idfs
{0: 0.0, 1: 1.5849625007211563, 2: 1.5849625007211563, 3: 0.5849625007211562, 4: 0.0, 5: 0.0, 6: 0.5849625007211562, 7: 0.5849625007211562, 8: 1.5849625007211563, 9: 1.5849625007211563, 10: 0.5849625007211562}
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 13:59:59,665 : INFO : using serial LSI version on this node
2015-05-13 13:59:59,665 : INFO : updating model with new documents
2015-05-13 13:59:59,666 : INFO : preparing a new chunk of documents
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2507: VisibleDeprecationWarnin
2015-05-13 13:59:59,668 : INFO : using 100 extra samples and 2 power iterations
2015-05-13 13:59:59,669 : INFO : 1st phase: constructing (11, 102) action matrix
2015-05-13 13:59:59,669 : INFO : orthonormalizing (11, 102) action matrix
2015-05-13 13:59:59,695 : INFO : 2nd phase: running dense svd on (11, 3) matrix
2015-05-13 13:59:59,703 : INFO : computing the final decomposition
2015-05-13 13:59:59,704 : INFO : keeping 2 factors (discarding 23.571% of energy spectrum)
2015-05-13 13:59:59,704 : INFO : processed documents up to #3
2015-05-13 13:59:59,704 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + 0.000*"a" + -0.000*"in"
2015-05-13 13:59:59,704 : INFO : topic #1(1.000): 0.728*"silver" + 0.364*"delivery" + -0.364*"fire" + -0.364*"damaged" + 0.134*"truck" + 0.134*"arrived" + -0.134*"shipment" + -0.134*"gold" + -0.000*"a" + 0.000*"in"
>>> lda.print_topics(2)
Traceback (most recent call last):
NameError: name 'lda' is not defined
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 14:01:05,888 : INFO : using serial LSI version on this node
2015-05-13 14:01:05,888 : INFO : updating model with new documents
2015-05-13 14:01:05,888 : INFO : preparing a new chunk of documents
2015-05-13 14:01:05,889 : INFO : using 100 extra samples and 2 power iterations
2015-05-13 14:01:05,889 : INFO : 1st phase: constructing (11, 102) action matrix
2015-05-13 14:01:05,889 : INFO : orthonormalizing (11, 102) action matrix
2015-05-13 14:01:05,891 : INFO : 2nd phase: running dense svd on (11, 3) matrix
2015-05-13 14:01:05,892 : INFO : computing the final decomposition
2015-05-13 14:01:05,892 : INFO : keeping 2 factors (discarding 23.571% of energy spectrum)
2015-05-13 14:01:05,892 : INFO : processed documents up to #3
2015-05-13 14:01:05,893 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"
2015-05-13 14:01:05,893 : INFO : topic #1(1.000): 0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"
>>> lsi.print_topics(2)
2015-05-13 14:01:14,112 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"
2015-05-13 14:01:14,112 : INFO : topic #1(1.000): 0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"
[u'0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"', u'0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"']
>>> lda = models.LdaModel(copurs_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
NameError: name 'copurs_tfidf' is not defined
>>> corpus_lsi = lsi[corpus_tfidf]
>>> for doc in corpus_lsi:
... print doc
IndentationError: expected an indented block
>>> for doc in corpus_lsi:
...
...
[(0, 0.6721146880987855), (1, -0.54880682119356061)]
[(0, 0.44124825208697899), (1, 0.83594920480338941)]
[(0, 0.80401378963792736)]
>>> lda = models.LdaModel(copurs_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
NameError: name 'copurs_tfidf' is not defined
>>> lda = models.LdaModel(copus_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
NameError: name 'copus_tfidf' is not defined
>>> lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 14:08:59,068 : INFO : using symmetric alpha at 0.5
2015-05-13 14:08:59,068 : INFO : using serial LDA version on this node
2015-05-13 14:08:59,068 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2015-05-13 14:08:59,069 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2015-05-13 14:08:59,086 : INFO : -4.149 per-word bound, 17.7 perplexity estimate based on a held-out corpus of 3 documents with 5 words
2015-05-13 14:08:59,086 : INFO : PROGRESS: pass 0, at document #3/3
2015-05-13 14:08:59,094 : INFO : topic #0 (0.500): 0.148*silver + 0.105*delivery + 0.102*truck + 0.099*arrived + 0.094*gold + 0.094*shipment + 0.086*fire + 0.082*damaged + 0.063*of + 0.063*in
2015-05-13 14:08:59,095 : INFO : topic #1 (0.500): 0.119*damaged + 0.118*shipment + 0.118*gold + 0.115*fire + 0.103*arrived + 0.100*truck + 0.082*silver + 0.071*delivery + 0.059*of + 0.059*in
2015-05-13 14:08:59,095 : INFO : topic diff=0.556293, rho=1.000000
>>> for doc in corpus_lda:
...
...
Traceback (most recent call last):
NameError: name 'corpus_lda' is not defined
>>> corpus_lda = lda[corpus_tfidf]
>>> for doc in corpus_lda:
...
...
[(0, 0.22307833898252546), (1, 0.77692166101747462)]
[(0, 0.76326581247359115), (1, 0.23673418752640893)]
[(0, 0.26091325840303231), (1, 0.73908674159696774)]
>>> index = similarities.MatrixSimilarity(lsi[corpus])
2015-05-13 14:13:26,605 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2015-05-13 14:13:26,606 : INFO : creating matrix for 3 documents and 2 features
>>> query = “gold silver truck”
SyntaxError: invalid syntax
>>> query = "gold silver truck"
>>> query_bow = dictionary.doc2bow(query.lower().split())
>>> print query_bow
[(3, 1), (9, 1), (10, 1)]
>>> query_lsi = lsi[query_bow]
>>> print query_lsi
[(0, 1.1012835748628489), (1, 0.72812283398049393)]
>>> sims = index[query_lsi]
>>> print list(enumerate(sims))
[(0, 0.40757114), (1, 0.93163693), (2, 0.83416492)]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print sort_sims
[(1, 0.93163693), (2, 0.83416492), (0, 0.40757114)]
>>> courses = [line.strip() for line in file(‘coursera_corpus’)]
SyntaxError: invalid syntax
>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split(‘\t’)[0] for course in courses]
SyntaxError: invalid syntax
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']
>>> from nltk.corpus import brown
>>> brown.words()[0:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
>>> brown.tagged_words()[0:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]
>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', 'variety', 'of', 'rhetorical', 'concepts\xe2\x80\x94that', 'is,', 'ideas', 'and', 'techniques', 'to', 'inform', 'and', 'persuade', 'audiences\xe2\x80\x94that', 'will', 'help', 'you', 'become', 'a', 'more', 'effective', 'consumer', 'and', 'producer', 'of', 'written,', 'visual,', 'and', 'multimodal', 'texts.', 'the', 'class', 'includes', 'short', 'videos,', 'demonstrations,', 'and', 'activities.', 'we', 'envision', 'rhetorical', 'composing', 'as', 'a', 'learning', 'community', 'that', 'includes', 'both', 'those', 'enrolled', 'in', 'this', 'course', 'and', 'the', 'instructors.', 'we', 'bring', 'our', 'expertise', 'in', 'writing,', 'rhetoric', 'and', 'course', 'design,', 'and', 'we', 'have', 'designed', 'the', 'assignments', 'and', 'course', 'infrastructure', 'to', 'help', 'you', 'share', 'your', 'experiences', 'as', 'writers,', 'students,', 'and', 'professionals', 'with', 'each', 'other', 'and', 'with', 'us.', 'these', 'collaborations', 'are', 'facilitated', 'through', 'wex,', 'the', 'writers', 'exchange,', 'a', 'place', 'where', 'you', 'will', 'exchange', 'your', 'work', 'and', 'feedback']
>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
[u'writing', u'ii', u':', u'rhetorical', u'composing', u'rhetorical', u'composing', u'engages', u'you', u'in', u'a', u'series', u'of', u'interactive', u'reading', u',', u'research', u',', u'and', u'composing', u'activities', u'along', u'with', u'assignments', u'designed', u'to', u'help', u'you', u'become', u'more', u'effective', u'consumers', u'and', u'producers', u'of', u'alphabetic', u',', u'visual', u'and', u'multimodal', u'texts', u'.', u'join', u'us', u'to', u'become', u'more', u'effective', u'writers', u'...', u'and', u'better', u'citizens', u'.', u'rhetorical', u'composing', u'is', u'a', u'course', u'where', u'writers', u'exchange', u'words', u',', u'ideas', u',', u'talents', u',', u'and', u'support', u'.', u'you', u'will', u'be', u'introduced', u'to', u'a', u'variety', u'of', u'rhetorical', u'concepts\u2014that', u'is', u',', u'ideas', u'and', u'techniques', u'to', u'inform', u'and', u'persuade', u'audiences\u2014that', u'will', u'help', u'you', u'become', u'a', u'more', u'effective', u'consumer', u'and', u'producer', u'of', u'written', u',', u'visual', u',', u'and', u'multimodal', u'texts', u'.', u'the', u'class', u'includes', u'short', u'videos', u',', u'demonstrations', u',', u'and', u'activities', u'.', u'we', u'envision', u'rhetorical', u'composing', u'as', u'a', u'learning', u'community', u'that', u'includes', u'both', u'those', u'enrolled', u'in', u'this', u'course', u'and', u'the', u'instructors', u'.', u'we', u'bring', u'our', u'expertise', u'in', u'writing', u',', u'rhetoric', u'and', u'course', u'design', u',', u'and', u'we', u'have', u'designed', u'the', u'assignments', u'and', u'course', u'infrastructure', u'to', u'help', u'you', u'share', u'your', u'experiences', u'as', u'writers', u',', u'students', u',', u'and', u'professionals', u'with', u'each', u'other', u'and', u'with', u'us', u'.', u'these', u'collaborations', u'are', u'facilitated', u'through', u'wex', u',', u'the', u'writers', u'exchange', u',', u'a', u'place', u'where', u'you', u'will', u'exchange', u'your', u'work', u'and', u'feedback']
>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
[u'writing', u'ii', u':', u'rhetorical', u'composing', u'rhetorical', u'composing', u'engages', u'series', u'interactive', u'reading', u',', u'research', u',', u'composing', u'activities', u'along', u'assignments', u'designed', u'help', u'become', u'effective', u'consumers', u'producers', u'alphabetic', u',', u'visual', u'multimodal', u'texts', u'.', u'join', u'us', u'become', u'effective', u'writers', u'...', u'better', u'citizens', u'.', u'rhetorical', u'composing', u'course', u'writers', u'exchange', u'words', u',', u'ideas', u',', u'talents', u',', u'support', u'.', u'introduced', u'variety', u'rhetorical', u'concepts\u2014that', u',', u'ideas', u'techniques', u'inform', u'persuade', u'audiences\u2014that', u'help', u'become', u'effective', u'consumer', u'producer', u'written', u',', u'visual', u',', u'multimodal', u'texts', u'.', u'class', u'includes', u'short', u'videos', u',', u'demonstrations', u',', u'activities', u'.', u'envision', u'rhetorical', u'composing', u'learning', u'community', u'includes', u'enrolled', u'course', u'instructors', u'.', u'bring', u'expertise', u'writing', u',', u'rhetoric', u'course', u'design', u',', u'designed', u'assignments', u'course', u'infrastructure', u'help', u'share', u'experiences', u'writers', u',', u'students', u',', u'professionals', u'us', u'.', u'collaborations', u'facilitated', u'wex', u',', u'writers', u'exchange', u',', u'place', u'exchange', u'work', u'feedback']
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*','@','#','$','%']>>>