潜在语义分析（python）

(2015-05-13 16:10:06)

标签：

python

gensim文档-语料库与向量空间

http://cloga.info/python/2014/01/27/Gensim_Corpora_and_Vector_Spaces/

如何计算两个文档的相似度（二）

http://www.52nlp.cn/如何计算两个文档的相似度二

Experiments on the English Wikipedia

http://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation

gensim做主题模型

gensim文档-主题与转换

>>> def cleanDoc(doc):
...     stopset = set(stopwords.words('english'))
...     stemmer = nltk.PorterStemmer()
...     tokens = WordPunctTokenizer().tokenize(doc)
...     clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
...     final = [stemmer.stem(word) for word in clean]
...     return final
...
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
Traceback (most recent call last):
File "", line 1, in
NameError: name 'corpora' is not defined
>>> from nltk.corpus import stopwords
>>> sentence = "this is a foo bar sentence"
>>> print [i for i in sentence.split() if i not in stop]
Traceback (most recent call last):
File "", line 1, in
NameError: name 'stop' is not defined
>>> stop = stopwords.words('english')
>>> print [i for i in sentence.split() if i not in stop]
['foo', 'bar', 'sentence']
>>>
[hd@localhost ~]$ from gensim import corpora, models, similarities
bash: from: 未找到命令...
[hd@localhost ~]$ python
Python 2.7.5 (default, Apr 10 2015, 08:09:05)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
File "", line 1
    logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
                               ^
SyntaxError: invalid syntax
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> documents = [“Shipment of gold damaged in a fire”,
File "", line 1
    documents = [“Shipment of gold damaged in a fire”,
                 ^
SyntaxError: invalid syntax
>>> … “Delivery of silver arrived in a silver truck”,
File "", line 1
    … “Delivery of silver arrived in a silver truck”,
    ^
SyntaxError: invalid syntax
>>> documents = ["Shipment of gold damaged in a fire",
... "Delivery of silver arrived in a silver truck",
... "Shipment of gold arrived in a truck"]
>>> texts = [[word for word in document.lower().split()] for document in documents]
>>> print texts
[['shipment', 'of', 'gold', 'damaged', 'in', 'a', 'fire'], ['delivery', 'of', 'silver', 'arrived', 'in', 'a', 'silver', 'truck'], ['shipment', 'of', 'gold', 'arrived', 'in', 'a', 'truck']]
>>> dictionary = corpora.Dictionary(texts)
2015-05-13 13:33:39,048 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-05-13 13:33:39,049 : INFO : built Dictionary(11 unique tokens: [u'a', u'damaged', u'gold', u'fire', u'of']...) from 3 documents (total 22 corpus positions)
>>> print dictionary
Dictionary(11 unique tokens: [u'a', u'damaged', u'gold', u'fire', u'of']...)
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> print corpus
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (4, 1), (5, 1), (7, 1), (8, 1), (9, 2), (10, 1)], [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (10, 1)]]
>>> tfidf = models.TfidfModel(corpus)
2015-05-13 13:53:00,075 : INFO : collecting document frequencies
2015-05-13 13:53:00,076 : INFO : PROGRESS: processing document #0
2015-05-13 13:53:00,076 : INFO : calculating IDF weights for 3 documents and 10 features (21 matrix non-zeros)
>>> corpus_tfidf = tfidf[corpus]
>>> for doc in corpus_tfidf:
... print doc
File "", line 2
    print doc
        ^
IndentationError: expected an indented block
>>> for doc in corpus_tfidf:
...     print doc
...
[(1, 0.6633689723434505), (2, 0.6633689723434505), (3, 0.2448297500958463), (6, 0.2448297500958463)]
[(7, 0.16073253746956623), (8, 0.4355066251613605), (9, 0.871013250322721), (10, 0.16073253746956623)]
[(3, 0.5), (6, 0.5), (7, 0.5), (10, 0.5)]
>>> print tfidf.dfs
File "", line 1
    print tfidf.dfs
    ^
IndentationError: unexpected indent
>>> print tfidf.dfs
{0: 3, 1: 1, 2: 1, 3: 2, 4: 3, 5: 3, 6: 2, 7: 2, 8: 1, 9: 1, 10: 2}
>>> print tfidf.idfs
{0: 0.0, 1: 1.5849625007211563, 2: 1.5849625007211563, 3: 0.5849625007211562, 4: 0.0, 5: 0.0, 6: 0.5849625007211562, 7: 0.5849625007211562, 8: 1.5849625007211563, 9: 1.5849625007211563, 10: 0.5849625007211562}
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 13:59:59,665 : INFO : using serial LSI version on this node
2015-05-13 13:59:59,665 : INFO : updating model with new documents
2015-05-13 13:59:59,666 : INFO : preparing a new chunk of documents
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2507: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
VisibleDeprecationWarning)
2015-05-13 13:59:59,668 : INFO : using 100 extra samples and 2 power iterations
2015-05-13 13:59:59,669 : INFO : 1st phase: constructing (11, 102) action matrix
2015-05-13 13:59:59,669 : INFO : orthonormalizing (11, 102) action matrix
2015-05-13 13:59:59,695 : INFO : 2nd phase: running dense svd on (11, 3) matrix
2015-05-13 13:59:59,703 : INFO : computing the final decomposition
2015-05-13 13:59:59,704 : INFO : keeping 2 factors (discarding 23.571% of energy spectrum)
2015-05-13 13:59:59,704 : INFO : processed documents up to #3
2015-05-13 13:59:59,704 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + 0.000*"a" + -0.000*"in"
2015-05-13 13:59:59,704 : INFO : topic #1(1.000): 0.728*"silver" + 0.364*"delivery" + -0.364*"fire" + -0.364*"damaged" + 0.134*"truck" + 0.134*"arrived" + -0.134*"shipment" + -0.134*"gold" + -0.000*"a" + 0.000*"in"
>>> lda.print_topics(2)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'lda' is not defined
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 14:01:05,888 : INFO : using serial LSI version on this node
2015-05-13 14:01:05,888 : INFO : updating model with new documents
2015-05-13 14:01:05,888 : INFO : preparing a new chunk of documents
2015-05-13 14:01:05,889 : INFO : using 100 extra samples and 2 power iterations
2015-05-13 14:01:05,889 : INFO : 1st phase: constructing (11, 102) action matrix
2015-05-13 14:01:05,889 : INFO : orthonormalizing (11, 102) action matrix
2015-05-13 14:01:05,891 : INFO : 2nd phase: running dense svd on (11, 3) matrix
2015-05-13 14:01:05,892 : INFO : computing the final decomposition
2015-05-13 14:01:05,892 : INFO : keeping 2 factors (discarding 23.571% of energy spectrum)
2015-05-13 14:01:05,892 : INFO : processed documents up to #3
2015-05-13 14:01:05,893 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"
2015-05-13 14:01:05,893 : INFO : topic #1(1.000): 0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"
>>> lsi.print_topics(2)
2015-05-13 14:01:14,112 : INFO : topic #0(1.137): 0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"
2015-05-13 14:01:14,112 : INFO : topic #1(1.000): 0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"
[u'0.438*"shipment" + 0.438*"gold" + 0.366*"truck" + 0.366*"arrived" + 0.345*"fire" + 0.345*"damaged" + 0.297*"silver" + 0.149*"delivery" + -0.000*"a" + -0.000*"of"', u'0.728*"silver" + -0.364*"fire" + -0.364*"damaged" + 0.364*"delivery" + -0.134*"shipment" + -0.134*"gold" + 0.134*"truck" + 0.134*"arrived" + 0.000*"a" + 0.000*"of"']
>>> lda = models.LdaModel(copurs_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'copurs_tfidf' is not defined
>>> corpus_lsi = lsi[corpus_tfidf]
>>> for doc in corpus_lsi:
... print doc
File "", line 2
    print doc
        ^
IndentationError: expected an indented block
>>> for doc in corpus_lsi:
...      print doc
...
[(0, 0.6721146880987855), (1, -0.54880682119356061)]
[(0, 0.44124825208697899), (1, 0.83594920480338941)]
[(0, 0.80401378963792736)]
>>> lda = models.LdaModel(copurs_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'copurs_tfidf' is not defined
>>> lda = models.LdaModel(copus_tfidf, id2word=dictionary, num_topics=2)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'copus_tfidf' is not defined
>>> lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
2015-05-13 14:08:59,068 : INFO : using symmetric alpha at 0.5
2015-05-13 14:08:59,068 : INFO : using serial LDA version on this node
2015-05-13 14:08:59,068 : INFO : running online LDA training, 2 topics, 1 passes over the supplied corpus of 3 documents, updating model once every 3 documents, evaluating perplexity every 3 documents, iterating 50x with a convergence threshold of 0.001000
2015-05-13 14:08:59,069 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy
2015-05-13 14:08:59,086 : INFO : -4.149 per-word bound, 17.7 perplexity estimate based on a held-out corpus of 3 documents with 5 words
2015-05-13 14:08:59,086 : INFO : PROGRESS: pass 0, at document #3/3
2015-05-13 14:08:59,094 : INFO : topic #0 (0.500): 0.148*silver + 0.105*delivery + 0.102*truck + 0.099*arrived + 0.094*gold + 0.094*shipment + 0.086*fire + 0.082*damaged + 0.063*of + 0.063*in
2015-05-13 14:08:59,095 : INFO : topic #1 (0.500): 0.119*damaged + 0.118*shipment + 0.118*gold + 0.115*fire + 0.103*arrived + 0.100*truck + 0.082*silver + 0.071*delivery + 0.059*of + 0.059*in
2015-05-13 14:08:59,095 : INFO : topic diff=0.556293, rho=1.000000
>>> for doc in corpus_lda:
...      print doc
...
Traceback (most recent call last):
File "", line 1, in
NameError: name 'corpus_lda' is not defined
>>> corpus_lda = lda[corpus_tfidf]
>>> for doc in corpus_lda:
...      print doc
...
[(0, 0.22307833898252546), (1, 0.77692166101747462)]
[(0, 0.76326581247359115), (1, 0.23673418752640893)]
[(0, 0.26091325840303231), (1, 0.73908674159696774)]
>>> index = similarities.MatrixSimilarity(lsi[corpus])
2015-05-13 14:13:26,605 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2015-05-13 14:13:26,606 : INFO : creating matrix for 3 documents and 2 features
>>> query = “gold silver truck”
File "", line 1
    query = “gold silver truck”
            ^
SyntaxError: invalid syntax
>>> query = "gold silver truck"
>>> query_bow = dictionary.doc2bow(query.lower().split())
>>> print query_bow
[(3, 1), (9, 1), (10, 1)]
>>> query_lsi = lsi[query_bow]
>>> print query_lsi
[(0, 1.1012835748628489), (1, 0.72812283398049393)]
>>> sims = index[query_lsi]
>>> print list(enumerate(sims))
[(0, 0.40757114), (1, 0.93163693), (2, 0.83416492)]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print sort_sims
[(1, 0.93163693), (2, 0.83416492), (0, 0.40757114)]
>>> courses = [line.strip() for line in file(‘coursera_corpus’)]
File "", line 1
    courses = [line.strip() for line in file(‘coursera_corpus’)]
                                             ^
SyntaxError: invalid syntax
>>> courses = [line.strip() for line in file('coursera_corpus')]
>>> courses_name = [course.split(‘\t’)[0] for course in courses]
File "", line 1
    courses_name = [course.split(‘\t’)[0] for course in courses]
                                 ^
SyntaxError: invalid syntax
>>> courses_name = [course.split('\t')[0] for course in courses]
>>> print courses_name[0:10]
['Writing II: Rhetorical Composing', 'Genetics and Society: A Course for Educators', 'General Game Playing', 'Genes and the Human Condition (From Behavior to Biotechnology)', 'A Brief History of Humankind', 'New Models of Business in Society', 'Analyse Num\xc3\xa9rique pour Ing\xc3\xa9nieurs', 'Evolution: A Course for Educators', 'Coding the Matrix: Linear Algebra through Computer Science Applications', 'The Dynamic Earth: A Course for Educators']
>>> from nltk.corpus import brown
>>> brown.words()[0:10]
[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of']
>>> brown.tagged_words()[0:10]
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN')]
>>> texts_lower = [[word for word in document.lower().split()] for document in courses]
>>> print texts_lower[0]
['writing', 'ii:', 'rhetorical', 'composing', 'rhetorical', 'composing', 'engages', 'you', 'in', 'a', 'series', 'of', 'interactive', 'reading,', 'research,', 'and', 'composing', 'activities', 'along', 'with', 'assignments', 'designed', 'to', 'help', 'you', 'become', 'more', 'effective', 'consumers', 'and', 'producers', 'of', 'alphabetic,', 'visual', 'and', 'multimodal', 'texts.', 'join', 'us', 'to', 'become', 'more', 'effective', 'writers...', 'and', 'better', 'citizens.', 'rhetorical', 'composing', 'is', 'a', 'course', 'where', 'writers', 'exchange', 'words,', 'ideas,', 'talents,', 'and', 'support.', 'you', 'will', 'be', 'introduced', 'to', 'a', 'variety', 'of', 'rhetorical', 'concepts\xe2\x80\x94that', 'is,', 'ideas', 'and', 'techniques', 'to', 'inform', 'and', 'persuade', 'audiences\xe2\x80\x94that', 'will', 'help', 'you', 'become', 'a', 'more', 'effective', 'consumer', 'and', 'producer', 'of', 'written,', 'visual,', 'and', 'multimodal', 'texts.', 'the', 'class', 'includes', 'short', 'videos,', 'demonstrations,', 'and', 'activities.', 'we', 'envision', 'rhetorical', 'composing', 'as', 'a', 'learning', 'community', 'that', 'includes', 'both', 'those', 'enrolled', 'in', 'this', 'course', 'and', 'the', 'instructors.', 'we', 'bring', 'our', 'expertise', 'in', 'writing,', 'rhetoric', 'and', 'course', 'design,', 'and', 'we', 'have', 'designed', 'the', 'assignments', 'and', 'course', 'infrastructure', 'to', 'help', 'you', 'share', 'your', 'experiences', 'as', 'writers,', 'students,', 'and', 'professionals', 'with', 'each', 'other', 'and', 'with', 'us.', 'these', 'collaborations', 'are', 'facilitated', 'through', 'wex,', 'the', 'writers', 'exchange,', 'a', 'place', 'where', 'you', 'will', 'exchange', 'your', 'work', 'and', 'feedback']
>>> from nltk.tokenize import word_tokenize
>>> texts_tokenized = [[word.lower() for word in word_tokenize(document.decode('utf-8'))] for document in courses]
>>> print texts_tokenized[0]
[u'writing', u'ii', u':', u'rhetorical', u'composing', u'rhetorical', u'composing', u'engages', u'you', u'in', u'a', u'series', u'of', u'interactive', u'reading', u',', u'research', u',', u'and', u'composing', u'activities', u'along', u'with', u'assignments', u'designed', u'to', u'help', u'you', u'become', u'more', u'effective', u'consumers', u'and', u'producers', u'of', u'alphabetic', u',', u'visual', u'and', u'multimodal', u'texts', u'.', u'join', u'us', u'to', u'become', u'more', u'effective', u'writers', u'...', u'and', u'better', u'citizens', u'.', u'rhetorical', u'composing', u'is', u'a', u'course', u'where', u'writers', u'exchange', u'words', u',', u'ideas', u',', u'talents', u',', u'and', u'support', u'.', u'you', u'will', u'be', u'introduced', u'to', u'a', u'variety', u'of', u'rhetorical', u'concepts\u2014that', u'is', u',', u'ideas', u'and', u'techniques', u'to', u'inform', u'and', u'persuade', u'audiences\u2014that', u'will', u'help', u'you', u'become', u'a', u'more', u'effective', u'consumer', u'and', u'producer', u'of', u'written', u',', u'visual', u',', u'and', u'multimodal', u'texts', u'.', u'the', u'class', u'includes', u'short', u'videos', u',', u'demonstrations', u',', u'and', u'activities', u'.', u'we', u'envision', u'rhetorical', u'composing', u'as', u'a', u'learning', u'community', u'that', u'includes', u'both', u'those', u'enrolled', u'in', u'this', u'course', u'and', u'the', u'instructors', u'.', u'we', u'bring', u'our', u'expertise', u'in', u'writing', u',', u'rhetoric', u'and', u'course', u'design', u',', u'and', u'we', u'have', u'designed', u'the', u'assignments', u'and', u'course', u'infrastructure', u'to', u'help', u'you', u'share', u'your', u'experiences', u'as', u'writers', u',', u'students', u',', u'and', u'professionals', u'with', u'each', u'other', u'and', u'with', u'us', u'.', u'these', u'collaborations', u'are', u'facilitated', u'through', u'wex', u',', u'the', u'writers', u'exchange', u',', u'a', u'place', u'where', u'you', u'will', u'exchange', u'your', u'work', u'and', u'feedback']
>>> from nltk.corpus import stopwords
>>> english_stopwords = stopwords.words('english')
>>> print english_stopwords
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now']
>>> texts_filtered_stopwords = [[word for word in document if not word in english_stopwords] for document in texts_tokenized]
>>> print texts_filtered_stopwords[0]
[u'writing', u'ii', u':', u'rhetorical', u'composing', u'rhetorical', u'composing', u'engages', u'series', u'interactive', u'reading', u',', u'research', u',', u'composing', u'activities', u'along', u'assignments', u'designed', u'help', u'become', u'effective', u'consumers', u'producers', u'alphabetic', u',', u'visual', u'multimodal', u'texts', u'.', u'join', u'us', u'become', u'effective', u'writers', u'...', u'better', u'citizens', u'.', u'rhetorical', u'composing', u'course', u'writers', u'exchange', u'words', u',', u'ideas', u',', u'talents', u',', u'support', u'.', u'introduced', u'variety', u'rhetorical', u'concepts\u2014that', u',', u'ideas', u'techniques', u'inform', u'persuade', u'audiences\u2014that', u'help', u'become', u'effective', u'consumer', u'producer', u'written', u',', u'visual', u',', u'multimodal', u'texts', u'.', u'class', u'includes', u'short', u'videos', u',', u'demonstrations', u',', u'activities', u'.', u'envision', u'rhetorical', u'composing', u'learning', u'community', u'includes', u'enrolled', u'course', u'instructors', u'.', u'bring', u'expertise', u'writing', u',', u'rhetoric', u'course', u'design', u',', u'designed', u'assignments', u'course', u'infrastructure', u'help', u'share', u'experiences', u'writers', u',', u'students', u',', u'professionals', u'us', u'.', u'collaborations', u'facilitated', u'wex', u',', u'writers', u'exchange', u',', u'place', u'exchange', u'work', u'feedback']
>>> english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*','@','#','$','%']>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
File "", line 1
    texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
    ^
>>> texts_filtered = [[word for word in document if not word in english_punctuations] for document in texts_filtered_stopwords]
>>> print texts_filtered[0]
[u'writing', u'ii', u'rhetorical', u'composing', u'rhetorical', u'composing', u'engages', u'series', u'interactive', u'reading', u'research', u'composing', u'activities', u'along', u'assignments', u'designed', u'help', u'become', u'effective', u'consumers', u'producers', u'alphabetic', u'visual', u'multimodal', u'texts', u'join', u'us', u'become', u'effective', u'writers', u'...', u'better', u'citizens', u'rhetorical', u'composing', u'course', u'writers', u'exchange', u'words', u'ideas', u'talents', u'support', u'introduced', u'variety', u'rhetorical', u'concepts\u2014that', u'ideas', u'techniques', u'inform', u'persuade', u'audiences\u2014that', u'help', u'become', u'effective', u'consumer', u'producer', u'written', u'visual', u'multimodal', u'texts', u'class', u'includes', u'short', u'videos', u'demonstrations', u'activities', u'envision', u'rhetorical', u'composing', u'learning', u'community', u'includes', u'enrolled', u'course', u'instructors', u'bring', u'expertise', u'writing', u'rhetoric', u'course', u'design', u'designed', u'assignments', u'course', u'infrastructure', u'help', u'share', u'experiences', u'writers', u'students', u'professionals', u'us', u'collaborations', u'facilitated', u'wex', u'writers', u'exchange', u'place', u'exchange', u'work', u'feedback']
>>> from nltk.stem.lancaster import LancasterStemmer
>>> st = LancasterStemmer()
>>> st.stem('stemmed')
'stem'
>>> texts_stemmed = [[st.stem(word) for word in docment] for docment in texts_filtered]
>>> print texts_stemmed[0]
[u'writ', u'ii', u'rhet', u'compos', u'rhet', u'compos', u'eng', u'sery', u'interact', u'read', u'research', u'compos', u'act', u'along', u'assign', u'design', u'help', u'becom', u'effect', u'consum', u'produc', u'alphabet', u'vis', u'multimod', u'text', u'join', u'us', u'becom', u'effect', u'writ', u'...', u'bet', u'cit', u'rhet', u'compos', u'cours', u'writ', u'exchang', u'word', u'idea', u'tal', u'support', u'introduc', u'vary', u'rhet', u'concepts\u2014that', u'idea', u'techn', u'inform', u'persuad', u'audiences\u2014that', u'help', u'becom', u'effect', u'consum', u'produc', u'writ', u'vis', u'multimod', u'text', u'class', u'includ', u'short', u'video', u'demonst', u'act', u'envid', u'rhet', u'compos', u'learn', u'commun', u'includ', u'enrol', u'cours', u'instruct', u'bring', u'expert', u'writ', u'rhet', u'cours', u'design', u'design', u'assign', u'cours', u'infrastruct', u'help', u'shar', u'expery', u'writ', u'stud', u'profess', u'us', u'collab', u'facilit', u'wex', u'writ', u'exchang', u'plac', u'exchang', u'work', u'feedback']
>>> all_stems = sum(texts_stemmed, [])
>>> stems_once = set(stem for stem in set(all_stems) if all_stems.count(stem) == 1)
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
File "", line 1
    texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
    ^
IndentationError: unexpected indent
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
File "", line 1
    texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
    ^
IndentationError: unexpected indent
>>> texts = [[stem for stem in text if stem not in stems_once] for text in texts_stemmed]
>>> print texts[0]
[u'writ', u'ii', u'rhet', u'compos', u'rhet', u'compos', u'eng', u'sery', u'interact', u'read', u'research', u'compos', u'act', u'along', u'assign', u'design', u'help', u'becom', u'effect', u'consum', u'produc', u'vis', u'multimod', u'text', u'join', u'us', u'becom', u'effect', u'writ', u'...', u'bet', u'cit', u'rhet', u'compos', u'cours', u'writ', u'exchang', u'word', u'idea', u'tal', u'support', u'introduc', u'vary', u'rhet', u'idea', u'techn', u'inform', u'help', u'becom', u'effect', u'consum', u'produc', u'writ', u'vis', u'multimod', u'text', u'class', u'includ', u'short', u'video', u'demonst', u'act', u'envid', u'rhet', u'compos', u'learn', u'commun', u'includ', u'enrol', u'cours', u'instruct', u'bring', u'expert', u'writ', u'rhet', u'cours', u'design', u'design', u'assign', u'cours', u'infrastruct', u'help', u'shar', u'expery', u'writ', u'stud', u'profess', u'us', u'collab', u'facilit', u'writ', u'exchang', u'plac', u'exchang', u'work', u'feedback']
>>> from gensim import corpora, models, similarities
>>> import logging
>>> logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
File "", line 1
    logging.basicConfig(format=’%(asctime)s : %(levelname)s : %(message)s’, level=logging.INFO)
                               ^
SyntaxError: invalid syntax
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> dictionary = corpora.Dictionary(texts)
2015-05-13 14:56:54,085 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2015-05-13 14:56:54,217 : INFO : built Dictionary(3008 unique tokens: [u"d'ex\xe9cution", u'four', u'protest', u'circuitry', u'proficy']...) from 379 documents (total 46989 corpus positions)
>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> tfidf = models.TfidfModel(corpus)
2015-05-13 14:57:20,516 : INFO : collecting document frequencies
2015-05-13 14:57:20,517 : INFO : PROGRESS: processing document #0
2015-05-13 14:57:20,536 : INFO : calculating IDF weights for 379 documents and 3007 features (28717 matrix non-zeros)
>>> corpus_tfidf = tfidf[corpus]
>>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=10)
2015-05-13 15:00:19,528 : INFO : using serial LSI version on this node
2015-05-13 15:00:19,528 : INFO : updating model with new documents
2015-05-13 15:00:19,587 : INFO : preparing a new chunk of documents
2015-05-13 15:00:19,597 : INFO : using 100 extra samples and 2 power iterations
2015-05-13 15:00:19,597 : INFO : 1st phase: constructing (3008, 110) action matrix
2015-05-13 15:00:19,605 : INFO : orthonormalizing (3008, 110) action matrix
2015-05-13 15:00:19,834 : INFO : 2nd phase: running dense svd on (110, 379) matrix
2015-05-13 15:00:19,918 : INFO : computing the final decomposition
2015-05-13 15:00:19,919 : INFO : keeping 10 factors (discarding 72.559% of energy spectrum)
2015-05-13 15:00:19,923 : INFO : processed documents up to #379
2015-05-13 15:00:19,926 : INFO : topic #0(3.798): -0.232*"teach" + -0.153*"nbsp" + -0.138*"program" + -0.121*"learn" + -0.117*"mus" + -0.105*"heal" + -0.105*"comput" + -0.100*"network" + -0.096*"stud" + -0.094*"design"
2015-05-13 15:00:19,927 : INFO : topic #1(2.720): 0.600*"de" + 0.371*"la" + 0.218*"y" + 0.210*"en" + 0.161*"los" + 0.160*"à" + 0.148*"que" + 0.143*"cour" + 0.137*"un" + 0.127*"les"
2015-05-13 15:00:19,928 : INFO : topic #2(2.489): -0.613*"teach" + -0.207*"portfolio" + -0.193*"assist" + 0.171*"mus" + -0.134*"improv" + -0.120*"learn" + -0.118*"undertak" + -0.117*"strongly" + -0.112*"found" + 0.111*"network"
2015-05-13 15:00:19,929 : INFO : topic #3(2.343): 0.811*"mus" + 0.170*"sound" + -0.167*"heal" + 0.095*"art" + 0.094*"audio" + 0.089*"digit" + -0.079*"econom" + -0.078*"glob" + 0.074*"teach" + 0.073*"program"
2015-05-13 15:00:19,930 : INFO : topic #4(2.200): -0.370*"heal" + 0.295*"program" + 0.219*"comput" + -0.209*"mus" + 0.193*"algorithm" + 0.187*"dat" + 0.181*"langu" + -0.169*"glob" + 0.149*"network" + -0.124*"food"
>>> index = similarities.MatrixSimilarity(lsi[corpus])
2015-05-13 15:00:58,271 : WARNING : scanning corpus to determine the number of features (consider setting `num_features` explicitly)
2015-05-13 15:00:58,297 : INFO : creating matrix for 379 documents and 10 features
>>> print courses_name[210]
Machine Learning
>>> ml_course = texts[210]
>>> ml_bow = dicionary.doc2bow(ml_course)
Traceback (most recent call last):
File "", line 1, in
NameError: name 'dicionary' is not defined
>>> ml_bow = dictionary.doc2bow(ml_course)
>>> ml_lsi = lsi[ml_bow]
>>> print ml_lsi
[(0, -8.7406913741774339), (1, -0.70293489704845991), (2, -0.9462620880713023), (3, 0.16366151418170138), (4, 4.5606439468587183), (5, 0.6647037843452156), (6, 1.9441650868832108), (7, 1.8344021623900915), (8, -0.026276046442867157), (9, 1.0736792916525693)]
>>> sims = index[ml_lsi]
>>> sort_sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print sort_sims[0:10]
[(210, 1.0000001), (174, 0.97911292), (63, 0.96206093), (74, 0.95905471), (141, 0.95170236), (189, 0.94995236), (238, 0.94848543), (184, 0.94354957), (203, 0.93663675), (219, 0.92838192)]
>>> print courses_name[174]
Machine Learning
>>> print courses_name[238]
Probabilistic Graphical Models
>>> history
Traceback (most recent call last):
File "", line 1, in
NameError: name 'history' is not defined

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：word中不能输入中文只能输入英文，其他地方正常的问题解决办法

后一篇：tfidf for 情感分析

新浪BLOG意见反馈留言板　欢迎批评指正