加载中…
个人资料
高志军_PKU
高志军_PKU 新浪个人认证
  • 博客等级:
  • 博客积分:0
  • 博客访问:327,974
  • 关注人气:313
  • 获赠金笔:0支
  • 赠出金笔:0支
  • 荣誉徽章:
正文 字体大小:

N元语法之n-gram(1)

(2016-05-05 11:33:36)
标签:

nlp

杂谈

分类: 自然语言处理
准备写一个系列的文章,旨在向零基础或有一些编程基础的学生介绍自然语言处理的基本原理,并能用Python自己动手实践。今天第一篇:N元语法之n-gram

N元语法
维基百科的定义:在计算语言学中,n-gram指的是文本中连续的n个item(item可以是phoneme, syllable, letter, word或base pairs)

n-gram 中如果n=1则为unigram,n=2则为bigram,n=3则为trigram。n>4后,则直接用数字指称,如4-gram,5gram。

示例
以 I will go to United States. 这句话为例。bigram为:
I will
will go
go to
to United
United States

最基本的思考与实现:

Python语言:
sent="I will go to United States."
lst_sent=sent.split(" ")
bigram=[]
for i in range(len(lst_sent)-1):
   bigram.append(lst_sent[i] + " " + lst_sent[i+1])

>>> bigram
['I will', 'will go', 'go to', 'to United', 'United States.']

标点符号的处理
基本思想便是如此,不过细节上依然有很多问题。例如sent中最后的标题符号如何处理,目前常见的做法是将其直接删除。
这时可用正则表达式模块进行处理
import re
punctuation_pattern=re.compile(r'[.,!?'"]')
no_punctuation_sent=re.sub(punctuation, "", sent)

调整好的代码:
  
Python语言:
import re
punctuation_pattern=re.compile(r"""[.,!?'"]""")

sent="I will go to United States."
no_punctuation_sent=re.sub(punctuation_pattern, "", sent)
lst_sent=no_punctuation_sent.split(" ")
bigram=[]
for i in range(len(lst_sent)-1):
   bigram.append(lst_sent[i] + " " + lst_sent[i+1])


trigram 如何实现?

Python语言:
import re
punctuation_pattern=re.compile(r"""[.,!?'"]""")

sent="I will go to United States."
no_punctuation_sent=re.sub(punctuation_pattern, "", sent)
lst_sent=no_punctuation_sent.split(" ")
trigram=[]
for i in range(len(lst_sent)-2):
   trigram.append(lst_sent[i] + " " + lst_sent[i+1]+ " " + lst_sent[i+2])

>>> trigram
['I will go', 'will go to', 'go to United', 'to United States']



读一个大文件进来看看效果:

Python语言
# -*- coding: utf-8 -*-

import re
from Tkinter import Tk
#利用对话框选择文件
from tkFileDialog import askopenfilename
Tk().withdraw()
filename=askopenfilename()

fileToProcess=open(filename,"r")
fileContent=fileToProcess.read()



def remove_punctuation(strings):

   punctuation_pattern=re.compile(r"""[.,;:!?'"\n]""")
   no_punctuation_string=re.sub(punctuation_pattern,"",strings)
   return no_punctuation_string

def bigram(lst_sent):
       bigram=[]
       for i in range(len(lst_sent)-1):
           bigram.append(lst_sent[i] + " " + lst_sent[i+1])
       return bigram


def trigram(lst_sent):
       trigram=[]
       for i in range(len(lst_sent)-2):
           trigram.append(lst_sent[i] + " " + lst_sent[i+1]+ " " + lst_sent[i+2])
       return trigram

clean_content=remove_punctuation(fileContent)
lst_clean_content=clean_content.split(" ")
bigramLst=bigram(lst_clean_content)
trigramLst=trigram(lst_clean_content)

看看trigram的前50个结果,看起来还不错

>>> trigramLst[:50]
['[Emma by Jane', 'by Jane Austen', 'Jane Austen 1816]VOLUME', 'Austen 1816]VOLUME ICHAPTER', '1816]VOLUME ICHAPTER IEmma', 'ICHAPTER IEmma Woodhouse', 'IEmma Woodhouse handsome', 'Woodhouse handsome clever', 'handsome clever and', 'clever and rich', 'and rich with', 'rich with a', 'with a comfortable', 'a comfortable homeand', 'comfortable homeand happy', 'homeand happy disposition', 'happy disposition seemed', 'disposition seemed to', 'seemed to unite', 'to unite some', 'unite some of', 'some of the', 'of the best', 'the best blessingsof', 'best blessingsof existence', 'blessingsof existence and', 'existence and had', 'and had lived', 'had lived nearly', 'lived nearly twenty-one', 'nearly twenty-one years', 'twenty-one years in', 'years in the', 'in the worldwith', 'the worldwith very', 'worldwith very little', 'very little to', 'little to distress', 'to distress or', 'distress or vex', 'or vex herShe', 'vex herShe was', 'herShe was the', 'was the youngest', 'the youngest of', 'youngest of the', 'of the two', 'the two daughters', 'two daughters of', 'daughters of a']

自己动手实验一下吧。

通用性的nGram

sent=['i', 'love', 'china.', 'i', 'love', 'suzhou']
def nGram(lst,n):
ngram=[]
for i in len(sent):
if i
ngram.append(sent[i:i+n])
else:
print "Finish the process"


0

阅读 收藏 喜欢 打印举报/Report
  

新浪BLOG意见反馈留言板 欢迎批评指正

新浪简介 | About Sina | 广告服务 | 联系我们 | 招聘信息 | 网站律师 | SINA English | 产品答疑

新浪公司 版权所有