加载中...

bicloud
• 博客等级：
• 博客积分：0
• 博客访问：315,048
• 关注人气：481
• 获赠金笔：0支
• 赠出金笔：0支
• 荣誉徽章：

python ngram计算

(2017-04-02 12:32:04)

python ngram

```
```# -*- coding: utf-8 -*-
# @DATE    : 2017/4/1 10:39
# @Author  :
# @File    : ngram.py
from collections import defaultdict

def gen_n_gram(input, sep=" ", n=2):
input = input.split(sep)
output = {}
for i in xrange(len(input) - n + 1):
gram = "".join(input[i: i + n])
output.setdefault(gram, 0)
output[gram] += 1
return output

def dict_sum(*dict):
ret = defaultdict(int)
for d in dict:
for k, v in d.items():
ret[k] += v
return ret

def sum_n_gram(inputs, sep=" ", n=2):
output_sum = defaultdict(int)
for input in inputs:
output_sum = dict_sum(output_sum, gen_n_gram(input))
output_sum = sorted(output_sum.items(), key=lambda x: x[1], reverse=True)
return output_sum

if __name__ == "__main__":
inputs = ["a a a j 9 3 h d e", "a j 9 3 h", "g g h 9 3"]
print(gen_n_gram("a a a j 9 3 h d e"))
output = sum_n_gram(inputs)
print(output)
output_file = "dict.txt"
cnt = len(output)
with open(output_file, "w") as out:
for i, value in enumerate(output):
if i + 1 <</span> cnt:
out.write("{}:{}\n".format(value[0], value[1]))
else:
out.write("{}:{}".format(value[0], value[1]))```
```

```
```{'aa': 2, 'de': 1, 'j9': 1, 'aj': 1, '3h': 1, '93': 1, 'hd': 1}
[('93', 3), ('aa', 2), ('aj', 2), ('j9', 2), ('3h', 2), ('de', 1), ('gg', 1), ('h9', 1), ('hd', 1), ('gh', 1)]

Process finished with exit code 0```
```

0

• 评论加载中，请稍候...

发评论

以上网友发言只代表其个人观点，不代表新浪网的观点或立场。

新浪BLOG意见反馈留言板　不良信息反馈　电话：4006900000 提示音后按1键（按当地市话标准计费）　欢迎批评指正

新浪公司 版权所有