NLP自然语言处理分词之一元模型_裴大帅2021

http://blog.sina.com.cn/u/1653392055

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

NLP自然语言处理分词之一元模型

(2017-08-07 23:02:14)

标签：

nlp

三叉树

一元模型

分类：机器学习

一元模型缘由

之前介绍过使用后缀三叉Trie树寻找词典中最长匹配词的方法进行句子分词，当然也说过前缀三叉Trie树，但是因为中文中主体后置所以后缀的效果基本优于前缀。但是通过寻找最长匹配词也存在着弊端，比如：

1、不一定每次分词最长效果就最好的

2、每次分词后面最长匹配词可能不应该是这个句子的分词，这个最长匹配词可能由前一个分词的后半部分，以及后一个分词的前半部分组成

举一个实际的例子，比如有句子：见证大上海银行业间合作。我们有可能通过最长匹配方式拿到分词“上海银行”，事实上这个句子和"上海银行"没有关系。这个就是最长匹配方式导致的歧义问题。

采用什么方式可以优化分词方式，尽量避免歧义问题的发生，这里我们就介绍一元模型的分词方法。

一元模型方法简述

首先，我们将句子转化成切分词图，比如“见证大上海银行业间合作”对应的切分词图，每条边对应这一个权值（也可以叫做成本）：

接着，我们需要寻找从0到11最合适的路径，使得总成本最小。

一元模型实现步骤

还是以“见证大上海银行业间合作”为例进行完整剖析。

1、获取每个词的出现概率（取log后的值）

2、画出切分词图

3、将切分词图转换成邻接表，因为需要计算每个节点的达到成本，所以邻接表的纵向list是以达到节点进行组织的，而横向则是具有相同达到节点的词语。得到邻接表如下所示：

4、接下来从上往下计算每个节点的最小成本，并将此节点作为那一行的节点。怎么计算最小成本呢？

假设我们已经得到"大"那一行的最小成本为-18.6(只保留一位小数)，"上"那一行的最小成本为-24.3，接下来计算"海"->"上海" 这一行的最小成本。

（1）“海"的成本为：-7.2540 + (-24.3)=-31.5，其中-24.3就是"海"上一个字"上"的最小成本

（2）"上海"的成本为：-6.9532 + (-18.6) = -25.6，其中-18.6就是"上海"上一个字“大”的最小成本

比较上面两个数字我们就知道"上海"的成本低于"海"，所以我们选择"上海"。其他的计算同理，从而我们就计算出了每一行的最小成本，以及相应的词。我们计算得到的结果是：

5、从后向前寻找每个切分词

第1个是"合作"，这个词占了2个坑；

所以第2个是序号9即"间"，这个词占了1个坑；

所以第3个是序号8即"银行业"，这个词占了3个坑；

所以第4个是序号5即"上海"，这个词占了2个坑；

所以第5个是序号3即"大"，这个词占了1个坑；

所以第6个是序号2即"见证"，这个词占了2个坑；

最后第7个是序号0，不好意思没有这个坑了。

通过以上计算我们得到了切词为：见证/大/上海/银行业/间/合作

一元模型代码实现

（1）标记切分词图中每个节点的具体信息CnToken.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

package segment.unigrammodel;

public class CnToken {
    public String termText; // 词
    public int start; // 词的开始位置
    public int end; // 词的结束位置
    public double freqProb; // 词在语料库中出现的概率(log后的数值)

    public CnToken(int vertexFrom, int vertexTo, String word, double freqProb){
        start = vertexFrom;
        end = vertexTo;
        termText = word;
        this.freqProb = freqProb;
    }

    @Override
    public String toString() {
        return "text:" + termText + " start:" + start + " end:" + end
                + " freqProb:" + freqProb;
    }
}

（2）节点Node.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

package segment.unigrammodel;

public class Node {
    public CnToken item; // 链表中的节点
    public Node next; // 记录下一个元素

    Node(CnToken item){
        this.item = item;
        next = null;
    }
}

（3）用于构建邻接表的单向链表TokenLinkedList.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79

package segment.unigrammodel;

import java.util.Iterator;
import java.util.NoSuchElementException;

public class TokenLinkedList implements Iterable{
    public Node head = null; // 链表的头


    public void put(CnToken item){
        Node n = new Node(item); // 新建一个节点
        n.next = head; // 新节点指向原来的头节点
        head = n; // 新节点放在链表头
    }


    @Override
    public String toString(){
        StringBuilder buf = new StringBuilder();
        Node cur = head; // 从头开始

        // 如果当前节点不是空就往下遍历
        while(cur != null){
            buf.append(cur.item.toString());
            buf.append('\t');
            cur = cur.next; // 找下一个节点
        }

        return buf.toString();
    }

    @Override
    public Iterator iterator() {
        return new LinkIterator(head);
    }


    private class LinkIterator implements Iterator {
        Node itr;

        public LinkIterator(Node begin){
            itr = begin; // 遍历的开始节点
        }

        @Override
        public boolean hasNext() {
            return itr != null;
        }

        // 向下遍历
        @Override
        public CnToken next() {
            if(itr == null){
                throw new NoSuchElementException();
            }

            Node cur = itr;
            itr = itr.next;
            return cur.item;
        }

        @Override
        public void remove() {
            throw new UnsupportedOperationException(); // 不支持这个操作
        }
    }// end of LinkIterator
}

（4）邻接表AdjList.java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

package segment.unigrammodel;

import java.util.Iterator;

public class AdjList {
    private TokenLinkedList list[]; // AdjList的图结构

    // 构造方法，分配空间
    public AdjList(int verticesNum){
        list = new TokenLinkedList[verticesNum];

        // 初始化数组中所有的链表
        for(int index = 0; index < verticesNum; index++){
            list[index] = new TokenLinkedList();
        }
    }

    public int getVerticesNum(){
        return list.length;
    }

    // 增加一个边到图中, 放在同一个行中的都是相同终点的
    public void addEdge(CnToken newEdge){
        list[newEdge.end].put(newEdge);
    }

    // 返回一个迭代器，包含以指定点结尾的所有的边
    public Iterator getAdjacencies(int vertex){
        TokenLinkedList ll = list[vertex];
        if(ll == null){
            return null;
        }
        return ll.iterator();
    }

    // 输出逆向邻接表
    @Override
    public String toString(){
        StringBuilder temp = new StringBuilder();
        for(int index = 0; index < getVerticesNum(); index++){
            if(list[index] == null){
                continue;
            }
            temp.append("node:");
            temp.append(index);
            temp.append(": ");
            temp.append(list[index].toString());
            temp.append("\n");
        }
        return temp.toString();
    }
}

（5）一元模型分词器UnigramSegmenter.java

package segment.unigrammodel;

import algorithm.TST.PrefixTSTNode2;

import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Deque;
import java.util.Iterator;

public class UnigramSegmenter {
    String[] bestPrevWords; // 最佳前驱词
    double[] prob;  // 节点概率


    public AdjList getSplitGraph(String sentence){
        PrefixTSTNode2 dict = new PrefixTSTNode2("D:\\workspace\\IDEA\\NLPSystem\\src\\data\\SDIC.txt");
        int sLen;
        sLen = sentence.length();
        AdjList g = new AdjList(sLen + 1); // 存储所有被切分的可能的词
        int j;
        ArrayList sucWords = new ArrayList();

        // 形成切分词图
        for(int i = 0; i < sLen;){
            boolean match = dict.matchAll(sentence, i, sucWords);

            // 已经匹配上
            if(match){
                for(WordEntry word : sucWords){
                    j = i + word.word.length();
                    double logProb = Math.log(word.freq) - Math.log(dict.totalFreq);

                    // 把词放入词图
                    g.addEdge(new CnToken(i, j, word.word, logProb));
                }
                i++;
            }else{
                j = i + 1;
                double logProb = Math.log(1) - Math.log(dict.totalFreq);

                // 把单字放入词图
                g.addEdge(new CnToken(i, j, sentence.substring(i, j), logProb));
                i = j;
            }
        }
        return g;
    }


    public Deque split(String sentence){
        PrefixTSTNode2 dict = new PrefixTSTNode2("D:\\workspace\\IDEA\\NLPSystem\\src\\data\\SDIC.txt");
        int sLen;
        sLen = sentence.length();
        AdjList g = new AdjList(sLen + 1); // 存储所有被切分的可能的词
        int j;
        ArrayList sucWords = new ArrayList();

        // 形成切分词图
        for(int i = 0; i < sLen;){
            boolean match = dict.matchAll(sentence, i, sucWords);

            // 已经匹配上
            if(match){
                for(WordEntry word : sucWords){
                    j = i + word.word.length();
                    double logProb = Math.log(word.freq) - Math.log(dict.totalFreq);
                    System.out.println("word: " + word.word + ",prob: " + logProb );

                    // 把词放入词图
                    g.addEdge(new CnToken(i, j, word.word, logProb));
                }
                i++;
            }else{
                j = i + 1;
                double logProb = Math.log(1) - Math.log(dict.totalFreq);

                // 把单字放入词图
                g.addEdge(new CnToken(i, j, sentence.substring(i, j), logProb));
                i = j;
            }
        }
        return maxProb(g);
    }


    private Deque maxProb(AdjList g){
        bestPrevWords = new String[g.getVerticesNum()]; // 最佳前驱词数组
        prob = new double[g.getVerticesNum()]; // 节点概率
        prob[0] = 0; // 节点0的初始概率是1，取log后是0

        // 按节点求最大前驱
        for(int index = 1; index < g.getVerticesNum(); index++){
            // 求出最大前驱
            getBestPrev(g, index);
        }

        return bestPath(g);
    }


    private void getBestPrev(AdjList g, int i){
        double maxProb = Double.NEGATIVE_INFINITY;
        String bestPrevWord = null; // 候选最佳前驱词

        Iterator it = g.getAdjacencies(i); // 得到前驱词集合，从中挑选最佳前驱词
        while(it.hasNext()){
            CnToken itr = it.next();
            double nodeProb = prob[itr.start] + itr.freqProb; // 候选节点概率

            // 概率最大的算作最佳前驱
            if(nodeProb > maxProb){
                bestPrevWord = itr.termText;
                maxProb = nodeProb;
            }
        }

        prob[i] = maxProb; // 节点概率
        bestPrevWords[i] = bestPrevWord; // 最佳前驱词
    }

    // 根据最佳前驱节点数组回溯求解词序列
    private Deque bestPath(AdjList g){
        Deque path = new ArrayDeque(); // 最佳词序列
        int i = g.getVerticesNum() - 1;
        while(bestPrevWords[i] != null){
            System.out.println("节点" + i + "的最佳前驱词：" + bestPrevWords[i]);
            path.push(bestPrevWords[i]); // 压栈
            i = i - bestPrevWords[i].length();
        }

        return path;
    }
}

（6）前缀三叉树PrefixTSTNode2.java，主要负责词典表的读入，构建三叉树，计算每个词的概率

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366

package algorithm.TST;

import segment.unigrammodel.WordEntry;

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Deque;
import java.util.StringTokenizer;

public class PrefixTSTNode2 {
    public WordEntry data; // 存放数据
    public PrefixTSTNode2 loNode; // 左边节点
    public PrefixTSTNode2 eqNode; // 中间节点
    public PrefixTSTNode2 hiNode; // 右边节点
    public char splitchar; // 本节点表示的字符
    public long totalFreq = 0; // 语料库中的词的总数
    public int totalNode = 0; // 节点总数

    public PrefixTSTNode2 rootNode;

    public PrefixTSTNode2(){
    }


    public PrefixTSTNode2(char splitchar){
        this.splitchar = splitchar;
    }

    public String toString(){
        return "splitchar:" + splitchar;
    }


    public PrefixTSTNode2(String fileName){
        try {
            FileInputStream file = new FileInputStream(fileName);
            BufferedReader in = new BufferedReader(new InputStreamReader(file, "utf-8"));
            String line;

            try{
                System.out.println("begin to load model...");
                while((line = in.readLine()) != null) {
                    if("".equals(line)){
                        continue;
                    }

                    StringTokenizer st = new StringTokenizer(line, "\t");
                    String key = st.nextToken();

                    int freq;
                    try{
                        freq = Integer.parseInt(st.nextToken());
                    }catch(NumberFormatException e){
                        freq = 1;
                    }

                    addWord(key, freq);
                    totalFreq += freq;
                }
                System.out.println("load model done");
            }catch(IOException e){
                e.printStackTrace();
            }finally{
                try {
                    in.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
    }


    public void addWord(String key, int freq) {
        if (rootNode == null) {
            rootNode = new PrefixTSTNode2(key.charAt(0));
        }

        int charIndex = 0;
        PrefixTSTNode2 currentNode = rootNode;
        while (true) {
            int compa = (key.charAt(charIndex) - currentNode.splitchar);
            if (compa == 0) {
                charIndex++;
                if (charIndex == key.length()) {
                    currentNode.data = new WordEntry(key, freq);
                    break;
                }
                if (currentNode.eqNode == null) {
                    currentNode.eqNode = new PrefixTSTNode2(key.charAt(charIndex));
                    totalNode++;
                }
                currentNode = currentNode.eqNode;
            } else if (compa < 0) {
                if (currentNode.loNode == null) {
                    currentNode.loNode = new PrefixTSTNode2(key.charAt(charIndex));
                    totalNode++;
                }
                currentNode = currentNode.loNode;
            } else {
                if (currentNode.hiNode == null) {
                    currentNode.hiNode = new PrefixTSTNode2(key.charAt(charIndex));
                    totalNode++;
                }
                currentNode = currentNode.hiNode;
            }
        }
    }


    public boolean matchAll(String sentence, int offset, ArrayList sucWords){
        sucWords.clear();  // 清空后继词集合中的词
        if(null == sentence || null == rootNode || "".equals(sentence)){
            return false;
        }

        PrefixTSTNode2 currentNode = rootNode;
        int charIndex = offset;
        while(true){
            if(null == currentNode){
                if(sucWords.size() > 0){
                    return true;
                }
                return false;
            }

            int charComp = sentence.charAt(charIndex) - currentNode.splitchar;
            if(charComp == 0){
                charIndex++;
                if(currentNode.data != null){
                    System.out.println("data = " + currentNode.data);
                    sucWords.add(currentNode.data);
                }

                if(charIndex == sentence.length()){
                    if(sucWords.size() > 0){
                        return true;
                    }
                    return false;
                }

                currentNode = currentNode.eqNode;
            }else if(charComp < 0){
                currentNode = currentNode.loNode;
            }else{
                currentNode = currentNode.hiNode;
            }
        }
    }


    public void compileDic(File file){
        try {
            FileOutputStream file_output = new FileOutputStream(file);
            BufferedOutputStream buffer = new BufferedOutputStream(file_output);
            DataOutputStream data_out = new DataOutputStream(buffer);

            PrefixTSTNode2 currNode = rootNode;
            if(currNode == null){
                return;
            }

            int currNodeNo = 1; // 当前节点编号
            int maxNodeNo = currNodeNo;

            // 用于存放节点数据的队列
            Deque queueNode = new ArrayDeque();
            queueNode.addFirst(currNode);

            // 用于存放节点编号的队列
            Deque queueNodeIndex = new ArrayDeque();
            queueNodeIndex.addFirst(currNodeNo);

            data_out.writeInt(totalNode); // Trie树节点总数
            data_out.writeLong(totalFreq); // 词频总数

            Charset charset = Charset.forName("utf-8");

            // 广度优先遍历所有节点，将其加入至数组中
            while(!queueNodeIndex.isEmpty()){
                // 取出队列第一个节点
                currNode = queueNode.pollFirst();
                currNodeNo = queueNodeIndex.pollFirst();

                // 处理左子节点
                int leftNodeNo = 0; // 当前节点的左孩子节点编号
                if(currNode.loNode != null){
                    maxNodeNo++;
                    leftNodeNo = maxNodeNo;
                    queueNode.addLast(currNode.loNode);
                    queueNodeIndex.addLast(leftNodeNo);
                }

                // 处理中间子节点
                int middleNodeNo = 0; // 当前节点的中间孩子节点编号
                if(currNode.eqNode != null){
                    maxNodeNo++;
                    middleNodeNo = maxNodeNo;
                    queueNode.addLast(currNode.eqNode);
                    queueNodeIndex.addLast(middleNodeNo);
                }

                // 处理右子节点
                int rightNodeNo = 0; // 当前节点的右孩子节点编号
                if(currNode.hiNode != null){
                    maxNodeNo++;
                    rightNodeNo = maxNodeNo;
                    queueNode.addLast(currNode.hiNode);
                    queueNodeIndex.addLast(rightNodeNo);
                }

                // 写入本节点的编号信息
                data_out.writeInt(currNodeNo);

                // 写入左孩子节点的编号信息
                data_out.writeInt(leftNodeNo);

                // 写入中孩子节点的编号信息
                data_out.writeInt(middleNodeNo);

                // 写入右孩子节点的编号信息
                data_out.writeInt(rightNodeNo);

                byte[] splitChar = String.valueOf(currNode.splitchar).getBytes("utf-8");

                // 记录byte数组的长度
                data_out.writeInt(splitChar.length);

                // 写入splitChar
                data_out.write(splitChar);

                // 是结束节点，data域不为空
                if(currNode.data != null && currNode.data.word != null){
                    CharBuffer cBuffer = CharBuffer.wrap(currNode.data.word);
                    ByteBuffer bb = charset.encode(cBuffer);

                    // 写入词的长度
                    data_out.writeInt(bb.limit());
                    // 写入词的内容
                    for(int i = 0; i < bb.limit(); i++){
                        data_out.write(bb.get(i));
                    }
                }else{
                    // 不是结束节点，data域为空
                    data_out.writeInt(0); // 写入字符串的长度
                }
            }

            data_out.close();
            file_output.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }


    public void loadBinaryFile(File file){
        Charset charset = Charset.forName("utf-8"); // 得到字符集
        try {
            InputStream file_input = new FileInputStream(file);
            BufferedInputStream buffer = new BufferedInputStream(file_input);
            DataInputStream data_in = new DataInputStream(buffer);

            // 获取节点个数
            totalNode = data_in.readInt();
            PrefixTSTNode2[] nodeList = new PrefixTSTNode2[totalNode + 2];  // 调试过程中发现需要+2

            // 要预先创建出来所有的节点，因为当前节点要指向后续节点
            for(int i = 0; i < nodeList.length; i++){
                nodeList[i] = new PrefixTSTNode2();
            }

            // 读入词典中目前词的个数
            totalFreq = data_in.readLong();

            for(int index = 1; index <= totalNode; index++){
                int currNodeIndex = data_in.readInt(); // 获得当前节点编号
                int leftNodeIndex = data_in.readInt(); // 获得当前节点左子节点编号
                int middleNodeIndex = data_in.readInt(); // 获得当前节点中子节点编号
                int rightNodeIndex = data_in.readInt(); // 获得当前节点右子节点编号

                PrefixTSTNode2 currentNode = nodeList[currNodeIndex];  // 获得当前节点

                // 获取splitchar值
                int length = data_in.readInt();
                byte[] bytebuf = new byte[length];
                data_in.read(bytebuf);
                currentNode.splitchar = charset.decode(ByteBuffer.wrap(bytebuf)).charAt(0);

                // 获取字典中词的内容
                length = data_in.readInt();
                if(length > 0){
                    bytebuf = new byte[length];
                    data_in.read(bytebuf);
                    String key = new String(bytebuf, "utf-8"); // 记录每一个词语
                    currentNode.data = new WordEntry();
                    currentNode.data.word = key;
                }

                // 生成树节点之间的对应关系，左、中、右子树
                if(leftNodeIndex >= 0){
                    currentNode.loNode = nodeList[leftNodeIndex];
                }

                if(middleNodeIndex >= 0){
                    System.out.println(middleNodeIndex);
                    currentNode.eqNode = nodeList[middleNodeIndex];
                }

                if(rightNodeIndex >= 0){
                    currentNode.hiNode = nodeList[rightNodeIndex];
                }
            }

            data_in.close();
            buffer.close();
            file_input.close();

            rootNode = nodeList[1]; //设置根节点

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：AI中各种模型的二进制存放

后一篇：推荐系统之贝叶斯过滤算法

新浪BLOG意见反馈留言板　欢迎批评指正