提取文本内容一：提取PDF格式文档内容_行云流水

http://blog.sina.com.cn/u/1740221837

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

提取文本内容一：提取PDF格式文档内容

(2012-06-10 20:35:15)

标签：

提取文档内容

提取pdf

提取word

提取ppt

it

分类：技术知识

对于除了txt文本以外的其他格式的文本内容的提取都是需要依赖第三方工具。解析不同格式的文档，需要先下载相关工具包。

1）提取PDF格式文档内容。我用的都是最大众化，最普及的工具，这样方便学习，而且工具安全可靠。解析pdf运用最多的就是PDFBox，下载地址：http://www.pdfbox.org/ 将其中 PDFBox.jar加入到项目中。

代码如下：

public class PdfHandler {

public String getDocument(InputStream is) throws Exception {

COSDocument cosDoc = null;

try{

cosDoc = parseDocument(is); 第二步：调用pdfbox中pdf解析器对pdf文档就行解析，并将结果赋值给COSDocument对象

}catch(IOException e){

e.printStackTrace();

}

String docText = "";

try{

PDFTextStripper stripper = new PDFTextStripper();

docText = stripper.getText(new PDDocument(cosDoc));第三步：调用pdfbox中PDFTextStripper类对文档内容进行提取

System.out.println(" doctext :: " + docText);docText中就是提取的pdf文档内容

}catch(Exception e){

e.printStackTrace();

}

return docText;

}

private static COSDocument parseDocument(InputStream is) throws Exception{

// pdf解析器

PDFParser parser = new PDFParser(is);

parser.parse();

return parser.getDocument();

}

public static void main(String[] args) throws Exception{

PdfHandler ph = new PdfHandler();

Document doc = ph.getDocument(new FileInputStream(new File("D:\\test\\2.pdf"))); 第一步：以流的形式传入pdf文档

}

代码很简单，上面代码即可运行，但是运行会报错：

Exception in thread "main" java.lang.NoClassDefFoundError: org/fontbox/cmap/CMapParser

at org.pdfbox.pdmodel.font.PDFont.parseCmap(PDFont.java:534)

at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:387)

at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:325)

at org.pdfbox.util.operator.ShowText.process(ShowText.java:64)

at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:452)

at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:215)

at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174)

at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336)

at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259)

at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216)

at com.ty.test.testread.geText(testread.java:83)

at com.ty.test.testread.main(testread.java:25)

原因：还缺少一个jar包。在下载的PDFBox文件夹中找到FontBox.jar加入到项目中即可。

注意：解析的pdf文档必须是未加密的，加密的pdf文档时解析不成功的。程序无法对PDF进行解密的，除了你知道解密密码，不然无法办到。可以先下载一个破解PDF工具，先清楚密码，再操作。

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：web页面素材收藏

后一篇：提取文本内容二：提取word格式文档内容

新浪BLOG意见反馈留言板　欢迎批评指正