使用Apache Tika删除PDFont缓存

编程入门 行业动态 更新时间:2024-10-21 15:32:57
本文介绍了使用Apache Tika删除PDFont缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试仅从许多不同的代码(rtf doc pdf)中提取文本.我自然而然地求助于Apache Tika,因为它可以自动检测文档并相应地提取文本.我只对文本感兴趣,而不对格式感兴趣.

我的应用程序最终出现大量内存泄漏,在进行调查时,这来自于PDFBox依赖项中PDFFont类的缓存.我对从PDF缓存Fontmetrics和其他Font格式问题不感兴趣,因为我只想提取文本.

我正在使用tika 1.12.有谁知道如何解决这个棘手的问题.这就是我使用自动检测的方式:

AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File(child.getPath())); ParseContext context = new ParseContext(); parser.parse(inputstream, handler, metadata, context); String s=null; s =handler.toString(); handler=null; context=null; inputstream.close(); PDFont.clearResources();

解决方案

所以我捏造了一个变通办法,每次文件处理完毕后都调用System.gc();,可以正常使用,但是并不能真正回答问题.

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.

My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.

I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:

AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); FileInputStream inputstream = new FileInputStream(new File(child.getPath())); ParseContext context = new ParseContext(); parser.parse(inputstream, handler, metadata, context); String s=null; s =handler.toString(); handler=null; context=null; inputstream.close(); PDFont.clearResources();

解决方案

So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.

更多推荐

使用Apache Tika删除PDFont缓存

本文发布于:2023-11-01 01:33:15,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1547858.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:缓存   Apache   Tika   PDFont

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!