我正在使用Java中的PdfBox从PDF文件提取文本.提供的某些输入文件无效,这些文件上的PDFTextStripper暂停.有没有一种干净的方法来检查提供的文件是否确实是有效的PDF?
I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to check if the provided file is indeed a valid PDF?
推荐答案您可以找出文件(或字节数组)的mime类型,因此不必盲目地依赖扩展名.我是用光圈的MimeExtractor( aperture.sourceforge/)来完成的,或者是几天前我看到的为此专用的库( sourceforge/projects/mime-util )
you can find out the mime type of a file (or byte array), so you dont dumbly rely on the extension. I do it with aperture's MimeExtractor (aperture.sourceforge/) or I saw some days ago a library just for that (sourceforge/projects/mime-util)
我使用光圈从各种文件中提取文本,不仅是pdf,而且还需要例如针对pdf进行调整(光圈使用pdfbox,但是当pdfbox失败时我添加了另一个库作为后备)
I use aperture to extract text from a variety of files, not only pdf, but have to tweak thinks for pdfs for example (aperture uses pdfbox, but i added another library as fallback when pdfbox fails)
更多推荐
如何确定文件是否为PDF文件?
发布评论