我已经 PDFStreamEngine 并且重载 processTextPosition ,我现在能够重建像 PDFTextStripper 但我不想处理透明文本,这通常是垃圾。
I subclassed PDFStreamEngine and overloaded processTextPosition, I am now able to reconstruct the text like PDFTextStripper but I don't want to process transparent text, which is often garbage.
我怎么知道某些文字是否透明?
How can I know if some text is transparent ?
推荐答案事实证明,透明文本实际上根本不是透明的,而只是由一个image:在 201103主要吸烟统计数据中对于SA 2010 FINAL.pdf ,文字SA的2004年吸烟统计数据已经被显示TC徽标的图像所覆盖。
As turned out the transparent text actually was not transparent at all but instead merely covered by an image: In 201103 Key Smoking Statistic for SA 2010 FINAL.pdf the text "Key Smoking Statistics for SA --- 2004" has been covered by an image showing a TC logo.
下面显示了文本剥离器类的概念证明,忽略了图像覆盖的文本。
The following shows a proof of concept of a text stripper class ignoring text covered by images.
public class VisibleTextStripper extends PDFTextStripper { public VisibleTextStripper() throws IOException { super(); registerOperatorProcessor("Do", new Invoke()); } // // Hiding operations // void hide(String name) { Matrix ctm = getGraphicsState().getCurrentTransformationMatrix(); float x = ctm.getXPosition(); float y = ctm.getYPosition(); float scaledWidth = ctm.getXScale(); float scaledHeight = ctm.getYScale(); for(List<TextPosition> characters : charactersByArticle) { Collection<TextPosition> toRemove = new ArrayList<TextPosition>(); for (TextPosition character : characters) { Matrix matrix = character.getTextPos(); float cx = matrix.getXPosition(); float cy = matrix.getYPosition(); float cw = character.getWidth(); float ch = character.getHeight(); if (overlaps(x, scaledWidth, cx, cw) && overlaps(y, scaledHeight, cy, cw)) { System.out.printf("Hidden by '%s': X: %f; Y: %f; Width: %f; Height: %f; Char: '%s'\n", name, cx, cy, cw, ch, character.getCharacter()); toRemove.add(character); } } characters.removeAll(toRemove); } } private boolean overlaps(float start1, float width1, float start2, float width2) { if (width1 < 0) { start1 += width1; width1 = -width1; } if (width2 < 0) { start2 += width2; width2 = -width2; } if (start1 < start2) { return start1 + width1 >= start2; } else { return start2 + width2 >= start1; } } // // operator processors // public static class Invoke extends OperatorProcessor { /** * Log instance. */ private static final Log LOG = LogFactory.getLog(Invoke.class); /** * process : Do : Paint the specified XObject (section 4.7). * @param operator The operator that is being executed. * @param arguments List * @throws IOException If there is an error invoking the sub object. */ public void process(PDFOperator operator, List<COSBase> arguments) throws IOException { VisibleTextStripper drawer = (VisibleTextStripper)context; COSName objectName = (COSName)arguments.get( 0 ); Map<String, PDXObject> xobjects = drawer.getResources().getXObjects(); PDXObject xobject = (PDXObject)xobjects.get( objectName.getName() ); if ( xobject == null ) { LOG.warn("Can't find the XObject for '"+objectName.getName()+"'"); } else if( xobject instanceof PDXObjectImage ) { drawer.hide(objectName.getName()); } else if(xobject instanceof PDXObjectForm) { PDXObjectForm form = (PDXObjectForm)xobject; COSStream formContentstream = form.getCOSStream(); // if there is an optional form matrix, we have to map the form space to the user space Matrix matrix = form.getMatrix(); if (matrix != null) { Matrix xobjectCTM = matrix.multiply( context.getGraphicsState().getCurrentTransformationMatrix()); context.getGraphicsState().setCurrentTransformationMatrix(xobjectCTM); } // find some optional resources, instead of using the current resources PDResources pdResources = form.getResources(); context.processSubStream( context.getCurrentPage(), pdResources, formContentstream ); } } } }它适用于您的示例文档。
It works well with your sample document.
支票
if (overlaps(x, scaledWidth, cx, cw) && overlaps(y, scaledHeight, cy, cw))遗憾地假设没有涉及旋转(所有转换聚合),包括文本和图像。
unfortunately assumes that there are no rotations (all transformations aggregated) involved, neither of the text nor of the image.
对于通用解决方案,您必须更改此测试用于检查由 Matrix ctm = getGraphicsState()。getCurrentTransformationMatrix()转换的1x1平方是否重叠由矩阵转换的字符框matrix = character.getTextPos()固定宽度和高度 cw = character.getWidth()和 ch = character。的getHeight()。也许简单的重叠是不够的,您可能希望充分覆盖字符框。
For a generic solution you have to change this test to something that checks whether the 1x1 square transformed by the Matrix ctm = getGraphicsState().getCurrentTransformationMatrix() overlaps the character box transformed by the Matrix matrix = character.getTextPos() with fixed width and height cw = character.getWidth() and ch = character.getHeight(). And maybe simple overlapping does not suffice, you might want the character box to be covered sufficiently.
此外,此测试忽略图像蒙版,即图像的透明度。
Furthermore this test ignores image masks, i.e. transparency of the image.
更多推荐
如何使用pdfbox检查文本是否透明
发布评论