如何使用pdfbox检查文本是否透明

编程入门 行业动态 更新时间:2024-10-12 01:31:55
本文介绍了如何使用pdfbox检查文本是否透明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我已经 PDFStreamEngine 并且重载 processTextPosition ,我现在能够重建像 PDFTextStripper 但我不想处理透明文本,这通常是垃圾。

I subclassed PDFStreamEngine and overloaded processTextPosition, I am now able to reconstruct the text like PDFTextStripper but I don't want to process transparent text, which is often garbage.

我怎么知道某些文字是否透明?

How can I know if some text is transparent ?

推荐答案

事实证明,透明文本实际上根本不是透明的,而只是由一个image:在 201103主要吸烟统计数据中对于SA 2010 FINAL.pdf ,文字SA的2004年吸烟统计数据已经被显示TC徽标的图像所覆盖。

As turned out the transparent text actually was not transparent at all but instead merely covered by an image: In 201103 Key Smoking Statistic for SA 2010 FINAL.pdf the text "Key Smoking Statistics for SA --- 2004" has been covered by an image showing a TC logo.

下面显示了文本剥离器类的概念证明,忽略了图像覆盖的文本。

The following shows a proof of concept of a text stripper class ignoring text covered by images.

public class VisibleTextStripper extends PDFTextStripper { public VisibleTextStripper() throws IOException { super(); registerOperatorProcessor("Do", new Invoke()); } // // Hiding operations // void hide(String name) { Matrix ctm = getGraphicsState().getCurrentTransformationMatrix(); float x = ctm.getXPosition(); float y = ctm.getYPosition(); float scaledWidth = ctm.getXScale(); float scaledHeight = ctm.getYScale(); for(List<TextPosition> characters : charactersByArticle) { Collection<TextPosition> toRemove = new ArrayList<TextPosition>(); for (TextPosition character : characters) { Matrix matrix = character.getTextPos(); float cx = matrix.getXPosition(); float cy = matrix.getYPosition(); float cw = character.getWidth(); float ch = character.getHeight(); if (overlaps(x, scaledWidth, cx, cw) && overlaps(y, scaledHeight, cy, cw)) { System.out.printf("Hidden by '%s': X: %f; Y: %f; Width: %f; Height: %f; Char: '%s'\n", name, cx, cy, cw, ch, character.getCharacter()); toRemove.add(character); } } characters.removeAll(toRemove); } } private boolean overlaps(float start1, float width1, float start2, float width2) { if (width1 < 0) { start1 += width1; width1 = -width1; } if (width2 < 0) { start2 += width2; width2 = -width2; } if (start1 < start2) { return start1 + width1 >= start2; } else { return start2 + width2 >= start1; } } // // operator processors // public static class Invoke extends OperatorProcessor { /** * Log instance. */ private static final Log LOG = LogFactory.getLog(Invoke.class); /** * process : Do : Paint the specified XObject (section 4.7). * @param operator The operator that is being executed. * @param arguments List * @throws IOException If there is an error invoking the sub object. */ public void process(PDFOperator operator, List<COSBase> arguments) throws IOException { VisibleTextStripper drawer = (VisibleTextStripper)context; COSName objectName = (COSName)arguments.get( 0 ); Map<String, PDXObject> xobjects = drawer.getResources().getXObjects(); PDXObject xobject = (PDXObject)xobjects.get( objectName.getName() ); if ( xobject == null ) { LOG.warn("Can't find the XObject for '"+objectName.getName()+"'"); } else if( xobject instanceof PDXObjectImage ) { drawer.hide(objectName.getName()); } else if(xobject instanceof PDXObjectForm) { PDXObjectForm form = (PDXObjectForm)xobject; COSStream formContentstream = form.getCOSStream(); // if there is an optional form matrix, we have to map the form space to the user space Matrix matrix = form.getMatrix(); if (matrix != null) { Matrix xobjectCTM = matrix.multiply( context.getGraphicsState().getCurrentTransformationMatrix()); context.getGraphicsState().setCurrentTransformationMatrix(xobjectCTM); } // find some optional resources, instead of using the current resources PDResources pdResources = form.getResources(); context.processSubStream( context.getCurrentPage(), pdResources, formContentstream ); } } } }

它适用于您的示例文档。

It works well with your sample document.

支票

if (overlaps(x, scaledWidth, cx, cw) && overlaps(y, scaledHeight, cy, cw))

遗憾地假设没有涉及旋转(所有转换聚合),包括文本和图像。

unfortunately assumes that there are no rotations (all transformations aggregated) involved, neither of the text nor of the image.

对于通用解决方案,您必须更改此测试用于检查由 Matrix ctm = getGraphicsState()。getCurrentTransformationMatrix()转换的1x1平方是否重叠由矩阵转换的字符框matrix = character.getTextPos()固定宽度和高度 cw = character.getWidth()和 ch = character。的getHeight()。也许简单的重叠是不够的,您可能希望充分覆盖字符框。

For a generic solution you have to change this test to something that checks whether the 1x1 square transformed by the Matrix ctm = getGraphicsState().getCurrentTransformationMatrix() overlaps the character box transformed by the Matrix matrix = character.getTextPos() with fixed width and height cw = character.getWidth() and ch = character.getHeight(). And maybe simple overlapping does not suffice, you might want the character box to be covered sufficiently.

此外,此测试忽略图像蒙版,即图像的透明度。

Furthermore this test ignores image masks, i.e. transparency of the image.

更多推荐

如何使用pdfbox检查文本是否透明

本文发布于:2023-11-10 14:37:32,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1575648.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:如何使用   文本   透明   pdfbox

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!