务实地将PDF图像转换为8位(Pragmatically convert PDF images to 8 bit)

我有一组普通RGB颜色的PDF。它们将受益于转换为8位以减小文件大小。是否有任何API或工具可以让我在PDF中保留非栅格元素的同时执行此操作？

I have a set of PDFs in normal RGB colour. They would benefit from conversion to 8 bit to reduce file sizes. Are there any APIs or tools that would allow me to do this whilst retaining non-raster elements in the PDF?

最满意答案

这是一个有趣的。带有PDF Rasterizer和dotPdf的Atalasoft dotImage可以做到这一点（免责声明：我为Atalasoft工作并编写了大部分PDF工具）。我首先找到候选页面：

List<int> GetCandidatePages(Stream pdf, string password) { List<int> retVal = new List<int>(); using (PageCollection pages = new PageCollection(pdf, password)) { for (int i=0; i < pages.Count; i++) { if (pages[i].SingleImageOnly()) retVal.Add(i); } } pdf.Seek(0, SeekOrigin.Begin); // restore file pointer return retVal; }

接下来，我只栅格化那些页面，将它们转换成8位图像，但为了保持高效，我会使用一个管理内存的ImageSource：

public class SelectPageImageSource : RandomAccessImageSource { private List<int> _pages; private Stream _stm; public SelectPageImageSource(Stream stm, List<int> pages) { _stm = stm; _pages = pages; } protected override ImageSourceNode LowLevelAcquire(int index) { PdfDecoder decoder = new PdfDecoder(); _stm.Seek(0, SeekOrigin.Begin); AtalaImage image = PdfDecoder.Read(_stm, _pages[index], null); // change to 8 bit if (image.PixelFormat != PixelFormat.Pixel8bppIndexed) { AtalaImage changed = image.GetChangedPixelFormat(PixelFormat.Pixel8bppIndexed); image.Dispose(); image = changed; } return new FileReloader(image, new PngEncoder()); } protected override int LowLevelTotalImages() { return _pages.Count; } }

接下来，您需要从此创建一个新的PDF：

public void Make8BitImagePdf(Stream pdf, Stream outPdf, List<int> pages) { PdfEncoder encoder = new PdfEncoder(); SelectPageImageSource source = new SelectPageImageSource(pdf, pages); encoder.Save(outPdf, source, null); }

接下来，您需要用新的页面替换原始页面：

public void ReplaceOriginalPages(Stream pdf, Stream image8Bit, Stream outPdf, List<int> pages) { PdfDocument docOrig = new PdfDocument(pdf); PdfDocument doc8Bit = new PdfDocument(image8Bit); for (int i=0; i < pages.Count; i++) { docOrig.Pages[pages[i]] = doc8Bit[i]; } docOrig.Save(outPdf); // this is your final }

这将或多或少地做你想做的事。不太理想的一点是图像页面已被光栅化，这可能不是你想要的。好消息是，仅通过栅格化，生成输出很容易，但它可能不是原始图像的分辨率。这可以做到，但是需要从SingleImageOnly页面中提取图像然后更改其像素格式，这要做得更多。这个问题是SingleImageOnly并不意味着图像适合整个页面，也不意味着图像被放置在任何特定位置。除了PixelFormat更改（实际上，在更改之前）之外，您还需要将用于将图像放置在页面上的矩阵应用于图像本身，并使用具有适当边距集和原始页面大小的PdfEncoder获得应有的图像。这都是切割和干燥，但它是大量的代码。

还有另一种方法可能也可以使用我们的PDF生成API。它涉及打开文档并使用8位文档交换文档的图像资源。这也是可行的，但并非完全无足轻重。你会做这样的事情：

public void ReplaceImageResources(Stream pdf, Stream outPdf, List<int> pages) { PdfGeneratedDocument doc = new PdfGeneratedDocument(pdf); doc.Resources.Images.Compressors.Insert(0, new AtalaImageCompressor()); foreach (int page in pages) { // GetSinglePageImage uses PageCollection, as above, to // pull a single image from the page (no need to use the matrix) // then converts it to 8 bpp indexed and returns it or null if it // is already 8 bpp indexed (or 4bpp or 1bpp). using (AtalaImage image = GetSinglePageImage(pdf, page)) { if (image == null) continue; foreach (string resName in doc.Pages[page].ImportedImages) { doc.Resources.Images.Remove(resName); doc.Resources.Images.Add(resName, image); break; } } } doc.Save(outPdf); }

正如我所说，这很棘手--PDF生成套件是为了从整个布料制作新PDF或为现有PDF添加新页面而制作的（将来，我们希望添加完整的编辑）。但PDF将其所有图像作为文档中的资源进行管理，我们有能力完全替换这些资源。因此，为了简化生活，我们将ImageCompressor添加到处理AtalaImage对象的Image资源集合中，并删除现有的图像资源并用新的替换它们。

现在我要做一些你在谈论他们自己的产品时可能看不到任何供应商的事情 - 我会在很多层面上批评它。首先，它不是超便宜的。抱歉。当你看到价格时，你可能会受到贴纸的冲击，但价格包括来自真正首屈一指的员工的技术支持。

你可以用iTextPdf Sharp或Bit Miracle的Docotic PDF库或Tall Components PDF库做很多事情。后两者也需要花钱。 Bit Miracle的工程师已经证明非常有帮助，你可能会在这里看到它们（HI！）。也许他们也可以帮助你。 iTextPdfSharp是有问题的，因为你真的需要理解PDF规范做正确的事情，否则你很可能输出垃圾PDF - 我已经用iTextPdfSharp与我自己的库并排完成了这个实验，发现了一些需要深入了解PDF规范才能解决的常见任务的难点。我尝试在我的高级工具中做出决定，这样您就不需要知道PDF规范，也不需要担心创建错误的PDF。

我并不特别喜欢这样一个事实，即我们的代码库中有几个明显不同的工具可以执行类似的操作。由于历史原因，PageCollection是我们的PDF光栅化器的一部分。 PdfDocument严格用于操作页面，并尝试使用内存轻量级和吝啬。 PdfGeneratedDocument用于操作/创建页面内容。 PdfDecoder用于从现有PDF生成光栅图像。 PdfEncoder用于从图像生成仅图像的PDF。拥有所有这些显然重叠的利基工具可能是令人生畏的，但他们有一个逻辑和他们彼此的关系。

This is a fun one. Atalasoft dotImage with the PDF Rasterizer and dotPdf can do this (disclaimer: I work for Atalasoft and wrote most of the PDF tools). I'd start off first by finding candidate pages:

List<int> GetCandidatePages(Stream pdf, string password) { List<int> retVal = new List<int>(); using (PageCollection pages = new PageCollection(pdf, password)) { for (int i=0; i < pages.Count; i++) { if (pages[i].SingleImageOnly()) retVal.Add(i); } } pdf.Seek(0, SeekOrigin.Begin); // restore file pointer return retVal; }

Next, I'd rasterize only those pages, turning them into 8-bit images, but to keep things efficient, I'd use an ImageSource which manages memory well:

public class SelectPageImageSource : RandomAccessImageSource { private List<int> _pages; private Stream _stm; public SelectPageImageSource(Stream stm, List<int> pages) { _stm = stm; _pages = pages; } protected override ImageSourceNode LowLevelAcquire(int index) { PdfDecoder decoder = new PdfDecoder(); _stm.Seek(0, SeekOrigin.Begin); AtalaImage image = PdfDecoder.Read(_stm, _pages[index], null); // change to 8 bit if (image.PixelFormat != PixelFormat.Pixel8bppIndexed) { AtalaImage changed = image.GetChangedPixelFormat(PixelFormat.Pixel8bppIndexed); image.Dispose(); image = changed; } return new FileReloader(image, new PngEncoder()); } protected override int LowLevelTotalImages() { return _pages.Count; } }

Next you need to create a new PDF from this:

public void Make8BitImagePdf(Stream pdf, Stream outPdf, List<int> pages) { PdfEncoder encoder = new PdfEncoder(); SelectPageImageSource source = new SelectPageImageSource(pdf, pages); encoder.Save(outPdf, source, null); }

Next you need to replace the original pages with the new ones:

public void ReplaceOriginalPages(Stream pdf, Stream image8Bit, Stream outPdf, List<int> pages) { PdfDocument docOrig = new PdfDocument(pdf); PdfDocument doc8Bit = new PdfDocument(image8Bit); for (int i=0; i < pages.Count; i++) { docOrig.Pages[pages[i]] = doc8Bit[i]; } docOrig.Save(outPdf); // this is your final }

This will do what you want, more or less. The less-than ideal bit of this is that the image pages have been rasterized, which is probably not what you want. The nice thing is that just by rasterizing, generating output is easy, but it might not be at the resolution of the original image. This can be done, but it is significantly more work in that you need to extract the image from SingleImageOnly pages and then change their pixel format. The problem with this is that SingleImageOnly does NOT imply that the image fits the entire page, nor does it imply that the image is placed in any particular location. In addition to the PixelFormat change (actually, before the change), you would want to apply the matrix that is used to place the image on the page to the image itself, and use PdfEncoder with an appropriate set of margins and the original page size to get the image where it should be. This is all cut-and dried, but it is a substantial amount of code.

There is another approach that might also work using our PDF generation API. It involves opening the document and swapping out the image resources for the document with 8-bit ones. This is also doable, but is not entirely trivial. You would do something like this:

public void ReplaceImageResources(Stream pdf, Stream outPdf, List<int> pages) { PdfGeneratedDocument doc = new PdfGeneratedDocument(pdf); doc.Resources.Images.Compressors.Insert(0, new AtalaImageCompressor()); foreach (int page in pages) { // GetSinglePageImage uses PageCollection, as above, to // pull a single image from the page (no need to use the matrix) // then converts it to 8 bpp indexed and returns it or null if it // is already 8 bpp indexed (or 4bpp or 1bpp). using (AtalaImage image = GetSinglePageImage(pdf, page)) { if (image == null) continue; foreach (string resName in doc.Pages[page].ImportedImages) { doc.Resources.Images.Remove(resName); doc.Resources.Images.Add(resName, image); break; } } } doc.Save(outPdf); }

As I said, this is tricky - the PDF generation suite was made for making new PDFs from whole cloth or adding new pages to an existing PDF (in the future, we want to add full editing). But PDF manages all of its images as resources within the document and we have the ability to replace those resources entirely. So to make life easier, we add an ImageCompressor to the Image resource collection that handles AtalaImage objects and remove the existing image resources and replace them with the new ones.

Now I'm going to do something that you probably won't see any vendor do when talking about their own products - I'm going to be critical of it on a number of levels. First, it isn't super cheap. Sorry. You might get sticker shock when you look at the price, but the price includes technical support from a staff that is honestly second to none.

You can probably do a lot of this with iTextPdf Sharp or the Bit Miracle's Docotic PDF library or Tall Components PDF libraries. The latter two also cost money. Bit Miracle's engineers have proven to be pretty helpful and you're likely to see them here (HI!). Maybe they can help you out too. iTextPdfSharp is problematic in that you really need to understand the PDF spec to do the right thing or you're likely to output garbage PDF - I've done this experiment with my own library side-by-side with iTextPdfSharp and found a number of pain points for common tasks that require an in-depth knowledge of the PDF spec to fix. I tried to make decisions in my high-level tools such that you didn't need to know the PDF spec nor did you need to worry about creating bad PDF.

I don't particularly like the fact that there are several apparently different tools in our code base that do similar things. PageCollection is part of our PDF rasterizer for historical reasons. PdfDocument is made strictly for manipulating pages and tries to be lightweight and stingy with memory. PdfGeneratedDocument is made for manipulating/creating page content. PdfDecoder is for generating raster images from existing PDF. PdfEncoder is for generating image-only PDF from images. It can be daunting to have all these apparently overlapping niche tools, but there is a logic to them and their relationship to each other.

更多推荐

务实地将PDF图像转换为8位(Pragmatically convert PDF images to 8 bit)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表