PyPDF2.utils.PdfReadError: Unexpected destination '__WKANCHOR

admin管理员组
文章数量:1636279

在学习这篇文章之前，对于一点都不懂python的朋友，可以去看下我之前写过的博客文章，也都是学习过程中的一些收获，感兴趣的可以去看看http://www.flybi/blog/seng。

之所以写这篇文章，是因为在学习《Python网络数据采集》的时候，书中提到了访问pdf的方法，于是我就想是否能将天善博客内容转成PDF文档，这个功能会很有用，以后可以方便查找。

思路：使用命令行工具wkhtmltopdf，可以将html 转成 PDF，Python有个包python-pdfkit，方便了wkhtmltopdf和Python的结合。不过目前wkhtmltopdf读取的页面数有限制，文档很多的话会生成多个PDF，这样就需要合并，试过很多种合并方法，最终使用ghostscript解决的。

接下来具体介绍一下做法。

1wkhtmltopdf功能简介

下载页面：http://wkhtmltopdf/downloads.html

使用0.12.3 Linux版下载后直接解压即可。

 tar -xvf wkhtmltox-0.13.0-alpha-7b36694_linux-centos6-amd64.rpm.part
  sudo ln -s ./wkhtmltox/bin/wkhtmltopdf /usr/bin/wkhtmltopdf

使用示例如下：

wkhtmltopdf  'http://www.flybi/blog/seng/3645' 'http://www.flybi/blog/seng/3599'  sengblog.pdf

注意：天善使用的javascript,好像对这个有影响，只能生成一个页面，要等javascript运行完，加参数即可--javascript-delay 2000，或者屏蔽掉也可以--disable-javascript

wkhtmltopdf  --javascript-delay 2000 'http://www.flybi/blog/seng/3645' 'http://www.flybi/blog/seng/3599'  sengblog.pdf

如果出问题，可以看一下大纲的信息：

wkhtmltopdf  --dump-outline out.xsl  toc 'http://www.flybi/blog/seng/3645' 'http://www.flybi/blog/seng/3599'  sengblog.pdf

2pdfkit调用的接口

参考网站：https://pypi.python/pypi/pdfkit

Install python-pdfkit:
 $ pip install pdfkit
 简单的示例：
 import pdfkit
 pdfkit.from_url([ 'http://www.flybi/blog/seng/3645','http://www.flybi/blog/seng/3599'], 'sengblog.pdf')

3生成blog的PDF文件

以我的博客为例

我的首页：

http://www.flybi/people/seng

我的博客就5页，获取这些链接就可以了

http://www.flybi/blog/id-seng__page-1

核心逻辑如下：

(1)获取所有blog具体页面的URL；

(2)排序：暂时偷个懒，先按默认字符排序了；

(3)生成pdf文档，注意目前测试只能包含有限的页面，目前使用20个，按类似以下格式sengblog20160419_1_20生成；

(4)合并pdf文档。

目前没有现成代码，一开始使用http://www.pdfmerge/手工做的，后找到更好的办法，下面详细介绍如何合并。

4合并PDF文档

因为wkhtmltopdf有限制，文档多了，需要生成多个pdf文件，原来使用http://www.pdfmerge/在线服务合并文档，感觉不完美，合并更多文件就比较麻烦了，找了一些工具最终解决了，测试了PyPDF2、pdftk、ghostscript，都能合并，不过PyPDF2、pdftk对wkhtmltopdf生成的文档的outline合并有问题，特别是PyPDF2需要修改代码，才能完成合并。所以最后选用了ghostscript来完成合并。

相关工具版本要求：

ghostscript:9.19

wkhtmltopdf:0.12.3 (with patched qt)

ghostscript的安装使用：

下载：

http://ghostscript/download/gsdnld.html

帮助文档：

http://www.ghostscript/doc/9.19/Use.htm

命令示例：

gs-919-linux_x86_64 -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=./output/all_coffee.pdf coffee1_20.pdf coffee21_40.pdf
linux_x86_64 -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=all_seng.pdf seng*.pdf

PyPDF2的安装使用：

PyPDF2版本：1.25.1

https://pypi.python/pypi/PyPDF2/1.25.1

或

https://github/mstamy2/PyPDF2

安装：

pip install PyPDF2

使用示例：

from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
input1 = open("hql_1_20.pdf", "rb")
input2 = open("hql_21_40.pdf", "rb")
merger.append(input1)
merger.append(input2)
# Write to an output PDF document
output = open("hql_all.pdf", "wb")
merger.write(output)

--------------------

注意合并wkhtmltopdf系统会出错：

PdfReadError: Unexpected destination '/__WKANCHOR_2'

#fix PyPDF2 merge outline出问题的解决

#参考https://github/mstamy2/PyPDF2/issues/193

I had the same problem. For a quick and dirty fix I commented out the lines 1225 and 1226 in the file pdf.py of the package PyPDF2 which raise the exception:

    # if destination found, then create outline
    if dest:
        if isinstance(dest, ArrayObject):
            outline = self._buildDestination(title, dest)
        elif isinstance(dest, Str) and dest in self._namedDests:
            outline = self._namedDests[dest]
            outline[NameObject("/Title")] = title
        # else:
        #     raise utils.PdfReadError("Unexpected destination %r" % dest)
    return outline

pdf.py 1225-1226按如下屏蔽即可
# if destination found, then create outline
if dest:
    if isinstance(dest, ArrayObject):
        outline = self._buildDestination(title, dest)
    elif isinstance(dest, Str) and dest in self._namedDests:
        outline = self._namedDests[dest]
        outline[NameObject("/Title")] = title
    #### else:
    ####     raise utils.PdfReadError("Unexpected destination %r" % dest)
return outline

pdftk的安装使用：

安装：

sudo yum install libgcj
sudo rpm -i pdftk-2.02-1.*.rpm

使用：

pdftk a1*.pdf cat output combined.pdf

outline的手工修复：

pdftk支持outline的编辑，可以参考这个，

http://stackoverflow/questions/296****79/merge-pdfs-with-pdftk-with-bookmarks

pdftk hql_1_20.pdf dump_data > in1.info

#手工修改in1.info文件

pdftk hql_1_20.pdf update_info in1.info output out.pdf.

当然程序还有些欠缺：

1.uft8的文件名在linux能使用，到windows下就软乱码了。

2.wkhtmltopdf读取html遇到503错误，需要再次检查和重新读取。

本文标签： PdfReadError utils destination unexpected

版权声明：本文标题：PyPDF2.utils.PdfReadError: Unexpected destination '__WKANCHOR_2' 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/xitong/1729215710a1190407.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

xp系统

电子爱好者 - 最新技术资讯及电子产品介绍！

PyPDF2.utils.PdfReadError: Unexpected destination '__WKANCHOR_2'

参考网站：https://pypi.python/pypi/pdfkit

以我的博客为例

更多相关文章

From 192.168.132.130 icmp_seq=14 Destination Host Unreachable

Destination Host Unreachable问题记录

local host is: &quot;node5172.16.100.115&quot;; destination host is: &quot;node5&quot;:9000;

JAVA+sapjco3连接SAP的两种Destination的创建方式

From 172.25.254.48 icmp_seq=1 Destination Host Unreachable

Could not find destination factory for transport解决方法

UBUNTU ： Destination Host Unreachable

destination exists and is not a directory

iOS编译时报错The run destination John的 iPhone is not valid for Running the scheme 'Test'.

icmp的报文，Destination Host Unreachable

Archivedestination and flash recovery area

warning C4789: destination of memory copy is too small

Choose a destination with a supported architecture in order to run on this device.

Error in remote connection to destination SAPOSS

vmware出错the destination file system does not support large files

Eclipse指定输出classes的destination

“Request timed out”与“Destination host unreachable”的区别

rename ,destination host unreachable

JUNIPER SRX双ISP部署Destination-nat配置实验1.0

The run destination My Mac 64-bit is not valid for Running the scheme '*****'.

发表评论

推荐文章

【日记】今天又是哪朵小云不开心了呀（1886 字）

PAM account management error: Permission denied

Trime 输入法项目教程

easyrecovery 2024如何使用工具来激活密钥？

UEFI启动模式下win10+Ubuntu18.04双硬盘（固态+机械）双系统安装2019.06船新版本

热门文章

ChatGPT API 低价上线，开发者可以人手一个了？

CentOS 安装 rar、zip 解压缩

不改HOST，另类打开谷歌搜索的方法

关于百度网盘下载过慢的解决方法

CE双上联情况下的路由控制（一）

vue中使用canvas手写输入识别中文

有哪些AI是完全免费且不限次数使用的？

io.lettuce.core.protocol.ConnectionWatchdog - Reconnecting, last destination was ***

解决安装ubuntu系统时，出现机器以UEFI模式启动了安装器的问题

爱奇艺html5不显示画面,爱奇艺有声音但是没有画面怎么办_爱奇艺黑屏怎么解决...

最新文章

成功解决：RuntimeError: implement_array_function method already has a docstring

Uncaught TypeError: ‘assign‘ called on an object that does not implement interface Location.

在PHP中implement什么意思,php 接口,extends,implement,implements 作用及区别收集整理

move occurs because `arr[_]` has type `T`, which does not implement the `Copy` trait

[Exceptions]java Anonymous class implement abstract method

api 与 implement 的区别

RuntimeError: implement_array_function method already has a docstring

java中implement_java中 implement和extends的作用和区别详细解释

gorm初始化表外键报错“define a valid foreign key for relations or implement the ValuerScanner interface”解决方法

implement在java中怎么用_JAVA中implement和extends的区别

Java开发--implement Serializable

ts重点学习72-implement语句

Java基础12 implement和extends的区别

implement 和 extends 的区别

[Application] The app delegate must implement the window property if ..... 错误

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

local host is: "node5172.16.100.115"; destination host is: "node5":9000;

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载