这是我所做的,但看起来很乱.预先感谢.
here is what I have done, but it appears disorderly. Thanks in advance.
1.使用 CGPDFStringCopyTextString 从pdf中获取文本
1.use CGPDFStringCopyTextString to get the text from the pdf
2.将NSString编码为char *
2.encode the NSString to char*
NSStringEncoding enc = CFStringConvertEncodingToNSStringEncoding(kCFStringEncodingGB_18030_2000); const char *char_content = [self.currentData cStringUsingEncoding:enc];下面是我如何获取currentData的信息:
Below is how I get the currentData:
void arrayCallback(CGPDFScannerRef inScanner, void *userInfo) { BIDViewController *pp = (__bridge BIDViewController*)userInfo; CGPDFArrayRef array; bool success = CGPDFScannerPopArray(inScanner, &array); for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 1) { if(n >= CGPDFArrayGetCount(array)) continue; CGPDFStringRef string; success = CGPDFArrayGetString(array, n, &string); if(success) { NSString *data = (__bridge NSString *)CGPDFStringCopyTextString(string); [pp.currentData appendFormat:@"%@", data]; } } } - (IBAction)press:(id)sender { table = CGPDFOperatorTableCreate(); CGPDFOperatorTableSetCallback(table, "TJ", arrayCallback); CGPDFOperatorTableSetCallback(table, "Tj", stringCallback); self.currentData = [NSMutableString string]; CGPDFContentStreamRef contentStream = CGPDFContentStreamCreateWithPage(pagerf); CGPDFScannerRef scanner = CGPDFScannerCreate(contentStream, table, (__bridge void *)(self)); bool ret = CGPDFScannerScan(scanner); }推荐答案
根据 Mac开发人员库 CGPDFStringCopyTextString 返回一个CFString对象,该对象将PDF字符串表示为文本字符串. PDF字符串以CGPDFString形式给出,该CGPDFString是一系列字节-介于0到255之间的无符号整数值;因此,该方法已经根据某种字符编码对字节进行了解码.
According to the Mac Developer Library CGPDFStringCopyTextString returns a CFString object that represents a PDF string as a text string. The PDF string is given as a CGPDFString which is a series of bytes—unsigned integer values in the range 0 to 255; thus, this method already decodes the bytes according to some character encoding.
没有明确给出任何编码,因此它假定一种编码类型,很可能是 PDFDocEncoding 或 UTF-16BE Unicode字符编码方案,这两种编码可能用来表示PDF文档文档内容流之外的文本字符串,请参见. 7.9.2.2节文本字符串类型和 PDF规范.
It is given none explicitly, so it assumes one encoding type, most likely the PDFDocEncoding or the UTF-16BE Unicode character encoding scheme which are the two encodings that may be used to represent text strings in a PDF document outside the document’s content streams, cf. section 7.9.2.2 Text String Type and Table D.1, Annex D in the PDF specification.
现在您还没有从收到CGPDFString的位置告诉我们.不过,我假设您是从文档内容流之一中的收到的.另一方面,可以使用任何可以想象的编码来编码那里的文本字符串.所使用的编码由与字符串一起显示的字体的嵌入数据给出.
Now you have not told us from where you received your CGPDFString. I assume, though, that you received it from inside one of the document’s content streams. Text strings there, on the other hand, can be encoded with any imaginable encoding. The encoding used is given by the embedded data of the font the string is to be displayed with.
有关此的更多信息,您可能需要阅读 CGPDFScannerPopString返回奇怪的结果,然后看看 PDFKitten .
For more information on this you may want to read CGPDFScannerPopString returning strange result and have a look at PDFKitten.
更多推荐
如何在iOS中正确阅读PDF格式的中文
发布评论