| Recommend this page to a friend! |
| PHP PDF to Text | > | All threads | > | error | > | (Un) Subscribe thread alerts |
| |||||||||||||
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
Oh, I thank you ! (but I'm afraid that tonight, my brain looks like marmelade...)
Would you mind if I asked you a little extra work ? I cannot read swedish and I have some difficulties to find where are the ligatures in the text. Could you please indicate me, in the original document, where I could find one or two of them ? (page number and paragraph number or paragraph contents). Regarding hex - dec - utf8 conversions : I meant the CodePointToUtf8 function to be called from everywhere, whenever the postscript-like instructions want to output a character. There is no other conversion in other parts of the code. However, postscript-like instructions may take several forms to specify the text to be output ; for example : (hello world) Tj will simply output "hello world". Such a text is enclosed within parentheses, and "Tj" is an instruction that says "output the text that has been previously put in the stack then go to the beginning of the next line" (well, it is actually the beginning of a next "virtual" line whose position depends on lots of things). In the above example, we assume that the current font (the one that is to be used to draw text) has either an associated character map (translation table) which handles single-byte characters, or is one of the 4 predefined Adobe fonts (WinAnsi, Mac Roman and I don't remember what). So the following : (ABCD) Tj will output "ABCD" , as you may have guessed. But imagine now that the current font has a Unicode character map which has a width of 2 bytes (ie, each code you specify in the text to be output must be 2-bytes long, and each 2-bytes code will be associated with a unicode character). In such a situation, "(ABCD)" will be interpreted differently ; since you need two characters to get the index in the character map of the resulting character, then the character map will be looked up using indexes : - 0x4142 (ascii codes for "A" and "B") - 0x4344 ("C" and "D") Of course, the output will depend of which characters are associated to entries #4142 and #4344 in the character map. But in this case, the output will consist of only 2 characters, not 4. Additionally, you can find another notation : <01020304> Here, indexes into character maps are specified as hex values. Again, they can be 1 or 2-bytes long each, depending on the width of the character map associated to the current font. The above tedious explanation had only one goal : to explain why there is a CodePointToUtf8 function : it is called whenever a character is to be output. If the current font has no character map, then the ordinal ascii value of the character is passed to this function. If there is a character map, then the unicode value that is mapped to the input character id (the one specified in the text to be drawn) will be passed to this function. In fact, whether the current font has an associated character map or not, the CodePointToUtf8 function is guaranteed to receive a Unicode codepoint. Not sure I was clear but it's a little bit late now ! With kind regards, Christian.
Hello Chistian,
I think I finally found a simple modification of the class that is good solution that works as well as the original class with my older PHP version. The output works well the ligatures hat I had issues with and the output character encoding is UTF-8. replace line 141 $entity = '&#x' . sprintf ( '%x', ( $code & 0xFFFF ) ) . ';' ; with: $entity = '&#' . sprintf ( "%d", ( $code & 0xFFFF ) ) . ';’ ; Previously I had only used $entity = '&#' . $code . ';’ ; but this gives garbage characters when there are ligatures in the text. Best regards, Aryan
Thanks for the feedback ; I can recall indeed that for PHP versions < 5.2.11, there were some problems with html entities expressed in hex notation.
Helle Christian,
Yes you are right. I think we discussed this in messages 20 21 in this thread. What I really meant to write is that I (finally) discovered that ($code & 0xFFFF) is essential in function CodePointToUtf8 (@line 141) where the conversion from $code to html_entities is done. Without '& 0xFFFF’ the ligatures in the pdf file are not converted correct. For some reason that I do not quite understand for example the lowercase l character in a fl ligature $code = 6684780 instead of 108. $code = 6684780; $entity = '&#' . $code . ';’ ; var $entity is now � (= � ) and not correct but this $entity = '&#' . sprintf ( "%d", ( $code & 0xFFFF ) ) . ';' ; or a seemingly faster alternative $entity = '&#' . ($code & 65535) . ';’ ; fixed this error, var $entity is now l (= l ) and correct :-) As a side note; don’t you think that replacing (line 141) $entity = '&#x' . sprintf ( '%x', ( $code & 0xFFFF ) ) . ';' ; with $entity = '&#x' . ($code & 65535) . ';’ ; should be a slightly faster? As you pointed out earlier the first alternative adds the extra expense of a function call with this dec to hex conversion . Best regards Aryan
well, there are still some mysteries to me that seem coming from the 4th dimension when talking about ligatures ! In my current version (for PHP5.6) they are returned as distinct characters (for example, "fl" is returned as characters "f" and "l").
Of course, the above remark does not concern the pdf sample you sent to me ; I have this good old issue to solve so that it displays the correct characters, and this is completely unrelated to the way ligatures are translated (see above reply number... well I can't remember !) Regarding avoiding calls to sprint(), you're definitely right ; it has a slight overhead when compared to the second method, because it calls a function. However, in the second version, don't forget to remove the 'x' in : $entity = '&#x' . ($code & 65535) . ';’ ; With kind regards, Christian.
Hi Christian,
Thank you for your reply. Are you really sure that PHP causes the ligatures to be presented as distinct characters and that it is not the way they're stored in the PDF file that is the reason we see the separate characters? I did a simple test: $entiry = ’&#’ . 64258 . ’;’ ; (fl ligature) echo $entity . ’era ’; // flera (several in swedish) $entity = mb_convert_encoding ( $entity, 'UTF-8', 'HTML-ENTITIES' ); echo $entity . ’ickor’; // flickor (girls in swedish) In my browser the output was as expected, "flera flickor" with two real ligatures, not distinct ’f' ’l' characters. I saved the output to a pdf file (print to pdf file in Mac OS X) and the pdf file was still containing the fl ligatures. Then I converted the pdf file back to text with you class and the ligatures where still intact as one character, no separation. I also created a similar pdf file directly from LibreOffice (not using mac OS build in print to pdf function) and repeated this test and that too kept the ligatures intact when I converted the pdf back to text. Thank you for pointing out the typo in the second version (of line 141) in my previous post, of course it should have been without x for decimal entities. $entity = '&#' . ($code & 65535) . ';’ ; Otherwise one could also convert to hex with or without sprintf but there might be a performance penalty to do that. $entity = '&#x' . dechex($code & 65535) . ';' ; Kind regards, Aryan
Hello Aryan,
well, to tell the truth, I'm not sure of anything any more now ! I think I'm mixing results from various intermediate versions and various sample files I've received so I feel a little bit confused. As time goes by, I think that it will become clearer in my mind, after fixing a few existing issues (and, especially, the BIG one that makes me ignore font references specified at a page level...). However, I'm really happy to read that my class left the ligatures intact ! You're right, as with sprintf(), using dechex() will have a performance penalty (although less important than for sprintf, since it requires less code and far less interpretation). Christian.
Hi Christian,
Thanks I wish I knew how I could help you with the BIG font issue. Could it be helpful to look at poppler does this? The poppler-utils package seems useful if you want to check which fonts are in the pdf file see http://stackoverflow.com/questions/614619/how-to-find-out-which-fonts-are-referenced-and-which-are-embedded-in-a-pdf-docum With kind regards Aryan
Hi Aryan,
I thank you very much for your help ! Actually, I'm using poppler tools as a comparison tool, when someone submits me a weird result when using my class. And I could even say that sometimes, my class performs better and faster - but poppler tools are not designed for simple text extraction, they are based on a C++ framework and they are designed to do a pretty much more complex task : analyzing the contents of any PDF document and allowing you to deal with it, including creating documents. In fact, at the time I developed the first versions of the PdfToText class, I was not aware that font references could be defined at a page level ; I thought they were defined at a document level, so I had the surprise this summer to discover this fact (and, guess what ? it was a document written in German that was using ligatures such as "ffi" or "ffl"). So the big issue now is not understanding how it works ! the big issue is in fact an internal design issue of my class that I'm trying to solve progressively. Along with the last releases I published since then, I put some kind of "provisional" code to prepare the field, verifying that it did not break existing code, and that the samples that were sent to me were still being correctly interpreted. The next step for me is to take the "big jump" : fix my design issue without altering the results that I got with my database of pdf samples... Well, to tell the truth, I'm a little bit procrastinating here...
Hi Christian,
I’m happy that thanks to your help it I got it to work - as good as on any other php version - on my old PHP version, I just hope that I didn’t distract you too much with my questions. with kind regards, Aryan PS if you need a PDF that hardly outputs any correct text at all than I have one for you! I also tested to open it with libre-office and I see the same junktext where the correct text should be so it probably is not an easy file… The pdf file info tells me Adobe PDF Library 9.9 Adobe InDesign CS5 (7.0) with the terminal command strings horrorfile.pdf | grep FontName I get: /FontBBox [-198 -246 1108 984] /FontName /RVBQFG+HelveticaNeueLTStd-Roman /FontBBox [-32 -43 684 742] /FontName /TQKEDN+MyriadPro-Regular /ItalicAngle /FontBBox [-189 -282 1158 984] /FontName /PGFHDR+MyriadPro-Regular /ItalicAngle /FontBBox [-198 -250 1110 1007] /FontName /FSFPLE+HelveticaNeueLTStd-Bd /ItalicAngle /FontBBox [-32 -43 684 742] /FontName /TQKEDN+MyriadPro-Regular /ItalicAngle /FontBBox [-32 -43 684 742] /FontName /YQPBCP+MyriadPro-Regular /ItalicAngle /FontBBox [-196 -244 1032 964] /FontName /UGLSAW+HelveticaNeueLTStd-Cn /ItalicAngle /FontBBox [-189 -282 1158 984] /FontName /SBKXJW+MyriadPro-Regular /ItalicAngle /FontBBox [-197 -264 1133 985] /FontName /MVJURV+HelveticaNeueLTStd-Blk /ItalicAngle /FontBBox [-32 -43 684 742] /FontName /ETVFOH+MyriadPro-Regular /ItalicAngle /FontBBox [-198 -246 1108 984] /FontName /IJQWHK+HelveticaNeueLTStd-Roman /FontBBox [-197 -253 1098 984] /FontName /GUFIQT+HelveticaNeueLTStd-Md /ItalicAngle /FontBBox [-189 -282 1158 984] /FontName /RDHEYN+MyriadPro-Regular /ItalicAngle |
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
info at phpclasses dot org.
