| Recommend this page to a friend! |
| PHP PDF to Text | > | All threads | > | error | > | (Un) Subscribe thread alerts |
| |||||||||||||
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
Thank you for all help and advice. I had downloaded the zip package today which I thought contained the latest version but it is Version : 1.2.49
I downloaded 1.2.50 and updated my code with the changed lines but it didn’t change the output. I’l email both versions of the modified PdfToText source code to you if you want to have a look at it.
could you please also join the pdf file you are testing ?
Hello Aryan,
I have received your email containing your modified version 1.2.49. I tested it as is with my current version, 5.6.25. Everything was ok. So I did further testings with older versions. I was not able to find any distribution of php5.2.4 for my Windows system. However, I found several versions from 5.2.6 to 5.2.17. Actually, STARTING from version 5.2.11, the output is ok. BEFORE version 5.2.11, I have the same output as you. Note that during my testings, I used exactly the same PHP.INI file (the one that was supplied with PHP5.2.17, where I enabled the mbstring etension). So this is not a question of different parameters across versions. This behavior (ie, the garbage characters you have in the output) is apparently due to a change in the mbstring PHP extension between version 5.2.10 and 5.2.11, but I couldn't tell you which change occurred (the news.txt file bundled with the distribution does not seem to describe it). What I can say is that it is not os-specific : I am using Windows, you are using Mac, and we have the same results for versions < 5.2.11. It is definitely a version-specific issue. The only part of my code that is using an mbstring function is in the CodePointToUtf8() function ; this is where I'm calling the mb_convert_encoding() function around line 148 : mb_convert_encoding ( $entity, 'UTF-8', 'HTML-ENTITIES' ) This is the very central point where I'm trying to convert a Unicode codepoint to UTF8 (and this is the only mb_xxx function I'm calling). It seems that before PHP5.2.11, things were a little bit different for such conversions but honestly, I cannot tell you what happened between versions 5.2.10 and 5.2.11. Maybe a default value changed internally in the mbstring extension ? or maybe they completely reviewed their way of handling conversions, especially those related to utf8 ? I don't know. What I know however is that the mbstring extension bundled with PHP5.2.11 works correctly. I tried using the mbstring.dll (aka mbstring.so for Unix users) extension of version 5.2.11 over version 5.2.9 : it fails to load. So I'm afraid this will be the same if you try to install the mbstring extension for PHP5.2.11 on your Mac/Os system. So I'm afraid you will be a little bit stuck unless you are a little bit more familiar than me with the mbstring extension, and character sets (but if you are using a Mac, chances are great that you are). As a conclusion : you correctly ported my class to work with PHP5.2.4, but there is this issue with the mb_convert_encoding() function that behaves differently for versions < 5.2.11. However, if you find any information about it, I will really be curious to get it ! With kind regards, Christian. PS : it's not necessary to send me your pdf sample, as I asked you in my previous message : we can be sure now that we have the same pdf file.
Hello Christian,
Thank you very much for your good advice and testing the modifications I made. I’m not sure this is the correct way to solve this it but I found a workaround that mostly works as I can see. In the mb_convert_encoding() function line 148 I changed the target encoding type from HTML-ENTITIES to AUTO. In the browser it is almost perfect but sometimes é (e acute) is not translated correctly for example rangés in the second line of the first poem becomes ”rangps" but in other instances it is. when I look at the source code I can see that the characters that used to be wrong are now hex coded for example the first line in the poem Les Hiboux is in ascii but the second line in source is Les hiboux se tiennent rangps and in the browser: "Les hiboux se tiennent rangps" So the next last chacter ;p (p) should have been é (é) Just above in Bruce Demaugé-Bost (Bruce Demaugé-Bost) it is correct. Best regards Aryan In the php.ini a default_charset has not been set default_charset no value and mbstring settings are Multibyte Support enabled Multibyte string engine libmbfl Multibyte (japanese) regex support enabled Multibyte regex (oniguruma) version 4.4.4 Multibyte regex (oniguruma) backtrack check On mbstring.detect_order no value mbstring.encoding_translation Off mbstring.func_overload 0 mbstring.http_input pass mbstring.http_output pass mbstring.internal_encoding no value mbstring.language neutral mbstring.strict_detection Off mbstring.substitute_character no value Kind regards and many thanks! Aryan
Hello Aryan,
thanks for your feedback ! I'm happy that you found a solution that runs on your PHP version. Regarding the "é" being sometimes replaced with a "p", this is one of the 40 issues I still have to fix in my class. I will try to explain why : The Postscript-like language that is used in a pdf file to draw text on the output device allows you to specify which font you want to use for the next chunk of text to be drawn. For example, "/F1" will tell the reader that you want to use font #1. But /F1 is just a shortcut ; somewhere in the PDF file, there is something that says "You will be able to find font #1 description in object #x1". Object #x1 further describes the font characteristics ; the most important of them is whether it has an associated character map. In the sample pdf file we're talking about, object #x1 (which describes the font characteristics) also says : "you will find the Unicode character map for this font in object #y1". Object #y1 describes the character map itself. What does it contain ? just a mapping table that says : "every time a character c1 is referenced by a text drawing postscript instruction, I want to substitute with character s1". The character s1 will be either a Unicode code point or a single byte value, depending on the "character width" declared in the character map. But hold on : Our sample contains a second font alias, say "/F2". Somewhere in the PDF file, a declaration states that "/F2" should be mapped to object #x2, which says that the associated character map y2. In our sample, character maps #y1 and #y2 are almost identical. For a given index in the map, character map #y1 maps it to the "é" character, while the second one maps it to "p". So far, so good. Unfortunately (and this is one case I'm not handling yet), font aliases such as /F1 and /F2 are not defined at the document level : they are defined at a page level (and I suspect they can also be redefined in other subparts of the document). Currently, I'm only handling font references at the document level. So in our sample, it says : "the description for /F1 can be found in object #x1, and in object #x2 for font /F2". However, at the page level, sometimes, this pdf sample shows to be really facetious with me ; it says : "/F1 description is in object #x2, and /F2 description in object #x1". Hence the "p" instead of the "é". This is a little bit tricky to fix due to the current state of my class, but each time I'm publishing a new version, I'm also including some kind of "provisional code" with the objective of facilitating the work when I will be addressing this issue (this means for example that some internal structures may be changed, some private functions may be added to facilitate this future fix, without changing the current behavior of the class). I'm expecting this issue to be fixed before the end of november. Note that this does not seem to affect too many pdf samples : apart from the one we're talking about, I only have 2 or 3 other samples showing the same defect. So be patient ! With kind regards, Christian.
Hello Christian,
Thank you very much for your explanation. I’m very impressed what you class can do already. I wonder if your original code did replace the unicode hex characters in the output with plain ascii where possible? Not that this is a big issue but the source code would be cleaner and leaner that way. Best regards Aryan
Hello Aryan,
well, I'm afraid I did not quite understand the question ! Do you mean using 1-byte character codes in the UTF8 output whenever possible ? or do you mean replacing things like accentuated letters with their Ascii 7-bits equivalent ? For the first question, this is already done, since I'm relying on the mb_convert_encoding() function. In your case, with your php version, and with your modifications, you are returning apparently html entities, so I cannot do that much about that. The normal version, which is designed for PHP versions >= 5.6, only returns pure UTF8, without any html entities. If you mean converting to ascii 7-bits, I already developped somewhere something to do that. Maybe I could integrate this feature in my class. Please let me know if you're interested. Regarding making code cleaner, I agree with you for the parsing of Postscript instructions that draw text, and the handling of character substitution. I have stacked many kludges to handle special cases and I'm waiting a little bit to see if more special cases will be submitted to me (after all, I received two new samples presenting two additional special cases during the past week...). Once I'll have the confidence that everything is more or less stabilized, and that I have covered almost all the possible quirks that can arise when interpreting a pdf file, I will review this portion of my code to provide something more elegant. But for now, it's still "under construction". We have to live with that ! Please let met know if I answered your concerns or not, Christian.
Hi Christian,
I realize I did not express myself clearly. I meant 1-byte character codes in the UTF8 output whenever possible. I haven’t found out what goes wrong with mb_convert_encoding ( $entity, 'UTF-8', 'HTML-ENTITIES’ ) in my case. I did some more modifications that work. first I made an array (skipping ’#' !! ) with HEX to ASCII printable characters (character code 32-127) like this $HexASCIICharacterMap = array( ' ' => ' ', # Space '!' => '!', # Exclamation mark '"' => '"', # Double quotes (or speech marks) '$' => '$', # Dollar '%' => '%', # Procenttecken '&' => '&', # Ampersand ''' => "'", # Single quote '(' => '(', # Open parenthesis (or open bracket) ')' => ')', # Close parenthesis (or close bracket) '*' => '*', # Asterisk '+' => '+', # Plus ',' => ',', # Comma '-' => '-', # Hyphen '.' => '.', # Period, dot or full stop '/' => '/', # Slash or divide '0' => '0', # Zero '1' => '1', # One '2' => '2', # Two '3' => '3', # Three '4' => '4', # Four '5' => '5', # Five '6' => '6', # Six '7' => '7', # Seven '8' => '8', # Eight '9' => '9', # Nine ':' => ':', # Colon ';' => ';', # Semicolon '<' => '<', # Less than (or open angled bracket) '=' => '=', # Equals '>' => '>', # Greater than (or close angled bracket) '?' => '?', # Question mark '@' => '@', # At symbol 'A' => 'A', # Uppercase A 'B' => 'B', # Uppercase B 'C' => 'C', # Uppercase C 'D' => 'D', # Uppercase D 'E' => 'E', # Uppercase E 'F' => 'F', # Uppercase F 'G' => 'G', # Uppercase G 'H' => 'H', # Uppercase H 'I' => 'I', # Uppercase I 'J' => 'J', # Uppercase J 'K' => 'K', # Uppercase K 'L' => 'L', # Uppercase L 'M' => 'M', # Uppercase M 'N' => 'N', # Uppercase N 'O' => 'O', # Uppercase O 'P' => 'P', # Uppercase P 'Q' => 'Q', # Uppercase Q 'R' => 'R', # Uppercase R 'S' => 'S', # Uppercase S 'T' => 'T', # Uppercase T 'U' => 'U', # Uppercase U 'V' => 'V', # Uppercase V 'W' => 'W', # Uppercase W 'X' => 'X', # Uppercase X 'Y' => 'Y', # Uppercase Y 'Z' => 'Z', # Uppercase Z '[' => '[', # Opening bracket '\' => '\\', # Backslash ']' => ']', # Closing bracket '^' => '^', # Caret - circumflex '_' => '_', # Underscore '`' => '`', # Grave accent 'a' => 'a', # Lowercase a 'b' => 'b', # Lowercase b 'c' => 'c', # Lowercase c 'd' => 'd', # Lowercase d 'e' => 'e', # Lowercase e 'f' => 'f', # Lowercase f 'g' => 'g', # Lowercase g 'h' => 'h', # Lowercase h 'i' => 'i', # Lowercase i 'j' => 'j', # Lowercase j 'k' => 'k', # Lowercase k 'l' => 'l', # Lowercase l 'm' => 'm', # Lowercase m 'n' => 'n', # Lowercase n 'o' => 'o', # Lowercase o 'p' => 'p', # Lowercase p 'q' => 'q', # Lowercase q 'r' => 'r', # Lowercase r 's' => 's', # Lowercase s 't' => 't', # Lowercase t 'u' => 'u', # Lowercase u 'v' => 'v', # Lowercase v 'w' => 'w', # Lowercase w 'x' => 'x', # Lowercase x 'y' => 'y', # Lowercase y 'z' => 'z', # Lowercase z '{' => '{', # Opening brace '|' => '|', # Vertical bar '}' => '}', # Closing brace '~' => '~', # Equivalency sign - tilde '' => ’’); # Delete then I changed $entity = '&#x' . sprintf ( ’%x', ( $code & 0xFFFF ) ) . ';’ ; to $entity = '&#x' . sprintf ( '%X', ( $code & 0xFFFF ) ) . ';’ ; //uppercase HEX code so I didn’t need to change the HEX codes in my array and added $entity = str_replace(array_keys($HexASCIICharacterMap), array_values($HexASCIICharacterMap), $entity); just before $result = mb_convert_encoding ( $entity, 'UTF-8', 'AUTO' ) . $result ; The str_replace can also be moved to the return line after the while loop: return str_replace(array_keys($HexASCIICharacterMap), array_values($HexASCIICharacterMap), $result); I’m not sure what is more efficient both makes a cleaner and leaner source code. The first part of the source of the output now looks like this Extracted file contents :<br /> <br /> v01 – Bruce Demaugé-Bost – http://bdemauge.free.fr <br /> <br /> Les hiboux <br /> Charles Baudelaire Cycle 3 <br /> * POÉSIE <br /> Sous les ifs noirs qui les abritent <br /> Les hiboux se tiennent rangps Ainsi que des dieux ptrangers <br /> Dardant leur œil rouge. Ils mpditent. <br /> <br /> Sans remuer ils se tiendront <br /> Jusqu'à l'heure mplancolique <br /> Où, poussant le soleil oblique, <br /> Les tpnqbres s'ptabliront. <br /> Then I tried adding extended ASCII codes (character code 128-255) to my array 'è' => 'č', # Latin small letter e with grave 'é' => 'é', # Latin small letter e with acute 'ê' => 'ę', # Latin small letter e with circumflex and so on and that worked too! First part of the source of generate output: Extracted file contents :<br /> <br /> v01 – Bruce Demaugé-Bost – http://bdemauge.free.fr <br /> <br /> Les hiboux <br /> Charles Baudelaire Cycle 3 <br /> * POÉSIE <br /> Sous les ifs noirs qui les abritent <br /> Les hiboux se tiennent rangps Ainsi que des dieux ptrangers <br /> Dardant leur œil rouge. Ils mpditent. <br /> <br /> Sans remuer ils se tiendront <br /> Jusqu'ŕ l'heure mplancolique <br /> Oů, poussant le soleil oblique, <br /> Les tpnqbres s'ptabliront. <br /> So If I'd include a longer array with all characters that would work but I thought it should be possible to solve this a better way. After some more trial and error and google searching I found a solution without character array nor mb_convert_encoding. The while $code loop (where mb_convert_encoding was) looks like this: while ( $code ) { if ($code < 128){ $utf8 = chr($code) ; } else if ($code < 2048) { $utf8 = chr(192 + (($code - ($code % 64)) / 64)); $utf8 .= chr(128 + ($code % 64)); } else { $utf8 = chr(224 + (($code - ($code % 4096)) / 4096)); $utf8 .= chr(128 + ((($code % 4096) - ($code % 64)) / 64)); $utf8 .= chr(128 + ($code % 64)); } $result = $utf8 . $result ; $code >>= 16 ; } return $result; The first part of the source of the output now looks like this Extracted file contents :<br /> <br /> v01 – Bruce Demaugé-Bost – http://bdemauge.free.fr <br /> <br /> Les hiboux <br /> Charles Baudelaire Cycle 3 <br /> * POÉSIE <br /> Sous les ifs noirs qui les abritent <br /> Les hiboux se tiennent rangps Ainsi que des dieux ptrangers <br /> Dardant leur śil rouge. Ils mpditent. <br /> <br /> Sans remuer ils se tiendront <br /> Jusqu'ŕ l'heure mplancolique <br /> Oů, poussant le soleil oblique, <br /> Les tpnqbres s'ptabliront. <br /> So this seems to work ok but I haven’t tested it extensively. Best regards Aryan
Hi Aryan,
Thanks for this clear and really detailed feedback. Clearly, the problem comes from the mbstring extension bundled with PHP 5.2.4. As far as I understood, in your case, mb_convert_encoding() returns HTML entities even after the conversion, which is not a correct behavior ! I've already seen the code you tried instead of mb_convert_encoding (I think I even tried it before discovering one day that in some places of my code I called CodePointToUtf8 with an integer value, and in some other places with a string value, which produced unexpected results as you might guess). I even think I also tried it one day, when I was struggling to figure out why I had unexpected conversion results ! As a conclusion, if it produces better results with your current php version, keep it ; definitely. Regarding your question about where to put the str_replace() in your previous version, well... even if your pdf document contains 500 000 characters (for information, the RTF 1.9 specification contains around 550 000 characters), most of them will be characters in the range 0..255 ; only a minority of them will be represented as codepoints of 2 bytes or more. This means that the while() loop will be executed only one time in most cases (for bytecodes <= 2 bytes), so putting the str_replace inside or outside the loop will not have a great impact. As an example, suppose that your document contains 500 000 characters, and that 10% of them take more than 2 bytes, which is unlikely unless you're handling documents written in rare languages : the mb_convert_encoding function will be called 550 000 times if you put the call to str_replace inside the loop, and 500 000 times if outside the loop. But, of course, as a rule of thumb, you should put loop invariants outside the loop, so it's definitely better to call str_replace outside of the loop. However, one optimization that could have been done in your "str_replace" version would have been to avoid too many calls to array_keys() and array_values() ; this can be done in a very simple way : function CodePointToUtf8 ( $code ) { static $searches = false, $replacements = false ; if ( $searches === false ) { $searches = array_keys ( $HexASCIICharacterMap ) ; $replacements = array_values ( $HexASCIICharacterMap ) ; } ... } Now, instead of calling : entity = str_replace(array_keys($HexASCIICharacterMap), array_values($HexASCIICharacterMap), $entity); you will have : $entity = str_replace ( $searches, $replacements, $entity ) ; Let's have a closer look at what a function like array_keys is doing : 1) Initialize a new array, which is represented internally as a hash table 2) Loop through each key of the array you specified ; for each key : 2.1) Allocate a ZVAL (the most basic entity in PHP ; a ZVAL can hold an integer, a string, a float, a boolean, a pointer to a hash table that represents an array, etc.). 2.2) Allocate memory long enough to store a copy of the current array key 2.3) Copy the value of the current array key to the memory allocated in 2.2) 2.4) Set the value of the ZVAL allocated in 2.1) to the value copied in 2.3). 2.5) Add the value to the array initialized in 1) 3) Return the newly populated array Of course, each step (and each internal step performed by the internal php functions regarding ZVAL and array management) performs lots of sanity checkings for things such as running out of memory, parameters inconsistencies, sudden rise of cosmic radiations, etc. This can lead to performance issues if your code is really sollicitated. The second version I'm suggesting above performs all this mess only once, at the very first invocation of CodePointToUtf8. After that, a simple if() test avoids rebuilding the $searches and $replacements arrays. Of course, you might say : "hey, my if() test will succeed only once for a 500 000-characters document, and will fail for the other 499 999 characters". I know, but it costs only a few CPU cycles, to be compared with the mess of calling array_keys and array_values upon each invocation. Even if they are builtin functions and are designed to operate as fast as possible, they have their own cost, which is very expensive when compared to the cost of a really simple if() test on a variable. The ultimate optimization would be to get rid of $HexASCIICharacterMap ; simply initialize $searches with its array keys, such as : static $searches = array ( '#20', '#21', '#22', ... ) ; and $replacements : static $replacements = array ( ' ', '!', '"', ... ) ; but personally, I would be really unhappy to have to maintain such a code ! You can easily test the difference in performance between version 1 (yours) and version 2 (mine) ; just take a BIG pdf document, such as the Adobe PDF specifications, then run this code with both versions : $t1 = microtime ( true ) ; $pdf = new PdfToText ( 'your BIG pdf document.pdf' ) ; $t2 = microtime ( true ) ; echo "Processing took " . round ( $t2 - $t1, 2 ) . " seconds\n" ; Sorry that your version of the PHP mbstring extension is so buggy ; but I think that the code you found by googling is your best alternative anyway ! Christian.
Hello Christian,
Thank you very much for your answer and interesting explanation about optimizing php code. About unexpected results with my code, would it help top put this in the end of the while loop? $utf8 = mb_convert_encoding ( $utf8, 'UTF-8', 'UTF-8' ); just before $result = $utf8 . $result ; Normally this would simply do nothing but if there are any ”wrong” not UTF characters they would be removed I suppose. I did some more tests with mb_convert_encoding and HTML-ENTITIES and discovered that my PHP version does not correctly convert hex coded entities like é but works fine with named entities like é or decimal entities like é So now I found a very simple solution: instead of creating hexadecimal entities $entity = '&#x' . sprintf ( '%X', ( $code & 0xFFFF ) ) . ';' ; i could just as well create decimal entities $entity = '&#' . $code . ';’ ; or like so $entity = sprintf( "&#%d;", $code); this way your original code with mb_convert_encoding ( $entity, 'UTF-8', 'HTML-ENTITIES' ) works just like intended! Do you think that there any drawbacks to this solution? Best regards Aryan |
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
info at phpclasses dot org.
