| Recommend this page to a friend! |
| PHP PDF to Text | > | All threads | > | error | > | (Un) Subscribe thread alerts |
| |||||||||||||
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
Hello Aryan,
I've had a quick look at the mbstring extension source code for version 5.2.4, for the following call : mb_convert_encoding ( $utf8, 'UTF-8', 'UTF-8' ); Apparently, it is authorized to specify the same source and destination character sets ; what I can say is that at the end of the function, another internal mb function is called, which wipes out bad characters from the conversion result. So it may help to add this extra call ; however, I would suggest to wait a little bit before : test your version without this call against a maximum number of pdf samples, to see what happens. The reason for that is simple : if you put the code-you-googled + the-call-to-mb_convert_encoding at the same time, AND an unexpected result occurs, you will have to struggle to figure out whether it comes from your code or from mb_string_convert_encoding. And after that, when you'll have identified the faulty code, you will need to perform regression testing to verify that the samples you've processed so far still give the same result. So wait using the simplest version , until some problem occurs... Regarding the problems with entities expressed as hex values : I can remember that one day I found some code converting to html entities before calling mb_convert_encoding ; the developer used the decimal notation for that and, during a few milliseconds, I told myself "hey, why using decimal numbers fo characters, when we all use the hexadecimal base for that ?". Now I better understand why he did it this way ! So , if it works with the decimal notation, this is a really good new for you ! It's surprising that your version of the mbstring extension gets crazy with hex numbers. I don't think there are any drawbacks with it (unless this version of mb_convert_encoding has not yet unveiled all its bugs...). A final note about optimization : between the two following forms : $entity = '&#' . $code . ';’ ; and : $entity = sprintf( "&#%d;", $code); prefer the first one ; the second one adds the extra expense of a function call. Of course, if "$code" came from an html form, I would say : prefer the second one, because it will sanitize your input ! but we're not in this case here... Christian.
Oh I almost forgot : you did a great work by investigating these problems !
Hello Christian,
Thank you very much for your tips and good advice. I’ve done some more testing and I can’t find any significant performance differences between your code modified for decimal entities or the direct conversion to utf with the chr if code < 127 else code < 2048 construction. I could indeed see that sprintf( "&#%d;", $code) is not as efficient as '&#' . $code . ’;’, just as a test I also tried '&#' . 1*$code . ’;’ and that seemed to be only slightly slower than without 1*, but as you say in this case there is no need to sanitize the input. I had some issues with larger pdf files, probably my php.ini settings are the limiting factor. As a workaround I split pdf files to individual pages / pdf files file1.pdf, file.2.pdf, file.3.pdf etc and loop through these when running PDF to Text and get the text from these files on (one) output. The only conversion issue I’ve seen so far is a pdf file that -I think- uses ligatures for combinations of fl and fi etc. The funny thing is the output in the browser is ok when I keep hex encoded entities in the output $entity = '&#x' . sprintf ( '%X', ( $code & 0xFFFF ) ) . ';' ; $result = mb_convert_encoding ( $entity, 'UTF-8', 'AUTO' ) . $result ; the output is ok in the browser with fl and fi characters that didn’t come out correct with the other two conversion methods. I you like can email you the pdf file where this occurs, it is not too big. Best regards Aryan
Hello Ryan,
you're right, my suggestion was just related to using array_keys() ; I did not make any comparison with the "chr if code < 127 else code < 2048" (in fact, I didn't even think of it, but I'm not surprised that it shows similar performances). However, I'm not surprised about the results with or without the sprintf function ! If you have issues with larger pdf files, then you need to consider 3 settings in php.ini : - memory_limit, which defines the maximum amount of memory a php script can use. Normally, the default value is "128M" (128 megabytes). On my Windows system, I set it to "1G" (but it's my personal system, so...) - if your pdf files are uploaded through a form, then look at the following settings in your php.ini : . upload_max_filesize : this is the maximum authorized size for a file upload. I've taken a look at the php.ini file bundled with php 5.2.4, and it's only "2M" ! . post_max_size : this sets the maximum size of data that can be sent through a form using the POST method. The default value on PHP 5.2.4 should be "8M". I suggest you increase the value of upload_max_filesize (and maybe post_max_size) ; just preserve the following inequality : upload_max_filesize < post_max_size < memory_limit (note : you can specify integer quantities, such as 1048576, or kilobytes : 1024K, or megabytes : 1M) Ah, the ligatures ! they made me crazy a few months ago but they helped me understand that my class was trying to perform strange conversions... They are converted almost normally in PHP 5.6 environments, despite a small inconsistency in mb_convert_encoding, that substitutes them with several characters ; for example, the single-character "fi" with ligature give two characters, "f" and "i". So don't hesitate to send me your pdf sample ([email protected]). I will be able at least to tell you what it gives with PHP 5.6 ; if I get some strange output, I may not be able to fix this issue if it is related to one of the 5 or 6 big issues I still have to fix on my side ! By the way, could you also send me in your email, together with the pdf file, the output of the PdfToText class so that I can see what "didn't come correct" and perform a comparison on results ? With kind regards, Christian.
Hello Ryan,
I received your sample PDF file by mail and... what a surprise ! my class did not extract anything. After a few investigations, I found out that the pdf file got the Mac-style line endings, ie "\r" (while Windows uses "\r\n" and Unix "\n"). Unfortunately, the PCRE functions like preg_match_all() go crazy with that, at least under Windows systems. This makes the following call, located at line 887, to fail : if ( ! preg_match_all ( '/(?P<object_id> \d+) \s+ \d+ \s+ obj (?P<object> .*?) endobj/imsx', $contents, $matches ) ) return ( false ) ; The PCRE port on Mac OS works just fine ; this explains why you can successfully extract text from your sample on your system. I have tried several alternatives, for example : '/(?P<object_id> \d+) \s+ \d+ \s+ obj (?P<object> (\r\n | \n | \r | [^\r\n])*?) endobj/imsx' but this make the php executable to cause a memory access violation. I also tried some embedded modifiers, such as (*ANYCRLF) (see https://nikic.github.io/2011/12/10/PCRE-and-newlines.html) without success. I can recall that I've seen one pdf sample which used carriage returns as line breaks, EXCEPT after the "obj" and "endobj" keywords, which were followed by a carriage return/line feed. This may explain why I didn't notice the problem before. But it's ok : I'll write a method to preprocess Pdf contents to replace \r line-endings with \n, so that PCRE will be happy. However, I do not want to do that systematically, because it could have side-effects on existing non-Mac samples I've tested so far. This is why I need to know what is the value of the PHP_OS constant on your Mac ; could you run the following PHP script : echo "PHP_OS = " . PHP_OS . "\n" ; echo "php_uname() = " . php_uname() . "\n" ; and tell me the result ? (for your information, PHP_OS is set to "WINNT" on Windows platforms, and to "Linux" on ... Linux platforms) Thanks for your help, Christian.
Hi Christian,
Thanks for testing that was indeed unexpected. I did not create the PDF myself and didn’t even know it was done on mac os, so I do not really know on what version of Mac OS this PDF was created. If you like I can send some more PDF’s that I can create on a Mac OS X (10.4 10.6, 10.10) . The old php server I test the class on is running mac OS X 10.4 PPC and Mac OS X translates in PHP_OS to ”Darwin” (in this case Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00 PDT 2007). About the ligatures; I though if it possibly could be that the ligature character in the pdf is replaced/presented by a vector based image so that it looks like a ligature in a font while the hidden text which you for example can search and highlight in the pdf still consists of the separate charters that the writer of the text entered (fl fi ffl etc)? Best regards Aryan
I was indeed expecting some new issue with your sample, this is why I asked for it !
Thanks for providing me with the result of PHP_OS and php_uname() ; the conclusion it leads me to is that I cannot rely on the information returned by Mac OSes. Yours say "Darwin", and I suspect that other versions will spell out different names, without mentioning something like "Mac Os", which could have been of real help for developers... ok, I will rely on the string "WIN" or "Linux" for the PHP_OS constant. Regarding ligatures : there is a kind of font in the PDF file format that is called "CID font" (CID stands for "character id"). It has exactly the behavior you described : you have a character id which points to a font where the vector-based image for this character is described. The CID does not correspond to anything known : it is not a Unicode codepoint, neither a UTF8 sequence, it's just an ID that is internal to Adobe. Adobe has been using it mainly for far-east languages such as Chinese or Japanese. I suspect they implemented it long before the Unicode standard emerged, because they needed to really be "international". However, my class does not handle CID fonts yet ; you would need tables that map character ids to their Unicode equivalent in order to do that (such information is not contained in the CID font structure). So, if the ligatured characters such as "ff", "fi", "ffl", etc. were described by a CID font, my class would leave the character id as is without interpretation ; in such situations, chances are great that it will produce garbage output ! Anyway, I will implement a workaround and come back to you when it will be ready ! Christian.
Hello Christian,
I think all versions of Mac OS X will say ”Darwin” see https://en.wikipedia.org/wiki/Darwin_(operating_system) On Mac OS X normally LF \n is used just like in other UNIX and UNIX like systems, but old ”classic" Mac OS (version ≤ 9) used CR \r. It is possible that the PDF I send you has been created with an old classic Mac OS. I now realize that I myself did not use the pdf file I send you for my own test because it was to big I had split it to indvidual pages before converting it. These PDFs possibly have been converted to ”normal" line feeds… I’l mail you these now as well so you can see if they are indeed different. Thank you for explaining the way ligatures are done that makes sense and if the test file I send was indee created on an classic Mac OS than unicode did not exsist and on those special characters like ligatures, small caps and undercase digits where usually done with another font of the same font family. Best regards Aryan
Hello Ryan,
thanks for the tip about Darwin, I was not aware of that ! Anyway, I won't need it, because I found out that this is not an OS issue. I put a wrong diagnosis on the problem ; in fact, the pdf file you supplied (boken-for-dig-om-adhd-faktabok.pdf) makes the preg_match_all() function crazy, not because of carriage returns, but because it reaches the limit given by the pcre.backtrack_limit setting in php.ini, which defaults to 1000000. I set it to 2000000 and everything worked fine. So if you modify the memory settings I described in one of my previous replies plus the pcre.backtrack_limit one, everyting should be ok and you won't have to split your initial document into individual pages. I sent you the output of the PdfToText class using this new setting value. There is also a version 1.2.51 of the class, which throws an exception when such a limit is reached, instead of silently ignoring it. Christian.
Brilliant, well done! I merged your changes of version 1.2.51 in my adopted version of you PDFtoText php class and discovered that the pcre.backtrack_limit setting in php.ini was the only reason why I couldn’t read any lager pdf documents, now it works for me too.
The only issue I (still) have is that the ligatures don’t get converted correctly, otherwise my output is more or less identical to yours apart from some extra spaces that your output got. If I understand correctly in the class hexcoded characters in the PDF file are converted to decimal code and then back to hexadecimal (which is what I do not do) and than to UTF-8. Do you think it might be worth to try this: to add the conversion back to hexadecimal encoded characters that you have and then convert those back once again to decimal encoded characters so I can run my incomplete mb_convert_encoding on it? Thank you for all the work you’ve done with this! Best regards Aryan |
| 1 - 10 | 11 - 20 | 21 - 30 | 31 - 40 | 41 - 41 |
info at phpclasses dot org.
