|
|
 Carin - 2016-07-18 12:16:14
Hi,
I get this error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 7080 bytes) in PdfToText.phpclass on line 3446
It's got something to do with the images in my pdf.
How can I exclude images from the conversion? I only need the text to search through.
Thanks for this great code!
 Christian Vigh - 2016-07-18 12:28:49 - In reply to message 1 from Carin
wow ! this is a realy interesting case, I (naively) made the assumption that everything would fit into memory.
Your suggestion of not processing images is really good and I will introduce an explicit flag to say that images have to be processed (the default behavior will be : do not process the images).
However, to allow me to test my modifications, would it be possible for you to send me your pdf file at the following address :
[email protected]
it would really be of great help to me !
Christian.
 Carin - 2016-07-18 12:49:01 - In reply to message 2 from Christian Vigh
Thank you! I just emailed you.
 Rolf Kellner - 2016-07-20 16:16:57 - In reply to message 3 from Carin
This is not a ConvertCharset.class issue. I already converted 25 M Byte documents without problems. Independent from text and image size. But the converter requires a lot of memory while converting PDFs
Your php.ini script is the bottleneck. Enlarge the value of
memory_limit=128M
in this script. Afterwards restart your server.
 Christian Vigh - 2016-07-20 16:56:22 - In reply to message 4 from Rolf Kellner
Rolf,
Yes, that's exactly the workaround I suggested until I fix the issue .
In fact there are 2 issues :
- The first one is that the previous version of my class tried to automatically extract jpeg images ; before PHP 5.6, the gdlib extension regularly complained with a memory allocation error when you tried to handle jpeg images greater than (approximately) 2Mb. The pdf files of Carin have images between 2 and 3Mb, and she is using PHP 5.5.12. So I disabled by default automatic image extraction.
- The second issue is due to really big character map tables in the pdf file AND to my way of storing them, that can in turn cause a memory allocation error. In this particular case, changing the memory_limit setting is the right solution.
The really good new for you, Rolf, is that the sample Carin sent to me and the issue it revealed helped me to find out why sometimes there are bad Unicode conversions (well, I hope so...).
I just have to completely rework my way of handling character maps, as well as my way to handle Unicode codepoints, which is inappropriate in some cases for european languages, and almost failing for most of the middle- and far-east languages.
A new version (which I'll call V1.3, and not V1.2.x) should be available within one or two weeks.
 Christian Vigh - 2016-08-03 20:17:48 - In reply to message 3 from Carin
Hello Carin,
I have finally managed to optimize memory usage for samples such as the one you provided to me (“31 August 1992.pdf”).
Initially it required that the memory_limit setting in php.ini be greater than 128Mb. Now it can run even if this setting is less than
32Mb.
This modification is available in the latest release, 1.2.29.
If you still have issues with files much greater than the one you sent to me, please send me a sample, I will be happy to have a look at it.
|