The simplest way to parse XML file is to use simplexml_load_file which will convert XML to the object. Problem with simplexml_load_file is that it will parse the whole file to the memory, which is not desirable when we are dealing with large XML documents.
For more info about simplexml_load_file memory consumption please check: “Get the real amount of memory allocated by the PHP – including the resource types” .
The XMLReader provides a way to read XML file in a memory efficient way. XMLReader is streaming pull XML parser – which means it is very low-level and it will fetch next fragment of the document when it is told to do so. This makes XMLReader very memory efficient, but not so programmer friendly. Fortunately XMLReader and SimpleXML can be combined.
Test
Large XML file: feed_big.xml.gz . Around 40000 nodes, uncompressed size on disk 109Mb. This XML is very simple, lots of <prod>…</prod> nodes.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
<cafProductFeed> <datafeed id="xxxx" merchantId="xxxx" merchantName="xxxxxxxxxxxxxxxxxxxxx"> <prod id="750924782" in_stock="no" is_for_sale="yes" lang="en" pre_order="no" stock_quantity="0" web_offer="no"> <brand> <brandName>Maxxis</brandName> </brand> <cat> <awCatId>252</awCatId> <awCat>Cycling</awCat> <mCat>Wheels & Tyres > Tyres</mCat> </cat> <price curr="GBP"> <buynow>43.99</buynow> <delivery>0.00</delivery> <rrp>53.99</rrp> <store>0.00</store> </price> <text> <name>Maxxis Crossmark Tyre - LUST</name> <desc>Maxxis Crossmark Tyre - LUSTDesigned with World Champion Christoph Sauser, the CrossMark is the dramatic evolution of the Cross Country racing tire. The nearly continuous center ridge flies on hardpack, yet has enough spacing to grab wet roots and rocks The slightly raised ridge of side knobs offers cornering precision never before seen on a tire this fast Features:LUST TechnologyFast rolling center ridgeRaised side knobs for better corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer: 70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online Bike Store.</desc> </text> <uri> <awTrack> http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack> <awImage> http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage> <awThumb> http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb> <mImage> http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage> <mLink> http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink> </uri> <vertical /> <pId>100003UK</pId> <colour>Black</colour> <delTime>UK Free Standard Delivery - 3-4 working days</delTime> <lastUpdated>2017-09-18 20:16:31</lastUpdated> <mpn>TB72545000</mpn> </prod> <prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en" pre_order="no" stock_quantity="16" web_offer="no"> <brand> <brandName>DMR</brandName> </brand> <cat> <awCatId>252</awCatId> <awCat>Cycling</awCat> <mCat>Components > Derailleurs</mCat> </cat> <price curr="GBP"> <buynow>12.49</buynow> <delivery>0.00</delivery> <rrp>17.99</rrp> <store>0.00</store> </price> <text> <name>DMR Chain Tugs</name> <desc>DMR Chain Tugs Available for single speed rear wheels – BMX or MTB- Invaluable asset for anyone who stretches chains or knocks their rear wheel out of alignment - CNC machined to fit 10mm axles - Made in the UKBuy DMR Frames & Forks from xxxxx, the World's Largest Online Bike Store.</desc> </text> <uri> <awTrack> http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack> <awImage> http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage> <awThumb> http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb> <mImage> http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage> <mLink> http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink> </uri> <vertical /> <pId>10000UK</pId> <colour>Black</colour> <delTime>UK Free Standard Delivery - 3-4 working days</delTime> <lastUpdated>2017-09-18 20:17:47</lastUpdated> <mpn>DMR-CT-K</mpn> </prod> ... |
Example 1: simplexml_load_file
test01.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
<?php if(empty($argv[1])) { die("Please specify xml file to parse.\n"); } $countIx = 0; $xml = simplexml_load_file('compress.zlib://'.$argv[1]); if($xml === false) { die('Unable to load and parse the xml file: ' . error_get_last()['message'] ); } foreach($xml->datafeed->prod as $element) { $prod = array( 'name' => strval($element->text->name), 'price' => strval($element->price->buynow), 'currency' => strval($element->price->attributes()->curr) ); print_r($prod); echo "\n"; $countIx++; } print "Number of items=$countIx\n"; print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n"; print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n"; print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n"; print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n"; print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n"; /** * Returns memory usage from /proc<PID>/status in bytes. * * @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false. */ function memory_get_process_usage() { $status = file_get_contents('/proc/' . getmypid() . '/status'); $matchArr = array(); preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr); if(!isset($matchArr[2][0]) || !isset($matchArr[2][1])) { return false; } return intval($matchArr[2][0]) + intval($matchArr[2][1]); } |
Example 2: XMLReader and SimpleXMLElement
The right way to process large XML file using XMLReader and SimpleXMLElement (to make programmer life a little bit easier):
test02.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
<?php if(empty($argv[1])) { die("Please specify xml file to parse.\n"); } $countIx = 0; $xml = new XMLReader(); $xml->open('compress.zlib://'.$argv[1]); while($xml->read() && $xml->name != 'prod') { ; } while($xml->name == 'prod') { $element = new SimpleXMLElement($xml->readOuterXML()); $prod = array( 'name' => strval($element->text->name), 'price' => strval($element->price->buynow), 'currency' => strval($element->price->attributes()->curr) ); print_r($prod); print "\n"; $countIx++; $xml->next('prod'); unset($element); } print "Number of items=$countIx\n"; print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n"; print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n"; print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n"; print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n"; print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n"; $xml->close(); /** * Returns memory usage from /proc<PID>/status in bytes. * * @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false. */ function memory_get_process_usage() { $status = file_get_contents('/proc/' . getmypid() . '/status'); $matchArr = array(); preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr); if(!isset($matchArr[2][0]) || !isset($matchArr[2][1])) { return false; } return intval($matchArr[2][0]) + intval($matchArr[2][1]); } |
Opens XML document. Since document is gziped ‘compress.zlib://’ compression wrapper is used:
1 |
$xml->open('compress.zlib://'.$argv[1]); |
Skips all the nodes until the first product is reached:
1 |
while($xml->read() && $xml->name != 'prod'){;} |
When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.
1 2 3 4 5 6 7 |
while($xml->name == 'prod') { $element = new SimpleXMLElement($xml->readOuterXML()); ... $xml->next('prod'); unset($element); } |
The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.
XMLReader::next() will jump to the next product node.
And at the end close the input which XMLReader is parsing:
1 |
$xml->close(); |
Comparison
method | memory (kb) custom memory_get_process_usage() |
---|---|
simplexml_load_file test01.php |
478400 |
XMLReader and SimpleXMLElement test02.php |
14008 |
XMLReader and SimpleXMLElement used 30 times less memory, and memory consumption is not depended on the size of the XML document (number of nodes which we want to process in the XML Document).
I used simplexml_load_file only on a very large XML – processing lasts about 90 minutes.
With your solution only 90 Seconds 🙂
Wow! That is amazing. Thank you for the feedback! 🙂
I am facing an issue as my xml is much larger (500mb). i have to check two xml tags, one is “gs-local-feed” and under that I also got “school” tag. when I use
while($xml->read() && $xml->name != ‘gs-local-feed’){;}
while($xml->name == ‘gs-local-feed’){
$element = new SimpleXMLElement($xml->readOuterXML());
…
$xml->next(‘gs-local-feed’);
unset($element);
}
its only taking the first “school” value but i need the all values of “school” tag under “gs-local-fee”
and when I use
while($xml->read() && $xml->name != ‘school”){;}
while($xml->name == ‘school’){
$element = new SimpleXMLElement($xml->readOuterXML());
…
$xml->next(‘school’);
unset($element);
}
then its only giving me all the values of “school” from the first “gs-local-feed” tag and stops the process.
I hope I can make you understand the issue. need your help.
Hi Wasid,
I think I do understand the problem. But I don’t see what is wrong with the first loop. The $element should contain all the content of gs-local-feed (including the schools).
Did you try to post the question to the stackoverflow including the small sample of the XML and simple (but complete) test php script which demonstrates the problem?
Also, is your XML valid (check it with one of many online validators)?
facing the same issue, after find the first tag it stops
Thank you, this is a neat solution. It gets “the best of both worlds” of XMLReader and SimpleXMLElement 🙂
Thanks! 🙂
It does not seem like the script is available, there are no links to download
Hi, the complete example is included in the page (below label test02.php), you can copy and paste it to your editor.
There is only link to download example xml file.
It does not delete nodes. Don’t you have to save the file before close?
Hi, I am not sure what do you mean. The article example is only about reading the large xml file.
Absolutely great piece of article. Thanks! Keep up the good work! 🙂
Thank you! 🙂
Thank you I finally understood XML parsing with your examples!
Thx! 🙂
The post was almost 3 years old still save my day.
Thank you so much.
How can I print all the elements, if one element is present more then one times
what if there are two sets of records and both needs to show
how i will use your code in that case
Hi,
I am not sure that I understand the structure of your xml. Can you please provide a example of such xml?
Hi Damir .
This saved a lot of frustrating work in a project I am working on for a client (HUUUGE XML files). Your solution look like it will work, it is slower but isn’t crashing the server 🙂
Great work!
Great to hear that! Thx!
You’re a live saver, I really appreciatie your work!
My XML file of 520 mb is 20 mb only now.
Thanks you
Great! Thx for the feedback!
Thank you again for your work, is there a way to make an offset of the XML loop?
For example start at element 500 and end at 1000.
Thank you very much!
Hi,
Thx for the comment. I don’t think it would be possible with this technique since we are reading node by node from the file – without actually knowing what is ahead.
But since this is working fast enough – I would just run trough first 500 nodes and ignore them, and then process the rest till the end.