How to parse large XML files with PHP

The simplest way to parse XML file is to use simplexml_load_file which will convert XML to the object. Problem with simplexml_load_file is that it will parse the whole file to the memory, which is not desirable when we are dealing with large XML documents.

For more info about simplexml_load_file memory consumption please check: “Get the real amount of memory allocated by the PHP – including the resource types” .

The XMLReader provides a way to read XML file in a memory efficient way. XMLReader is streaming pull XML parser – which means it is very low-level and it will fetch next fragment of the document when it is told to do so. This makes XMLReader very memory efficient, but not so programmer friendly. Fortunately XMLReader and SimpleXML can be combined.

Test

Large XML file: feed_big.xml.gz . Around 40000 nodes, uncompressed size on disk 109Mb. This XML is very simple, lots of <prod>…</prod> nodes.

For example:

100

<datafeed id="xxxx" merchantId="xxxx"

merchantName="xxxxxxxxxxxxxxxxxxxxx">

<prod id="750924782" in_stock="no" is_for_sale="yes" lang="en"

pre_order="no" stock_quantity="0" web_offer="no">

<brand>

<brandName>Maxxis</brandName>

</brand>

<cat>

<awCat>Cycling</awCat>

<mCat>Wheels & Tyres > Tyres</mCat>

</cat>

</price>

<text>

<name>Maxxis Crossmark Tyre - LUST</name>

<desc>Maxxis Crossmark Tyre - LUSTDesigned with World

Champion Christoph Sauser, the CrossMark is the dramatic

evolution of the Cross Country racing tire. The nearly

continuous center ridge flies on hardpack, yet has enough

spacing to grab wet roots and rocks The slightly raised

ridge of side knobs offers cornering precision never before

seen on a tire this fast Features:LUST TechnologyFast

rolling center ridgeRaised side knobs for better

corneringSize: 26" x 2.1"TPI: 120Max PSI: 60Durometer:

70aBuy Maxxis Tyres from xxxxx, the World’s Largest Online

Bike Store.</desc>

</text>

<uri>

http://www.awin1.com/pclick.php?p=750924782&a=181769&m=2698</awTrack>

http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awImage>

http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod17336_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=0a98fd83cd569b80406b92333bd3ad46e49ccb50</awThumb>

http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod17336_Black_NE_01?$productfeedlarge$</mImage>

<mLink>

http://click.pump.to/fm-d0151/NY49D4IwEED~SnODUwsiKtLBxcn4sahxYWlKDY1Am7YGiPG~e2C8pcnL6717w8vVwKEKwfIiLuKu6yJZCd06JWTQppWDrJWPpGmKuBF9rz2TznjfCPdkYXCK1S8fithZZp0pkyxN10DhATxZJRQ08GydUbDAFxsKEjGFFvgSkduZUmE8meOktwOM7CyakZ2mFNn9U-SKKcLI8Xa5Tp6WqC3TKM-nTSLgp3ulVO3JbJI92f7e8RqLwr5EBT5f</mLink>

</uri>

<colour>Black</colour>

<delTime>UK Free Standard Delivery - 3-4 working

days</delTime>

</prod>

<prod id="750924792" in_stock="yes" is_for_sale="yes" lang="en"

pre_order="no" stock_quantity="16" web_offer="no">

<brand>

</brand>

<cat>

<awCat>Cycling</awCat>

<mCat>Components > Derailleurs</mCat>

</cat>

</price>

<text>

<name>DMR Chain Tugs</name>

<desc>DMR Chain Tugs Available for single speed rear wheels

– BMX or MTB- Invaluable asset for anyone who stretches

chains or knocks their rear wheel out of alignment - CNC

machined to fit 10mm axles - Made in the UKBuy DMR Frames

& Forks from xxxxx, the World's Largest Online Bike

Store.</desc>

</text>

<uri>

http://www.awin1.com/pclick.php?p=750924792&a=181769&m=2698</awTrack>

http://images2.productserve.com/?w=200&h=200&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awImage>

http://images2.productserve.com/?w=70&h=70&bg=white&trim=5&t=letterbox&url=media.xxxxxxxxxxxx.com%2Fis%2Fimage%2Fxxxxxxxxxxxx%2Fprod216_Black_NE_01%3F%24productfeedlarge%24&feedId=2698&k=2566b3a761af626a65afe7524356054963064fc8</awThumb>

http://media.xxxxxxxxxxxx.com/is/image/xxxxxxxxxxxx/prod216_Black_NE_01?$productfeedlarge$</mImage>

<mLink>

http://click.pump.to/fm-d0151/HY09D4IwFEX~SvPmApYgagcXWIzRwejWpSlFmtCPlBJijP~dB28899z7vjDHETgMKQUuClEsy5KrQRoXtVTJeKc-atRTrrwVRWdjtoVZmt-TKGLIQvRdyWqg0ANne0bBAD~UBwoBeHmkoBBTcMArRLHxncZ3bIf3usKK7tKuqL09SLNukydub4lRGLAyr05bVSbUGm-Dd9qliZxJq6M046jnuBb6gMqlQwl-fw__</mLink>

</uri>

<colour>Black</colour>

<delTime>UK Free Standard Delivery - 3-4 working

days</delTime>

</prod>

...

Example 1: simplexml_load_file

test01.php

<?php

if(empty($argv[1]))

{

die("Please specify xml file to parse.\n");

}

$countIx = 0;

$xml = simplexml_load_file('compress.zlib://'.$argv[1]);

if($xml === false)

{

die('Unable to load and parse the xml file: ' . error_get_last()['message'] );

}

foreach($xml->datafeed->prod as $element)

{

$prod = array(

'name' => strval($element->text->name),

'price' => strval($element->price->buynow),

'currency' => strval($element->price->attributes()->curr)

);

print_r($prod);

echo "\n";

$countIx++;

}

print "Number of items=$countIx\n";

print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";

print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";

print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";

print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";

/**

* Returns memory usage from /proc<PID>/status in bytes.

* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.

function memory_get_process_usage()

{

$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();

preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))

{

return false;

}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);

}

Example 2: XMLReader and SimpleXMLElement

The right way to process large XML file using XMLReader and SimpleXMLElement (to make programmer life a little bit easier):

test02.php

<?php

if(empty($argv[1]))

{

die("Please specify xml file to parse.\n");

}

$countIx = 0;

$xml = new XMLReader();

$xml->open('compress.zlib://'.$argv[1]);

while($xml->read() && $xml->name != 'prod')

{

;

}

while($xml->name == 'prod')

{

$element = new SimpleXMLElement($xml->readOuterXML());

$prod = array(

'name' => strval($element->text->name),

'price' => strval($element->price->buynow),

'currency' => strval($element->price->attributes()->curr)

);

print_r($prod);

print "\n";

$countIx++;

$xml->next('prod');

unset($element);

}

print "Number of items=$countIx\n";

print "memory_get_usage() =" . memory_get_usage()/1024 . "kb\n";

print "memory_get_usage(true) =" . memory_get_usage(true)/1024 . "kb\n";

print "memory_get_peak_usage() =" . memory_get_peak_usage()/1024 . "kb\n";

print "memory_get_peak_usage(true) =" . memory_get_peak_usage(true)/1024 . "kb\n";

print "custom memory_get_process_usage() =" . memory_get_process_usage() . "kb\n";

$xml->close();

/**

* Returns memory usage from /proc<PID>/status in bytes.

* @return int|bool sum of VmRSS and VmSwap in bytes. On error returns false.

function memory_get_process_usage()

{

$status = file_get_contents('/proc/' . getmypid() . '/status');

$matchArr = array();

preg_match_all('~^(VmRSS|VmSwap):\s*([0-9]+).*$~im', $status, $matchArr);

if(!isset($matchArr[2][0]) || !isset($matchArr[2][1]))

{

return false;

}

return intval($matchArr[2][0]) + intval($matchArr[2][1]);

}

Opens XML document. Since document is gziped ‘compress.zlib://’ compression wrapper is used:

1	$xml->open('compress.zlib://'.$argv[1]);

Skips all the nodes until the first product is reached:

1	while($xml->read() && $xml->name != 'prod'){;}

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

while($xml->name == 'prod')

{

$element = new SimpleXMLElement($xml->readOuterXML());

...

$xml->next('prod');

unset($element);

}

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

1	$xml->close();

Comparison

method	memory (kb) custom memory_get_process_usage()
simplexml_load_file test01.php	478400
XMLReader and SimpleXMLElement test02.php	14008

XMLReader and SimpleXMLElement used 30 times less memory, and memory consumption is not depended on the size of the XML document (number of nodes which we want to process in the XML Document).

Reference:

25 Comments

Oliver Iost 7 years ago /Reply

I used simplexml_load_file only on a very large XML – processing lasts about 90 minutes.
With your solution only 90 Seconds 🙂
- drib 7 years ago /Reply
  
  Wow! That is amazing. Thank you for the feedback! 🙂
Wasid 7 years ago /Reply

I am facing an issue as my xml is much larger (500mb). i have to check two xml tags, one is “gs-local-feed” and under that I also got “school” tag. when I use
while($xml->read() && $xml->name != ‘gs-local-feed’){;}
while($xml->name == ‘gs-local-feed’){
$element = new SimpleXMLElement($xml->readOuterXML());
…
$xml->next(‘gs-local-feed’);
unset($element);
}
its only taking the first “school” value but i need the all values of “school” tag under “gs-local-fee”
and when I use
while($xml->read() && $xml->name != ‘school”){;}
while($xml->name == ‘school’){
$element = new SimpleXMLElement($xml->readOuterXML());
…
$xml->next(‘school’);
unset($element);
}
then its only giving me all the values of “school” from the first “gs-local-feed” tag and stops the process.
I hope I can make you understand the issue. need your help.
- drib 7 years ago /Reply
  
  Hi Wasid,
  I think I do understand the problem. But I don’t see what is wrong with the first loop. The $element should contain all the content of gs-local-feed (including the schools).
  Did you try to post the question to the stackoverflow including the small sample of the XML and simple (but complete) test php script which demonstrates the problem?
  Also, is your XML valid (check it with one of many online validators)?
  - Praveen Mourya 5 years ago /Reply
    
    facing the same issue, after find the first tag it stops
Chairos 7 years ago /Reply

Thank you, this is a neat solution. It gets “the best of both worlds” of XMLReader and SimpleXMLElement 🙂
- drib 7 years ago /Reply
  
  Thanks! 🙂
Danny 7 years ago /Reply

It does not seem like the script is available, there are no links to download
- drib 7 years ago /Reply
  
  Hi, the complete example is included in the page (below label test02.php), you can copy and paste it to your editor.
  There is only link to download example xml file.
Flavio Avelar Cambraia 7 years ago /Reply

It does not delete nodes. Don’t you have to save the file before close?
- drib 7 years ago /Reply
  
  Hi, I am not sure what do you mean. The article example is only about reading the large xml file.
ZedB 7 years ago /Reply

Absolutely great piece of article. Thanks! Keep up the good work! 🙂
- drib 7 years ago /Reply
  
  Thank you! 🙂
Chris 7 years ago /Reply

Thank you I finally understood XML parsing with your examples!
- drib 7 years ago /Reply
  
  Thx! 🙂
Tony 5 years ago /Reply

The post was almost 3 years old still save my day.
Thank you so much.
Praveen Mourya 5 years ago /Reply

How can I print all the elements, if one element is present more then one times
wasi 5 years ago /Reply

what if there are two sets of records and both needs to show

how i will use your code in that case
- drib 5 years ago /Reply
  
  Hi,
  I am not sure that I understand the structure of your xml. Can you please provide a example of such xml?
Kim 5 years ago /Reply

Hi Damir .

This saved a lot of frustrating work in a project I am working on for a client (HUUUGE XML files). Your solution look like it will work, it is slower but isn’t crashing the server 🙂

Great work!
- drib 5 years ago /Reply
  
  Great to hear that! Thx!
Mr D 4 years ago /Reply

You’re a live saver, I really appreciatie your work!

My XML file of 520 mb is 20 mb only now.

Thanks you
- drib 4 years ago /Reply
  
  Great! Thx for the feedback!
Mr D 4 years ago /Reply

Thank you again for your work, is there a way to make an offset of the XML loop?

For example start at element 500 and end at 1000.

Thank you very much!
- drib 4 years ago /Reply
  
  Hi,
  Thx for the comment. I don’t think it would be possible with this technique since we are reading node by node from the file – without actually knowing what is ahead.
  But since this is working fast enough – I would just run trough first 500 nodes and ignore them, and then process the rest till the end.