How to parse large XML files with PHP

Scroll this

The simplest way to parse XML file is to use simplexml_load_file which will convert XML to the object. Problem with simplexml_load_file is that it will parse the whole file to the memory, which is not desirable when we are dealing with large XML documents.

For more info about simplexml_load_file memory consumption please check: “Get the real amount of memory allocated by the PHP  – including the resource types” .

The XMLReader provides a way to read XML file in a memory efficient way. XMLReader is streaming pull XML parser – which means it is very low-level and it will fetch next fragment of the document when it is told to do so. This makes XMLReader very memory efficient, but not so programmer friendly. Fortunately XMLReader and SimpleXML  can be combined.

Test

Large XML file: feed_big.xml.gz . Around 40000 nodes, uncompressed size on disk 109Mb. This XML is very simple, lots of <prod>…</prod> nodes.

For example:

 

Example 1: simplexml_load_file

test01.php

Example 2: XMLReader and SimpleXMLElement

The right way to process large XML file using XMLReader and SimpleXMLElement (to make programmer life a little bit easier):

test02.php

Opens XML document. Since document is gziped ‘compress.zlib://’ compression wrapper is used:

Skips all the nodes until the first product is reached:

When the above while loop finishes – that means that XMLReader has either reached the first product, or the end of file is reached. In case the first product is reached document stream cursor will be at the first product node in the XML document, and we will enter the while loop below.

The XMLReader::readOuterXML() returns the contents of the current node as a string, only one node at the time will be parsed. When we are finished with this node, it is destroyed with unset so that PHP garbage collection can free it.

XMLReader::next() will jump to the next product node.

And at the end close the input which XMLReader is parsing:

Comparison

method memory (kb)
custom memory_get_process_usage()
simplexml_load_file
test01.php
478400
XMLReader and SimpleXMLElement
test02.php
14008

XMLReader and SimpleXMLElement used 30 times less memory, and memory consumption is not depended on the size of the XML document (number of nodes which we want to process in the XML Document).

Reference:

25 Comments

  1. I used simplexml_load_file only on a very large XML – processing lasts about 90 minutes.
    With your solution only 90 Seconds 🙂

  2. I am facing an issue as my xml is much larger (500mb). i have to check two xml tags, one is “gs-local-feed” and under that I also got “school” tag. when I use
    while($xml->read() && $xml->name != ‘gs-local-feed’){;}
    while($xml->name == ‘gs-local-feed’){
    $element = new SimpleXMLElement($xml->readOuterXML());

    $xml->next(‘gs-local-feed’);
    unset($element);
    }
    its only taking the first “school” value but i need the all values of “school” tag under “gs-local-fee”
    and when I use
    while($xml->read() && $xml->name != ‘school”){;}
    while($xml->name == ‘school’){
    $element = new SimpleXMLElement($xml->readOuterXML());

    $xml->next(‘school’);
    unset($element);
    }
    then its only giving me all the values of “school” from the first “gs-local-feed” tag and stops the process.
    I hope I can make you understand the issue. need your help.

    • Hi Wasid,
      I think I do understand the problem. But I don’t see what is wrong with the first loop. The $element should contain all the content of gs-local-feed (including the schools).
      Did you try to post the question to the stackoverflow including the small sample of the XML and simple (but complete) test php script which demonstrates the problem?
      Also, is your XML valid (check it with one of many online validators)?

      • facing the same issue, after find the first tag it stops

  3. Thank you, this is a neat solution. It gets “the best of both worlds” of XMLReader and SimpleXMLElement 🙂

  4. It does not seem like the script is available, there are no links to download

    • Hi, the complete example is included in the page (below label test02.php), you can copy and paste it to your editor.
      There is only link to download example xml file.

    • Hi, I am not sure what do you mean. The article example is only about reading the large xml file.

  5. Absolutely great piece of article. Thanks! Keep up the good work! 🙂

  6. The post was almost 3 years old still save my day.
    Thank you so much.

  7. How can I print all the elements, if one element is present more then one times

  8. what if there are two sets of records and both needs to show

    how i will use your code in that case

    • Hi,
      I am not sure that I understand the structure of your xml. Can you please provide a example of such xml?

  9. Hi Damir .

    This saved a lot of frustrating work in a project I am working on for a client (HUUUGE XML files). Your solution look like it will work, it is slower but isn’t crashing the server 🙂

    Great work!

  10. You’re a live saver, I really appreciatie your work!

    My XML file of 520 mb is 20 mb only now.

    Thanks you

  11. Thank you again for your work, is there a way to make an offset of the XML loop?

    For example start at element 500 and end at 1000.

    Thank you very much!

    • Hi,
      Thx for the comment. I don’t think it would be possible with this technique since we are reading node by node from the file – without actually knowing what is ahead.
      But since this is working fast enough – I would just run trough first 500 nodes and ignore them, and then process the rest till the end.

Leave a Reply to Anonymous Cancel reply