PHP parse and process HTML

I prefer using one of the native XML extensions, like

DOM or
XMLReader.

If you prefer a 3rd party lib, I’d suggest not to use SimpleHtmlDom, but a lib that actually usesDOM/libxml underneath instead of String Parsing:

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you want to consider using a dedicated parser, like

html5lib

Or use a WebService like

YQL or
ScraperWiki.

If you want to spend some money, have a look at

PHP Architects Guide to Webscraping with PHP

Last and least recommended, you can extract data from HTML with Regular Expressions. In general using Regular Expressions on HTML is discouraged. The snippets you will usually find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Once the markup changes, the Regex fails.

You can write more reliable parsers, but writing a complete and reliable custom parser with Regular Expressions is a waste of time when the aforementioned libraries already exist and do a much better and likely faster job on this.

Also see Parsing Html The Cthulhu Way

PHP parse and process HTML

Leave a comment

Cancel reply