Posts Tagged ‘DOM’

Use PHP DOM Parser for more robust screen scraping

December 5, 2009

I’d just like to put this out there, as I just “failed” a “do-at-home” interview assignment which was to implement a screen scraper using Java/PHP. I had previously (1-2 years ago) done screen scrapers in PHP, so I proceeded to do this assignment the same way – using regexes. Little did I know that using regexes would be one of the weak points of my submission – they wanted me to use a DOM parser instead. In hindsight, I guess I should have looked into that, but it just never occured to me because I already used other methods in the past.

So the moral of the story is to use DOM parsers when writing screen scrapers, they should be more robust than regex parsing in most cases. Here is an example tutorial.