Parse html with preg_match and preg_match_all

PHP preg_match_allFor the most of the web developer which are using the function preg_match, is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. The first part of this preg_match_all. tutorial is about how to “collect” the image source values inside a web page. For many other  parts in a HTML document is the preg_match function more useful, that’s why I added two other examples: A PHP backlink checker and a example that extracts the title and META description from a webpage.

preg_match_all tutorial

Let’s take a closer look on the regular expression pattern:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next I check the rule starts within the first bracket:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Now I will test inside the long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Next I will test for a string that is followed by a file extension and if there is a match I will retrieve all the paths from the HTML file.

I need all the rules to isolate the string parts (image paths) from the rest of the HTML. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif"
$images[1][0] -> /images/english.gif
$images[2][0] -> gif

The index [1] is the information I need, try this preg_match_all example with other parts of HTML code and experiment for a better understanding. 

A PHP link checker “powered” by preg_match

This function is useful for people with a link directory website where the links are stored inside a (MySQL) database. The PHP script below is used to check if a reciprocal (external) link still exists on a specific webpage. The script takes care about the trailing slash in URLs and can be used with (almost) every URL. Inside the function preg_match is used to check for a link against the a regular expression.

You can try this PHP link checker here.

Extract Page title and META description from a webpage

With this script it’s possible to obtain the first part of a remote file to parse the html elements in local script. The title element and the meta description are parsed while using preg_match() function. Additional HTML head elements are possible, by adding some extra rules and regex patterns. The script reads only the first part (the head) of a remote file for a better performance. 

DEMO: Get page title and META

7 thoughts on “Parse html with preg_match and preg_match_all

      • That’s not an answerable question, because invalid HTML might or might not create a valid DOM tree. In general, it is not possible to parse “invalid HTML” because there are too many ways for it to be invalid.

        Regexps can’t work to parse valid HTML because its grammar is too complex. Most of the complexity happens because attributes and other pieces of HTML can appear in any order. If there are n ordered items, then the complexity of the grammar is proportional to two to the n, which is exponential in size.

  1. I see there are some complains about the examples used for this tutorial. The reason for this tutorial was to understand how to use these functions. Parsing HTML is maybe not the best thing you can do with preg_match and preg_match_all, but it’s much easier to understand (in my opinion). I’m sure that someone who is looking for a tool that can parse HTML in a real case, will try the tools mentioned by the comment authors.

    @David, I’ve use the Simple HTML DOM Parser in several projects, and yes this tool was able to parse “some” invalid HTML as well. Maybe not all invalid structures, but it worked very well.

Comment Rules

Don’t post your code here, post your code block or snippet to pastebin and include the pastebin URL in your comment.

I delete all comments with non related links inside the comment text. Don't use keywords for the field of your real name (most people like to use your name for their answer). Keep your comment related to the topic, if your question is off-topic, please use the contact form instead.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*
Website