Parse html with preg_match and preg_match_all

For the most of the web developer which are using the function preg_match, is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. The first part of this preg_match_all. tutorial is about how to “collect” the image source values inside a web page. For many other  parts in a HTML document is the preg_match function more useful, that’s why I added two other examples: A PHP backlink checker and a example that extracts the title and META description from a webpage.

preg_match_all tutorial

$data = file_get_contents("https://www.finalwebsites.com");
$pattern = "/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i";
preg_match_all($pattern, $data, $images);

Let’s take a closer look on the regular expression pattern:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next I check the rule starts within the first bracket:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Now I will test inside the long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Next I will test for a string that is followed by a file extension and if there is a match I will retrieve all the paths from the HTML file.

I need all the rules to isolate the string parts (image paths) from the rest of the HTML. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif"
$images[1][0] -> /images/english.gif
$images[2][0] -> gif

The index [1] is the information I need, try this preg_match_all example with other parts of HTML code and experiment for a better understanding. 

A PHP link checker “powered” by preg_match

This function is useful for people with a link directory website where the links are stored inside a (MySQL) database. The PHP script below is used to check if a reciprocal (external) link still exists on a specific webpage. The script takes care about the trailing slash in URLs and can be used with (almost) every URL. Inside the function preg_match is used to check for a link against the a regular expression.

function check_back_link($remote_url, $your_link) {
    $match_pattern = preg_quote(rtrim($your_link, "/"), "/");
    $found = false;
    if ($handle = @fopen($remote_url, "r")) {
        while (!feof($handle)) {
            $part = fread($handle, 1024);
            if (preg_match("/<a(.*)href=[\"']".$match_pattern.
"(\/?)[\"'](.*)>(.*)<\/a>/", $part)) {
                $found = true;
                break;
            }
        } 
        fclose($handle);
    }
    return $found;
}
// example:
//if (check_back_link("https://www.web-development-blog.com", "https://www.finalwebsites.com")) echo "link exists";

You can try this PHP link checker here.

Extract Page title and META description from a webpage

With this script it’s possible to obtain the first part of a remote file to parse the html elements in local script. The title element and the meta description are parsed while using preg_match() function. Additional HTML head elements are possible, by adding some extra rules and regex patterns. The script reads only the first part (the head) of a remote file for a better performance. 

<?php
$page_title = "n/a";
$meta_descr = "n/a";
	
if ($handle = fopen("https://www.finalwebsites.com", "r")) {
	$content = '';
	while (!feof($handle)) {
		$part = fread($handle, 1024);
		$content .= $part;
		if (preg_match('/<\/head>/', $part)) break;
	}
	fclose($handle);
	$lines = preg_split('/\r?\n|\r/', $content);
	$result = true;
	$is_title = false;
	$is_descr = false;
	foreach ($lines as $val) {
		if (preg_match('/\<title\>(.*)\<\/title\>/', $val, $title)) {
			$page_title = $title[1];
			$is_title = true;
		} 
		if (preg_match('/\<meta name\="description" content\="(.*)"\s?\/?\>/', $val, $descr)) {
			$meta_descr = $descr[1];
			$is_descr = true;
		}
		if ($is_title && $is_descr) break;
	}
}

Published in: PHP Scripts

11 Comments

      1. Yes, you *can* use regex to parse HTML. In some circumstances it might work. My point was it’s probably not the best example to use to demo regex…

      1. That’s not an answerable question, because invalid HTML might or might not create a valid DOM tree. In general, it is not possible to parse “invalid HTML” because there are too many ways for it to be invalid.

        Regexps can’t work to parse valid HTML because its grammar is too complex. Most of the complexity happens because attributes and other pieces of HTML can appear in any order. If there are n ordered items, then the complexity of the grammar is proportional to two to the n, which is exponential in size.

  1. I see there are some complains about the examples used for this tutorial. The reason for this tutorial was to understand how to use these functions. Parsing HTML is maybe not the best thing you can do with preg_match and preg_match_all, but it’s much easier to understand (in my opinion). I’m sure that someone who is looking for a tool that can parse HTML in a real case, will try the tools mentioned by the comment authors.

    @David, I’ve use the Simple HTML DOM Parser in several projects, and yes this tool was able to parse “some” invalid HTML as well. Maybe not all invalid structures, but it worked very well.

  2. How To check a exact string with a file contents
    $string = C0DB-9700-WP;

    The impossible entries are
    C0DB-9700-W
    C0DB-9700-
    C0DB-9700
    C0DB-

    I have tried

    if(!preg_match(‘/\b’.$string.’\b/’,$file)){
    echo “error”;
    }

    But it is not validating all the above conditions.
    How to validate all the above entries?

    1. Hi Ann,

      this is not how it works, you need to “load” the file first into a string (with file_get_contents() for example).
      You know that using this function is a task for a programmer?

      1. Hi Olaf Lederer,
        I have added file_get_contents() and all. The $file contains the path to the file with file_get_contents(). I didn’t paste my entire code here. I just added only the required line here.

      2. You need to load the file first into a string or array. After that you can check the string against the regex pattern. Maybe you need to use look if your data is stored in an array. It doesn’t work that simple.

Comments are closed.