Parse HTML with preg_match_all & preg_match

For the most of the web developer which are using the function preg_match, is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. The first part of this preg_match_all. tutorial is about how to “collect” the image source values inside a web page. For many other parts in a HTML document is the preg_match function more useful, that’s why I added two other examples: A PHP backlink checker and a example that extracts the title and META description from a webpage.

preg_match_all tutorial

$data = file_get_contents("https://www.finalwebsites.com");
$pattern = "/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i";
preg_match_all($pattern, $data, $images);

Let’s take a closer look on the regular expression pattern:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next I check the rule starts within the first bracket:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Now I will test inside the long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Next I will test for a string that is followed by a file extension and if there is a match I will retrieve all the paths from the HTML file.

I need all the rules to isolate the string parts (image paths) from the rest of the HTML. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif" $images[1][0] -> /images/english.gif $images[2][0] -> gif

The index [1] is the information I need, try this preg_match_all example with other parts of HTML code and experiment for a better understanding.

A PHP link checker “powered” by preg_match

This function is useful for people with a link directory website where the links are stored inside a (MySQL) database. The PHP script below is used to check if a reciprocal (external) link still exists on a specific webpage. The script takes care about the trailing slash in URLs and can be used with (almost) every URL. Inside the function preg_match is used to check for a link against the a regular expression.

function check_back_link($remote_url, $your_link) {
    $match_pattern = preg_quote(rtrim($your_link, "/"), "/");
    $found = false;
    if ($handle = @fopen($remote_url, "r")) {
        while (!feof($handle)) {
            $part = fread($handle, 1024);
            if (preg_match("/<a(.*)href=[\"']".$match_pattern.
"(\/?)[\"'](.*)>(.*)<\/a>/", $part)) {
                $found = true;
                break;
            }
        } 
        fclose($handle);
    }
    return $found;
}
// example:
//if (check_back_link("https://www.web-development-blog.com", "https://www.finalwebsites.com")) echo "link exists";

Extract Page title and META description from a webpage

With this script it’s possible to obtain the first part of a remote file to parse the html elements in local script. The title element and the meta description are parsed while using preg_match() function. Additional HTML head elements are possible, by adding some extra rules and regex patterns. The script reads only the first part (the head) of a remote file for a better performance.

<?php
$page_title = "n/a";
$meta_descr = "n/a";
	
if ($handle = fopen("https://www.finalwebsites.com", "r")) {
	$content = '';
	while (!feof($handle)) {
		$part = fread($handle, 1024);
		$content .= $part;
		if (preg_match('/<\/head>/', $part)) break;
	}
	fclose($handle);
	$lines = preg_split('/\r?\n|\r/', $content);
	$result = true;
	$is_title = false;
	$is_descr = false;
	foreach ($lines as $val) {
		if (preg_match('/\<title\>(.*)\<\/title\>/', $val, $title)) {
			$page_title = $title[1];
			$is_title = true;
		} 
		if (preg_match('/\<meta name\="description" content\="(.*)"\s?\/?\>/', $val, $descr)) {
			$meta_descr = $descr[1];
			$is_descr = true;
		}
		if ($is_title && $is_descr) break;
	}
}

Published in: PHP Scripts

11 Comments

LiquorVicar says:
More than 1 year ago at 1:13 pm
Might be better to pick a different example, using regex to parse HTML is really not recommended: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
1. Olaf Lederer says:
  More than 1 year ago at 2:03 pm
  Hi,
  you’re right parsing XHTML is often a quick and dirty way to solve stuff quickly. It works very well if the HTML structure is valid right? A more secure (and complex) way is to use this HTML DOM parser:
  http://sourceforge.net/projects/simplehtmldom/
  1. LiquorVicar says:
    More than 1 year ago at 2:09 pm
    Yes, you *can* use regex to parse HTML. In some circumstances it might work. My point was it’s probably not the best example to use to demo regex…
Sean Mumford says:
More than 1 year ago at 4:33 pm
You’re much better off using PHP’s build-in DOMDocument classes to handle this. I whipped up a quick demonstration here: https://gist.github.com/thepsion5/2f00beed4b622fffe98e
1. Olaf Lederer says:
  More than 1 year ago at 7:38 am
  Hi Sean,
  thanks for the PHP code example, does it work for invalid HTML too?
  1. David Spector says:
    More than 1 year ago at 4:01 am
    That’s not an answerable question, because invalid HTML might or might not create a valid DOM tree. In general, it is not possible to parse “invalid HTML” because there are too many ways for it to be invalid.
    Regexps can’t work to parse valid HTML because its grammar is too complex. Most of the complexity happens because attributes and other pieces of HTML can appear in any order. If there are n ordered items, then the complexity of the grammar is proportional to two to the n, which is exponential in size.
Olaf Lederer says:
More than 1 year ago at 7:21 am
I see there are some complains about the examples used for this tutorial. The reason for this tutorial was to understand how to use these functions. Parsing HTML is maybe not the best thing you can do with preg_match and preg_match_all, but it’s much easier to understand (in my opinion). I’m sure that someone who is looking for a tool that can parse HTML in a real case, will try the tools mentioned by the comment authors.
@David, I’ve use the Simple HTML DOM Parser in several projects, and yes this tool was able to parse “some” invalid HTML as well. Maybe not all invalid structures, but it worked very well.
Ann says:
More than 1 year ago at 12:23 pm
How To check a exact string with a file contents
$string = C0DB-9700-WP;
The impossible entries are
C0DB-9700-W
C0DB-9700-
C0DB-9700
C0DB-
I have tried
if(!preg_match(‘/\b’.$string.’\b/’,$file)){
echo “error”;
}
But it is not validating all the above conditions.
How to validate all the above entries?
1. Olaf Lederer says:
  More than 1 year ago at 12:37 pm
  Hi Ann,
  this is not how it works, you need to “load” the file first into a string (with file_get_contents() for example).
  You know that using this function is a task for a programmer?
  1. Ann says:
    More than 1 year ago at 12:42 pm
    Hi Olaf Lederer,
    I have added file_get_contents() and all. The $file contains the path to the file with file_get_contents(). I didn’t paste my entire code here. I just added only the required line here.
  2. Olaf Lederer says:
    More than 1 year ago at 7:50 am
    You need to load the file first into a string or array. After that you can check the string against the regex pattern. Maybe you need to use look if your data is stored in an array. It doesn’t work that simple.

Comments are closed.

preg_match_all tutorial

A PHP link checker “powered” by preg_match

Extract Page title and META description from a webpage

Related Posts

11 Comments