Parse html with preg_match and preg_match_all

PHP preg_match_allFor the most of the web developer which are using the function preg_match, is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. The first part of this preg_match_all. tutorial is about how to “collect” the image source values inside a web page. For many other  parts in a HTML document is the preg_match function more useful, that’s why I added two other examples: A PHP backlink checker and a example that extracts the title and META description from a webpage.

preg_match_all tutorial

High Performance Cloud

Let’s take a closer look on the regular expression pattern:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next I check the rule starts within the first bracket:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Now I will test inside the long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Next I will test for a string that is followed by a file extension and if there is a match I will retrieve all the paths from the HTML file.

I need all the rules to isolate the string parts (image paths) from the rest of the HTML. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif"
$images[1][0] -> /images/english.gif
$images[2][0] -> gif

The index [1] is the information I need, try this preg_match_all example with other parts of HTML code and experiment for a better understanding. 

A PHP link checker “powered” by preg_match

This function is useful for people with a link directory website where the links are stored inside a (MySQL) database. The PHP script below is used to check if a reciprocal (external) link still exists on a specific webpage. The script takes care about the trailing slash in URLs and can be used with (almost) every URL. Inside the function preg_match is used to check for a link against the a regular expression.

You can try this PHP link checker here.

Extract Page title and META description from a webpage

With this script it’s possible to obtain the first part of a remote file to parse the html elements in local script. The title element and the meta description are parsed while using preg_match() function. Additional HTML head elements are possible, by adding some extra rules and regex patterns. The script reads only the first part (the head) of a remote file for a better performance. 

DEMO: Get page title and META

5 thoughts on “Parse html with preg_match and preg_match_all

Comment Rules

Don’t post your code here, post your code block or snippet to pastebin and include the pastebin URL in your comment.

I delete all comments with non related links inside the comment text. Don't use keywords for the field of your real name (most people like to use your name for their answer). Keep your comment related to the topic, if your question is off-topic, please use the contact form instead.

Leave a Reply

Your email address will not be published. Required fields are marked *

*
*
Website

tawse-parishad