Parse html with PHP, a preg_match_all tutorial

preg_match_all tutorialFor the most of the web developer which are using preg_match or preg_replace frequently is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. This preg_match_all tutorial is about how to “collect” the image source values inside a web page:

$data = file_get_contents("http://www.finalwebsites.com");
$pattern = "/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i";
preg_match_all($pattern, $data, $images);
Smarter Web Hosting
Hosting for developers - Free Trial!

Let’s take a closer look on the regular expression pattern:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next I check the rule starts within the first bracket:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Now I will test inside the long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\"']?([^\"']?.*(png|jpg|gif))[\"']?/i"

Next I will test for a string that is followed by a file extension and if there is a match I will retrieve all the paths from the HTML file.

I need all the rules to isolate the string parts (image paths) from the rest of the HTML. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif"
$images[1][0] -> /images/english.gif
$images[2][0] -> gif

The index [1] is the information I need, try this preg_match_all example with other parts of HTML code and experiment for a better understanding. Check my website finalwebsites.com for more PHP scripts and code examples.

Comments

  1. I modified it to pull the image path out of “content” brougt in from a db. It works fine if only on image is in the “content”, when I tried with 2 images the pattern grabs from src in the first image until the second image’s ending ‘”‘. I have tried a lot of different variations, but nothing just pulls the paths for the image. Do you have any ideas?

  2. @geredfds,

    this is about 3 short rows of code…?

    Please replace the single and double quotes using your keyboard (if needed).

  3. Hi, can you post a plaintext version of this code? The formatting of your quotes seems to be stopping it working?

  4. when I enter this the php script

    "/src=["']?([^"']?.*(png|jpg|gif))["']?/i"

    prints the following error:

    Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING

    What I am writing wrong?

  5. Hi,

    not that WP will replace some chars, check if the single and double quotes are the right one.

  6. @stelabouras:

    add backslaches before the double quotes:

    "/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i"

  7. :)

    I see the problem, I unescaped the double quotes too but WP has stripped them off!

    Thanks Iiro for pointing me on that!

  8. The code doesnt work for single quotes (the closing quote doesnt work properly)

    src=’image_path/some_path.image.gif’

  9. Hi Max,

    just did a (second) test, but it’s working fine, do you have an example page (URL) where it’s not working?

    Olaf

  10. put the code in as such and i get error

    Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING in /home/hardwork/public_html/imageparse.php on line 3

    any help please?

  11. Hello George,

    the problem is php code parse for this WordPress blog :(
    I removed the double black-slashes and the error is gone, but if you copy paste the code above it’s possible that you have to replace the single/double quotes.

  12. Hi,
    your question is about single quotes used inside the image tag?

    the above pattern matchs:
    src=image.gif
    src=’image.gif’
    src=”image.gif”

    I hope that helps

  13. Does this code work for single quotes??

    I have tried to use this code, it does not work for me for single ones.

    Please help me with this one, really urgent. Cant find a way to allow both double quotes and single quotes.

  14. Olaf,

    Been browsing through your site and again thanks for the tutorials. Didn’t know web development existed this long. Has just bookmarked your site, hope you’ll help cos I on my way to create a better wp theme for my site.

  15. You’re welcome Charleston, sure web development exists as long as the web ;)

  16. I need help to extract the url which does not contain the primary domain.

  17. Hi ssultan,

    which part of the URL do you need? Please provide some examples (or use the function parse_url())

  18. Hy ssultan
    Try this:

    $regexp = ‘#href=["|']http://(.[^/]+)([^"|']+)#i';
    $url = ‘a hyperlink';

    if (preg_match($regexp, $url, $matches)) {
    echo ‘Domain: ‘. $matches[1];
    echo ‘Rest: ‘. $matches[2];
    }

  19. Hi,

    your example doesn’t work since your example URL is not valid ;)

    @ssultan, if you need to extract the domain or hostname from some URL, try parse_url()

  20. hello, thank for sharing.

    i want to get one of my html. My code like that:
    $url = “index.html”;
    $content = file_get_contents($url);
    $patten = ‘/(.*)/';
    preg_match_all($patten, $content, $data);
    print($data);

    result return empty. Help me. Thanks a lot

  21. Hello Lan,
    the pattern is wrong, it’s not specific (you accept the whole string inside the $content var)

  22. Hello, very good job here!

    But, not working for me.
    I have this directory (http://familiagrissi.com/images/arvore/) and want to select only the name of the pictures with the .jpg extension, but without the extension in an array.

    the result should be something like:
    $image[0] = Adelia Grissi Borgo;
    $image[1] = Andrea BergaminiReis;

    and so on…

    thanks! :D

  23. Hi,
    It’s possible that you need to change the REGEX pattern a little. This tutorial is just an example on how-to use the function preg_match_all().
    Compare your HTML with the pattern you’re using at the moment, I’m sure you will see the difference ;)

  24. thanks fellow, u save my live..
    i was edited your code to fit my need, here it is

    $pattern = preg_match_all(‘!imgurl=.+.(?:jpe?g|png|gif)!Ui’ , $data , $images);

Because of all the spam attemps I've decided to close the comment form at this time. If you have have any questions or comments please post them by using Google+ or Twitter (the links to my profiles are located at the top of this page).