Parse html with PHP preg_match_all

For the most of the PHP developer which are using preg_match or preg_replace frequently is the function preg_match_all a smaller advantage, but for all others it’s maybe hard to understand. The biggest difference between preg_match_all and the regular preg_match is that all matched values are stored inside a multi-dimensional array to store an unlimited number of matches. With the following example I will try to make clear how its possible to store the image paths inside a web page:

GotVMail - The Entrepreneur's Phone System

$data = file_get_contents("http://www.finalwebsites.com");
$pattern = "/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i";
preg_match_all($pattern, $data, $images);

We take a closer look to the pattern:

"/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i"

The first part and the last part are searching for everything that starts with src and ends with a optional quote or double quote. This could be a long string because the outer rule is very global. Next we check the rule starts within the first bracket:

"/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i"

Now we are looking inside this long string from the outer rule for strings starting with an optional quote or double quote followed by any characters. The last part inside the inner brackets is the magic:

"/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i"

We are looking next for a string that is followed by a file extension and match we get all the paths from the html file.


We need all the rules to isolate the string parts (image paths) from the rest of the html. The result looks like this (access the array $images with these indexes, or just use print_r($images)):

$images[0][0] -> src="/images/english.gif"
$images[1][0] -> /images/english.gif
$images[2][0] -> gif

The index 1 is the information we need, try this example with other part of html code for a better understanding. Check our partner finalwebsites.com for more scripts/snippets.

 

Related posts

Comments

Trackback URL for this post: http://www.web-development-blog.com/archives/parse-html-with-preg_match_all/trackback/

Thanks a LOT!

Just what I was looking forr!
Digged

I modified it to pull the image path out of “content” brougt in from a db. It works fine if only on image is in the “content”, when I tried with 2 images the pattern grabs from src in the first image until the second image’s ending ‘”‘. I have tried a lot of different variations, but nothing just pulls the paths for the image. Do you have any ideas?

Hi Jim,

sure I will help, but not here, please join the Webdigity Webmaster Forum and post your question there.

Hi, can you post a plaintext version of this code? The formatting of your quotes seems to be stopping it working?

@geredfds,

this is about 3 short rows of code…?

just replace the single and double quotes using your keyboard.

when I enter this the php script

"/src=["']?([^"']?.*(png|jpg|gif))["']?/i"

prints the following error:

Parse error: syntax error, unexpected T_CONSTANT_ENCAPSED_STRING

What I am writing wrong?

Hi,

not that WP will replace some chars, check if the single and double quotes are the right one.

@stelabouras:

add backslaches before the double quotes:

"/src=[\\"']?([^\\"']?.*(png|jpg|gif))[\\"']?/i"

:)

I see the problem, I unescaped the double quotes too but WP has stripped them off!

Thanks Iiro for pointing me on that!

The code doesnt work for single quotes (the closing quote doesnt work properly)

src=’image_path/some_path.image.gif’

Hi Max,

just did a (second) test, but it’s working fine, do you have an example page (URL) where it’s not working?

Olaf

how can i get the info bw

user_agent=” info ”
i want to extract the “info”
part from a file

can u tell me how to add tht to the $ pattern

Please post further questions via the forum:
http://www.finalwebsites.com/forums/

Sorry, the comment form is closed at this time.