0

I'm looking to write a script in php that scans an html document and adds new markup to a element based on what it finds. More specifically, I was it to scan the document and for every element it searches for the CSS markup "float: right/left" and if it locates it, it adds align="right/left" (based on what it finds). Example:

<img alt="steve" src="../this/that" style="height: 12px; width: 14px; float: right"/>

becomes

<img alt="steve" src="../this/that" align="right" style="height: 12px; width: 14px; float: right"/>

Ghjnut
  • 571
  • 2
  • 5
  • 9
  • 4
    http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – deinst Aug 05 '10 at 21:08
  • possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) – PeeHaa Jan 16 '12 at 20:01

2 Answers2

7
 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 foreach($x->query("//img[contains(@style,'float: right']") as $node) $node->setAttribute('align','right');
 foreach($x->query("//img[contains(@style,'float: left']") as $node) $node->setAttribute('align','left');

edit:

When there is no certainty of amount of space between 'float:' & 'right', there are several options:

  1. Use the XPath 1.0: //img[starts-with(normalize-space(substring-after(@style,'float:')),'right')]
  2. Just do a simple check for float like //img[contains(@style,'float:'], and check with $node->getAttribute() what actually comes afterwards.
  3. Import preg_match into the equasion (which was just recently pointed out to me (thanks Gordon), but in this case is imho the least favorite solution):

.

 $dom = new DOMDocument();
 $dom->loadHTML($htmlstring);
 $x = new DOMXPath($dom);
 $x->registerNamespace("php", "http://php.net/xpath");
 $x->registerPHPFunctions('preg_match');

 foreach($x->query("//img[php:functionString('preg_match','/float\s*:\s*right/',@style)]") as $node) $node->setAttribute('align','right');
Wrikken
  • 69,272
  • 8
  • 97
  • 136
  • Will this work with variations in the syntax of float? (I'm using CKeditor and I don't know how consistent this is, might get 'float:left ;' or 'float: left;' – Ghjnut Aug 05 '10 at 21:59
  • Not directly, no. For that there is some trickery involved (unfortunately, XPath 2.0's `matches()` function cannot be used. One could fiddle around with `substring-after()` and the like, I'll edit something in a moment. – Wrikken Aug 05 '10 at 22:08
  • 1
    +1 Do this. It is so much faster than simple_html_dom it isn't even funny. – Byron Whitlock Aug 05 '10 at 22:10
  • echo preg_replace('%(\)%', '\1 align="\3"\2\3\4', $htmlstring); Looks complicated, but robust (I think). Is there a specific reason it's suggested not to use regexp for parsing? – Ghjnut Aug 06 '10 at 17:10
  • Yes: test containing 'look at this ` _without_ float, but an arbitrary element following it (span, div, another img, etc.) after it _with_ a float, the mentioning of `float: ` in text itself etc. A lot of things _can_ go wrong, which is way we rely more on parsers then actual _best case scenario_ regexes. The reges doesn't look nearly as complicated as 'best efforts' I've seen, actually, it is one of the more naïve ones. A better one would be a regex which at least _tries_ to validate it is still within a tag, which this one utterly lacks. – Wrikken Aug 06 '10 at 21:40
  • .. which doesn't mean it doesn't work for all circumstances your particular code comes across, it may very well be _just enough_ sufficient. There are however clearly known bugs for anyone who knows both html & regexes. – Wrikken Aug 06 '10 at 21:45
  • A small hint if you _really_ want to go the regex route which you shouldn't): the `.*` is utterly., catastrophically, wrong: it should match _not >_ (`[^>]`), with the exception that it could match a `>` is in an attribute, in which case it would be: _if we're still in a tag, a quoting character may have started but it not guaranteed in some HTML, and we're not even sure we're in a `style` attribute_ etc, etc, etc. – Wrikken Aug 06 '10 at 21:56
  • Believe you me: I've taken the regex road to HTML before in my days, came up with a perfect solution for all available test cases, untill I was routed by just the right user brainfart in HTML to make it break. If you utterly and _completely_ control your HTML, there's a chance you will succees, but it is **not** in any way reliable for all possible situations. – Wrikken Aug 06 '10 at 21:57
2

Please please, don't use a regexp to parse HTML.

Use simple_html_dom instead.

$dom = new simple_html_dom();
$dom->load($html);
foreach ($dom->find("[style=float: left],[style=float: right]") as $fragment)
{
   if ($fragment[0]->style == 'float:left')
   {
      $fragment[0]->align='left';
      $fragment[0]->style = '';
   }
   ...
}
echo $dom;
Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
  • 4
    Use **a** parser, of which `simplehtmldom` is one , but the fascination people have for an outside, php coded package instead of the fast & reliable built-in methods `SimpleXML` or `DOM` implemented at C level I've never understood. – Wrikken Aug 05 '10 at 21:08
  • @wrikken because simple_html_dom is simpler and easier. Have you ever tried it? It will blow you away ;) – Byron Whitlock Aug 05 '10 at 21:11
  • 3
    Maybe I'm a fool for having learned `DOM` (mainly using javascript earlier on) and `XPath`, but I don't find it a lick easier, and even so, most of those methods could be simply implemented creating a few helper functions in extending `DOM`. – Wrikken Aug 05 '10 at 21:14
  • 4
    Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Aug 05 '10 at 21:15
  • @Wrikken, fair enough. I've worked with tcldom and the expat C lib, and was not easy. Sounds like I need to add another tool to my PHP box! – Byron Whitlock Aug 05 '10 at 21:16
  • But I seem to be attacking you reading this back. That is not my intention. Please, go ahead and use `simplehtmldom`, it was just some venting / inability to understand it's popularity. – Wrikken Aug 05 '10 at 21:16
  • @Wrikken, no worries, seeing your example, I think you have a point. Thanks for the heads up! – Byron Whitlock Aug 05 '10 at 21:20
  • @Byron: thanks for not taking it the wrong way :) @Gordon: nice ones, will check some of them out / compare them to my own toolbox. – Wrikken Aug 05 '10 at 21:22
  • @Wrikken I have to thank you. There is finally someone who shares my thoughts about SimpleHtmlDom. You dont know how lonely I was blankly staring at answers getting massive upvotes for just suggesting SimpleHtmlDom without even giving examples like it was the holy grail. Now I know I am not alone. For that I `define('A_TOKEN_OF_APPRECIATION, '♥')` for you. – Gordon Aug 05 '10 at 21:35
  • @Wrikken, @Gordan, You just found another convert. I am working on a scraping project and for the current website, I tried using the php DOM as Wrikken suggest. Holy smokes that sucker is fast! AND I can use firebug's "copy XPATH" instead of counting by hand. You just saved me at least an hour sirs! THANK YOU VERY MUCH! And thank you for the rant Wrikken. I wish I could buy you a beer!!!!! – Byron Whitlock Aug 05 '10 at 22:09
  • 2
    Ah, good, that means more ammunition :) I'll take a beer from the fridge & pretend it's been given, cheers. – Wrikken Aug 05 '10 at 22:28