Wednesday, November 11, 2009

Literal Regular Expression Safe Regular Expression

... sorry for the redundant title but that's exactly what is this post about ... after yesterday explanation about problem, logic, and solution, to grab valid strings inside JS code, here I am with the literal RegExp able to grab literals RegExps in a generic JavaScript code.

Why Do Not Add Just A "/" Into Other Strings RegExp

One comment gave me the hint to write this second post about RegExps. While time is a bit over during days, this answer is simple, but not obvious!
Differences between strings and literal regular expressions are basically these:
  • there must be at least one char, or the parser will consider the literal RegExp an inline comment //
  • the slash does NOT necessary need to be escaped. If we have a slash inside a range [a/b] the latter one won't break the RegExp and the slash will be considered just one valid char in that range
  • there could be one or more chars after, where i(ignore case), g(match all), and m(multi line) can be present one or more times

Latter point is not truly a problem since this syntax will break the code in any case:

function igm();
var a = "string"igm();

But still, we need to understand first couple of points.

The RegExp Safe Regular Expression



// WebReflection Solution
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g

Since yesterday after 10 seconds somebody pointed me another solution, I bet this will happen again but so far I have tested above little monster enough to say that should work without problems but obviously only if the code is valid, otherwise we don't even need to waste our time trying to parse it.
As example, yesterday somebody told me:look, it does not work with this

a = \"string"

Well, now consider that an escaped char could be everywhere in the code but again, these regular expressions are not code sanitizer, n any case improbable since:

// tell me what do you expect and WHY!
a = "string\"
b = \"other"
\"
" c = what?!"

So any kind of weird combination wont work but if the regular expression is valid, escaped or not escaped, the precedent solution should work like a charm.

Explanation

I won't go step by step for the entire RegExp this time, things are the same described in my precedent post so please read there if you want to know more. The emulated look-behind pattern has been included in this regexp to skip groupd of possible ranges present in the regexp. When a range is encountered, starting with char "[", it is skipped till the end. If there is no end theoretically the literal RegExp is broken and the code won't execute. Same strategy is used for the other case, where no [ is encountered, if there is a char followed by a slash, we go on as described in the other post. In this way we should be sure that whatever will be, we'll find the end of the RegExp included chars. I did not spend too much time ensuring consistency for these flags since "/whatever/ii" will be part of inconsistent code which is a syntax parser problem, and not mine.

Test Cases


//comment <-- should not be matched at all
var a = /a/;
var b = /\//i;
var c = /[/]/;

I bet there are hundreds of RegExp or minifier out there able to fail with the latest one, since even different Editors have problems trying to understand what is going on.

The Test Case

Same code I have posted yesterday, except the alert will be for all arguments. I know I have used an empty replace, which is a bad practice, but that was good enough for test purpose:

onload = function(){
document.body.appendChild(
document.createElement("textarea")
).onchange=function(){
this.value.replace(
// WebReflection Solution Test
/\/(?:\[(?:(?=(\\?))\1.)*?\]|(?=(\\?))\2.)+?\/[igm]*/g,
function(){
alert([].slice.call(arguments).join("\n"));
}
);
};
};

Please let me know if you find a better solution or whatever gotcha via the test case, considering that arguments[0] should be exactly the matched RegExp, thanks.

P.S. about the inline comment, it's not worth it to avoid that case for two reasons: we can always test that match.charAt(1) !== "/" plus the problem is still: who comes first? If we have a string inside a regexp or vice-versa there is no way to exclude these cases in a single, reasonable, RegExp. As I have said, as soon as I'll find some time, I will explain how to create a proper function able to manage all JavaScript cases, stay tuned!

No comments:

Post a Comment