Regular expression understanding

From: Radio31 Oct 2011 10:35
To: ALL1 of 57

Anyone here good with regular expressions?
We've picked up the following one from Microsoft, with regards to validating email structures before they're applied:
^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$

 

The only problem is, that this excludes bob@123.com or bob@456.com, which are both valid domains. The expression seems to require a letter be present in the domain, which obviously isn't a requirement in the real world.

 

Can anyone confirm this is the case, and perhaps even better, suggest a modification to the expression?

From: Radio31 Oct 2011 10:50
To: Radio 2 of 57

And for anyone that /can/ understand the above, is this any better?
( \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* )

 

(Should I be putting these in code tags I wonder?)

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 10:53
To: Radio 3 of 57

Bob's an arse, I wouldn't bother sending him email.

 

(and I don't know, regex has always confused me)

EDITED: 31 Oct 2011 10:54 by MR_BASTARD
From: Mizzy31 Oct 2011 11:01
To: Radio 4 of 57

\w in [\w-\.] will match 0-9,A_Z and a-z chars without modification.

 

Explanation of the whole thing.

 

^[\w-\.]
Anything before the @ sign as long as its alphanumeric + dot

 

+@
Match the @ symbol

 

([\w-]+\.)
Match any alphanumeric domain prefixes and the dot

 

+[\w-]{2,4}$
Match the domain suffixes (.uk, .com) making sure the length is between 2 and 4 chars.

 


get regexbuddy it's useful :-)

EDITED: 31 Oct 2011 11:05 by MIZZY
From: Radio31 Oct 2011 11:33
To: Mizzy 5 of 57
Cheers, so that looks like it should work fine for numerical domains, meaning that it's probably been typed in wrong in the actual code.
From: Mizzy31 Oct 2011 12:40
To: Radio 6 of 57

that is likely,
Although there are some differences between implementations of regex so check the regex against your language standard.

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 13:17
To: Mizzy 7 of 57
All of which stimulated me to check out regex editors. Anything to take out the hit-n-miss of regex.
From: Peter (BOUGHTONP)31 Oct 2011 13:48
To: Radio 8 of 57
The regex you've given does not require a letter. In every regex implementation I've encountered where \w has meaning, it will match (as a minimum) [0-9A-Za-z_] - some implementations will also match accented/etc letters, but they all match at least those 53 chars.

But what's the context of this?

To correctly validate all email formats with regex is actually a hugely complex task (due to what syntax is allowed by various email software), and it's probably not worth the effort to do more than check that there's a single @ and at least one dot.

Which is usually easier to do with simple text checks:

code:
if ( Text.contains('@') and Text.contains('.') )


But if you're forced to use a regex for something then:

code:
[^@]++@[^@]+.[^@]+
From: Drew (X3N0PH0N)31 Oct 2011 13:51
To: Peter (BOUGHTONP) 9 of 57
As someone who regularly gets my valid email address rejected as not a valid email address, I strongly agree :Y
From: Peter (BOUGHTONP)31 Oct 2011 13:58
To: 99% of gargoyles look like (MR_BASTARD) 10 of 57
Regex editors don't take the hit and miss out of it, though they can create the illusion of knowing what you're doing, which may or not be a good thing.

To really take the hit and miss out of it, just learn a bit of regex syntax and how regex matching takes place, and you'll find it much easier.

It's not actually as hard as people often make out - there's only a few basic constructs to understand, and the rest can be extrapolated (and if you don't use often checked up in a reference easily enough).
From: Peter (BOUGHTONP)31 Oct 2011 14:00
To: Drew (X3N0PH0N) 11 of 57
Um, which bit are you disagreeing with?
From: Peter (BOUGHTONP)31 Oct 2011 14:18
To: Mizzy 12 of 57
You've split the regex up wrong (the + quantifiers apply to the items which preceed them, not the items after them), and given incorrect descriptions for some parts. :/

code:
^            # start of input
 
[\w-\.]+     # incorrect syntax. It should be [\w\-.]+
             # since - needs escaping inside a char class, and . doesn't.
             # Matches a single alphanumeric, underscore, hyphen, or dot.
             # The + says "as many as possible, but at least one"
 
@            # a literal @ sign
 
(            # begin a capturing group
             # though should be a non-capturing group
             # but syntax is slightly more complex (?: ... ) vs ( ... )
 
    [\w-]+   # this time the syntax _might_ work; some regex engines
             # will auto-escape hyphen if it's the last character, but if someone
             # that doesn't know regex might put another char afterwards,
             # it's always recommended to manually escape hyphen, to
             # avoid inadvertently creating an unwanted range.
 
    \.       # since this . is outside the class it's been escaped.
 
)+           # end the capturing group.
             # match the pattern inside the group as many times as 
             # possible, at least once.
 
[\w-]{2,4}   # as above, but this time match 4 characters, at least 2.
             # this is not recommended for emails because 
             # it will break for emails that end in .museum and similar
             # and also for numerical IP addresses ending in a single digit
             # and it allows things like ___ or --- which are not valid emails
 
$            # end of input
EDITED: 31 Oct 2011 14:19 by BOUGHTONP
From: Drew (X3N0PH0N)31 Oct 2011 14:38
To: Peter (BOUGHTONP) 13 of 57
quote: me
I strongly agree :Y


You just expect disagreement (hug)
From: Drew (X3N0PH0N)31 Oct 2011 14:40
To: Peter (BOUGHTONP) 14 of 57
I think the problem with regex is not that it's hard as such (as you say, it's not) it's just that it's a large vocabulary and it's quite arbitrary.

I can do pretty complex things with regex after a bit of reminding myself what's what. But then 20 minutes later I've forgotten it all again. That's the problem with regex :(
From: Peter (BOUGHTONP)31 Oct 2011 15:15
To: Drew (X3N0PH0N) 15 of 57
But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants.

(Then, to reduce having to type {...} and [...] and (...) as much, there's shorthand quantifiers, shorthand classes, shorthand positions.)

It is a bit of a pain that the syntax uses the same symbols for different meanings, but - once you understand when \ and ? mean the different things, and a few other bits - then the rest isn't so bad, and far less arbitrary than it seems on the surface.

And it's also annoying that we've got at least five major programming variants (Perl/PCRE/Python/.NET/Java) which all have slight differences/benefits and then cut-down versions in JavaScript/grep/awk/etc.

But it still doesn't deserve the bad reputation a lot of people assign it.


withregardstoremembering/understanding,themostimportantthing,is
notwritingregexesthatlooklikethis-becauseitdoesn'thelpanyonewith
figuringoutwhat'sgoingonwhenyouremoveallformattinginformation.

It's just a pity that extended/comment mode (where unescaped whitespace is ignored, and # starts a comment) is not the default one in almost all implementations, so people think they must squish it all on a single line.
EDITED: 31 Oct 2011 15:22 by BOUGHTONP
From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 15:18
To: Peter (BOUGHTONP) 16 of 57

Of course, you're probably right.

 

My problem is simply that I usually turn to regex when I need to get something done (and learning it just gets in the way of doing something more interesting), rather than sitting down and taking the time to learn it properly.

From: Drew (X3N0PH0N)31 Oct 2011 15:23
To: Peter (BOUGHTONP) 17 of 57
quote:
But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants.


Yeah, that's grammar/syntax which I agree is pretty neat.

The vocabulary isn't that large but it's quite large and that combined with its arbitrariness (i.e. everything is one character, so can't be differentiated/remembered that way and the characters often don't obviously relate to their subjects and so on - makes memorising hard) which makes it difficult.

And it genuinely is complex when you get into back/forward references and have to worry about greediness and that kinda stuff. That's a genuine headfuck.
EDITED: 31 Oct 2011 15:24 by X3N0PH0N
From: Peter (BOUGHTONP)31 Oct 2011 15:37
To: Drew (X3N0PH0N) 18 of 57

I'm only half sure what you're on about with that middle paragraph. :S

 


Most times when people worry about greediness, they should actually be using lazy quantifiers, or a negative character class.

 

(If I was designing regex from scratch, I'd either make lazy the default, or have no default, so that people had to learn there are three different modes, and when each is appropriate.)

 


If you're using back references a lot, you're likely getting into the territory where a simple parser is likely the better choice (probably using a number of smaller, more basic regexes).

EDITED: 31 Oct 2011 15:37 by BOUGHTONP
From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 16:11
To: Peter (BOUGHTONP) 19 of 57
I am both a lazy quantifier AND a negative character class.
From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 16:11
To: 99% of gargoyles look like (MR_BASTARD) 20 of 57
I am also putting off performance reviews. God the tedium! :(