Regular expression understanding

From: Radio31 Oct 2011 10:35

To: ALL1 of 57

Anyone here good with regular expressions?
We've picked up the following one from Microsoft, with regards to validating email structures before they're applied:
^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$

The only problem is, that this excludes bob@123.com or bob@456.com, which are both valid domains. The expression seems to require a letter be present in the domain, which obviously isn't a requirement in the real world.

Can anyone confirm this is the case, and perhaps even better, suggest a modification to the expression?

0/0

From: Radio31 Oct 2011 10:50

To: Radio 2 of 57

39068.2 In reply to 39068.1

And for anyone that /can/ understand the above, is this any better?
( \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* )

(Should I be putting these in code tags I wonder?)

0/0

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 10:53

To: Radio 3 of 57

39068.3 In reply to 39068.1

Bob's an arse, I wouldn't bother sending him email.

(and I don't know, regex has always confused me)

EDITED: 31 Oct 2011 10:54 by MR_BASTARD

0/0

From: Mizzy31 Oct 2011 11:01

To: Radio 4 of 57

39068.4 In reply to 39068.1

\w in [\w-\.] will match 0-9,A_Z and a-z chars without modification.

Explanation of the whole thing.

^[\w-\.]
Anything before the @ sign as long as its alphanumeric + dot

+@
Match the @ symbol

([\w-]+\.)
Match any alphanumeric domain prefixes and the dot

+[\w-]{2,4}$
Match the domain suffixes (.uk, .com) making sure the length is between 2 and 4 chars.

get regexbuddy it's useful :-)

EDITED: 31 Oct 2011 11:05 by MIZZY

0/0

From: Radio31 Oct 2011 11:33

To: Mizzy 5 of 57

39068.5 In reply to 39068.4

Cheers, so that looks like it should work fine for numerical domains, meaning that it's probably been typed in wrong in the actual code.

0/0

From: Mizzy31 Oct 2011 12:40

To: Radio 6 of 57

39068.6 In reply to 39068.5

that is likely,
Although there are some differences between implementations of regex so check the regex against your language standard.

0/0

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 13:17

To: Mizzy 7 of 57

39068.7 In reply to 39068.4

All of which stimulated me to check out regex editors. Anything to take out the hit-n-miss of regex.

0/0

From: Peter (BOUGHTONP)31 Oct 2011 13:48

To: Radio 8 of 57

39068.8 In reply to 39068.1

The regex you've given does not require a letter. In every regex implementation I've encountered where \w has meaning, it will match (as a minimum) [0-9A-Za-z_] - some implementations will also match accented/etc letters, but they all match at least those 53 chars.

But what's the context of this?

To correctly validate all email formats with regex is actually a hugely complex task (due to what syntax is allowed by various email software), and it's probably not worth the effort to do more than check that there's a single @ and at least one dot.

Which is usually easier to do with simple text checks:

code:

if ( Text.contains('@') and Text.contains('.') )

But if you're forced to use a regex for something then:

code:

[^@]++@[^@]+.[^@]+

0/0

From: Drew (X3N0PH0N)31 Oct 2011 13:51

To: Peter (BOUGHTONP) 9 of 57

39068.9 In reply to 39068.8

As someone who regularly gets my valid email address rejected as not a valid email address, I strongly agree :Y

0/0

From: Peter (BOUGHTONP)31 Oct 2011 13:58

To: 99% of gargoyles look like (MR_BASTARD) 10 of 57

39068.10 In reply to 39068.7

Regex editors don't take the hit and miss out of it, though they can create the illusion of knowing what you're doing, which may or not be a good thing.

To really take the hit and miss out of it, just learn a bit of regex syntax and how regex matching takes place, and you'll find it much easier.

It's not actually as hard as people often make out - there's only a few basic constructs to understand, and the rest can be extrapolated (and if you don't use often checked up in a reference easily enough).

0/0

From: Peter (BOUGHTONP)31 Oct 2011 14:00

To: Drew (X3N0PH0N) 11 of 57

39068.11 In reply to 39068.9

Um, which bit are you disagreeing with?

0/0

From: Peter (BOUGHTONP)31 Oct 2011 14:18

To: Mizzy 12 of 57

39068.12 In reply to 39068.4

You've split the regex up wrong (the + quantifiers apply to the items which preceed them, not the items after them), and given incorrect descriptions for some parts. :/

code:

^            # start of input
 
[\w-\.]+     # incorrect syntax. It should be [\w\-.]+
             # since - needs escaping inside a char class, and . doesn't.
             # Matches a single alphanumeric, underscore, hyphen, or dot.
             # The + says "as many as possible, but at least one"
 
@            # a literal @ sign
 
(            # begin a capturing group
             # though should be a non-capturing group
             # but syntax is slightly more complex (?: ... ) vs ( ... )
 
    [\w-]+   # this time the syntax _might_ work; some regex engines
             # will auto-escape hyphen if it's the last character, but if someone
             # that doesn't know regex might put another char afterwards,
             # it's always recommended to manually escape hyphen, to
             # avoid inadvertently creating an unwanted range.
 
    \.       # since this . is outside the class it's been escaped.
 
)+           # end the capturing group.
             # match the pattern inside the group as many times as 
             # possible, at least once.
 
[\w-]{2,4}   # as above, but this time match 4 characters, at least 2.
             # this is not recommended for emails because 
             # it will break for emails that end in .museum and similar
             # and also for numerical IP addresses ending in a single digit
             # and it allows things like ___ or --- which are not valid emails
 
$            # end of input

EDITED: 31 Oct 2011 14:19 by BOUGHTONP

0/0

From: Drew (X3N0PH0N)31 Oct 2011 14:38

To: Peter (BOUGHTONP) 13 of 57

39068.13 In reply to 39068.11

quote: me

I strongly agree :Y

You just expect disagreement (hug)

0/0

From: Drew (X3N0PH0N)31 Oct 2011 14:40

To: Peter (BOUGHTONP) 14 of 57

39068.14 In reply to 39068.10

I think the problem with regex is not that it's hard as such (as you say, it's not) it's just that it's a large vocabulary and it's quite arbitrary.

I can do pretty complex things with regex after a bit of reminding myself what's what. But then 20 minutes later I've forgotten it all again. That's the problem with regex :(

0/0

From: Peter (BOUGHTONP)31 Oct 2011 15:15

To: Drew (X3N0PH0N) 15 of 57

39068.15 In reply to 39068.14

But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants.

(Then, to reduce having to type {...} and [...] and (...) as much, there's shorthand quantifiers, shorthand classes, shorthand positions.)

It is a bit of a pain that the syntax uses the same symbols for different meanings, but - once you understand when \ and ? mean the different things, and a few other bits - then the rest isn't so bad, and far less arbitrary than it seems on the surface.

And it's also annoying that we've got at least five major programming variants (Perl/PCRE/Python/.NET/Java) which all have slight differences/benefits and then cut-down versions in JavaScript/grep/awk/etc.

But it still doesn't deserve the bad reputation a lot of people assign it.

withregardstoremembering/understanding,themostimportantthing,is
notwritingregexesthatlooklikethis-becauseitdoesn'thelpanyonewith
figuringoutwhat'sgoingonwhenyouremoveallformattinginformation.

It's just a pity that extended/comment mode (where unescaped whitespace is ignored, and # starts a comment) is not the default one in almost all implementations, so people think they must squish it all on a single line.

EDITED: 31 Oct 2011 15:22 by BOUGHTONP

0/0

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 15:18

To: Peter (BOUGHTONP) 16 of 57

39068.16 In reply to 39068.10

Of course, you're probably right.

My problem is simply that I usually turn to regex when I need to get something done (and learning it just gets in the way of doing something more interesting), rather than sitting down and taking the time to learn it properly.

0/0

From: Drew (X3N0PH0N)31 Oct 2011 15:23

To: Peter (BOUGHTONP) 17 of 57

39068.17 In reply to 39068.15

quote:

But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants.

Yeah, that's grammar/syntax which I agree is pretty neat.

The vocabulary isn't that large but it's quite large and that combined with its arbitrariness (i.e. everything is one character, so can't be differentiated/remembered that way and the characters often don't obviously relate to their subjects and so on - makes memorising hard) which makes it difficult.

And it genuinely is complex when you get into back/forward references and have to worry about greediness and that kinda stuff. That's a genuine headfuck.

EDITED: 31 Oct 2011 15:24 by X3N0PH0N

0/0

From: Peter (BOUGHTONP)31 Oct 2011 15:37

To: Drew (X3N0PH0N) 18 of 57

39068.18 In reply to 39068.17

I'm only half sure what you're on about with that middle paragraph. :S

Most times when people worry about greediness, they should actually be using lazy quantifiers, or a negative character class.

(If I was designing regex from scratch, I'd either make lazy the default, or have no default, so that people had to learn there are three different modes, and when each is appropriate.)

If you're using back references a lot, you're likely getting into the territory where a simple parser is likely the better choice (probably using a number of smaller, more basic regexes).

EDITED: 31 Oct 2011 15:37 by BOUGHTONP

0/0

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 16:11

To: Peter (BOUGHTONP) 19 of 57

39068.19 In reply to 39068.18

I am both a lazy quantifier AND a negative character class.

0/0

From: 99% of gargoyles look like (MR_BASTARD)31 Oct 2011 16:11

To: 99% of gargoyles look like (MR_BASTARD) 20 of 57

39068.20 In reply to 39068.19

I am also putting off performance reviews. God the tedium! :(

0/0

Regular expression understanding

Beehive Forum 1.5.2© 2024 Project Beehive Forum

Beehive Forum 1.5.2
© 2024 Project Beehive Forum