Regular expression understanding

From:	Radio	31 Oct 2011 10:50
To:	Radio	2 of 57

39068.2 In reply to 39068.1

And for anyone that /can/ understand the above, is this any better?
( \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* )

(Should I be putting these in code tags I wonder?)

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	31 Oct 2011 10:53
To:	Radio	3 of 57

39068.3 In reply to 39068.1

Bob's an arse, I wouldn't bother sending him email.

(and I don't know, regex has always confused me)

bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Mizzy	31 Oct 2011 11:01
To:	Radio	4 of 57

39068.4 In reply to 39068.1

\w in [\w-\.] will match 0-9,A_Z and a-z chars without modification.

Explanation of the whole thing.

^[\w-\.]
Anything before the @ sign as long as its alphanumeric + dot

+@
Match the @ symbol

([\w-]+\.)
Match any alphanumeric domain prefixes and the dot

+[\w-]{2,4}$
Match the domain suffixes (.uk, .com) making sure the length is between 2 and 4 chars.

get regexbuddy it's useful :-)

0/0

Reply Quote

More

From:	Radio	31 Oct 2011 11:33
To:	Mizzy	5 of 57

39068.5 In reply to 39068.4

Cheers, so that looks like it should work fine for numerical domains, meaning that it's probably been typed in wrong in the actual code.

0/0

Reply Quote

More

From:	Mizzy	31 Oct 2011 12:40
To:	Radio	6 of 57

39068.6 In reply to 39068.5
that is likely, Although there are some differences between implementations of regex so check the regex against your language standard.

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	31 Oct 2011 13:17
To:	Mizzy	7 of 57

39068.7 In reply to 39068.4
All of which stimulated me to check out regex editors. Anything to take out the hit-n-miss of regex. bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 13:48
To:	Radio	8 of 57

39068.8 In reply to 39068.1

The regex you've given does not require a letter. In every regex implementation I've encountered where \w has meaning, it will match (as a minimum) [0-9A-Za-z_] - some implementations will also match accented/etc letters, but they all match at least those 53 chars.

But what's the context of this?

To correctly validate all email formats with regex is actually a hugely complex task (due to what syntax is allowed by various email software), and it's probably not worth the effort to do more than check that there's a single @ and at least one dot.

Which is usually easier to do with simple text checks:

code:

if ( Text.contains('@') and Text.contains('.') )

But if you're forced to use a regex for something then:

code:

[^@]++@[^@]+.[^@]+

3.1415P265E589T932E846R64338

0/0

Reply Quote

More

From:	Drew (X3N0PH0N)	31 Oct 2011 13:51
To:	Peter (BOUGHTONP)	9 of 57

39068.9 In reply to 39068.8
As someone who regularly gets my valid email address rejected as not a valid email address, I strongly agree :Y , , , , , , , , , , `' (o,o) (o,o) (o,o) (o,o) (o,o) \ \|)__) \|)__) \|)__) \|)__) \|)__) ''`>-,--”-”---”-”---”-”---”-”---”-”---- ,'.*''`

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 13:58
To:	99% of gargoyles look like (MR_BASTARD)	10 of 57

39068.10 In reply to 39068.7
Regex editors don't take the hit and miss out of it, though they can create the illusion of knowing what you're doing, which may or not be a good thing. To really take the hit and miss out of it, just learn a bit of regex syntax and how regex matching takes place, and you'll find it much easier. It's not actually as hard as people often make out - there's only a few basic constructs to understand, and the rest can be extrapolated (and if you don't use often checked up in a reference easily enough). `3.1415P265E589T932E846R64338`

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 14:00
To:	Drew (X3N0PH0N)	11 of 57

39068.11 In reply to 39068.9
Um, which bit are you disagreeing with? `3.1415P265E589T932E846R64338`

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 14:18
To:	Mizzy	12 of 57

39068.12 In reply to 39068.4

You've split the regex up wrong (the + quantifiers apply to the items which preceed them, not the items after them), and given incorrect descriptions for some parts. :/

code:

^            # start of input
 
[\w-\.]+     # incorrect syntax. It should be [\w\-.]+
             # since - needs escaping inside a char class, and . doesn't.
             # Matches a single alphanumeric, underscore, hyphen, or dot.
             # The + says "as many as possible, but at least one"
 
@            # a literal @ sign
 
(            # begin a capturing group
             # though should be a non-capturing group
             # but syntax is slightly more complex (?: ... ) vs ( ... )
 
    [\w-]+   # this time the syntax _might_ work; some regex engines
             # will auto-escape hyphen if it's the last character, but if someone
             # that doesn't know regex might put another char afterwards,
             # it's always recommended to manually escape hyphen, to
             # avoid inadvertently creating an unwanted range.
 
    \.       # since this . is outside the class it's been escaped.
 
)+           # end the capturing group.
             # match the pattern inside the group as many times as 
             # possible, at least once.
 
[\w-]{2,4}   # as above, but this time match 4 characters, at least 2.
             # this is not recommended for emails because 
             # it will break for emails that end in .museum and similar
             # and also for numerical IP addresses ending in a single digit
             # and it allows things like ___ or --- which are not valid emails
 
$            # end of input

3.1415P265E589T932E846R64338

0/0

Reply Quote

More

From:	Drew (X3N0PH0N)	31 Oct 2011 14:38
To:	Peter (BOUGHTONP)	13 of 57

39068.13 In reply to 39068.11
quote: me I strongly agree :Y You just expect disagreement (hug) , , , , , , , , , , `' (o,o) (o,o) (o,o) (o,o) (o,o) \ \|)__) \|)__) \|)__) \|)__) \|)__) ''`>-,--”-”---”-”---”-”---”-”---”-”---- ,'.*''`

0/0

Reply Quote

More

From:	Drew (X3N0PH0N)	31 Oct 2011 14:40
To:	Peter (BOUGHTONP)	14 of 57

39068.14 In reply to 39068.10
I think the problem with regex is not that it's hard as such (as you say, it's not) it's just that it's a large vocabulary and it's quite arbitrary. I can do pretty complex things with regex after a bit of reminding myself what's what. But then 20 minutes later I've forgotten it all again. That's the problem with regex :( , , , , , , , , , , `' (o,o) (o,o) (o,o) (o,o) (o,o) \ \|)__) \|)__) \|)__) \|)__) \|)__) ''`>-,--”-”---”-”---”-”---”-”---”-”---- ,'.*''`

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 15:15
To:	Drew (X3N0PH0N)	15 of 57

39068.15 In reply to 39068.14
But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants. (Then, to reduce having to type {...} and [...] and (...) as much, there's shorthand quantifiers, shorthand classes, shorthand positions.) It is a bit of a pain that the syntax uses the same symbols for different meanings, but - once you understand when \ and ? mean the different things, and a few other bits - then the rest isn't so bad, and far less arbitrary than it seems on the surface. And it's also annoying that we've got at least five major programming variants (Perl/PCRE/Python/.NET/Java) which all have slight differences/benefits and then cut-down versions in JavaScript/grep/awk/etc. But it still doesn't deserve the bad reputation a lot of people assign it. withregardstoremembering/understanding,themostimportantthing,is notwritingregexesthatlooklikethis-becauseitdoesn'thelpanyonewith figuringoutwhat'sgoingonwhenyouremoveallformattinginformation. It's just a pity that extended/comment mode (where unescaped whitespace is ignored, and # starts a comment) is not the default one in almost all implementations, so people think they must squish it all on a single line. `3.1415P265E589T932E846R64338`

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	31 Oct 2011 15:18
To:	Peter (BOUGHTONP)	16 of 57

39068.16 In reply to 39068.10

Of course, you're probably right.

My problem is simply that I usually turn to regex when I need to get something done (and learning it just gets in the way of doing something more interesting), rather than sitting down and taking the time to learn it properly.

bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Drew (X3N0PH0N)	31 Oct 2011 15:23
To:	Peter (BOUGHTONP)	17 of 57

39068.17 In reply to 39068.15
quote: But it doesn't really have a large vocabulary. Well, not sure how you're defining vocabulary, but there's really only four or five types of things - quantifiers, character classes, positions, groups, and alternation, and none of those have more than a handful of variants. Yeah, that's grammar/syntax which I agree is pretty neat. The vocabulary isn't that large but it's quite large and that combined with its arbitrariness (i.e. everything is one character, so can't be differentiated/remembered that way and the characters often don't obviously relate to their subjects and so on - makes memorising hard) which makes it difficult. And it genuinely is complex when you get into back/forward references and have to worry about greediness and that kinda stuff. That's a genuine headfuck. , , , , , , , , , , `' (o,o) (o,o) (o,o) (o,o) (o,o) \ \|)__) \|)__) \|)__) \|)__) \|)__) ''`>-,--”-”---”-”---”-”---”-”---”-”---- ,'.*''`

0/0

Reply Quote

More

From:	Peter (BOUGHTONP)	31 Oct 2011 15:37
To:	Drew (X3N0PH0N)	18 of 57

39068.18 In reply to 39068.17

I'm only half sure what you're on about with that middle paragraph. :S

Most times when people worry about greediness, they should actually be using lazy quantifiers, or a negative character class.

(If I was designing regex from scratch, I'd either make lazy the default, or have no default, so that people had to learn there are three different modes, and when each is appropriate.)

If you're using back references a lot, you're likely getting into the territory where a simple parser is likely the better choice (probably using a number of smaller, more basic regexes).

3.1415P265E589T932E846R64338

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	31 Oct 2011 16:11
To:	Peter (BOUGHTONP)	19 of 57

39068.19 In reply to 39068.18
I am both a lazy quantifier AND a negative character class. bastard by name, bastard by nature

0/0

Reply Quote

More

From:	99% of gargoyles look like (MR_BASTARD)	31 Oct 2011 16:11
To:	99% of gargoyles look like (MR_BASTARD)	20 of 57

39068.20 In reply to 39068.19
I am also putting off performance reviews. God the tedium! :( bastard by name, bastard by nature

0/0

Reply Quote

More

From:	Mizzy	31 Oct 2011 16:34
To:	Peter (BOUGHTONP)	21 of 57

39068.21 In reply to 39068.12

sigh, yes I know I wasn't paying attention where I chopped up the code and I was generalising broadly, :-(( sorry I wasn't up to PB standard, but I didn't have time to write a dissertation :-P

PS have you considered a career as a QSA ? :-)

I still stand by regexbuddy it does a nice job of validating regex against different regex flavours.

0/0

Reply Quote

More