Anyone here good with regular expressions?
We've picked up the following one from Microsoft, with regards to validating email structures before they're applied:
^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$
The only problem is, that this excludes bob@123.com or bob@456.com, which are both valid domains. The expression seems to require a letter be present in the domain, which obviously isn't a requirement in the real world.
Can anyone confirm this is the case, and perhaps even better, suggest a modification to the expression?
And for anyone that /can/ understand the above, is this any better?
( \w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)* )
(Should I be putting these in code tags I wonder?)
Bob's an arse, I wouldn't bother sending him email.
(and I don't know, regex has always confused me)
\w in [\w-\.] will match 0-9,A_Z and a-z chars without modification.
Explanation of the whole thing.
^[\w-\.]
Anything before the @ sign as long as its alphanumeric + dot
+@
Match the @ symbol
([\w-]+\.)
Match any alphanumeric domain prefixes and the dot
+[\w-]{2,4}$
Match the domain suffixes (.uk, .com) making sure the length is between 2 and 4 chars.
get regexbuddy it's useful :-)
that is likely,
Although there are some differences between implementations of regex so check the regex against your language standard.
if ( Text.contains('@') and Text.contains('.') )
[^@]++@[^@]+.[^@]+
^ # start of input [\w-\.]+ # incorrect syntax. It should be [\w\-.]+ # since - needs escaping inside a char class, and . doesn't. # Matches a single alphanumeric, underscore, hyphen, or dot. # The + says "as many as possible, but at least one" @ # a literal @ sign ( # begin a capturing group # though should be a non-capturing group # but syntax is slightly more complex (?: ... ) vs ( ... ) [\w-]+ # this time the syntax _might_ work; some regex engines # will auto-escape hyphen if it's the last character, but if someone # that doesn't know regex might put another char afterwards, # it's always recommended to manually escape hyphen, to # avoid inadvertently creating an unwanted range. \. # since this . is outside the class it's been escaped. )+ # end the capturing group. # match the pattern inside the group as many times as # possible, at least once. [\w-]{2,4} # as above, but this time match 4 characters, at least 2. # this is not recommended for emails because # it will break for emails that end in .museum and similar # and also for numerical IP addresses ending in a single digit # and it allows things like ___ or --- which are not valid emails $ # end of input
Of course, you're probably right.
My problem is simply that I usually turn to regex when I need to get something done (and learning it just gets in the way of doing something more interesting), rather than sitting down and taking the time to learn it properly.
I'm only half sure what you're on about with that middle paragraph. :S
Most times when people worry about greediness, they should actually be using lazy quantifiers, or a negative character class.
(If I was designing regex from scratch, I'd either make lazy the default, or have no default, so that people had to learn there are three different modes, and when each is appropriate.)
If you're using back references a lot, you're likely getting into the territory where a simple parser is likely the better choice (probably using a number of smaller, more basic regexes).