Text data analysis

From: Ally27 Aug 2010 05:42
To: ALL1 of 9

Hullo. (If you are in a hurry, just read the last paragraph)

 

My workplace (an accommodation listing web site) has a problem with scammers. They use stolen credit cards to sign up for accounts on the site, then spam users with fraudulent messages persuading them to send money via Western Union. They are depressingly successful. Given the proportion of users who have their password set to "password" this perhaps isn't surprising, but I digress.

 

It has occurred to me that we have a potentially massive dataset that could help us predict whether a person is a scammer or not. There are a number of criteria I know how to check against (time it takes for them to send their first message, the time of day they send these messages, etc.) but one big one is the text content of the ads themselves.

 

I'd like to run an analysis of which words appear in scammer ads, but not in genuine ads. It's stored as a (MS SQL) varchar field- my current plan is to use a cursor (...urgh) to go through each row, splitting the string and UPDATE-ing a count of the word occurrence in a separate table as I go. This sounds hellish, performance wise. Does anyone have any ideas of a better way to do this?

From: 99% of gargoyles look like (MR_BASTARD)27 Aug 2010 07:44
To: Ally 2 of 9
I assume that your scripting language as ASP (or variant thereof) and I cheerfully admit to knowing SFA about ASP. But is there an equivalent to preg_replace? I was just thinking that you could use your blacklist as an array for the pattern and the scammer's text as the subject. Count the number of replacements. Robert, your mother's brother, is.
From: Kenny J (WINGNUTKJ)27 Aug 2010 11:30
To: Ally 3 of 9
What version/level of SQL Server are you using? That's the sort of thing that you can do in SSIS. If you've got Enterprise Edition, you get all kinds of fuzzy logic and data mining goodness.
From: Ally27 Aug 2010 14:35
To: 99% of gargoyles look like (MR_BASTARD) 4 of 9
We already do that, to an extent. But in this case I don't already have a blacklist- I want to assemble one from the data we already have.
From: Ally27 Aug 2010 14:38
To: Kenny J (WINGNUTKJ) 5 of 9
We're on 2005, but not Enterprise. Just Standard, or whatever it's called. I suppose it won't be a critical problem to do it manually- it's not going to run often, after all.
From: 99% of gargoyles look like (MR_BASTARD)27 Aug 2010 14:53
To: Ally 6 of 9
quote: Ally
it's not going to run often, after all.

<employs legions of scammers to bust Ally's database />
From: Kenny J (WINGNUTKJ)27 Aug 2010 15:04
To: Ally 7 of 9

In that case, you won't have the fun stuff for doing pattern matching, etc. No matter! You can still do what you want in SQL rather than .net code:

 

http://stackoverflow.com/questions/881913/sql-server-function-for-displaying-word-frequency-in-a-column

 

is a good place to start

From: Ally27 Aug 2010 15:06
To: 99% of gargoyles look like (MR_BASTARD) 8 of 9
Hah. It'll run every 24 hours to establish a watch list of words. That watch list will then be used constantly.
From: Ally27 Aug 2010 15:07
To: Kenny J (WINGNUTKJ) 9 of 9
Aha. Looks perfect-thanks.