Blacklist not marked for deletion

eouvrie · Thu Aug 12, 2010 5:37 am

Hi,
I received an email which sender is blacklisted but it is not marked for deletion.
In fact, the bayesian is +135 ; the default spam rating value for blacklist is -50 giving a total of +85 which is not considered as spam.
Here is the Bayesian debug:
[code]DECODED_BASE64_SUBJECT: 0
SUBJECT_WORDS: 0
FROM_WORDS: 2
CONTENT_TYPE: multipart/alternative; boundary="ea7763016137cad67a63e81280a2968e"
NUM_PARTS: 3
TYPE: multipart/alternative; boundary="ea7763016137cad67a63e81280a2968e"
ENCODING: quoted-printable
TYPE: text/plain; charset=iso-8859-1
DISPOSITION: inline
BYTES: 17567
NUM_RAW_WORDS: 38
ENCODING: quoted-printable
TYPE: text/html; charset=iso-8859-1
DISPOSITION: inline
BYTES: 35792
NUM_RAW_WORDS: 206
URL: HTTP://TR2.VIRTUALTARGET.COM.BR
URL: http://busca.extra.com.br
URL: http://imagens.extra.com
URL: http://imagens.extra.com.br
URL: http://notebook.extra.com
URL: http://tr2.virtualtarget.com
URL: http://tr2.virtualtarget.com.br
URL: http://www.extra.com.br
MULT: 3.217742E-026
COMB: 8.660676E-015
RAWSPAMYNESSE: -1.000000E+000
RAWSPAMYNESS: -1.00000
SPAMYNESSE: -9.090000E-001
SPAMYNESS: -0.90900
GOODCOUNT: 20
BADCOUNT: 10
GOODWORDCOUNT: 909
GGAIN: 0.909000
BADWORDCOUNT: 975
BGAIN: 0.975000
INTRESTINGWORDCOUNT: 20
WORDCOUNT: 101
TOTAL_WORDCOUNT_FACTOR: 1.0000
INTERESTING_WORDCOUNT_FACTOR: 1.0000
WORD: conseguir prob=0.990000 occurrences=2
WORD: deste prob=0.990000 occurrences=1
WORD: from_"extra.com.br prob=0.990000 occurrences=1
WORD: imagens prob=0.990000 occurrences=1
WORD: nosso prob=0.990000 occurrences=1
WORD: televendas prob=0.990000 occurrences=2
WORD: voc&ecirc prob=0.990000 occurrences=2
WORD: acesse prob=0.011000 occurrences=1
WORD: anunciados prob=0.011000 occurrences=1
WORD: aqui prob=0.011000 occurrences=1
WORD: cancele prob=0.011000 occurrences=1
WORD: contato prob=0.011000 occurrences=2
WORD: custo prob=0.011000 occurrences=3
WORD: deseja prob=0.011000 occurrences=2
WORD: durarem prob=0.011000 occurrences=1
WORD: e-mails prob=0.011000 occurrences=2
WORD: entrar prob=0.011000 occurrences=2
WORD: essa prob=0.011000 occurrences=1
WORD: esta prob=0.011000 occurrences=2
WORD: estado prob=0.011000 occurrences=1[/code]
In this case, I think the Bayesian is positive because this is a portuguese spam.
Is there another way to configure MW2010 to mark for deletion blacklisted mails, independently of Bayesian calculation ?
Eric

Thu Aug 12, 2010 5:40 am

Yes, create a filter with the "Span tool rule" = "Sender is in blacklist", you can then add up to another -200 to the spam score etc...

Thu Aug 12, 2010 5:53 am

You can also change the default value for the Blacklist slider to -200 if you trust your Blacklist entries.
I would also suggest changing the Bayesian Range setting to Low.
If you need help locating these post back.

Thu Aug 12, 2010 6:31 am

You can also do a filter to look for the charset if you don't get good mail using it.

rusticdog · Thu Aug 12, 2010 7:15 am

I think one of the big problems here is that very common words in Portuguese are being included for Bayesian analysis. In our testing we found that common words caused more problems as they tended to end up scoring as either very good words or very spammy words and would negatively influence the results.

You can see all the very good words here

Code: Select all

WORD: acesse                 prob=0.011000          occurrences=1
WORD: anunciados             prob=0.011000          occurrences=1
WORD: aqui                   prob=0.011000          occurrences=1
WORD: cancele                prob=0.011000          occurrences=1
WORD: contato                prob=0.011000          occurrences=2
WORD: custo                  prob=0.011000          occurrences=3
WORD: deseja                 prob=0.011000          occurrences=2
WORD: durarem                prob=0.011000          occurrences=1
WORD: e-mails                prob=0.011000          occurrences=2
WORD: entrar                 prob=0.011000          occurrences=2
WORD: essa                   prob=0.011000          occurrences=1
WORD: esta                   prob=0.011000          occurrences=2
WORD: estado                 prob=0.011000          occurrences=1

I know it's a hassle, but assuming you get a lot of Portuguese emails that are both good and spam, you could consider adding the very common words to your Bayesian Exclusion list. Under Help >> Common Files >> look for the mwp_exw.dat, and if you add some of the more common words to this file it means the bayesian will ignore them and instead look at words that are considered more interesting.

Thu Aug 12, 2010 11:46 am

Might be something to put on the programmer's to do list for later in the project. Add a language specific exclusion list similar to the English one that folks could selectively add to the default one from a menu in MW.

mcullet · Thu Aug 12, 2010 12:57 pm

Hi Folks,

Thanks for the post - especially the code you included. Until now, I'd not seen anything from under MWP's skirt. Got all trembley too

RustidDog's comments (as in understood them) refer to the risk of false positives. (Feel free to correct me here RD).

Unless I am wildly wrong, the Bayesian score (index?) refers to probabilities. Used in MWP, it refers to the probability that a particular email is SPAM. Nothing on Earth is perfect - so far as I have been able to determine - and so as with all things related to probability, there is a risk of false positives and negatives: high score but a clean email - low score but a bomb.

(Anyone interested in such things can happily look here for a layman's explanation of Bayesian probability and here for Bayes' Theorem. The mathematics of probabilities is a large church containing lots of practical and interesting theorems - but that's off topic.)

The maths and coding skill behind spam detection / creation is high level - spammers are good at what they do but let's give them no free help.

I'm a bit concerned that spammers visits sites like this and gain very useful intel. I KNOW they are regular visitors of spambaiting forums. Our approach is one of risk management - uses require password access to the forum and moderators can direct suspects to a sandpit, PLUS we have sections which are closed to the general public. It works.

Back to your post ...

Probabilities is what this comes down too - that and commonsense. I can only speak English (poorly at best) and I have no idea which countries use MWP - hopefully lots. I can't speak Portuguese so it would be pointless for someone to send me anything in that language - delete it, even if it happens to be non-spam. I have no idea how to treat languages which are based on pictures (like Chines or Japanese) because, so far as I understand it, the same character might mean several things depending upon context.

MWP (any spam filter product) cannot replace your 'suss' sense (no references to Kiwi's). We sometimes see something that just looks suss - but passes a SPAM filter: maybe we delete it, maybe we look out of curiosity. I prefer caution and if I suspect something is suss then it goes to the bin. If you alter default spam filter settings (a personal choice) then do so as long as you understand the consequences and risks. You might be deleting / blacklisting lots of innocents (generating loads of false positives) or neutering MWP to the point where it is of limited value.

Every time a word is added to a 'look-up' list on a programmer's reference, then you it affects product performance - sometimes really badly - depending upon HOW things are done. (Algorithms - a whole other topic)

My points:

(1) We might need a safe section to discuss this sort of thing which is kept away from spammers. (Maybe it exists and I have yet to discover it ....

)
(2) Probabilities - are an assigned value of likelihood of an event / state. While valuable, nothing can replace the 'suss' sense - if something seems suss then it probably is so delete it.
(3) Change default settings carefully - everything has a consequence (false positives / false negatives) - but don't assume a high setting offers bigger, better protection. That's is not how probabilities work.
(4) I can't talk about some other things here because I have no interest in aiding spammers (and it would absolutely BORE most members to tears).

HTH

Thu Aug 12, 2010 1:40 pm

I don't think anything said here would be of much help to the spammers, they could learn a lot more by spending an hour poking into the guts of MW using the free trial version.

The neat thing about Bayesian is that it adjusts to each person based on their good and bad e-mail making it impossible for a spammer to target a message to bypass its detection. What you do see are very short spams like "naked ladies at -link-" that are too short to process well and ones that drag in a whole paragraph of junk words to try and fool the Bayesian,neither is a very effective method of getting folks to visit the link so spammers aren't happy with the results.

The key point of 2010 is that it applies all of the tools against each e-mail so that if they do beat the Bayesian they still run up against the other tools. It isn't even hard to put up a filter to look for short messages if you want to give them some bad karma. Say -200 for less than 30 characters, -100 for under 50 and -50 fir under 100. Put them near the bottom of your filter stack and they will only be run if none of your other filters hit.

Firetrust Support Forums

Blacklist not marked for deletion

Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion

Re: Blacklist not marked for deletion