Regex help request: Filter Only English characters.

Forum for MailWasher Pro 7 and/or older 2011/2012 versions.
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Regex help request: Filter Only English characters.

Thu Dec 06, 2018 4:47 am

We're getting a lot of spam using Chinese character sets, so I set out to devise a subject line filter that passes only if it contains only English characters. The problem is that it is tripping occasionally on 100% pure English subject lines. I'm looking for help with this.

The RegEx is as follows:

[^\cA-\cZa-z0-9A-Z `~!@#$%^&*()_=+[{}|;:'",<.>/?\\\-\]\u2013]

Which in simple English is: when the text contains anything that is *NOT*
- a Control Character (char code 1 to 26)
- a lowercase a-z
- a numeral 0-9
- an Uppercase A-Z
- any of the following punctuation: <space>`~!@#$%^&*()_=+[{}|;:'",<.>/?
- any of the following special characters that require escaping: \-]
- the special unicode character 2013 (a dash used by some mail programs.)

However, the following subject line gets flagged by this:

New: Manage Avira Antivirus With O&O Syspectr

I have used both the great online tool RegExr at https://regexr.com/, as well as the Regex Match Tracer 2.1, and both didn't match:
RegExr.png
RegExr.png (52.49 KiB) Viewed 4420 times
Foreign Subject.rws - Regex Match Tracer-2_1.png
Foreign Subject.rws - Regex Match Tracer-2_1.png (16.75 KiB) Viewed 4420 times
However, Match Tracer v4 matches on the first letter of the string?
Foreign Subject.rws - Regex Match Tracer-3.png
Foreign Subject.rws - Regex Match Tracer-3.png (18.69 KiB) Viewed 4420 times
Can someone please assist. As you can see, I'm willing to put the time and effort in to get this right, and to share my results...
User avatar
rusticdog
Firetrust Monkey
Posts: 15864
Joined: Mon Jun 13, 2005 6:27 pm

Re: Regex help request: Filter Only English characters.

Thu Dec 06, 2018 9:54 am

The DEELX engine we use does support POSIX sets http://www.regexlab.com/en/deelx/syntax/bas_set.htm , so you ought to be able to simplify the filter by just using

[:ascii:]

Which is the characters here http://www.asciitable.com/

Does that do the same ?
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Re: Regex help request: Filter Only English characters.

Thu Dec 06, 2018 10:50 am

[^:ascii:] seems to help a lot -- I'll continue testing with the special characters (like © and ® and special dashes) and see if I can get it to be reliable.

Other ideas also appreciated...
User avatar
rusticdog
Firetrust Monkey
Posts: 15864
Joined: Mon Jun 13, 2005 6:27 pm

Re: Regex help request: Filter Only English characters.

Thu Dec 06, 2018 11:02 am

We have some built in filters for language when the emails dictate certain character sets, these won't catch UTF-8 encoded messages however.

When you click the drop down beside 'Add Rule' in the filter you can choose Character Set rule >> then click the next drop down to choose which languages you want to filter on.

e.g.
Untitled.jpg
Untitled.jpg (170.54 KiB) Viewed 4410 times
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Re: Regex help request: Filter Only English characters.

Thu Dec 06, 2018 1:04 pm

We're already using _all_ of those filters (we have no clients in non-English countries), and it's still not filtering these crafty e-mails.

By limiting the Subject line to use only English characters, with only a few exceptions, we hope to avoid non-English messages as well as any message that attempts to use unicode lookalikes to bypass filters. No reasonable person attempts to use an 'a' that isn't really a char(95) (or was it 60? Oh, nevermind.)
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Re: Regex help request: Filter Only English characters.

Sat Dec 08, 2018 7:47 am

I tried using [:ascii:], but it doesn't seem to be working. Can I use Regex Match Tracer to test this?
Foreign Subject.rws - Regex Match Tracer.png
Foreign Subject.rws - Regex Match Tracer.png (16.99 KiB) Viewed 4390 times
I would have expected that the first English character would have matched...? (I should have used [:ascii:].* if I wanted all of them, but that doesn't work either.)
User avatar
rusticdog
Firetrust Monkey
Posts: 15864
Joined: Mon Jun 13, 2005 6:27 pm

Re: Regex help request: Filter Only English characters.

Sat Dec 08, 2018 11:16 am

You are hitting F3 (Match button) to trigger the search ? I'm not having issue with example text I use to get a hit.

I think to match all ASCII you would use
[:ascii:]+
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Re: Regex help request: Filter Only English characters.

Sat Dec 08, 2018 12:38 pm

Doh! And here I was thinking the output was immediate.

Given that I only need to confirm the presence of even one non-ASCII character, I believe a rule of "Subject Containing [:^ascii:]" should work.
Fingers crossed, I'll report back once I have a few more garbage e-mails, likely after the weekend...
User avatar
rusticdog
Firetrust Monkey
Posts: 15864
Joined: Mon Jun 13, 2005 6:27 pm

Re: Regex help request: Filter Only English characters.

Sat Dec 08, 2018 2:40 pm

Yeah that program does do some odd highlighting that gives you that impression. Fingers crossed MW works as expected :)
DanAtCCD
Travelling Tuatara
Posts: 26
Joined: Wed Aug 24, 2016 4:39 am

Re: Regex help request: Filter Only English characters.

Tue Dec 11, 2018 5:35 am

[:^ascii:] didn't catch a \u00AE (Ascii 174) (https://unicode-table.com/en/#00AE). Looks like I'll have to deal with that one, and with \u00A9 (Ascii 169) (https://unicode-table.com/en/#00A9) separately. I haven't gotten a glut of foreign subject lines this weekend like I usually do, so I haven't been able to compare against the regular corpus of foreign spam.

Return to “MailWasher Pro 7”