The corpus of junk emails

laohong · 2007-04-19

A linguistic investigation of the junk emails

Purpose of the project
The goal of this project is to investigate the linguistic features of junk emails and maybe to design a filter for junk emails based on linguistic information rather than on a "bag-of-words" approach.

People
Ramesh Krishnamurthy
Constantin Orasan

Resources
The corpus of junk emails can be downloaded from here. This corpus consists of 1563 messages received by us in the last few years, but they are not necessary unique messages.

Given that we are interested in linguistic features of the junk emails, we thought that it would be better to eliminate duplications. The corpus without duplications can be downloaded from here. The elimination of duplications was automatically done, but it did not consider only perfect matching between messages, but also small formatting differences. More details about the method will be available in our forthcoming paper at LREC2002: "A corpus-based investigation of junk emails"

A frequency list generated from the corpus without duplications can be downloaded, as well a lematised list

Papers
C. Orasan and R. Krishnamurthy (2002) "A corpus-based investigation of junk emails", In Proceedings of Language Resources and Evaluation Conference (LREC-2002), Las Palmas, Spain (pdf)

Other resources
Corpus of emails including junk mail built by Ion Androutsopoulos
Extensive information about lawsuits, news and oppinions about junk eamils can be found at www.junkemail.org and spam.abuse.net

http://clg.wlv.ac.uk/projects/junk-email/

The corpus of junk emails

laohong

管理员