Malware detection is an important problem today. New
malware appears every day and in order to be able to
detect it, it is important to recognize families of existing
malware. Data mining techniques will be very helpful in
this context; concretely unsupervised learning methods
will be adequate. This work presents a comparison of the
behaviour of two representations for malware
executables, a set of twelve distances for comparing them,
and three variants of the hierarchical agglomerative
clustering algorithm when used to capture the structure of
different malware families and subfamilies. We propose a
way the comparison can be done in an unsupervised
learning environment. There are different conclusions we
can draw from the whole work. Concerning to algorithms,
the best option is average-linkage; this option seems to
capture better the structure represented by the distance.
The evaluation of the distances is more complex but some
of them can be discarded because they behave clearly
worse than the rest of the distances, and the group of
distances behaving the best can be identified; the
computational cost analysis can help when selecting the
most convenient one. |
Cite as: Gurrutxaga, I., Arbelaitz, O., Ma Perez, J., Muguerza, J., Martin, J.I. and Perona, I. (2008). Evaluation of Malware clustering based on its dynamic behaviour. In Proc. Seventh Australasian Data Mining Conference (AusDM 2008), Glenelg, South Australia. CRPIT, 87. Roddick, J. F., Li, J., Christen, P. and Kennedy, P. J., Eds. ACS. 163-170. |