Intrusion Detection

    Intrusion Detection Systems, Columbia, Fall 2009

      Created and analyzed user behavior models based on captured network packets using Nmap, Snort, tcpdump and R


Project Report

1. Introduction

1.1 Tools

I use tcpdump to capture the data,

sed is used to prepare the data to a tabular format,

awk and MATLAB is used to do statistical analysis.


1.2 Model

Since most of my traffic are in 80 and 443, very few connection using other services. my model is based on the distribution of packet count (per minute) associated with port pairs other than 80 and 443.

2. Data Capture and Preparation


2.1 Capture Raw Traffic

tcpdump data was captured from 10/7/2009 to 10/11/2009 and wrote to normal.log file with Script:


sudo tcpdump -w normal.log




normal.log file is 3.2GB










The hour (y axis) - day (x axis) plot of collected continuous data.



Total 105 hours of time span is covered.

2.2 Extract TCP Record



Since only TCP data is concerned, with Script:


tcpdump -r normal.log -n tcp > normal_tcp.txt


to filter out TCP records, the content of normal_tcp.txt looks like:


05:51:10.023276 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 371608:373026, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.023288 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 373026, win 2672, options [nop,nop,TS val 2076067 ecr 3022862199], length 0

05:51:10.024792 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 373026:374444, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.024803 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 374444, win 2718, options [nop,nop,TS val 2076068 ecr 3022862199], length 0

05:51:10.026046 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 374444:375862, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.026057 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 375862, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0

05:51:10.027297 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 375862:377280, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.028893 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 377280:378698, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.028905 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 378698, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0

05:51:10.029893 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 378698:380116, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.031496 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 380116:381534, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.031507 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 381534, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0

05:51:10.032747 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 381534:382952, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.033994 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 382952:384370, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418

05:51:10.034006 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 384370, win 2741, options [nop,nop,TS val 2076069 ecr 3022862199], length 0

05:51:10.035245 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 384370:385788, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418



Each line represents a TCP packet with information of timestamp, source and distination IP and port, and packet length etc.


My IP address is 160.39.187.145.


2.3 Simplification of the Tabular Record

Use sed to simplify the tabular record to the following format with 9 fields separated by space per line:


1 2 3 4 5 6 7 8 9

hour min sec fract srcIP srcPort destIP destPort length


Script:


sed -e 's/Flags.*length\ /length/' -e 's/\([0-9]*\)\:\([0-9]*\)\:\([0-9]*\)\./\1 \2 \3 /' -e 's/IP \([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/> \([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/\: length/ /' < normal_tcp.txt > normal_tcp.csv


The content of normal_tcp.csv looks like:


05 59 59 815133 160.39.187.145 33390 72.32.58.166 80 585

05 59 59 819793 72.32.58.166 80 160.39.187.145 33393 0

05 59 59 820414 72.32.58.166 80 160.39.187.145 33394 289

05 59 59 820427 160.39.187.145 33394 72.32.58.166 80 0




2.4 Port Pair Statistics

Port pair is represented by field 6 (source port) and field 8(destination port).

Script:

awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' < normal_tcp.csv | sort -n -r > pp_bi.csv


pp_bi.csv is then sorted result of count of port pair, the first few lines are:


150279 80_35327

136853 80_42756

78719 35327_80

75342 42756_80

57054 80_39498


We can see they are all 80 related port pairs, which is expected as 80 is web service.


Script to exclude 80 port:


sed -e '/80\_/d' -e '/80\_/d' < pp_bi.csv > pp_bi_no_80.csv


4657 443_40902

3486 443_44775

3473 44775_443

3446 59126_38133

3096 443_42330


443 port is for HTTP Secure (SSL for https:// )


Script to exclude 443 port:

sed -e '/443\_/d' -e '/\_443/d' < pp_bi_no_80.csv > pp_bi_no_80_443.csv


3446 59126_38133

2846 1935_57089

1385 38133_59126

950 1935_50308

854 1935_35101

823 19246_49092


Then, use scripts to automatically count each non-ephemeral port's information:


count.sh

cat pp_bi.csv | grep " 21_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 22_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 80_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 90_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 110_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 143_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 443_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 1080_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep " 1935_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'


echo


cat pp_bi.csv | grep "_21$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_22$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_80$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_90$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_110$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_143$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_443$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_1080$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'

cat pp_bi.csv | grep "_1935$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'


echo


sed -e '/ 21_/d' -e '/ 22_/d' -e '/ 80_/d' -e '/ 90_/d' -e '/ 110_/d' -e '/ 139_/d' -e '/ 143_/d' -e '/ 443_/d' -e '/ 993_/d' -e '/ 1080_/d' -e '/ 1935_/d' -e '/_21$/d' -e '/_22$/d' -e '/_80$/d' -e '/_90$/d' -e '/_110$/d' -e '/_139$/d' -e '/_143$/d' -e '/_443$/d' -e '/_993$/d' -e '/_1080$/d' -e '/_1935$/d' -e '/\:$/d' < pp_bi.csv > temp.txt


awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' < temp.txt


2.5 Ports and Services Summary

Identified ports and services [1] are:

22/TCP,UDP Secure Shell (SSH)鈥攗sed for secure logins, file transfers (scp, sftp) and port forwarding

80/TCP,UDP Hypertext Transfer Protocol (HTTP)

90/TCP,UDP dnsix (DoD Network Security for Information Exchange) Securit Attribute Token Map 90/TCP,UDP Pointcast Unofficial

110/TCP Post Office Protocol 3 (POP3)

143/TCP,UDP Internet Message Access Protocol (IMAP)鈥攗sed for retrieving, organizing, and synchronizing e-mail messages

443/TCP,UDP Hypertext Transfer Protocol over TLS/SSL (HTTPS)

1080/TCP SOCKS proxy

1935/TCP Adobe Systems Macromedia Flash Real Time Messaging Protocol (RTMP) "plain" protocol




80 and 443 traffic are almost with ephemeral port[2].


The distribution of ephemeral ports are even.

There are a few number of connections with ports of lower number:

First column is the # of packets for the other port # < 32768, associated with port 80

80 > *

* > 80

4 80 32731

2 80 2239

2 80 11744

1 80 594

1 80 418

1 80 23953


5 32731 80

2 418 80

2 2239 80

2 11744 80

1 594 80

1 23953 80


but could not identify their usage[1].

Ports associated with 443 are all > 32768

2.5.1 Overall Port Pair Table


The table show overall port pair statistics.

other denotes ephemeral port.

2.5.2 Overall Port Pair Diagram

# of port pairs # of packets

This diagram shows the absolute amount and percentage bar chart.

As denoted in the table, Blue bars represent the pattern of port_* (data goes out of the port), red represent *_port (data goes into the port).

2.5.3 Overall Packet Size Distribution

Histogram for total and per day statistics are shown.

Most packets are smaller than 2000, while a few are very large up to 10^4.



2.5.4 Overall Traffic Rate and Volume


The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) matches.


The four peeks should correspond to my activity of intensively using Internet each day.


3. Inbound Traffic

Perform the same analysis as in section 2, after filtering out the inbound data, by find the rows that the 7-th field (destIP) = 160.39.187.145 (My IP)


in.sh

awk '$7=="160.39.187.145"' < normal_tcp.csv > in.txt

cat in.txt | awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv



3.1 Inbound Port Pair Table



3.2 Inbound Port Pair Diagram





                        # of port pairs                                                                             # of packets

Port 80 and 443 stand out. Blue bars (for inbound) dominate.

3.3 Inbound Packet Size Distribution

Histogram for total and per day statistics are shown.

This inbound histogram is very different from the overall histogram. There is no larger packets, and most of inbound packets are exactly 1387 and 73.

3.4 Inbound Traffic Rate and Volume


The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) matches, similar to the overall diagram in section 2.5.4.


4. Outbound Traffic

Perform the same analysis as in section 2, after filtering out the inbound data, by find the rows that the 5-th field (srcIP) = 160.39.187.145 (My IP)

out.sh

awk '$5=="160.39.187.145"' < normal_tcp.csv > out.txt

cat out.txt | awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv

4.1 Inbound Port Pair Table



4.2 Outbound Port Pair Diagram




                        # of port pairs                                                                             # of packets


Port 80 and 443 stand out. Red bars (for outbound) dominate.


4.3 Outbound Packet Size Distribution

Histogram for total and per day statistics are shown.

This outbound histogram is similar to the overall histogram. So larger packets are generated from outbound traffic.

4.4 Outbound Traffic Rate and Volume


The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) does not match, this justifies the outbound distribution of packets size is not even, there are some large size packets going out.






5 Anomaly Detector

Since most of my traffic are in 80 and 443, very few connection using other services. my model is base on the distribution of packet # (per minute) associated with port pairs other than 80 and 443.


5.1 Distribution of Packets # per Minute for non 80, 443 Ports

The distribution is generated by using data in appendix B.



We can see most counts fall into less 500 per minute. The mean and standard deviation is


m_m =


78.3352



m_std =


440.7354



6. Testing Detector

6.1 Normal and Scan Data Set

I performed zenmap scans, the observed data is in Appendix C.


Plot of the probability density function (pdf) value for normal data set and scan data set.

x axis is one minute sample, y axis is the probability density function value.

We can see significant difference between normal data set and scan data set, the normal data set has much larger value of pdf value.





6.2 ROC Diagram

By varying the threshold of pdf to classify this two set of data, ROC diagram is plotted:

The curve approaches 1 very fast as the two data sets are significantly different.

Reference

1. List of TCP and UDP port numbers

http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

2. Ephemeral port

http://en.wikipedia.org/wiki/Ephemeral_port

Appendix A: MATLAB Code

%% model.m, Main entry point of program

prepare_data;

% 1 2 3 4 5 6 7 8

% hour min sec srcIP srcPort destIP destPort length

%% hour - day

plot(n(:,1));

%%

plot_rate_volume(n,201,'Overall');

plot_rate_volume(n_in,202,'Inbound');

plot_rate_volume(n_out,203, 'Outbound');

%%

plot_packet_size;

%%

ports = [21,22,90,110,139,143,993,1080,1935];

for i=1:length(ports)

inds{i} = filter_port(n,ports(i), 0);

end

%% 80, 443 distribution

a80t{1} = dlmread('80_.txt');a80t{2} = dlmread('_80.txt');

a80t{3} = dlmread('443_.txt');a80t{4} = dlmread('_443.txt');

a80t_title={'80 > *','* > 80','443 > *','* > 443'};

figure(444);

for i=1:4

subplot(2,2,i);hist(a80t{i});title(a80t_title{i});

end

%% no 80 443 mincount

mc = dlmread('no80mincount.txt',' ');m = mc(:,2);hist(m);

% distribution parameter

m_m = mean(m); m_std = std(m);

%

scan_m=[

1730 22965

1731 5660

1727 8625

1729 2733

1730 1286

1731 20501

1732 16398

1728 1772

1986 1933

1987 1949

1988 2037

1989 19074

1985 28355

1987 6268

1988 21278

1989 7087

2207 16222

2203 26098

2205 7943

2206 9542

2207 19419

2208 16317

2204 1315

2334 8942

2335 116879

2336 229549

2337 109804

2333 1315

];

%

p_scan = pdf('norm', scan_m(:,2) ,m_m ,m_std);

p_m = pdf('norm', m ,m_m ,m_std);

%%

subplot(211);plot(p_m);title('normal data pdf');

subplot(212);plot(p_scan);title('scan data pdf');

%% RoC

p = 10.^([-8,-7.5,-7,-6.5,-6,-5.8,-5.5,-5,-4,-3,-2,-1]);x=[];y=[];

for i=1:length(p)

[TP, TN, FP, FN] = calc_roc (p(i),p_m,p_scan);

TPR = TP / (TP + FN);FPR = FP / (FP + TN);

x=[x;FPR];y=[y;TPR];

end

plot(x,y,'.-');title('ROC');xlabel('FPR');ylabel('TPR');



function [TP,TN,FP,FN]=calc_roc(p,p_m,p_scan)

TP=sum(p_scan<p);

FN=sum(p_scan>=p);

FP=sum(p_m<p);

TN=sum(p_m>=p);

end



%% prepare_data.m, read n

%clear;n = dlmread('data.csv', ' ');save;

% 1 2 3 4 5 6 7 8

% hour min sec srcIP srcPort destIP destPort length

%% load n

clear; load;

%%

n_out = n(find(n(:,4)),:);

n_in = n(find(n(:,6)),:);

split = find_split(n);

split_in = find_split(n_in);

split_out = find_split(n_out);

%% dn

for i=1:5

dn{i}=get_data_from_day(n,i,split);

dn_in{i}=get_data_from_day(n_in,i,split_in);

dn_out{i}=get_data_from_day(n_out,i,split_out);

end



%% plot_packet_size.m, Packet Size Distribution

%% n

figure(101);

subplot(6,1,1);hist(n(:,8));title('Overall total','Color','b');

%

for i=1:5

subplot(6,1,i+1);hist(dn{i}(:,8));title(['day ',num2str(i)]);

end

%% n_in

figure(102);

subplot(6,1,1);hist(n_in(:,8));title('Inbound total','Color','b');

%

for i=1:5

subplot(6,1,i+1);hist(dn_in{i}(:,8));title(['day ',num2str(i)]);

end

%% n_out

figure(103);

subplot(6,1,1);hist(n_out(:,8));title('Outbound total','Color','b');

%

for i=1:5

subplot(6,1,i+1);hist(dn_out{i}(:,8));title(['day ',num2str(i)]);

end

%% 6 * 3

figure(1000);

subplot(6,3,1);hist(n(:,8));title('Overall total','Color','b');

subplot(6,3,2);hist(n_in(:,8));title('Inbound total','Color','b');

subplot(6,3,3);hist(n_out(:,8));title('Outbound total','Color','b');

%

for i=1:5

subplot(6,3,3*i+1);hist(dn{i}(:,8));title(['day ',num2str(i)]);

subplot(6,3,3*i+2);hist(dn_in{i}(:,8));title(['day ',num2str(i)]);

subplot(6,3,3*i+3);hist(dn_out{i}(:,8));title(['day ',num2str(i)]);

end



function split = find_split(n)

split = zeros(6,1);

split(1) = 0;

split(6) = length(n);

j=0;

for i=1:length(n)

if(n(i,1)==23 && n(i+1,1)==0)

j=j+1;

split(j+1)= i;

end

end

end



function data = get_data_from_day(n,day,split)

data = n(split(day)+1:split(day+1),:);

end


%% plot_rate_volume.m, Rate and Volume

function plot_rate_volume(n,g,t)

[h,s] = get_hours(n);

for i=1:length(h)

data{i} = get_data_from_split(n,i,s);

rate(i) = length(data{i});

volume(i) = sum(data{i}(:,8));

end

%%

figure(g)

subplot(2,1,1);plot(1:length(h),rate);title([t,' Rate (hourly packet #)']);

subplot(2,1,2);plot(1:length(h),volume);title([t,' Volume (hourly packet size total)']);

end



function [hours, split]= get_hours(n)

hours(1) = n(1,1);

split(1) = 0;

for i=1:length(n)

read_next = n(i,1);

if hours(length(hours)) ~= read_next

hours = [hours; read_next];

split = [split; i-1];

end

end

split=[split;length(n)];

end





Appendix B:

Non 80,443 Port per Minute Packet Count on Normal Data

First column is hour concatenated with minute, second column is the packet count in that minute.

This data is used to produce the detector's gaussian distribution to find mean and deviation.

1614 139

0053 74

0742 18

1630 30

0055 86

0743 13

1619 28

0745 40

1633 10

1406 6

1847 4

1848 1

1409 2

2021 11

0110 21

2237 1

2024 4

2252 2

2025 1

1655 2

1902 6

0113 2

0555 2

0342 19

0803 22

1905 2

0804 2

0116 9

0117 11

1659 83

0130 43

0131 64

0118 44

0806 30

0346 119

0119 60

0821 2

0133 130

0808 122

0134 46

1710 369

0135 38

1711 491

1250 2

0824 2

0136 39

1928 2

1731 184

1945 2

1719 2

0631 2

0405 6

2331 74

2319 2

0420 441

1735 6

1736 3

0408 2

2121 2

1750 4

1751 6

0651 2

2351 15

0440 1

1314 6

2140 2

2141 3

1758 2

2143 2

1759 6

1332 2

0446 28

2146 3

2147 8

0449 16

0236 1

0237 4

0238 4

1352 4

0927 2

1129 2

0502 18

1603 2

2201 7

2203 60

0505 235

0507 50

2206 6

1835 4

0508 35

0509 5

1838 6

1625 154

0753 11

0101 2

0102 53

1418 2

0103 2

1859 144

2017 1

1647 2

2031 3

2018 4

2032 2

0106 2

0120 84

2034 4

0335 58

0121 78

0109 6

2035 3

0122 65

0810 11

0123 22

0124 100

0125 38

2052 5

0814 2

0126 28

1701 2

0600 38

0354 1

0127 106

0815 4

0601 135

0128 2

0602 54

0142 2

0129 78

0357 2

0143 2

0604 54

2304 2

0144 18

0833 4

0359 110

1724 33

2323 33

0411 151

1727 3

1513 2

0413 52

1956 2

1515 2

0641 6

0414 127

2327 11

0415 422

2328 32

0416 89

2116 5

0418 267

2130 2

0419 106

1306 2

2132 7

0434 1661

1749 6

2133 34

1535 4

0649 2

0435 5

2134 2

0450 3

0451 5

1328 2

0240 54

0915 2

1053 6

0702 2

0241 3

0703 1

2156 5674

1805 4

0459 277

1820 82

0245 7

0720 2

1058 2

0246 28

1059 2

1349 2



Appendix C:

Non 80,443 Port per Minute Packet Count on Scan Data

First column is hour concatenated with minute, second column is the packet count in that minute.

1730 22965

1731 5660

1727 8625

1729 2733

1730 1286

1731 20501

1732 16398

1728 1772

1986 1933

1987 1949

1988 2037

1989 19074

1985 28355

1987 6268

1988 21278

1989 7087

2207 16222

2203 26098

2205 7943

2206 9542

2207 19419

2208 16317

2204 1315

2334 8942

2335 116879

2336 229549

2337 109804

2333 1315