Intrusion Detection Systems, Columbia, Fall 2009
- Created and analyzed user behavior models based on captured network packets using Nmap, Snort, tcpdump and R
Project Report
1. Introduction
1.1 Tools
I use tcpdump to
capture the data,
sed is used to
prepare the data to a tabular format,
awk and MATLAB is
used to do statistical analysis.
1.2 Model
Since most of my
traffic are in 80 and 443, very few connection using other services.
my model is based on the distribution of packet count (per minute)
associated with port pairs other than 80 and 443.
2.
Data Capture and Preparation
2.1 Capture Raw Traffic
tcpdump data was captured from
10/7/2009 to 10/11/2009 and wrote to normal.log file with Script:
sudo tcpdump -w normal.log

normal.log file is 3.2GB
The hour (y axis) - day (x axis) plot
of collected continuous data.

Total 105 hours of time span is
covered.
2.2
Extract TCP Record
Since only TCP data is concerned, with
Script:
tcpdump -r normal.log -n tcp >
normal_tcp.txt
to filter out TCP records, the content
of normal_tcp.txt looks like:
05:51:10.023276
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
371608:373026, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.023288
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 373026,
win 2672, options [nop,nop,TS val 2076067 ecr 3022862199], length 0
05:51:10.024792
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
373026:374444, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.024803
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 374444,
win 2718, options [nop,nop,TS val 2076068 ecr 3022862199], length 0
05:51:10.026046
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
374444:375862, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.026057
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 375862,
win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0
05:51:10.027297
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
375862:377280, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.028893
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
377280:378698, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.028905
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 378698,
win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0
05:51:10.029893
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
378698:380116, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.031496
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
380116:381534, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.031507
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 381534,
win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0
05:51:10.032747
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
381534:382952, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.033994
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
382952:384370, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
05:51:10.034006
IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 384370,
win 2741, options [nop,nop,TS val 2076069 ecr 3022862199], length 0
05:51:10.035245
IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq
384370:385788, ack 906, win 117, options [nop,nop,TS val 3022862199
ecr 2076061], length 1418
Each line represents a TCP packet with
information of timestamp, source and distination IP and port, and
packet length etc.
My IP address is 160.39.187.145.
2.3 Simplification of the Tabular Record
Use sed to simplify the tabular record
to the following format with 9 fields separated by space per line:
1 2 3 4 5 6 7 8 9
hour min sec fract srcIP srcPort destIP destPort length
Script:
sed -e 's/Flags.*length\ /length/' -e
's/\([0-9]*\)\:\([0-9]*\)\:\([0-9]*\)\./\1 \2 \3 /' -e 's/IP
\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/>
\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/\:
length/ /' < normal_tcp.txt > normal_tcp.csv
The content of normal_tcp.csv looks
like:
05 59 59 815133 160.39.187.145 33390
72.32.58.166 80 585
05 59 59 819793 72.32.58.166 80
160.39.187.145 33393 0
05 59 59 820414 72.32.58.166 80
160.39.187.145 33394 289
05 59 59 820427 160.39.187.145 33394
72.32.58.166 80 0
2.4 Port Pair Statistics
Port pair is represented by field 6
(source port) and field 8(destination port).
Script:
awk '{pp[$6"_"$8]++} END {for
(i in pp) print pp[i],i}' < normal_tcp.csv | sort -n -r >
pp_bi.csv
pp_bi.csv is then sorted result of
count of port pair, the first few lines are:
150279 80_35327
136853 80_42756
78719 35327_80
75342 42756_80
57054 80_39498
We can see they are all 80 related port
pairs, which is expected as 80 is web service.
Script to exclude 80 port:
sed -e '/80\_/d' -e '/80\_/d' <
pp_bi.csv > pp_bi_no_80.csv
4657 443_40902
3486 443_44775
3473 44775_443
3446 59126_38133
3096 443_42330
443 port is for HTTP Secure (SSL for
https:// )
Script to exclude 443 port:
sed -e '/443\_/d' -e '/\_443/d' <
pp_bi_no_80.csv > pp_bi_no_80_443.csv
3446 59126_38133
2846 1935_57089
1385 38133_59126
950 1935_50308
854 1935_35101
823 19246_49092
Then, use scripts to automatically
count each non-ephemeral port's information:
count.sh
cat pp_bi.csv | grep " 21_" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 22_" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 80_" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 90_" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 110_"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 143_"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 443_"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 1080_"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep " 1935_"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
echo
cat pp_bi.csv | grep "_21$" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_22$" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_80$" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_90$" |
awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_110$"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_143$"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_443$"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_1080$"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
cat pp_bi.csv | grep "_1935$"
| awk '{packets_count+=$1} END {print NR, "\t"
,packets_count}'
echo
sed -e '/ 21_/d' -e '/ 22_/d' -e '/
80_/d' -e '/ 90_/d' -e '/ 110_/d' -e '/ 139_/d' -e '/ 143_/d' -e '/
443_/d' -e '/ 993_/d' -e '/ 1080_/d' -e '/ 1935_/d' -e '/_21$/d' -e
'/_22$/d' -e '/_80$/d' -e '/_90$/d' -e '/_110$/d' -e '/_139$/d' -e
'/_143$/d' -e '/_443$/d' -e '/_993$/d' -e '/_1080$/d' -e '/_1935$/d'
-e '/\:$/d' < pp_bi.csv > temp.txt
awk '{packets_count+=$1} END {print NR,
"\t" ,packets_count}' < temp.txt
2.5 Ports and Services Summary
Identified ports and services [1] are:
22/TCP,UDP Secure Shell (SSH)鈥攗sed
for secure logins, file transfers (scp, sftp) and port forwarding
80/TCP,UDP Hypertext Transfer Protocol
(HTTP)
90/TCP,UDP dnsix (DoD Network Security
for Information Exchange) Securit Attribute Token Map 90/TCP,UDP
Pointcast Unofficial
110/TCP Post
Office Protocol 3 (POP3)
143/TCP,UDP Internet Message Access
Protocol (IMAP)鈥攗sed for retrieving, organizing, and synchronizing
e-mail messages
443/TCP,UDP Hypertext Transfer Protocol
over TLS/SSL (HTTPS)
1080/TCP SOCKS proxy
1935/TCP Adobe Systems Macromedia Flash
Real Time Messaging Protocol (RTMP) "plain" protocol
80 and 443 traffic are almost with
ephemeral port[2].
The
distribution of ephemeral ports are even.
There are a few number of connections
with ports of lower number:
First column is the # of
packets for the other port # < 32768, associated with port 80
|
80 > *
|
* > 80
|
|
4 80 32731
2 80 2239
2 80 11744
1 80 594
1 80 418
1 80 23953
|
5 32731 80
2 418 80
2 2239 80
2 11744 80
1 594 80
1 23953 80
|
but could not identify their usage[1].
Ports associated with 443 are all >
32768
2.5.1 Overall
Port Pair Table

The table show overall port pair
statistics.
other denotes ephemeral port.
2.5.2 Overall Port Pair Diagram
# of port pairs # of packets

This diagram shows the absolute amount and percentage bar chart.
As denoted in the table, Blue bars represent the pattern of port_*
(data goes out of the port), red represent *_port (data goes into the
port).
2.5.3 Overall
Packet Size Distribution
Histogram for total and per day
statistics are shown.

Most packets are smaller than 2000, while a few are very large up
to 10^4.
2.5.4 Overall Traffic Rate
and Volume

The x axis is hour, the temporal
behavior of traffic rate (by count of packet #) and volume (by sum of
packet size) matches.
The four peeks should correspond to my
activity of intensively using Internet each day.
3.
Inbound Traffic
Perform the same analysis as in section
2, after filtering out the inbound data, by find the rows that the
7-th field (destIP) = 160.39.187.145 (My IP)
in.sh
awk '$7=="160.39.187.145"' <
normal_tcp.csv > in.txt
cat in.txt | awk '{pp[$6"_"$8]++}
END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv
3.1 Inbound Port Pair Table
3.2 Inbound Port Pair Diagram
#
of port pairs # of packets

Port 80 and 443 stand out. Blue bars
(for inbound) dominate.
3.3 Inbound Packet Size Distribution
Histogram for total and per day
statistics are shown.

This inbound histogram is very different from the overall
histogram. There is no larger packets, and most of inbound packets
are exactly 1387 and 73.
3.4 Inbound Traffic Rate
and Volume

The x axis is hour, the temporal
behavior of traffic rate (by count of packet #) and volume (by sum of
packet size) matches, similar to the overall diagram in section
2.5.4.
4.
Outbound Traffic
Perform the same analysis as in section
2, after filtering out the inbound data, by find the rows that the
5-th field (srcIP) = 160.39.187.145 (My IP)
out.sh
awk '$5=="160.39.187.145"' <
normal_tcp.csv > out.txt
cat out.txt | awk '{pp[$6"_"$8]++}
END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv
4.1 Inbound Port Pair Table
4.2 Outbound Port Pair Diagram
#
of port pairs # of packets

Port 80 and 443 stand out. Red bars
(for outbound) dominate.
4.3 Outbound
Packet Size Distribution
Histogram for total and per day
statistics are shown.

This outbound histogram is similar to the overall histogram. So
larger packets are generated from outbound traffic.
4.4 Outbound Traffic
Rate and Volume

The x axis is hour, the temporal
behavior of traffic rate (by count of packet #) and volume (by sum of
packet size) does not match, this justifies the outbound distribution
of packets size is not even, there are some large size packets going
out.
5 Anomaly
Detector
Since most of my traffic are in 80 and
443, very few connection using other services. my model is base on
the distribution of packet # (per minute) associated with port pairs
other than 80 and 443.
5.1 Distribution of Packets # per Minute for non
80, 443 Ports
The distribution is generated by using
data in appendix B.

We can see most counts fall into less
500 per minute. The mean and standard deviation is
m_m =
78.3352
m_std =
440.7354
6. Testing Detector
6.1 Normal and Scan Data Set
I performed zenmap scans, the observed
data is in Appendix C.
Plot of the probability density
function (pdf) value for normal data set and scan data set.

x axis is one minute sample, y axis is the probability density
function value.
We can see significant difference between normal data set and scan
data set, the normal data set has much larger value of pdf value.
6.2 ROC Diagram
By varying the threshold of pdf to classify this two set of data,
ROC diagram is plotted:

The curve approaches 1 very fast as the two data sets are
significantly different.
Reference
1. List of TCP and UDP port numbers
http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
2.
Ephemeral port
http://en.wikipedia.org/wiki/Ephemeral_port
Appendix A: MATLAB Code
%%
model.m, Main entry point of program
prepare_data;
%
1 2 3 4 5 6 7 8
%
hour min sec srcIP srcPort destIP destPort length
%%
hour - day
plot(n(:,1));
%%
plot_rate_volume(n,201,'Overall');
plot_rate_volume(n_in,202,'Inbound');
plot_rate_volume(n_out,203,
'Outbound');
%%
plot_packet_size;
%%
ports
= [21,22,90,110,139,143,993,1080,1935];
for
i=1:length(ports)
inds{i}
= filter_port(n,ports(i), 0);
end
%%
80, 443 distribution
a80t{1}
= dlmread('80_.txt');a80t{2}
= dlmread('_80.txt');
a80t{3}
= dlmread('443_.txt');a80t{4}
= dlmread('_443.txt');
a80t_title={'80
> *','*
> 80','443
> *','*
> 443'};
figure(444);
for
i=1:4
subplot(2,2,i);hist(a80t{i});title(a80t_title{i});
end
%%
no 80 443 mincount
mc
= dlmread('no80mincount.txt','
');m = mc(:,2);hist(m);
%
distribution parameter
m_m
= mean(m); m_std = std(m);
%
scan_m=[
1730
22965
1731
5660
1727
8625
1729
2733
1730
1286
1731
20501
1732
16398
1728
1772
1986
1933
1987
1949
1988
2037
1989
19074
1985
28355
1987
6268
1988
21278
1989
7087
2207
16222
2203
26098
2205
7943
2206
9542
2207
19419
2208
16317
2204
1315
2334
8942
2335
116879
2336
229549
2337
109804
2333
1315
];
%
p_scan
= pdf('norm',
scan_m(:,2) ,m_m ,m_std);
p_m
= pdf('norm',
m ,m_m ,m_std);
%%
subplot(211);plot(p_m);title('normal
data pdf');
subplot(212);plot(p_scan);title('scan
data pdf');
%%
RoC
p
= 10.^([-8,-7.5,-7,-6.5,-6,-5.8,-5.5,-5,-4,-3,-2,-1]);x=[];y=[];
for
i=1:length(p)
[TP,
TN, FP, FN] = calc_roc (p(i),p_m,p_scan);
TPR
= TP / (TP + FN);FPR = FP / (FP + TN);
x=[x;FPR];y=[y;TPR];
end
plot(x,y,'.-');title('ROC');xlabel('FPR');ylabel('TPR');
function
[TP,TN,FP,FN]=calc_roc(p,p_m,p_scan)
TP=sum(p_scan<p);
FN=sum(p_scan>=p);
FP=sum(p_m<p);
TN=sum(p_m>=p);
end
%%
prepare_data.m, read n
%clear;n
= dlmread('data.csv', ' ');save;
%
1 2 3 4 5 6 7 8
%
hour min sec srcIP srcPort destIP destPort length
%%
load n
clear;
load;
%%
n_out
= n(find(n(:,4)),:);
n_in
= n(find(n(:,6)),:);
split
= find_split(n);
split_in
= find_split(n_in);
split_out
= find_split(n_out);
%%
dn
for
i=1:5
dn{i}=get_data_from_day(n,i,split);
dn_in{i}=get_data_from_day(n_in,i,split_in);
dn_out{i}=get_data_from_day(n_out,i,split_out);
end
%%
plot_packet_size.m, Packet Size Distribution
%%
n
figure(101);
subplot(6,1,1);hist(n(:,8));title('Overall
total','Color','b');
%
for
i=1:5
subplot(6,1,i+1);hist(dn{i}(:,8));title(['day
',num2str(i)]);
end
%%
n_in
figure(102);
subplot(6,1,1);hist(n_in(:,8));title('Inbound
total','Color','b');
%
for
i=1:5
subplot(6,1,i+1);hist(dn_in{i}(:,8));title(['day
',num2str(i)]);
end
%%
n_out
figure(103);
subplot(6,1,1);hist(n_out(:,8));title('Outbound
total','Color','b');
%
for
i=1:5
subplot(6,1,i+1);hist(dn_out{i}(:,8));title(['day
',num2str(i)]);
end
%%
6 * 3
figure(1000);
subplot(6,3,1);hist(n(:,8));title('Overall
total','Color','b');
subplot(6,3,2);hist(n_in(:,8));title('Inbound
total','Color','b');
subplot(6,3,3);hist(n_out(:,8));title('Outbound
total','Color','b');
%
for
i=1:5
subplot(6,3,3*i+1);hist(dn{i}(:,8));title(['day
',num2str(i)]);
subplot(6,3,3*i+2);hist(dn_in{i}(:,8));title(['day
',num2str(i)]);
subplot(6,3,3*i+3);hist(dn_out{i}(:,8));title(['day
',num2str(i)]);
end
function
split = find_split(n)
split
= zeros(6,1);
split(1)
= 0;
split(6)
= length(n);
j=0;
for
i=1:length(n)
if(n(i,1)==23
&& n(i+1,1)==0)
j=j+1;
split(j+1)=
i;
end
end
end
function
data = get_data_from_day(n,day,split)
data
= n(split(day)+1:split(day+1),:);
end
%%
plot_rate_volume.m, Rate and Volume
function
plot_rate_volume(n,g,t)
[h,s]
= get_hours(n);
for
i=1:length(h)
data{i}
= get_data_from_split(n,i,s);
rate(i)
= length(data{i});
volume(i)
= sum(data{i}(:,8));
end
%%
figure(g)
subplot(2,1,1);plot(1:length(h),rate);title([t,'
Rate (hourly packet #)']);
subplot(2,1,2);plot(1:length(h),volume);title([t,'
Volume (hourly packet size total)']);
end
function
[hours, split]= get_hours(n)
hours(1)
= n(1,1);
split(1)
= 0;
for
i=1:length(n)
read_next
= n(i,1);
if
hours(length(hours)) ~= read_next
hours
= [hours; read_next];
split
= [split; i-1];
end
end
split=[split;length(n)];
end
Appendix B:
Non 80,443 Port per Minute Packet Count on Normal
Data
First column is hour concatenated with minute, second column is
the packet count in that minute.
This data is used to produce the detector's gaussian distribution
to find mean and deviation.
1614 139
0053 74
0742 18
1630 30
0055 86
0743 13
1619 28
0745 40
1633 10
1406 6
1847 4
1848 1
1409 2
2021 11
0110 21
2237 1
2024 4
2252 2
2025 1
1655 2
1902 6
0113 2
0555 2
0342 19
0803 22
1905 2
0804 2
0116 9
0117 11
1659 83
0130 43
0131 64
0118 44
0806 30
0346 119
0119 60
0821 2
0133 130
0808 122
0134 46
1710 369
0135 38
1711 491
1250 2
0824 2
0136 39
1928 2
1731 184
1945 2
1719 2
0631 2
0405 6
2331 74
2319 2
0420 441
1735 6
1736 3
0408 2
2121 2
1750 4
1751 6
0651 2
2351 15
0440 1
1314 6
2140 2
2141 3
1758 2
2143 2
1759 6
1332 2
0446 28
2146 3
2147 8
0449 16
0236 1
0237 4
0238 4
1352 4
0927 2
1129 2
0502 18
1603 2
2201 7
2203 60
0505 235
0507 50
2206 6
1835 4
0508 35
0509 5
1838 6
1625 154
0753 11
0101 2
0102 53
1418 2
0103 2
1859 144
2017 1
1647 2
2031 3
2018 4
2032 2
0106 2
0120 84
2034 4
0335 58
0121 78
0109 6
2035 3
0122 65
0810 11
0123 22
0124 100
0125 38
2052 5
0814 2
0126 28
1701 2
0600 38
0354 1
0127 106
0815 4
0601 135
0128 2
0602 54
0142 2
0129 78
0357 2
0143 2
0604 54
2304 2
0144 18
0833 4
0359 110
1724 33
2323 33
0411 151
1727 3
1513 2
0413 52
1956 2
1515 2
0641 6
0414 127
2327 11
0415 422
2328 32
0416 89
2116 5
0418 267
2130 2
0419 106
1306 2
2132 7
0434 1661
1749 6
2133 34
1535 4
0649 2
0435 5
2134 2
0450 3
0451 5
1328 2
0240 54
0915 2
1053 6
0702 2
0241 3
0703 1
2156 5674
1805 4
0459 277
1820 82
0245 7
0720 2
1058 2
0246 28
1059 2
1349 2
Appendix C:
Non 80,443 Port per Minute Packet Count on Scan
Data
First column is hour concatenated with minute, second column is
the packet count in that minute.
1730
22965
1731
5660
1727
8625
1729
2733
1730
1286
1731
20501
1732
16398
1728
1772
1986
1933
1987
1949
1988
2037
1989
19074
1985
28355
1987
6268
1988
21278
1989
7087
2207
16222
2203
26098
2205
7943
2206
9542
2207
19419
2208
16317
2204
1315
2334
8942
2335
116879
2336
229549
2337
109804
2333
1315