Intrusion Detection Systems, Columbia, Fall 2009 - Created and analyzed user behavior models based on captured network packets using Nmap, Snort, tcpdump and R 1. Introduction1.1 ToolsI use tcpdump to capture the data, sed is used to prepare the data to a tabular format, awk and MATLAB is used to do statistical analysis.
1.2 ModelSince most of my traffic are in 80 and 443, very few connection using other services. my model is based on the distribution of packet count (per minute) associated with port pairs other than 80 and 443. 2. Data Capture and Preparation
2.1 Capture Raw Traffictcpdump data was captured from 10/7/2009 to 10/11/2009 and wrote to normal.log file with Script:
sudo tcpdump -w normal.log
normal.log file is 3.2GB
The hour (y axis) - day (x axis) plot of collected continuous data.
Total 105 hours of time span is covered. 2.2 Extract TCP Record
Since only TCP data is concerned, with Script:
tcpdump -r normal.log -n tcp > normal_tcp.txt
to filter out TCP records, the content of normal_tcp.txt looks like:
05:51:10.023276 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 371608:373026, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.023288 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 373026, win 2672, options [nop,nop,TS val 2076067 ecr 3022862199], length 0 05:51:10.024792 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 373026:374444, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.024803 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 374444, win 2718, options [nop,nop,TS val 2076068 ecr 3022862199], length 0 05:51:10.026046 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 374444:375862, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.026057 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 375862, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0 05:51:10.027297 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 375862:377280, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.028893 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 377280:378698, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.028905 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 378698, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0 05:51:10.029893 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 378698:380116, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.031496 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 380116:381534, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.031507 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 381534, win 2741, options [nop,nop,TS val 2076068 ecr 3022862199], length 0 05:51:10.032747 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 381534:382952, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.033994 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 382952:384370, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418 05:51:10.034006 IP 160.39.187.145.55179 > 66.249.81.91.80: Flags [.], ack 384370, win 2741, options [nop,nop,TS val 2076069 ecr 3022862199], length 0 05:51:10.035245 IP 66.249.81.91.80 > 160.39.187.145.55179: Flags [.], seq 384370:385788, ack 906, win 117, options [nop,nop,TS val 3022862199 ecr 2076061], length 1418
Each line represents a TCP packet with information of timestamp, source and distination IP and port, and packet length etc.
My IP address is 160.39.187.145.
2.3 Simplification of the Tabular RecordUse sed to simplify the tabular record to the following format with 9 fields separated by space per line:
1 2 3 4 5 6 7 8 9 hour min sec fract srcIP srcPort destIP destPort length
Script:
sed -e 's/Flags.*length\ /length/' -e 's/\([0-9]*\)\:\([0-9]*\)\:\([0-9]*\)\./\1 \2 \3 /' -e 's/IP \([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/> \([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)\.\([0-9]*\)/\1 \2/' -e 's/\: length/ /' < normal_tcp.txt > normal_tcp.csv
The content of normal_tcp.csv looks like:
05 59 59 815133 160.39.187.145 33390 72.32.58.166 80 585 05 59 59 819793 72.32.58.166 80 160.39.187.145 33393 0 05 59 59 820414 72.32.58.166 80 160.39.187.145 33394 289 05 59 59 820427 160.39.187.145 33394 72.32.58.166 80 0
2.4 Port Pair StatisticsPort pair is represented by field 6 (source port) and field 8(destination port). Script: awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' < normal_tcp.csv | sort -n -r > pp_bi.csv
pp_bi.csv is then sorted result of count of port pair, the first few lines are:
150279 80_35327 136853 80_42756 78719 35327_80 75342 42756_80 57054 80_39498
We can see they are all 80 related port pairs, which is expected as 80 is web service.
Script to exclude 80 port:
sed -e '/80\_/d' -e '/80\_/d' < pp_bi.csv > pp_bi_no_80.csv
4657 443_40902 3486 443_44775 3473 44775_443 3446 59126_38133 3096 443_42330
443 port is for HTTP Secure (SSL for https:// )
Script to exclude 443 port: sed -e '/443\_/d' -e '/\_443/d' < pp_bi_no_80.csv > pp_bi_no_80_443.csv
3446 59126_38133 2846 1935_57089 1385 38133_59126 950 1935_50308 854 1935_35101 823 19246_49092
Then, use scripts to automatically count each non-ephemeral port's information:
count.sh cat pp_bi.csv | grep " 21_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 22_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 80_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 90_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 110_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 143_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 443_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 1080_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep " 1935_" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'
echo
cat pp_bi.csv | grep "_21$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_22$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_80$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_90$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_110$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_143$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_443$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_1080$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' cat pp_bi.csv | grep "_1935$" | awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}'
echo
sed -e '/ 21_/d' -e '/ 22_/d' -e '/ 80_/d' -e '/ 90_/d' -e '/ 110_/d' -e '/ 139_/d' -e '/ 143_/d' -e '/ 443_/d' -e '/ 993_/d' -e '/ 1080_/d' -e '/ 1935_/d' -e '/_21$/d' -e '/_22$/d' -e '/_80$/d' -e '/_90$/d' -e '/_110$/d' -e '/_139$/d' -e '/_143$/d' -e '/_443$/d' -e '/_993$/d' -e '/_1080$/d' -e '/_1935$/d' -e '/\:$/d' < pp_bi.csv > temp.txt
awk '{packets_count+=$1} END {print NR, "\t" ,packets_count}' < temp.txt
2.5 Ports and Services SummaryIdentified ports and services [1] are: 22/TCP,UDP Secure Shell (SSH)鈥攗sed for secure logins, file transfers (scp, sftp) and port forwarding 80/TCP,UDP Hypertext Transfer Protocol (HTTP) 90/TCP,UDP dnsix (DoD Network Security for Information Exchange) Securit Attribute Token Map 90/TCP,UDP Pointcast Unofficial 110/TCP Post Office Protocol 3 (POP3) 143/TCP,UDP Internet Message Access Protocol (IMAP)鈥攗sed for retrieving, organizing, and synchronizing e-mail messages 443/TCP,UDP Hypertext Transfer Protocol over TLS/SSL (HTTPS) 1080/TCP SOCKS proxy 1935/TCP Adobe Systems Macromedia Flash Real Time Messaging Protocol (RTMP) "plain" protocol
80 and 443 traffic are almost with ephemeral port[2].
There are a few number of connections with ports of lower number: First column is the # of packets for the other port # < 32768, associated with port 80
but could not identify their usage[1]. Ports associated with 443 are all > 32768 2.5.1 Overall Port Pair Table
The table show overall port pair statistics. other denotes ephemeral port. 2.5.2 Overall Port Pair Diagram# of port pairs # of packets This diagram shows the absolute amount and percentage bar chart. As denoted in the table, Blue bars represent the pattern of port_* (data goes out of the port), red represent *_port (data goes into the port). 2.5.3 Overall Packet Size DistributionHistogram for total and per day statistics are shown. Most packets are smaller than 2000, while a few are very large up to 10^4.
2.5.4 Overall Traffic Rate and Volume
The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) matches.
The four peeks should correspond to my activity of intensively using Internet each day.
3. Inbound TrafficPerform the same analysis as in section 2, after filtering out the inbound data, by find the rows that the 7-th field (destIP) = 160.39.187.145 (My IP)
in.sh awk '$7=="160.39.187.145"' < normal_tcp.csv > in.txt cat in.txt | awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv
3.1 Inbound Port Pair Table
3.2 Inbound Port Pair Diagram
# of port pairs # of packets Port 80 and 443 stand out. Blue bars (for inbound) dominate. 3.3 Inbound Packet Size DistributionHistogram for total and per day statistics are shown. This inbound histogram is very different from the overall histogram. There is no larger packets, and most of inbound packets are exactly 1387 and 73. 3.4 Inbound Traffic Rate and Volume
The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) matches, similar to the overall diagram in section 2.5.4.
4. Outbound TrafficPerform the same analysis as in section 2, after filtering out the inbound data, by find the rows that the 5-th field (srcIP) = 160.39.187.145 (My IP) out.sh awk '$5=="160.39.187.145"' < normal_tcp.csv > out.txt cat out.txt | awk '{pp[$6"_"$8]++} END {for (i in pp) print pp[i],i}' | sort -n -r > pp_bi.csv 4.1 Inbound Port Pair Table
4.2 Outbound Port Pair Diagram
# of port pairs # of packets
Port 80 and 443 stand out. Red bars (for outbound) dominate.
4.3 Outbound Packet Size DistributionHistogram for total and per day statistics are shown. This outbound histogram is similar to the overall histogram. So larger packets are generated from outbound traffic. 4.4 Outbound Traffic Rate and Volume
The x axis is hour, the temporal behavior of traffic rate (by count of packet #) and volume (by sum of packet size) does not match, this justifies the outbound distribution of packets size is not even, there are some large size packets going out.
5 Anomaly DetectorSince most of my traffic are in 80 and 443, very few connection using other services. my model is base on the distribution of packet # (per minute) associated with port pairs other than 80 and 443.
5.1 Distribution of Packets # per Minute for non 80, 443 PortsThe distribution is generated by using data in appendix B.
We can see most counts fall into less 500 per minute. The mean and standard deviation is
m_m =
78.3352
m_std =
440.7354
6. Testing Detector6.1 Normal and Scan Data SetI performed zenmap scans, the observed data is in Appendix C.
Plot of the probability density function (pdf) value for normal data set and scan data set. x axis is one minute sample, y axis is the probability density function value. We can see significant difference between normal data set and scan data set, the normal data set has much larger value of pdf value.
6.2 ROC DiagramBy varying the threshold of pdf to classify this two set of data, ROC diagram is plotted: The curve approaches 1 very fast as the two data sets are significantly different. Reference1. List of TCP and UDP port numbers http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers http://en.wikipedia.org/wiki/Ephemeral_port Appendix A: MATLAB Code%% model.m, Main entry point of program prepare_data; % 1 2 3 4 5 6 7 8 % hour min sec srcIP srcPort destIP destPort length %% hour - day plot(n(:,1)); %% plot_rate_volume(n,201,'Overall'); plot_rate_volume(n_in,202,'Inbound'); plot_rate_volume(n_out,203, 'Outbound'); %% plot_packet_size; %% ports = [21,22,90,110,139,143,993,1080,1935]; for i=1:length(ports) inds{i} = filter_port(n,ports(i), 0); end %% 80, 443 distribution a80t{1} = dlmread('80_.txt');a80t{2} = dlmread('_80.txt'); a80t{3} = dlmread('443_.txt');a80t{4} = dlmread('_443.txt'); a80t_title={'80 > *','* > 80','443 > *','* > 443'}; figure(444); for i=1:4 subplot(2,2,i);hist(a80t{i});title(a80t_title{i}); end %% no 80 443 mincount mc = dlmread('no80mincount.txt',' ');m = mc(:,2);hist(m); % distribution parameter m_m = mean(m); m_std = std(m); % scan_m=[ 1730 22965 1731 5660 1727 8625 1729 2733 1730 1286 1731 20501 1732 16398 1728 1772 1986 1933 1987 1949 1988 2037 1989 19074 1985 28355 1987 6268 1988 21278 1989 7087 2207 16222 2203 26098 2205 7943 2206 9542 2207 19419 2208 16317 2204 1315 2334 8942 2335 116879 2336 229549 2337 109804 2333 1315 ]; % p_scan = pdf('norm', scan_m(:,2) ,m_m ,m_std); p_m = pdf('norm', m ,m_m ,m_std); %% subplot(211);plot(p_m);title('normal data pdf'); subplot(212);plot(p_scan);title('scan data pdf'); %% RoC p = 10.^([-8,-7.5,-7,-6.5,-6,-5.8,-5.5,-5,-4,-3,-2,-1]);x=[];y=[]; for i=1:length(p) [TP, TN, FP, FN] = calc_roc (p(i),p_m,p_scan); TPR = TP / (TP + FN);FPR = FP / (FP + TN); x=[x;FPR];y=[y;TPR]; end
plot(x,y,'.-');title('ROC');xlabel('FPR');ylabel('TPR');
function [TP,TN,FP,FN]=calc_roc(p,p_m,p_scan) TP=sum(p_scan<p); FN=sum(p_scan>=p); FP=sum(p_m<p); TN=sum(p_m>=p); end
%% prepare_data.m, read n %clear;n = dlmread('data.csv', ' ');save; % 1 2 3 4 5 6 7 8 % hour min sec srcIP srcPort destIP destPort length %% load n clear; load; %% n_out = n(find(n(:,4)),:); n_in = n(find(n(:,6)),:); split = find_split(n); split_in = find_split(n_in); split_out = find_split(n_out); %% dn for i=1:5 dn{i}=get_data_from_day(n,i,split); dn_in{i}=get_data_from_day(n_in,i,split_in); dn_out{i}=get_data_from_day(n_out,i,split_out); end
%% plot_packet_size.m, Packet Size Distribution %% n figure(101); subplot(6,1,1);hist(n(:,8));title('Overall total','Color','b'); % for i=1:5 subplot(6,1,i+1);hist(dn{i}(:,8));title(['day ',num2str(i)]); end %% n_in figure(102); subplot(6,1,1);hist(n_in(:,8));title('Inbound total','Color','b'); % for i=1:5 subplot(6,1,i+1);hist(dn_in{i}(:,8));title(['day ',num2str(i)]); end %% n_out figure(103); subplot(6,1,1);hist(n_out(:,8));title('Outbound total','Color','b'); % for i=1:5 subplot(6,1,i+1);hist(dn_out{i}(:,8));title(['day ',num2str(i)]); end %% 6 * 3 figure(1000); subplot(6,3,1);hist(n(:,8));title('Overall total','Color','b'); subplot(6,3,2);hist(n_in(:,8));title('Inbound total','Color','b'); subplot(6,3,3);hist(n_out(:,8));title('Outbound total','Color','b'); % for i=1:5 subplot(6,3,3*i+1);hist(dn{i}(:,8));title(['day ',num2str(i)]); subplot(6,3,3*i+2);hist(dn_in{i}(:,8));title(['day ',num2str(i)]); subplot(6,3,3*i+3);hist(dn_out{i}(:,8));title(['day ',num2str(i)]); end
function split = find_split(n) split = zeros(6,1); split(1) = 0; split(6) = length(n); j=0; for i=1:length(n) if(n(i,1)==23 && n(i+1,1)==0) j=j+1; split(j+1)= i; end end end
function data = get_data_from_day(n,day,split)
data = n(split(day)+1:split(day+1),:);
end
%% plot_rate_volume.m, Rate and Volume function plot_rate_volume(n,g,t) [h,s] = get_hours(n); for i=1:length(h) data{i} = get_data_from_split(n,i,s); rate(i) = length(data{i}); volume(i) = sum(data{i}(:,8)); end %% figure(g) subplot(2,1,1);plot(1:length(h),rate);title([t,' Rate (hourly packet #)']); subplot(2,1,2);plot(1:length(h),volume);title([t,' Volume (hourly packet size total)']); end
function [hours, split]= get_hours(n) hours(1) = n(1,1); split(1) = 0; for i=1:length(n) read_next = n(i,1); if hours(length(hours)) ~= read_next hours = [hours; read_next]; split = [split; i-1]; end end split=[split;length(n)]; end
Appendix B:Non 80,443 Port per Minute Packet Count on Normal DataFirst column is hour concatenated with minute, second column is the packet count in that minute. This data is used to produce the detector's gaussian distribution to find mean and deviation. 1614 139 0053 74 0742 18 1630 30 0055 86 0743 13 1619 28 0745 40 1633 10 1406 6 1847 4 1848 1 1409 2 2021 11 0110 21 2237 1 2024 4 2252 2 2025 1 1655 2 1902 6 0113 2 0555 2 0342 19 0803 22 1905 2 0804 2 0116 9 0117 11 1659 83 0130 43 0131 64 0118 44 0806 30 0346 119 0119 60 0821 2 0133 130 0808 122 0134 46 1710 369 0135 38 1711 491 1250 2 0824 2 0136 39 1928 2 1731 184 1945 2 1719 2 0631 2 0405 6 2331 74 2319 2 0420 441 1735 6 1736 3 0408 2 2121 2 1750 4 1751 6 0651 2 2351 15 0440 1 1314 6 2140 2 2141 3 1758 2 2143 2 1759 6 1332 2 0446 28 2146 3 2147 8 0449 16 0236 1 0237 4 0238 4 1352 4 0927 2 1129 2 0502 18 1603 2 2201 7 2203 60 0505 235 0507 50 2206 6 1835 4 0508 35 0509 5 1838 6 1625 154 0753 11 0101 2 0102 53 1418 2 0103 2 1859 144 2017 1 1647 2 2031 3 2018 4 2032 2 0106 2 0120 84 2034 4 0335 58 0121 78 0109 6 2035 3 0122 65 0810 11 0123 22 0124 100 0125 38 2052 5 0814 2 0126 28 1701 2 0600 38 0354 1 0127 106 0815 4 0601 135 0128 2 0602 54 0142 2 0129 78 0357 2 0143 2 0604 54 2304 2 0144 18 0833 4 0359 110 1724 33 2323 33 0411 151 1727 3 1513 2 0413 52 1956 2 1515 2 0641 6 0414 127 2327 11 0415 422 2328 32 0416 89 2116 5 0418 267 2130 2 0419 106 1306 2 2132 7 0434 1661 1749 6 2133 34 1535 4 0649 2 0435 5 2134 2 0450 3 0451 5 1328 2 0240 54 0915 2 1053 6 0702 2 0241 3 0703 1 2156 5674 1805 4 0459 277 1820 82 0245 7 0720 2 1058 2 0246 28 1059 2 1349 2
Appendix C:Non 80,443 Port per Minute Packet Count on Scan DataFirst column is hour concatenated with minute, second column is the packet count in that minute. 1730 22965 1731 5660 1727 8625 1729 2733 1730 1286 1731 20501 1732 16398 1728 1772 1986 1933 1987 1949 1988 2037 1989 19074 1985 28355 1987 6268 1988 21278 1989 7087 2207 16222 2203 26098 2205 7943 2206 9542 2207 19419 2208 16317 2204 1315 2334 8942 2335 116879 2336 229549 2337 109804 2333 1315 |
..Machine Learning >