J. Cent. South Univ. Technol. (2011) 18: 2061-2067
DOI: 10.1007/s11771-011-0943-8
An early recognition algorithm for BitTorrent traffic based on
improved K-means
RONG Hui-gui(荣辉桂), LI Ming-wei(李明伟), CAI Li-jun(蔡立军)
School of Information Science and Engineering, Hunan University, Changsha 410082, China
? Central South University Press and Springer-Verlag Berlin Heidelberg 2011
Abstract:
In response to the deficiencies of BitTorrent, the concept of density radius was proposed, and the distance from the maximum point of radius density to cluster center as a cluster radius was taken to solve the too large cluster radius resulted from the discrete points and to reduce the false positive rate of early recognition algorithms. Simulation results show that in the actual network environment, the improved algorithm, compared with K-means, will reduce the false positive rate of early identification algorithm from 6.3% to 0.9% and has a higher operational efficiency.
Key words:
traffic identification; early recognition algorithm; cluster radius; false positive/negative rate;
1 Introduction
Recently, the network has greatly promoted the development and popularity of peer-to-peer (P2P) business. According to the research report of Sandine and Cachelogic company [1-2], P2P applications have become the killer applications of network and P2P traffic has also become an important component of internet traffic. Unlike traditional C/S network service model, P2P applications can be more rapid and reliable for delivery of documents and information. At the same time, they also bring great pressure to network operators. Studies show that the characteristics of traditional network traffic have changed due to the popularity of P2P, such as the network traffic hosted by backbone connection shifts from the previous WWW application traffic to P2P traffic [3-4]. So, the popularity of P2P applications gives internet service providers (ISPs) tremendous pressure [5]. BitTorrent (BT), as a typical P2P application, was invented by COHEN in 2002. The current BT users worldwide have reached 45 million. The latest research shows that BT traffic has accounted for more than 50% of P2P traffic by 2007 [2]. Therefore, it is an urgent need for studying P2P applications, especially for analyzing and studying BT behavioral characteristics in large-scale internet and the impact of the breadth and depth on Internet, analyzing the causes of the impact, improving and optimizing the protocol and its application mode, then reducing the negative impacts on the network. The successful identification of the BT network traffic is the basis of these studies, and is of great significance. So, this becomes a hot issue in domestic and international research currently.
Currently, the widely used payload-based identification methods of BT traffic [6-8] have a higher identification accuracy. However, the biggest problem is that it is unable to recognize the encrypted packet. P2P traffic identification methods based on transport layer [9-10] may overcome the mentioned insufficiency of the payload-based identification methods, while they can only distinguish P2P and non-P2P protocols, and are unable to distinguish all P2P protocols, which mean that these methods cannot recognize a single P2P protocol. So, it cannot be used for a single BT protocol identification. Some traffic identification methods based on machine learning [11-14] may get some indicators, such as the duration time, the number of packets, the average packet size, and the average arrival time. By studying the traffic characteristics, the cluster algorithms for data classification were used to achieve the purpose of traffic identification. Nevertheless, these methods may identify indicators of one flow only after the data flow is finished. Therefore, they are not be used for online traffic identification.
To compensate the shortcomings of the above methods, BERNAILLE [15-17] took the sizes and directions of the first few packets in a flow as the identification indicators, and proposed a type of early identification methods. K-means algorithm, as a simple and efficient early identification method, has been widely used. This algorithm identifies the online traffic effectively by applying machine learning principles to traffic identification at the beginning stage of the specified flow.
2 Problem formulations
Early identification methods are based on data clustering. Cluster analysis is based on the similarity, and there are more similarities in the modes of same cluster than in those of not being in the same cluster. K-means algorithm is simple and has high efficiency when used for the early identification process of network traffic, while there are also some deficiencies.
2.1 K-means algorithm
K-means algorithm is widely used as a kind of clustering algorithm, in which Euclidean distance function is often used as clustering criteria function. The Euclidean distance function formula is
(1)
where D is the Euclidean distance; n is the number of data; s is the number of dimensions; Xij is described as the j dimensionality of the i-th data; Cj is the j dimensionality of cluster center.
K-means algorithm procedure is described as follows:
1) Randomly select K data objects from set S according to a selection algorithm as the initial cluster center;
2) Calculate Euclidean distances from the remaining data objects in S to the cluster center, and join them to the cluster with the shortest European distance;
3) Recalculate the medium value for all data objects in each cluster, and use the mean point of data objects as a new cluster center;
4) Repeat steps 2) and 3), until each cluster center will not change.
The sketch map of K-means algorithm is shown in Fig.1. Figure 1(a) illustrates the data objects before clustering. Initially, three random points are selected as the cluster center. Calculate the distances from each point to the three cluster centers and join them to the nearest cluster. After first clustering, the data objects are shown in Fig.1(b), where the three coils represent three clusters, and the arrows denote initial three randomly selected cluster centers.
After first clustering, the new cluster center was calculated, as shown in Fig.1(c), where the squares are described as new cluster centers. Subsequently, recalculate the distances from each point to the new cluster centers and join them to the nearest cluster. The second clustering result is shown in Fig.1(c), where three coils represent the three clusters.
Fig.1 Sketch map of K-means algorithm: (a) Clustering before; (b) First clustering; (c) Clustering after recalculating cluster center; (d) Final cluster center after clustering
After repeated clustering, the cluster center will no longer change and the cluster objects also keep stable, as shown in Fig.1(d).
It is clear that K-means algorithm is easily implemented and has a less computing cost due to the quick calculating Euclidean distance supported by a cluster structure. Secondly, the cluster generated according to the algorithm is of good quality and the association between cluster data is higher. Finally, the learning time is short due to the algorithm simplicity.
2.2 Limitations of K-means algorithm
Although K-means algorithm is simple and has a high efficiency, there exist many deficiencies when this algorithm is used as an early identification method of BT traffic. K-means algorithm requires that the user must first give the number of clusters to be generated. However, the cluster division is not easy, because different cluster numbers have a great influence on clustering effect and identification accuracy. Secondly, since the initialization of cluster centers is randomly selected, this will make the clustering process very unstable, which means that sometimes the initial points selected are relatively even, and sometimes the selected points are relatively isolated, resulting in many iterations for final clustering. Moreover, the cluster division is unstable after each iteration.
At the learning stage of early recognition algorithm, the first five packets of each flow are treated as a data object for clustering after filtering out the transimission control protocol (TCP) control packets, as far as the off-line flows are concerned. When K-means algorithm is used for clustering, some discrete points often appear in the case of no uniform size of packets in network. In K-means algorithm, the value of cluster radius is the distance from the cluster center to the outermost point, and the cluster radius will be very large for those clusters with discrete points, as shown in Fig.2.
Fig.2 Cluster of K-means algorithm with discrete points
From Fig.2, most of the data in this cluster is concentrated in a small circle. The arrows indicate the location of the cluster center. However, in the cluster, there exists a discrete point, which makes the cluster radius the distance from the discrete point to the cluster center point. In this case, when the cluster in the network is used for BT traffic identification, other protocol objects will have a larger probability to be grouped into this cluster, which increases the erroneous judgment and the false positive.
Non-BT traffic being identified mistakenly as BT traffic is called false positive (FP), and BT traffic being identified mistakenly as non-BT traffic is called false negative (FN). K denotes the total non-BT traffic; K1 denotes the total traffic that non-BT traffic is identified mistakenly as BT traffic. L1 denotes the total traffic that BT traffic is identified mistakenly as non-BT traffic and L denotes the total practical BT traffic. Then, the formulas of FP and FN are
VFP=K1/K (2)
VFN=L1/L (3)
where VFP and VFN denote the values of false positive and false negative, respectively.
In the early recognition process of network traffic, the lower FP and FN are very important for the identification correctness.
As seen from Table 1, among the 84 objects in the cluster, the distance between the vast majority of objects and the cluster center is 90, and only one point to the cluster center distance is 6 570. If the point does not exist, the cluster radius should be 90, and any objects with a distance larger than 90 should not belong to this cluster. While the cluster radius is increased by two orders of magnitude because of the existence of the discrete point, and the cluster radius turns into 6 570. So, the cluster objects with a distance from 90 to 6 570 will be grouped into this cluster, which greatly increases the false positive.
Table 1 Distance from cluster objects to cluster center
According to the shortages of K-means algorithm, this work improves the K-means algorithm and presents a new method for determining the cluster radius.
3 Improved algorithm of K-means
The above analysis shows that K-means algorithm will generate a greater false positive when it is applied to the early reorganization process of BT traffic, due to the presence of discrete points. This work improves K-means algorithm by proposing the concept of radius density and presents a better method of determining the cluster radius for further reducing the false positives of network traffic early identification.
3.1 Implementation of improved algorithm
As for traffic identification, an appropriate cluster radius should meet the following criteria:
1) Cluster radius should cover the majority of the traffic, which is usually over 90% of the traffic.
2) Objects within the clusters radius should be more concentrated, and the situation should be avoided as much as possible, as shown in Fig.2.
As for the clustering operation of traffic recognition algorithm, a data object represents a data flow, and the object properties are described as some indicators of a data flow, and these indicators refer to the size of top five packets except TCP control packets and direction. The sizes of the five packets denote the directions with plus or minus, which constitute a five dimensional property of one data object. In order to meet the two criteria mentioned earlier, this work proposes the concept of radius density that is the size of dada packets in unit length along the radius direction. Assuming that all data objects within the cluster are ordered from least to largest according to the distance to the cluster center, CN denotes the radius distance from the first N data objects to the cluster center, Qi denotes the packet size of the first i data objects flow, and DN represents the radius distance from the first N data objects to the cluster center, then the cluster radius density may be defined as
(4)
That is to say, the radius density of the N-th point represents the size of all data object flows in a circle with a radius distance from the N-th point to the cluster center, and also describes the density from the N-th point to the cluster center. In a word, it reflects the clustering degree of all data objects.[微软用户1]
Then, corresponding to the above mentioned two standards, regulations are set as follows:
1) Cluster radius should cover the vast majority of the traffic, and specifically should be over 80%.
2) The cluster radius is the largest point that satisfies the first point distance in cluster.
So, the improved algorithm of cluster radius selection is as follows:
1) Assume that there are K objects in the cluster and they are ranged from the near to the distance away from the cluster center.
2) The total traffic in the cluster is described as Qtotal.
3) For the first N data objects, get the statistical
value and then calculate the ratio of total traffic:
(5)
4) When RN>80%, calculate the radius density CN of first N data objects.
5) From the first N data objects, calculate CX in turn (X=N, N+1, N+2, …, K).
6) Select the distance from the maximum point CX (X=N, N+1, N+2, …, K) to cluster center as the cluster radius.
Figure 3 shows that the discrete points are no longer included in the cluster and then the cluster radius decreases greatly, so the false positive reduces greatly.
3.2 Off-line learning stages
Traffic early identification methods based on machine learning identify the traffic to a great extent according to the sizes and directions of the first few packets of each data flow. Early identification algorithms are divided into two stages: off-line learning stage and online identification stage.
Fig.3 Improved effect of cluster radius selection
In off-line learning stage, the early identification algorithm uses the improved K-means algorithm to study the off-line BitTorrent data that are divided into multiple clusters, and then multiple cluster centers and cluster radius appear. The specific implementation may refer to Refs.[14-15] for determining the following indicators:
1) The number of the first few packets in a flow is five.
2) The data payload of each flow defines a direction attributed with plus or minus. For instance, the payload of the packet from source to destination is positive, and that of the packet from destination to source is negative.
3) The cluster number is determined to be 25.
This off-line learning process is shown in Fig.4.
Fig.4 Schematic diagram of off-line learning stage
It can be seen from Fig.4 that, in off-line learning stage, off-line packets are as inputs, and then the TCP control packets (three TCP handshake packets, SYN packets, empty ACK packets, etc) will be filtered out by the packet analyzer. As for other packets, we first find the flow information table. If there exist flow records of the packet, then add the flow size and packets number to the flow information table. In addition, if the packet resides in the first five packets, we also record the direction of the packet in data sheet. If the flow information table does not exist, then create a flow record and record the five tuples (source IP, purpose IP, source port, destination port, protocol number), the sizes and the number of the flow. Here, the two directions of one connection are denoted as plus or minus of the packet size, where the launch direction denotes the positive direction. The five tuples of the flow only record flow information in the positive direction, and the flow in negative direction is also described as a five-tuple flow. When a certain connection including the two directions has collected five packets, the packet analyzer will send the payload size of first five packets to the packet counter. First five packets and payload size of each connection are stored in the packet counter. When all the packets in the off-line data collection finish inputting, packet counter will dispatch the information to the improved K-means algorithm analyzer. The analyzer will get all the parameters including each parameter definition, dimensions, the number of packets and clusters, and then will obtain the sum of payload absolute value of all packets in a flow from the flow information table. With the above information, the data from the packet counter are clustered, finally the clustering results are dispatched to clustering results counter, which will save all the results.
In particular, because many flows in data sets may not be captured at the beginning, which may generate some irregular clusters, for example, first five data sizes of one BT flow are 68, -1 268, 1 228, 6, -4. Obviously, the second and the third packet are not any kind of control packets of BT flows, and they are most likely data packets, and the existence of such packets will greatly increase the false positive rate. In fact, if the early recognition algorithm including these cluster rules is used for the actual traffic identification, the false positive rate can reach a staggering 9.53%. Thus, it is necessary to process such a cluster.
As for the traditional K-means algorithm, these clusters are excluded by manual type, and for K-means++ algorithms, the cluster radius is usually set to be 0. In fact, this cluster contains a small amount of data, generally only tens of kb. The original cluster radius is often greater than the data covered by cluster radius; at the same time, such clusters are few. Therefore, the cluster radius is determined by calculating the radius density and comparing the cluster radius with the amount of data contained in the cluster. If the former is larger than the latter, the cluster radius will be set to be 0.
Here, the storage for cluster centers and cluster radius by clustering results counter is implemented by XML structure by setting corresponding tags in XML document. It is convenient to realize the storage for cluster centers and cluster radius and acquire cluster centers and cluster radius in online identification stage.
3.3 Online learning stages
By off-line learning, the size characteristics of first five packets are indicated by 25 cluster centers and corresponding cluster radius. Thus, in online identification stage, we may determine whether one flow is BT flow through calculating first five packet sizes of a flow and 25 cluster centers, then comparing with the corresponding cluster radius.
Figure 5 describes the process of the online identification stage.
Fig.5 Schematic diagram of online identification stage
It can be seen from Fig.5 that, in online identification stage, the packets captured online are treated as inputs, then the TCP control packets (three TCP handshake packets, SYN packets, empty ACK packets, etc) will be filtered out by the packet analyzer, and other packets are sent to the flow recorder. Just like off-line analysis, after the flow recorder gets the packets from packets analyzer, we first find the flow information table. If there exist flow records of the packet, then we add the flow size and packets number to the flow information table. In addition, if the packet resides in first five packets, we also record the direction of the packet in information sheet. If there is no flow record in flow information table, then we create a flow record and record the five tuples (source IP, purpose IP, source port, destination port, protocol number), the sizes and the number of the flow. Here, the two directions of one connection are denoted as plus or minus of the packet size, where the launch direction denotes the positive direction. The five tuples of the stream only record flow information in the positive direction, and the negative direction flow is also described as a five-tuple flow. When the flow recorder has collected first five packets of one connection, it will send them to BT protocol judgment device. Then, BT protocol judgment device takes first five packet sizes of this connection as units of analysis, reads 25 cluster centers and corresponding cluster radius from the cluster information table, calculates the Euclidean distances from this object to each cluster center and carries out a comparison. If Euclidean distance away from one cluster center is in the range of cluster radius, this object should belong to this cluster, and then the connection is BT flow. When the connection object does not belong to any cluster, the connection is non-BT flow. Thus, BT protocol judgment device marks with labels on each analyzed connection that records whether a connection is BT protocol data in the flow information table. By this, we may successfully distinguish the online BT and non-BT flows.
4 Simulations
In this section, a comparison is carried out between BT traffic early identification algorithm (ECKM) based on improved K-means algorithm and BT traffic early identification algorithm (EKM) based on normal K-means algorithm, and the accuracy and effect are compared by using the false positive and false negative descriptions.
4.1 Description of data collection
Seven data sets are used in the experiment and they are labeled as T1, T2, T3, T4, T5, T6 and D1. Among them, T1, T2 and T3 are collected in a controlled environment without BT flows, and the three data sets are used to measure the false positive. T4, T5 and T6 are used to test false negative and they are also collected in a controlled environment with only BT traffic. D1 is collected in the exit router in the Hunan University, and it includes a variety of protocols.
The specific data sets are shown in Table 2.
Table 2 Description of data sets
4.2 False positive analysis
In order to determine the BT traffic of data set D1, the BT identification technology based on multi-feature string was used to construct a traffic packet analysis tool (PAT). Because the identification accuracy of PAT method is very high, the recognized BT traffic is total BT traffic in D1.
Specifically, in order to get the false positive of EKM in D1, first of all, BT traffic and non-BT traffic are analyzed with X denoting the size of BT traffic, and Y denoting the size of non-BT traffic. Next, the packets are classified by use of EKM with X1 denoting the BT traffic size identified by this method, and Y1 denoting the non-BT traffic size identified by this method. Finally, the data identified as BT traffic with the PAT method are checked, with X2 used for BT traffic size and Y2 the size of non-BT traffic. Thus, according to Eq.(2): K1=Y2, K= Y, then we may calculate the false positive rate in D1 by EKM. Similarly, we may calculate the false positive rate in D1 by ECKM.
Figure 6 shows the false positive rate of the two methods of EKM and ECKM in each data set.
Fig.6 False positives rate of two methods of EKM and ECKM
From Fig.6, it can be seen that as for the five data sets of D1, T1, T2 and T3, the false positive rate of EKM is larger than that of ECKM, and the false positive rate of ECKM algorithm is close to zero. In T1, T2 and T3, the false positive rate of EKM is much higher compared with that of ECKM, which is mainly due to a small amount of data and few kinds of agreement contained in T1, T2, T3. The traffic is mainly from Thunder and Emule traffic that belongs to P2P protocol, whose data have higher similarity to BT data. So, there is a higher probability with erroneous identification. In D1, there exist many types of protocols, and the cluster radius is very large. Relatively speaking, there is a relatively small probability of erroneous identification. However, even in D1, the false positive rate of EKM is still higher and reaches more than 6%, which means more than 60 MB data are treated mistakenly for BT traffic. Therefore, it is reputed that EKM is difficult for effectively distinguishing BT and non-BT flows.
4.3 False negative analysis
The false-negative of EKM in D1 may be calculated by PAT. According to Eq.(3): L1=X1-X2, L=X1, the false negative results may be calculated. Similarly, the false negative results of ECKM may be obtained.
Figure 7 shows the false negative results of EKM and ECKM algorithm. From Fig.7, it can be seen that regardless of the small data sets such as T4, T5 and T6 or time network traffic data set like D1, the false negative rate of EKM is obviously lower than that of ECKM. In addition, the false negative rate of T4, T5 and T6 is even close to zero. In ordinary K-means algorithm, there exists a bigger cluster radius, which makes the BT data flow easily be classified as one cluster, while in the improved K-means algorithm, the decrease of cluster radius makes some BT flows not be classified as a single cluster.
Fig.7 False negative results of EKM and ECKM algorithm
Considering the positive and negative rate together, for small data sets, the false positive rate of EKM is very high and almost all data are identified as BT traffic. Therefore, this cannot distinguish between BT and non-BT traffic. Although there exist some false negative rates in ECKM, they may play a role in identifying BT traffic because of their low false positive rates. As for the actual network data set D1, in a total almost 1 GB data set, EKM algorithm identifies about 67 MB data mistakenly as BT traffic, where these erroneous identification data are the majority of P2P traffic, and filter about 0.1 MB BT data. While ECKM only identifies about 7.5 MB data mistakenly as BT traffic, and filter only less than 1 MB BT data. In sum, EKM is almost not able to distinguish between BT and non-BT traffic in P2P traffic. Although ECKM has a higher false negative rate, it is still in an acceptable range, and its low false positive rate makes a good play to distinguish between BT and non-BT traffic.
In summary, EKM has a higher erroneous judgment rate that reduces its application value, and ECKM reduces the erroneous judgment rate of EKM with a little increase of missing judgment rate. Therefore, integrated ECKM has a more practical application value compared with EKM.
5 Conclusions
1) EKM has a higher erroneous judgment rate and ECKM reduces the erroneous judgment rate of EKM with a little increase of missing judgment rate.
2) In actual network environment, ECKM will reduce the false positive rate from 6.3% to 0.9% and has a higher operational efficiency, compared with K-means algorithm.
References
[1] SEN S, WANG J. Analyzing peer-to-peer trafficacross large networks [J]. IEEE/ACM Transactions on Networking, 2004, 12(2): 219-232.
[2] Ernesto. Bittorrent continues to dominate internet traffic [EB/OL]. 2009-04-20. http://torrentfreak.com/bittorrent-dominates-internet- traffic-070901/.
[3] KARAGIANNIS T, BROIDO A, BROWNLEE N, CLAFFY K, FALOUTSOS M. Is P2P dying or just hiding? [C]// Proceeding of Global Telecommunications Conference 2004. New York: ACM Press, 2004: 1532-1538.
[4] LI Jiang-tao, JIANG Yong-ling. Survey of P2P traffic identification and engineering technology [J]. Telecommunications Science, 2005, 21(3): 57-61. (in Chinese)
[5] MADHUKAR A, WILLIAMSON C. A longitudinal study of P2P traffic classification [C]// Proceedings of the International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems[微软用户2]. Heidelberg: Springer, 2006: 179-188.
[6] SEN S, SPATSCHECK O, WANG D. Accurate, scalable in-network identification of P2P traffic using application signatures [C]// Proceedings of the 13th International Conference on World Wide Web. New York: ACM Press, 2004: 512-521.
[7] LIU Gang, FANG Bin-xing, HU Ming-zeng, ZHANG Hong-li. Capture method and self-similarity assess of bittorrent traffic [J]. Application Research of Computers, 2006, 23(5): 205-206. (in Chinese)
[8] DU Li, YU Jun-xiang, WANG Xiao-jing. A coarse-grained differentiated routing algorithm inmulti-protocol label switching traffic engineering [J]. Journal of Central South University: Science and Technology, 2010, 17(6): 1258-1263. (in Chinese)
[9] ZANDER S, NGUYEN T, ARMITAGE G. Automated traffic classification and application identification using machine learning [C]// Proceedings of the IEEE Conference on Local Computer Networks 30th Anniversary. Sydney: IEEE Society, 2005: 250-257.
[10] EKMAN J, ARLITT M, MAHANTI A. Traffic classification using clustering algorithms [C]// Proceedings of the Special Interest Group on Data Communication[微软用户3]. New York: ACM Press, 2006: 281-286.
[11] KARAGIANNIS T, PAPAGIANNAKI D, FALOUTSOS M. Blinc: Multilevel traffic classification in the dark [C]// Proceedings of the 2005 SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. New York: ACM Press, 2005: 229-240.
[12] MCGREGOR A, HALL M, LORIER P, BRUNSKILL J. Flow clustering using machine learning techniques [C]// Proceedings of the Personalised Ambient Monitoring[微软用户4]. Heidelberg: Springer, 2005: 205-214.
[13] JUNIOR G P S, MAIA J E B, HOLANDA R, DE SOUSA J N.P2P Traffic identification using cluster analysis [C]// Proceedings of the International Symposium on Global Information Infrastructure. New York, 2007: 128-133.
[14] LU Gang, ZHANG Hong-Li, YE Lin. P2P traffic identification [J]. Journal of Software, 2011, 22(6): 1281-1298. (in Chinese)
[15] BERNAILLE L, TEIXEIRA R. Early recognition of encrypted applications [C]// Proceedings of the International Symposium on Passive and Active Measurement[微软用户5]. Heidelberg: Springer, 2007: 165-175.
[16] BERNAILLE L, TEIXEIRA R, SALAMATIAN K. Early application identification [C]// Proceedings of the 2006 ACM CoNEXT Conference. New York: ACM Press, 2006: 70-82.
[17] BERNAILLE L, TEIXEIRA R, AKODKENOU I, SOULE A, SALAMATIAN K. Traffic classification on the fly [C]// Proceedings of the Special Interest Group on Data Communication. New York: ACM Press, 2006: 23-26.
(Edited by HE Yun-bin)
Foundation item: Project(2011FJ3034) supported by the Planned Science and Technology Program of Hunan Province, China; Project(61070194) supported by the National Natural Science Foundation of China
Received date: 2011-06-21; Accepted date: 2011-10-17
Corresponding author: CAI Li-jun, Professor, PhD; Tel: +86-731-88664160; E-mail: ljcai@hnu.edu.cn
Abstract: In response to the deficiencies of BitTorrent, the concept of density radius was proposed, and the distance from the maximum point of radius density to cluster center as a cluster radius was taken to solve the too large cluster radius resulted from the discrete points and to reduce the false positive rate of early recognition algorithms. Simulation results show that in the actual network environment, the improved algorithm, compared with K-means, will reduce the false positive rate of early identification algorithm from 6.3% to 0.9% and has a higher operational efficiency.