Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback

Liang, Chen; Yu, Bo; Xie, Wei; Wang, Baosheng; Peng, Wei

doi:10.3390/app12168335

Open AccessArticle

Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback

College of Computer, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(16), 8335; https://doi.org/10.3390/app12168335

Submission received: 18 July 2022 / Revised: 12 August 2022 / Accepted: 19 August 2022 / Published: 20 August 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

A large number of IoT devices access the Internet. While enriching our lives, IoT devices bring potential security risks. Device identification is one effective way to mitigate security risks and manage IoT assets. Typical identification algorithms generally separate data capture and target identification into two parts. As a result, it is inefficient and coarse-grained to evaluate the results only once the identification process is complete and then adjust the data capture strategy afterward. To solve this problem, we propose a fine-grained probe-scheduling approach based on information feedback. First, we model the probe surface as three layers for IoT devices and define their relationships. Then, we improve the policy gradient algorithm to optimize the probe policy and generate the optimal probe sequence for the target device. We implement a prototype system and evaluate it on 53,000 IoT devices across various categories to show its wide applicability. The results indicate that our approach can achieve success rates of 96.89%, 93.43%, and 83.71% for device brand, model, and firmware version, respectively, and reduce the identification time by 55.96%.

Keywords:

Internet of Things; fine-grained fingerprints; device identification; reinforcement learning

1. Introduction

With the rapid development of the Internet of Things (IoT), a large number of devices have accessed the Internet. According to Global System for Mobile Communications Association (GSMA) statistics, the number of IoT devices will reach almost 25 billion globally by 2025, up from 10.3 billion in 2018 [1]. While enriching our lives, IoT devices bring potential security risks to cyberspace, such as information leakage, authentication bypass, and lagging firmware upgrades [2]. As one method of cyberspace mapping, device identification can help to mitigate security risks and reduce attack surfaces.

The device fingerprint refers to device features that can be used to identify a device. According to the identification granularity, the device fingerprint includes the device’s brand, type, model, firmware version, and other features. Common cyber search engines, such as Shodan [3] and Censys [4], generally adopt a banner-based approach to identify online IoT devices. While Shodan can reach 95% identification accuracy, the low recall results in coarse identification granularity.

In general, existing identification methods have the following three disadvantages: First, manual identification is time-consuming and insufficiently accurate, making it difficult to establish comprehensive information from the physical layer to the application layer. Second, traversing all the device features increases the communication overhead and may trigger the intrusion detection mechanism [5,6,7]. Third, the typical identification algorithms are accustomed to separating data capture and target identification into two parts. It is inefficient and coarse-grained to evaluate the results until the target identification process is complete and then adjust the data collection strategy afterward.

When identifying large-scale IoT devices, protocol handshake and data transmission (typically milliseconds) consume most of the runtime compared with the identification time (typically microseconds). In the standard reinforcement learning (RL) setting, the agent receives feedback from the environment at every step and chooses an action based on that feedback [8]. Interactions between the agent and the environment allow determining whether the information received is sufficient to complete the identification task and adjust the data collection policy accordingly. In this way, data capture and target identification can be performed simultaneously, avoiding sending additional probe requests. Therefore, a probe-scheduling approach with information feedback can balance the success rate and the communication overhead of device identification.

In this paper, we propose a fine-grained probe-scheduling approach based on information feedback to identify large-scale IoT devices. We aim to improve identification success rates and efficiency for large-scale IoT devices. First, we model the probe surface as three layers for IoT devices and define their relationships. Then, we improve the policy gradient (PG) algorithm to optimize the probe policy and generate the optimal probe sequence for the target device.

We implement a prototype system and evaluate it through real-world experiments to validate our approach. We use the Shodan API [3] to collect response data (open ports, protocol response data, and web feature information) from 53,000 real IoT devices. The dataset covers a wide range of device categories. Thus, our approach has wide applicability, i.e., different types of IoT devices can achieve high identification efficiency and success rates using our approach.

Overall, our contributions are summarized as follows:

We model the probe surface as three layers by analyzing the characteristics of IoT devices and define their sequential relationships.
We propose a fine-grained probe-scheduling approach based on information feedback to achieve high identification efficiency and success rates. Using the improved RL algorithm, we update the identification state dynamically and select the next action with the greatest benefit.
We implement a prototype system and evaluate it on 53,000 IoT devices across various categories. The results show that our approach can achieves success rates of 96.89%, 93.43%, and 83.71% for device brand, model, and firmware version, respectively. Furthermore, our approach reduces the identification time by 55.96% compared with that of the protocol-popularity method.
We have released all data and the analysis script to replicate the results of this work and to encourage further studies: https://github.com/sherlocklchen/real-IoT-device-assets.

The remainder of this paper is organized as follows. Section 2 discusses the related work. Section 3 introduces our motivation. Section 4 describes the framework and algorithm for large-scale IoT devices. Section 5 presents the experimental evaluation. We discuss the ability and limitation of our approach in Section 6. Finally, Section 7 concludes.

2. Related Work

In network security, IoT device identification has been used for more than two decades, and there are many related works. On the one hand, device identification can help operators sort out the devices running in the network to find information leaked due to configuration errors. On the other hand, vulnerabilities in IoT devices are usually related to the properties of the device (brand, type, model, etc.) and identifying the device correctly will help operators block known vulnerable devices [9] before they do harm to the network. The main solutions to identify devices can be divided into three categories.

According to the identification granularity, a device fingerprint includes the brand, product model, and firmware version. For example, an IoT device is produced by a brand (e.g., Cisco, Sony), has a product model (e.g., ASR-900 or ASA-5520), and several firmware versions (e.g., 1.04, 3.40). With numerous types of IoT devices, it is difficult to enumerate all fingerprints manually. In prior works [10,11,12,13,14,15,16], traditional, traffic-based, and banner-based approaches have been used to discover and manage IoT devices.

2.1. Traditional Detection Methods

Traditional detection methods focus on identifying the operating system by analyzing TCP/IP protocol characteristics. Nmap [17] sends detection packets to the target device [18] and constructs device fingerprints based on the characteristics of the response data. Zmap [19] is a fast single-packet network scanner that can scan the entire public Internet in less than an hour, displaying information about nearly four billion online devices.

Cheng et al. [20] relied on the hardware differences between the CPU modules of different devices to detect and identify different devices; Park et al. [21] distinguish different devices based on the inherent characteristics of hardware for embedded systems; Sanchez-Rola et al. [22] compute a hardware fingerprinting, based on timing the execution of sequences of instructions readily available in API functions.

The identification success rate for a limited number of operating systems is acceptable, while the rate will drop significantly for the wide variety of IoT devices.

2.2. Traffic-Based Methods

To identify devices, some researchers have collected and analyzed traffic data. Miettinen et al. [10] used machine learning to distinguish the types of smart devices. The fingerprin tS is represented by n data packets and 23 features (such as packet length, port number, and protocol used by the packet) as binary features, which can achieve high accuracy. Wang et al. [12] designed a port scanning strategy that combines multiple weak classifiers into multiple classifiers. Each classifier is responsible for analyzing specific port data, which greatly shortens the cycle of device identification and increases the identification accuracy by 46.67%. Yu et al. [11] used Convolutional Neural Networks (CNN) and Long Short-Term Memory Networks (LSTM) to extract and construct the characteristic fingerprints of HTTP and TCP cross-layer data packets to achieve high-precision and fine-grained IoT device identification. In [13,23,24,25,26,27,28,29,30], an inspection of data packets was used to extract device features.

The authors of [10,31,32] proposed mechanisms for analyzing encrypted traffic. The mechanism proposed in IoT Sentinel [10] uses a flow attribute vector of 276 dimensions (12 groups × 23 features), which increases the excessively high computational cost [33]. The mechanism proposed in [31] requires 49 traffic attributes and 30,000 frames to identify the device, and it takes a long time to capture the traffic.

Although the automation of device identification is improved when applying machine learning to analyze network traffic, the device model can not be identified (coarse granularity).

2.3. Banner-Based Methods

The term banner refers to the device attributes contained in the protocol packets, which typically include the device type, brand, and model [16,32,34,35]. Obtaining protocol banners require first sending probe packets for a specific services (i.e., ports) to the target device. If the target device runs the particular service, it will return the response packets containing device information. Since IoT devices run a large number of services, we can obtain a wide variety of banners to improve the accuracy of device identification.

DAN et al. [11] proposed a cross-layer protocol fingerprinting technique for fine-grained device identification. This approach utilized a convolutional neural network (CNN) and a long short-term memory network (LSTM) to extract and construct feature fingerprints. Nevertheless, the proactive identification method increases the identification time because it depends on the network state.

Qiang et al. [15] proposed an approach for generating fine-grained fingerprints based on the subtle differences between the file systems of various firmware images. They leveraged natural language processing to process the file content and the document object model to obtain firmware fingerprints. The recall and precision of the firmware fingerprints exceeded 90%. However, this approach requires an average of 75 HTTP packets to identify the firmware version of a single device, which is inefficient for large-scale device identification.

Xuan et al. [36] proposed a scalable framework for physical device profiling that leverages banner grabbing to identify device types and running services before using clock skew to determine a device ID. Although they used multiple protocols to improve the identification accuracy, the approach only ranks the popularity of application layer protocols to identify device types, which increases the communication costs. By contrast, our method balances the success rate with the communication overhead by scheduling multiple probe methods.

However, there are two shortcomings of banner-based device identification: (1) The target device may not return the corresponding protocol banner after sending a protocol probe packet to the target; (2) The device type can be easily extracted from the banners, but most banners contain incomplete information about device properties, such as brand and model. Combining the various protocol banners supported by the device for device identification will increase the communication and time overhead considerably.

3. Motivation

By observing and analyzing fingerprint features, we model the probe surface for IoT devices as three layers and define their relationships.

3.1. Port Layer

The port layer refers to the opening ports, which are associated with the host’s communication protocol and specific service types. Unlike traditional hosts, most IoT devices are made for specific tasks, which can reflect the device fingerprint. In Appendix A, we shows the default open ports for the main IoT device manufacturers. For example, webcams need to receive control data and send image data via the network. Therefore, they use the real-time streaming protocol (RTSP) and Onvif protocols.

Table 1 shows three typical categories of devices: webcam, network printer, and network-attached storage (NAS). Several unique ports are used by only one type of device. For example, the Dahua uses port 37777 to run its private protocol. Thus, we can perform coarse-grained identification for IoT devices based on the extracted port features at the first step.

3.2. Protocol-Response Layer

The protocol-response layer contains response data such as HTTP and SSH protocols. Obtaining protocol responses requires sending probe packets to the target device. If the target device opens the particular service, it will return a response packet containing a device fingerprint. For example, as shown in Figure 1, a Cisco device opens the 80_HTTP and 23_Telnet ports. The Telnet protocol response contains information about the device model and firmware version, while the information about the device brand appears in the HTTP protocol response. Therefore, both 80_HTTP and 23_Telnet responses should be captured to obtain the complete device fingerprint.

3.3. Web-Feature Layer

Web features refer to the features of web applications (e.g., special URLs, SSL certificates). IoT devices typically use the Linux-based file system that contains tens of thousands of files. The “WWW” directory contains files that can be accessed through the web. These files can be used to identify the model and firmware version. As shown in Figure 2, the feature URLs and SSL certificate provide information about the device brand and model.

3.4. Dependencies among the Probe Surfaces

There are hierarchical dependencies among the three-layer probe surface. Individual ports or port combinations can be used to identify the device brand but not the model or firmware version. The protocol-response layer must determine whether the port is open before probing the protocol response. Protocol responses can be used to identify the model and firmware version. The web-feature layer must determine whether the target device opens the HTTP/HTTPS protocol port (such as 80 and 443) and runs a web service. Then, web features such as special URLs, page content, and SSL certificates can be further gathered.

Device identification is a dynamic process. We can improve the RL algorithm to update the identification status and obtain the optimal probe sequence. Furthermore, interactions between the agent and the environment enable evaluation of whether the information received is sufficient to complete the identification task and adjust the capture policy accordingly. In this way, data capture and device identification can be performed simultaneously, preventing the need to send additional probe requests.

4. Framework

In this section, we present the framework and algorithm of fine-grained device identification. The framework has three main modules (see Figure 3): probing knowledge base module, data analysis module, and probe-scheduling module.

4.1. Probing Knowledge Base Module

The probing knowledge base consists of several probe plugins and a multidimensional fingerprint library. We use plugins to probe open ports, different protocol responses, and web features (e.g., special URLs, SSL certificates). We construct the fingerprint library automatically and update it regularly based on the structural features of device fingerprints displayed on websites.

The probing knowledge base module supports both the data analysis and the probe-scheduling modules.

4.2. Data Analysis Module

The data analysis module analyzes the response data from the three layers. We show the default open ports for the main IoT device manufacturers in Appendix A. The port-scan algorithm is detailed in Algorithm 1. When scanning open ports, we detect the characteristic ports first. If open, the device brand can be determined directly (e.g., 37777_Dahua, 2020_TP-Link). Then, we detect the characteristic port combinations. If open, the device brand can be determined (e.g., 81,82_Hikvision, 81,21_Axis). Finally, we detect the typical protocols. If the target device runs the typical protocols, the device type can be determined (e.g., RTSP_webcam, IPP_printer).

Algorithm 1: Port Scan Algorithm.

Input:

O b j :

IoT device to be identified;

Variables:

P_{c} :

Characteristic ports for

O b j

;

P_{c o m b i n e} :

Characteristic ports combinations for

O b j

;

P_{t p} :

typical protocols for

O b j

;

Output:

c :

Device category;

1: Initialise $c = N u l l$ ;
2: ifIsopen( $O b j$ , $P_{c}$ )then
3: $|$ c = Devicebrand( $P_{c}$ )
4: else ifIsopen( $O b j$ , $P_{c o m b i n e}$ )then
5: $|$ c = Devicebrand( $P_{c} o m b i n e$ )
6: else ifIsopen( $O b j$ , $P_{t p}$ )then
7: $|$ c = Devicetype( $P_{t p}$ )
8: end
9: returnc

Figure 4 illustrates how the data analysis module processes the response data from the target device, including protocol responses, and web-feature data. First, the detector sends probe messages actively to target devices to obtain different types of responses. Then, we process the response data and generate segmentation lists during the data processing stage. Finally, we obtain the device fingerprint from the segmentation lists using regular expression and keywords matching.

We can flexibly extend and customize the data processing module according to the application scenario. For example, when deployed on edge devices with scarce computational resources, some infrequent probe requests can be removed based on the possible device types in the current environment.

4.3. Probe-Scheduling Module

While identifying device fingerprints, we need to send N probe requests one by one to obtain the device information. The optimal probe sequence for IoT devices can obtain device responses containing more fingerprint information while sending as few probe requests as possible. Different probe methods have different benefits for device identification, and the order in which probe requests are sent can affect the identification benefits dynamically.

We model the scheduling problem as a Markov decision process denoted as

(S, A, r, γ, T)

[8]. The goal is to maximize the expected discounted return:

J = E_{τ} [\underset{t = 0}{\overset{T - 1}{Σ}} γ^{t} r_{t}]

(1)

where

τ

is the trajectory

(s_{0}, a_{0}, r_{0}, s_{1}, \dots, s_{T - 1}, a_{T - 1}, r_{T - 1})

and

r_{t} = r (s_{t}, a_{t})

. The core idea behind the PG algorithm is to obtain the policy gradient

\nabla_{θ} J

of the expected discounted return with respect to the policy parameter

θ

.

\begin{matrix} g_{p o l i c y} = E_{τ} [\nabla_{θ} log π_{θ} (a_{τ} |s_{τ}) G_{τ}] = E_{τ} [\nabla_{θ} \underset{t = 0}{\overset{T - 1}{Σ}} log π_{θ} (a_{t} |s_{t}) G_{t}] \end{matrix}

(2)

where

G_{t} = Σ_{k = 0}^{\infty} γ^{k} r_{t + k}

denotes the discounted return following time t.

The scheduling algorithm is detailed in Algorithm 2. The state set S refers to the identification state of the device, including the brand, model, and firmware version. The executable action set A includes probe actions such as probing SNMP responses and probing special URLs.

Algorithm 2: Scheduling Algorithm.

The symbol r indicates the immediate reward provided by the change in identification status after sending the probe request. According to the identification granularity and the model convergence in the experiment, we set the reward function r. If the device identification state remains unchanged,

r = - 5

; if the brand information of the device is added,

r = 10

; if the model information is added,

r = 50

; if the firmware version information is added,

r = 100

.

The agent’s task is to learn a strategy

π : s \to a

to choose the next action

a_{t}

based on the current state

s_{t}

, i.e.,

π (s_{t}) = a_{t}

. At each discrete time t, the agent perceives the current state

s_{t}

and chooses the current action

a_{t}

according to

s_{t}

. After obtaining the reward

r_{t} = R (s_{t}, a_{t})

, the agent generates the subsequent state

s_{t + 1} = δ (s_{t}, a_{t})

, which is related to only the current state. Moreover, the optional probe actions differ among devices due to the different open ports. Therefore, we use the MASK [37] to obtain the available action set

A_{v a l i d}

from the action set A for valid policy gradient updates.

5. Implementation and Evaluation

We implemented a prototype system and conducted real-world experiments to validate the identification capability. We collected response data (port open, protocol response data, and web feature information) from 53,000 real IoT devices using the Shodan API [3]. As shown in Table 2, the dataset captures common IoT device brands well, covering a wide range of device categories. The ratio of training data to test data in our experiments was 9:1.

As mentioned earlier, cyber search engines can identify device type with an accuracy of over 95%. Our approach focuses on fine-grained identification, such as device brand, model, and firmware version. We utilize the dataset to implement model training and to evaluate the success rate and efficiency of identification. Then, we validate the identification capability of our approach by comparing it with another method based on protocol popularity.

5.1. Data Set Analysis

We first calculate the information acquisition rate for different probe methods. The probability equals

n_{i} / N, (i ϵ [1, 12])

, where

n_{i}

is the number of devices successfully identified by the probe methods, and N is the total number of devices. Figure 5 shows the probability of obtaining device fingerprints using different probe methods. We found the following patterns in identifying devices:

A single method cannot identify complete fingerprints. For example, probing using the HTTP/HTTPS protocol has a high probability of obtaining brand and model information but a lower probability of obtaining firmware version.
Different response data contain different device fingerprint information. If a protocol response does not contain the firmware version, it is more likely that the rest of the protocol responses will contain this information.
Different combinations of probe methods have different complementarity in identifying device fingerprints. For example, the probability of identifying firmware version via the SNMP, HTTPS, and SIP protocols is 73.02%, 15.90%, and 50.82%, respectively. Although the probability of identifying the firmware version is only 15.90% using HTTPS, it can reach 80.31% when combined with SNMP, which is higher than the probability of 78.17% achieved by combining the SNMP and SIP protocols.

In other words, the complementarity between the SNMP and HTTPS protocols is higher than that between the SNMP and SIP protocols because the HTTPS and SNMP protocols’ responses have less duplicate fingerprint information. Therefore, the communication overhead can be significantly reduced by optimizing the probe methods and their order.

5.2. Evaluation

For 53,000 devices, we examine the identification capability via 10-fold cross-validation. Figure 6 shows how the success rate changes with the maximum number of probe methods. The X-axis represents the maximum number of probe methods, and the Y-axis represents the success rate. The success rate equals

N_{g e t} / N_{a l l}

, where

N_{a l l}

is the number of target devices, and

N_{g e t}

is the number of devices that return responses containing device fingerprint information.

Our approach selects the optimal probe method in the first step of identification and achieves a high success rate. Furthermore, the success rate stabilizes when the maximum number of probe methods reaches five. At this stage, the identification success rates for device brand, model, and firmware version reach 96.89%, 93.43%, and 83.71%, respectively. Therefore, when performing large-scale device identification, we can ignore the probe methods in the tail of the optimal probe sequence to improve efficiency.

Different types of IoT devices can use our approach to obtain high efficiency and success rates, i.e., our approach has wide applicability.

5.3. Success Rate and Time Efficiency Compared with Other Work

5.3.1. Success Rate Performance

Table 3 compares our approach with another approach based on protocol popularity. The results show that we can identify device fingerprints at a finer granularity (device model and firmware version) than [36]. In addition, we use no more than five probe methods, which reduces the communication overhead significantly.

5.3.2. Time Efficiency Performance

To validate that our approach can successfully balance the success rate with communication overhead, we compute the identification time of 5292 real IoT devices. We compare our approach with another approach based on protocol popularity. The idea of ranking protocol popularity is inspired by previous work [36]. We rank the probe methods as HTTP, HTTPS, RTSP, FTP, SSH, TELNET, SNMP, CWMP, and PPTP based on the number of responses.

Figure 7 shows how the identification success rate changes with the maximum number of probe methods for the two approaches. Figure 7a–c show how the success rate changes when identifying brand, model, and firmware version, respectively. Our scheduling policy achieves a higher identification success rate with fewer probe methods. Especially for firmware version identification, our approach reaches a success rate of 83.71% with a combination of no more than five probes. By contrast, the approach based on protocol popularity requires a combination of nine probes to achieve a similar success rate.

In addition, we calculate the time to identify 5292 real IoT devices. Our approach requires 10.89 min to achieve success rates of 96.89%, 93.43%, and 83.71% for the brand, model, and firmware version, respectively, while the popularity-based approach requires 24.73 min. Our approach can reduce the identification time by 55.96% when identifying large-scale IoT devices.

6. Discussion and Limitations

In this section, we discuss the ability and limitation of our approach, and explore the improvement direction in the future.

6.1. Ranks of Probing Actions

Our evaluation shows that the reward function r in the Algorithm 2 is appropriate. In fact, we can extend the reward function to describe the benefits of the probing action more accurately. For example, different probing actions consume different communication time. In this case, we can consider the action execution time, success rate and other factors as the reward function parameters.

6.2. Scheduling Policy

There are three basic ways to implement an RL algorithm: value-based, policy-based and model-based. Our evaluation shows that the policy-based algorithm is used in reasonable way. In fact, we can use other RL algorithms such as Deep Reinforcement Learning (DQN) to test our approach and compare the effects of different algorithms in experiments.

7. Conclusions

In this paper, we propose a fine-grained probe-scheduling approach based on information feedback to identify large-scale IoT devices. First, we model the probe surface as three layers for IoT devices and define their relationships. Then, we improve the policy gradient algorithm to optimize the probe policy and generate the optimal probe sequence for the target device. We implement a prototype system and evaluate its effectiveness through real-world experiments. Our approach can achieve success rates of 96.89%, 93.43%, and 83.71% for device brand, model, and firmware version, respectively, and it reduces the identification time by 55.96%.

Author Contributions

Conceptualization, C.L. and B.Y.; methodology, C.L.; software, W.X.; validation, C.L. and B.W.; data curation, C.L.; writing—original draft preparation, W.P.; writing—review and editing, C.L., B.Y., W.X., B.W. and W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the the Natural Science Foundation of China (61902416, 61902412) and Natural Science Foundation of Hunan Province in China (2019JJ50729).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We will release all data and the analysis script to replicate the results of this work and to encourage further studies: https://github.com/sherlocklchen/real-IoT-device-assets.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IoT	Internet of Things
RL	Reinforcement Learning
PG	policy gradient
GSMA	Global System for Mobile Communications Association
SNMP	Simple Network Management Protocol
HTTP	Hyper Text Transfer Protocol
HTTPS	Hyper Text Transfer Protocol over SecureSocket Layer
RTSP	Real Time Streaming Protocol
FTP	File Transfer Protocol
SSH	Secure Shell Protocol
TELNET	Telecommunication Network Protocol
CWMP	CPE WAN Management Protocol
PPTP	Point to Point Tunneling Protocol
IPP	Internet Printing Protocol
UPnP	Universal Plug and Play Protocol
Onvif	Open Network Video Interface Forum
NDMP	Network Data Management Protocol
NAS	Network Attached Storage
URL	Uniform Resource Location
SSL	Secure Sockets Layer
CNN	Convolutional Neural Networks
LSTM	Long-Term Memory Networks
API	Application Programming Interface
DQN	Deep Reinforcement Learning

Appendix A. Special Port for IoT Devices

Table A1. Special Port For Webcam.

Device Brand	Protocol Type	Ports
HIKVISION	HTTP/HTTPS	81, 80, 82, 443, 8443
	RTSP	554
	DateService	8000
	Onvif	80
Dahua	HTTP/HTTPS	80, 8080, 443, 8443
	RTSP	554
	DateService	37777
	Onvif	80
TP-Link	HTTP/HTTPS	80, 443, 8080, 443
	RTSP	554
	Onvif	2020, 80
D-Link	HTTP/HTTPS	80, 443, 8080
	RTSP	554
	Onvif	80
VIVOTEK	HTTP/HTTPS	80, 443, 8080
	RTSP	554
	Onvif	80
	FTP	21
AXIS	HTTP/HTTPS	80, 81, 8081, 8080
	RTSP	554
	Onvif	80
	FTP	21
Panasonic	HTTP/HTTPS	80, 443, 81
	RTSP	554
	Onvif	80
	FTP	21
Cisco	HTTP/HTTPS	80, 443
	RTSP	554
	FTP	21

Table A2. Special Port For Firewall.

Device Brand	Protocol Type	Ports
Cisco ASA	HTTP/HTTPS	443, 80, 8443
	SSH	22
	Telnet	23
Fortinet Gate	HTTP/HTTPS	10443, 443, 80, 8443
	SSH	22
	Telnet	23
Huawei	HTTP/HTTPS	443, 8443, 80, 8888
	SSH	22
	Telnet	23
	SNMP	161
D-Link DFL	HTTP/HTTPS	80, 443, 8080, 8443
D-Link DFL	Telnet	23
Ruijie	HTTP/HTTPS	443, 80
	Telnet	23
	SNMP	161
	SSH	22

Table A3. Special Port For Router.

Device Brand	Protocol Type	Ports
Cisco	HTTP/HTTPS	80, 8080, 8081, 443
	SNMP	161
	UPnP	1900
	FTP	21
	SSH	22
	Telnet	23
Netcore	HTTP/HTTPS	8080, 8081, 443
	SNMP	161
	UPnP	1900
	FTP	21
	SSH	22
	Telnet	23
Juniper	HTTP/HTTPS	80, 443
	SNMP	161
	UPnP	1900
	FTP	21
	SSH	22
	Telnet	23

Table A4. Special Port For Printer.

Device Brand	Protocol Type	Ports
Samsung	HTTP/HTTPS	80, 8080, 8081, 443
	IPP	631
	FTP	21
	Telnet	23
	SNMP	161
	PJL	9100, 9101, 9102
LexMark	HTTP/HTTPS	80, 8000, 8080, 443
	IPP	631
	FTP	21, 9600
	Telnet	23
	SNMP	161
	PJL	9100, 515
	Finger	79
Dell	HTTP/HTTPS	80, 443
	IPP	631
	FTP	21
	Telnet	9000, 23
	SNMP	161
	PJL	9100, 9101, 9102

References

GSM Association. IoT Connections Forecast: The Rise of Enterprise. Dosegljivo. Available online: https://www.gsma.com/iot/resources/iot-connections-forecast-the-riseof-enterprise/ (accessed on 15 November 2020).
Park, M.; Oh, H.; Lee, K. Security risk measurement for information leakage in IoT-based smart homes from a situational awareness perspective. Sensors 2019, 19, 2148. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Matherly, J. Complete Guide to Shodan; Shodan LLC: Pflugerville, TX, USA, 2015; Volume 1. [Google Scholar]
Ribeiro, T.; Vala, M.; Paiva, A. Censys: A model for distributed embodied cognition. In Lecture Notes in Computer Science, Proceedings of the International Workshop on Intelligent Virtual Agents, Edinburgh, UK, 29–31 August 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 58–67. [Google Scholar]
Feng, X.; Li, Q.; Wang, H.; Sun, L. Characterizing industrial control system devices on the Internet. In Proceedings of the International Conference on Network Protocols (ICNP), Singapore, 8–11 November 2016. [Google Scholar]
Wang, S.; Bi, J.; Wu, J.; Vasilakos, A.V.; Fan, Q. VNE-TD: A virtual network embedding algorithm based on temporal-difference learning. Comput. Netw. 2019, 161, 251–263. [Google Scholar] [CrossRef]
Huang, M.; Liu, A.; Xiong, N.N.; Wang, T.; Vasilakos, A.V. A low-latency communication scheme for mobile wireless sensor control systems. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 317–332. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Cisco. Big Security in a Small Business World 10 Myth Busters for SMB Cybersecurity; Cisco: San Jose, CA, USA, 2020. [Google Scholar]
Miettinen, M.; Marchal, S.; Hafeez, I.; Asokan, N.; Sadeghi, A.R.; Tarkoma, S. Iot sentinel: Automated device-type identification for security enforcement in iot. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017; pp. 2177–2184. [Google Scholar]
Yu, D.; Xin, H.; Chen, Y.; Ma, Y.; Chen, J. Cross-Layer Protocol Fingerprint for Large-Scale Fine-Grain Devices Identification. IEEE Access 2020, 8, 176294–176303. [Google Scholar] [CrossRef]
Wang, X.; Huang, J.; Qi, C. FDI: A Fast IoT Device Identification Approach. In Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies, Guangzhou, China, 4–6 December 2020; pp. 277–282. [Google Scholar]
Sivanathan, A.; Gharakheili, H.H.; Loi, F.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Classifying IoT devices in smart environments using network traffic characteristics. IEEE Trans. Mob. Comput. 2018, 18, 1745–1759. [Google Scholar] [CrossRef]
Antonakakis, M.; April, T.; Bailey, M.; Bernhard, M.; Bursztein, E.; Cochran, J.; Durumeric, Z.; Halderman, J.A.; Invernizzi, L.; Kallitsis, M.; et al. Understanding the mirai botnet. In Proceedings of the 26th USENIX security symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017; pp. 1093–1110. [Google Scholar]
Li, Q.; Feng, X.; Wang, R.; Li, Z.; Sun, L. Towards fine-grained fingerprinting of firmware in online embedded devices. In Proceedings of the IEEE INFOCOM 2018-IEEE Conference on Computer Communications, Honolulu, HI, USA, 16–19 April 2018; pp. 2537–2545. [Google Scholar]
Feng, X.; Li, Q.; Wang, H.; Sun, L. Acquisitional rule-based engine for discovering Internet-of-Thing devices. In Proceedings of the 27th USENIX Security Symposium, Baltimore, MD, USA, 15–17 August 2018; pp. 327–341. [Google Scholar]
Duarte, F.S.L.G.; Sikansi, F.; Fatore, F.M.; Fadel, S.G.; Paulovich, F.V. Nmap: A novel neighborhood preservation space-filling algorithm. IEEE Trans. Vis. Comput. Graph. 2014, 20, 2063–2071. [Google Scholar] [CrossRef] [PubMed]
Yang, K.; Li, Q.; Sun, L. Towards automatic fingerprinting of IoT devices in the cyberspace. Comput. Netw. 2019, 148, 318–327. [Google Scholar] [CrossRef]
Durumeric, Z.; Wustrow, E.; Halderman, J.A. ZMap: Fast Internet-wide Scanning and Its Security Applications. In Proceedings of the 22nd USENIX Security Symposium (USENIX Security 13), Washington, DC, USA, 14–16 August 2013; pp. 605–620. [Google Scholar]
Cheng, Y.; Ji, X.; Zhang, J.; Xu, W.; Chen, Y.C. DemicPU: Device fingerprinting with magnetic signals radiated by CPU. In Proceedings of the ACM Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 1149–1162. [Google Scholar]
Park, S.Y.; Lim, S.; Jeong, D.; Lee, J.; Yang, J.S.; Lee, H. PUFSec: Device fingerprint-based security architecture for Internet of Things. In Proceedings of the IEEE INFOCOM, Atlanta, GA, USA, 1–4 May 2017. [Google Scholar]
Sanchez-Rola, I.; Santos, I.; Balzarotti, D. Clock around the clock: Time-based device fingerprinting. In Proceedings of the ACM Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 1502–1514. [Google Scholar]
Meidan, Y.; Bohadana, M.; Shabtai, A.; Guarnizo, J.D.; Ochoa, M.; Tippenhauer, N.O.; Elovici, Y. ProfilIoT: A machine learning approach for IoT device identification based on network traffic analysis. In Proceedings of the ACM Symposium on Applied Computing, Pisa, Italy, 21–24 March 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 506–509. [Google Scholar]
Meidan, Y.; Bohadana, M.; Shabtai, A.; Ochoa, M.; Tippenhauer, N.O.; Guarnizo, J.D.; Elovici, Y. Detection of unauthorized IoT devices using machine learning techniques. arXiv 2017, arXiv:1709.04647. [Google Scholar]
Sivanathan, A.; Sherratt, D.; Gharakheili, H.H.; Radford, A.; Wijenayake, C.; Vishwanath, A.; Sivaraman, V. Characterizing and classifying IoT traffic in smart cities and campuses. In Proceedings of the 2017 IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS, Atlanta, GA, USA, 1–4 May 2017; pp. 559–564. [Google Scholar]
Santos, M.R.; Andrade, R.M.; Gomes, D.G.; Callado, A.C. An efficient approach for device identification and traffic classification in IoT ecosystems. In Proceedings of the IEEE Symposium on Computers and Communications, Natal, Brazil, 25–28 June 2018; pp. 304–309. [Google Scholar]
Fki, Z.; Ammar, B.; Ayed, M.B. Machine learning with Internet of Things data for risk prediction: Application in ESRD. In Proceedings of the International Conference on Research Challenges in Information Science, Barcelona, Spain, 17–20 May 2018; pp. 1–6. [Google Scholar]
Shen, Y.Z.; Gu, C.X.; Chen, X.; Zhang, X.L.; Lu, Z.Y. Vulnerability analysis of OpenVPN system based on model learning. Ruan Jian Xue Bao/J. Softw. 2019, 30, 3750–3764. [Google Scholar]
Shaikh, F.; Bou-Harb, E.; Crichigno, J.; Ghani, N. A Machine Learning Model for Classifying Unsolicited IoT Devices by Observing Network Telescopes. In Proceedings of the 2018 14th International Wireless Communications and Mobile Computing Conference, IWCMC 2018, Limassol, Cyprus, 25–29 June 2018; pp. 938–943. [Google Scholar]
Thangavelu, V.; Divakaran, D.M.; Sairam, R.; Bhunia, S.S.; Gurusamy, M. DEFT: A Distributed IoT Fingerprinting Technique. IEEE Internet Things J. 2019, 6, 940–952. [Google Scholar] [CrossRef]
Maiti, R.R.; Siby, S.; Sridharan, R.; Tippenhauer, N.O. Link-layer device type classification on encrypted wireless traffic with COTS radios. In Lecture Notes in Computer Science, Proceedings of the European Symposium on Research in Computer Security, Oslo, Norway, 11–15 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10493, pp. 247–264. [Google Scholar]
Apthorpe, N.; Reisman, D.; Sundaresan, S.; Narayanan, A.; Feamster, N. Spying on the smart home: Privacy attacks and defenses on encrypted iot traffic. arXiv 2017, arXiv:1708.05044. [Google Scholar]
Clarke, M.R.B.; Duda, R.O.; Hart, P.E. Pattern Classification and Scene Analysis. J. R. Stat. Soc. Ser. A Gen. 1974, 137, 442. [Google Scholar] [CrossRef]
Zhu, F.; Liu, L.; Meng, W.; Lv, T.; Hu, S.; Ye, R. SCAFFISD: A scalable framework for fine-grained identification and security detection of wireless routers. In Proceedings of the 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 29 December–1 January 2021; pp. 1194–1199. [Google Scholar]
Samtani, S.; Yu, S.; Zhu, H.; Patton, M.; Matherly, J.; Chen, H. Identifying SCADA systems and their vulnerabilities on the internet of things: A text-mining approach. IEEE Intell. Syst. 2018, 33, 63–73. [Google Scholar] [CrossRef]
Feng, X.; Li, Q.; Han, Q.; Zhu, H.; Liu, Y.; Cui, J.; Sun, L. Active profiling of physical devices at internet scale. In Proceedings of the 2016 25th International Conference on Computer Communications and Networks, ICCCN 2016, Waikoloa, HI, USA, 1–4 August 2016. [Google Scholar]
Huang, S.; Ontañón, S. A closer look at invalid action masking in policy gradient algorithms. arXiv 2020, arXiv:2006.14171. [Google Scholar] [CrossRef]

Figure 1. Examples of protocol responses.

Figure 2. Examples of web features.

Figure 3. Overview of our framework.

Figure 4. The process of data analysis.

Figure 5. The probability of obtaining device fingerprints.

Figure 6. The success rate of device fingerprint identification.

Figure 7. Time efficiency compared with another work: (a) success rate of brand; (b) success rate of model; (c) success rate of version.

Table 1. Typical open port combinations.

Device Type	Protocol Type	Default Ports
Webcam	HTTP/HTTPS	81, 80, 82, 8080, 443, 8443
	RTSP	554
	Data Service	8000, 37777
	Onvif	80, 2020, 3702
Printer	HTTP/HTTPS	80, 8000, 8080, 8081, 443, 8443
	IPP	631
	FTP	21, 9600
	Telnet	23
	PJL	9100, 9101, 9102, 515
Firewall	HTTP/HTTPS	443, 10443, 8443, 80, 8888, 8080
	SSH	22
	Telnet	23
	SNMP	161
NAS	HTTP/HTTPS	80, 8080, 443, 8443, 5000, 5001, 8000
	UPnP	1900
	FTP	21
	NDMP	10,000
	Rpcbind	111

Table 2. Sample device models for the real-world test.

Brands	Number of Models	Number of Devices	Main Model Series	Type
Cisco	946	15,186	CSR, DPQ, ASR, RV	Router, Switch
			TANDBERG, Codian	Webcam
			ASA	Firewall
Huawei (H3C)	776 (114)	20,977	AR, HG, EG, Quidway, WX, CR, SI	Router, Switch
			Secoway, Eudemon, ASG, SecPath	Firewall
			IPC, HiSilicon	Webcam
D_Link	666	12,270	DI, DCM, DSL, DIR	Router, Switch
D_Link	666	12,270	DCS, DSH	Webcam
Juniper	483	1333	MX, ERS, EX	Router, Switch
Juniper	483	1333	SRX	Firewall
HP	370	850	ProCurve, SuperStack, NR	Router, Switch
HP	370	850	LaserJet, OfficeJet	Printer
Synology	191	718	DiskStation, CubeStation	NAS
Dahua	117	1219	DHI, HCVR, NVR	Webcam
Fortinet	86	553	FortiGate, FortiManager	Firewall

Table 3. Success rate compared with other work.

Work	Approach	Feathers	Success Rate of Identification
Work	Approach	Feathers	Type	Brand	Model	Version
[36]	Protocol Banner Fingerprints	Protocol Banners	over 90%	NA	NA	NA
Our Work	Probe Scheduling	Multi-layer features	93.43%	96.89%	93.43%	83.71%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liang, C.; Yu, B.; Xie, W.; Wang, B.; Peng, W. Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback. Appl. Sci. 2022, 12, 8335. https://doi.org/10.3390/app12168335

AMA Style

Liang C, Yu B, Xie W, Wang B, Peng W. Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback. Applied Sciences. 2022; 12(16):8335. https://doi.org/10.3390/app12168335

Chicago/Turabian Style

Liang, Chen, Bo Yu, Wei Xie, Baosheng Wang, and Wei Peng. 2022. "Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback" Applied Sciences 12, no. 16: 8335. https://doi.org/10.3390/app12168335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fine-Grained Identification for Large-Scale IoT Devices: A Smart Probe-Scheduling Approach Based on Information Feedback

Abstract

1. Introduction

2. Related Work

2.1. Traditional Detection Methods

2.2. Traffic-Based Methods

2.3. Banner-Based Methods

3. Motivation

3.1. Port Layer

3.2. Protocol-Response Layer

3.3. Web-Feature Layer

3.4. Dependencies among the Probe Surfaces

4. Framework

4.1. Probing Knowledge Base Module

4.2. Data Analysis Module

4.3. Probe-Scheduling Module

5. Implementation and Evaluation

5.1. Data Set Analysis

5.2. Evaluation

5.3. Success Rate and Time Efficiency Compared with Other Work

5.3.1. Success Rate Performance

5.3.2. Time Efficiency Performance

6. Discussion and Limitations

6.1. Ranks of Probing Actions

6.2. Scheduling Policy

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Special Port for IoT Devices

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI