The purpose of industrial control system (ICS) cybersecurity is to ensure that the industrial process performs safely and as expected. It should only perform at the right time, for the right people, and for the purposes for which it was designed. Anything outside those conditions is often considered a cybersecurity incident. Small improvements to the system design, network architecture, monitoring strategy, and maintenance policies can solve many problems before they become larger issues.
Reliability and ROI
When broaching the topic of cybersecurity with management, it is important to show some return on investment (ROI). Generally, organizations are not prepared to invest in cybersecurity for cybersecurity’s sake. It may be easier to introduce cybersecurity improvements by looking toward the overall reliability and uptime of the system instead. The reliability and uptime of ICS is a function of safety, security, and performance. A failure in any of those conditions affects the overall reliability of the system, which will affect uptime and production efficiency.
For many industrial processes, safety is king. Companies have learned the hard lessons of not responding to safety issues right away through a number of serious incidents. Organizations have also designed their systems to operate safely or with safer processes to lower their potential risk. By reducing the consequences in areas of their systems, organizations can reduce the complexity of the countermeasures that need to be applied to the system.
Security can have a negative or positive effect on reliability and uptime, depending on how it is implemented. For example, it can segment the network, reduce the attack surface of legacy systems, and limit the spread of an incident.
Performance seems like a natural aspect of reliability and uptime, but the root causes of performance degradation or failures may actually be overlooked. Performance problems often present themselves as inconsistent data delivery, halted human-machine interface screens, or jitter in data values. They may be indicators of network infrastructure problems and not the result of malfunctioning devices.
Risk management is an integral part of industrial processes. Balancing the process risks with those for production quantity, quality, and safety is important for industrial organizations. When considering how to manage ICS security risks, learn from existing risk management systems.
Organizations have often analyzed financial, safety, physical security, and business information technology (IT) security risks. The consequences and risk calculations made during those efforts are similar to those for ICS cybersecurity. Generally, the consequences will be the same for the different risk management systems, although the root causes may be different.
When comparing ICS cybersecurity to other risk management systems consider people, devices, and systems not acting as they should or as they were configured, either through unintentional events or intentional actions. The failure modes associated with ICS security are slightly different as well.
- Loss of view = condition where a device or system is not receiving information from another device or system
- Manipulation of view = actions by an attacker to change the information between devices or systems
- Denial of control = condition where a device or system is not receiving control signals from another device or system
- Manipulation of control = actions by an attacker to change the control signals sent between devices or systems
- Loss of control = actions by an attacker to combine some or all of the above and deny information and control signals from reaching the proper devices or systems correctly
For greenfield (new) ICS, security should be factored in from the start. When designing the control system, organizations should consider the security of components and communication paths. ICS cybersecurity should be included in the normal hazard and operability study, safety instrumented system (SIS) designs, and basic process control system designs. Consider possible single points of failure and systems that require extra protection due to potential consequences or their importance to the process.
For brownfield (retrofit/upgrade) projects, security should be factored in to all future designs. The organization should consider adding or modifying security countermeasures during maintenance outages. These upgrades will require more planning, because maintenance outages are limited in duration and resources may not be available. Any improvements should be designed, procured, and tested with enough lead time to initiate them without any delay—possibly months in advance.
In a perfect world, organizations have enough personnel and funding to implement ICS security for all their systems once management approves. In the real world, capital expenditures are limited, personnel are almost always overloaded, and systems cannot be shut down at a moment’s notice. Organizations need to prioritize their ICS security countermeasures. One way to do this is by looking at the ability to implement versus the time for planned outages and making three categories: easily actionable improvements, near-term improvements, and long-term plans (see table 1).
Easily actionable improvements are not on the critical path where a change requires shutting down the main process. An example is removing unused or unauthorized software from operator workstations. Another example is upgrading a network interface on equipment only used periodically. A third example is adding a test system capable of validating patches and updates before they are applied to the production equipment.
Near-term achievable improvements are countermeasures that can be fully developed, procured, tested, and ready to implement before the next planned outage. For example, if a plant is planning a network infrastructure change, it will require downtime to change equipment. The network change can be planned; the equipment can be procured and preconfigured; and the personnel can be trained on the new equipment before the outage. This could possibly require months of preparation.
Long-term plans are countermeasures that may take multiple planned outages to accomplish. They can be done in different ways. One way is to break down the long-term plans into a series of near-term improvements that can then be implemented during multiple planned outages. Another way is to develop the plans in parallel with the existing system. Once finished and tested, the production system can be switched over during a planned outage. Depending on the size of the long-term plans, it may be necessary to use a combination of both methods.
Network segmentation is one of the biggest factors in ICS network reliability. Properly designed segmentation can be a natural barrier to performance and security issues. A poorly designed network architecture may expose ICS to unnecessary network traffic, expand the potential attack surface of ICS equipment, and reduce the overall effectiveness of security practices within the organization.
Technology is only part of the solution
Network segmentation is more than just adding technology to a network. It is a process to understand:
- what devices communicate on the network
- how fast or often those devices communicate
- where the information flows throughout the network
- what form that information takes
Understanding how the ICS devices and systems interact is key to designing for robustness and reliability. In the example ICS network architecture (see figure 1), major areas have been segmented by purpose and physical location. Buildings 2 and 3 are tightly coupled, requiring real-time ICS protocol traffic with control cycles in milliseconds. To reduce the ICS core network overhead and improve performance, these networks are joined into a single segment. Building 4 needs access to buildings 3’s information, but it resides on an OPC server in the ICS servers segment.
Security zones are areas within a system that contain similar security requirements. In Figure 1, building 1 contains two distinct areas with different sets of security requirements, and area 1 has been assigned its own security zone and network segment. Examples of this situation may be SIS, legacy system, or vendor-proprietary equipment.
Simplicity of design
Figure 1 may look complex enough with multiple network topology layers and security zones. When overlaying these concepts on an operational ICS network, the architecture only gets more complex. In reality, though, the process of segmenting portions of the network and creating security zones makes things easier in the long run.
For networks similar to the one in figure 1, it is common to use a layer 3 switch at the ICS core and separate IP subnets for each of the network segments shown. Improvements to reliability and maintenance will probably overshadow the added initial cost of the network hardware.
ICS devices are sensitive to the amount of network traffic they are exposed to on a network. Reducing the amount of network traffic not associated with an ICS device’s specific function increases its overall performance.
Maintenance reports and network alarms may only include a device’s IP address and an error. If the IP address subnet directly relates to a physical area within the facility, maintenance personnel will be able to identify systems and devices more easily. A host-numbering scheme can help further. For example, assign a host number from 10–29 for all controllers and 30–99 for all I/O devices. Another example is to assign host numbers from 10–49 for process 1 and 50–89 for process 2.
Adding security zones on top of network segments may also seem like extra, unnecessary complexity. In many cases, there will be only three security zones assigned to the ICS. The business/ICS demilitarized (DMZ) zone is responsible for information passed between the business and ICS networks. It needs to restrict direct access through the DMZ, while still allowing information to pass securely in both directions. The ICS core and servers zone generally consolidates and processes information going through the DMZ between the business and ICS systems. The ICS process networks contain the bulk of the systems within the ICS environment. These are the controllers, I/O systems, sensors, actuators, process equipment, and other devices that make up the actual process.
Some subnets, systems, or subsystems within the ICS network have higher security requirements than the rest of the network (e.g., SIS, legacy systems, vendor proprietary systems, and tightly coupled network devices). In these cases, an additional security zone can be created to isolate them from the rest of the network.
If a system is connected to a network and not monitored, then there is no guarantee that it is safe, secure, or performing properly. Monitoring can be done with specialized, designed systems and services or by observing the behavior of the system. Four main things to monitor are network segmentation devices, ingress/egress filtering, intrusion detection systems (IDSs), and network performance indicators.
Network segmentation device monitoring
Segmenting networks using devices such as layer 3 switches, firewalls, routers, and data diodes should be part of most network design. The rule sets, configurations, and logs for the segmentation devices should use strict change management policies and be monitored regularly. Changes to access privileges, management interfaces, or segmentation rule sets should receive prior approval. Even small changes can have drastic effects on the architecture of the network. An automated tool for monitoring these changes is recommended given the length and complexity of the files.
It is also important to monitor the types of traffic flowing in and out of the ICS network. If the configuration of the segmentation device is changed without the knowledge of the ICS network administrator, the first indication may be unknown network traffic going across the business/ICS network boundary.
Monitoring the ingress and egress of traffic during factory acceptance testing (FAT), site acceptance testing (SAT), or process startup establishes a baseline of the network traffic. Facilities should monitor traffic periodically to look for new network communication paths to addresses within or outside the organization. If unknown traffic is detected, there may be a problem in a particular ICS or network segmentation device.
Intrusion detection systems
IDSs are valuable tools capable of monitoring traffic continuously to look for known conditions based upon a set of rules. In many cases, additional rules can be generated for known good traffic. IDSs also generate alarm and event data that can be integrated into other systems, like security information and event management systems.
Network performance indicators
Network performance indicators are less well defined than the other types of monitoring and are very process dependent. The network streams that are most sensitive to network anomalies are heavily dependent on the ICS network architecture, devices, protocols, and environmental conditions. Some network performance metrics and methods have been developed to aid organizations.
The cyclic jitter on periodic traffic or the latency associated with command/response traffic can measure an ICS network traffic stream. Deciding which metric is more important depends on the ICS being analyzed. Also, statistics about these streams may not be useful. In many cases, mean, minimum, maximum, and standard deviation values do not indicate any problems. Network performance indicators may only be observable when looking at time plots for each traffic stream.
Tools such as Matlab, Microsoft Excel, and the Kenexis Gemini tool can be used to generate time plots. In the following two examples, the desired cyclic traffic pattern should be 20 ms. In figure 2, the device produced a random pattern at approximately 40 ms delay, which the receiving device interpreted as missing packets. The device produced these with no recognizable pattern. In figure 3, the device was able to transmit at approximately 19.5 and 20.5 ms, but not at 20 ms exactly. In addition, an angular pattern indicated a skew between two or more of the device’s internal clocks.
Both examples came from prototype devices during development. The data captures and traffic graphs gave the vendor valuable information it used to improve its products before production. Had these graphs been from devices in live production, the results may have indicated an anomaly in the device. That anomaly could be from a vendor development issue, but it could also be from a network performance problem or security incident.
Using network performance tools during FAT, SAT, and startup to create baselines is an important way to judge the continuing health of the network. By comparing the baseline diagrams to ones collected periodically, early indications of network performance issues can be determined before they lead to larger problems.
Whitelisting is not a new technology, but it has not gained much traction in the IT environment. Whitelisting is the process of restricting the applications and libraries that can run on a system to a previously approved list. When an application starts, it is checked against the approved list to determine whether it can run. This is the opposite of blacklisting software, such as antivirus and antimalware applications, that restrict known bad behavior. Whitelisting makes starting an application slower, but it runs much faster in memory because its checks are much less intrusive.
For IT desktop computers, whitelisting is not common, because the software on them changes regularly with the introduction of new operating system patches, software updates, antivirus signatures, etc. After each change, a system administrator would have to verify that the new applications and libraries are valid and whitelist each one of them. For more than a couple systems, this would be too time consuming to be practical.
For systems in the ICS environment that do not change regularly and where changes have to be approved through a change management process, whitelisting makes sense. An administrator will already be going through the action of approving the specific changes. After each change has been made, the final step in the change management process would be to approve the change in the whitelisting software.
In summary, there are many things an ICS organization can do to manage its cybersecurity. In many cases, improving the reliability and uptime of the systems has much more return on investment for the organization. Security is one aspect of reliability and uptime, as are performance and safety, and many aspects of improving the performance, safety, and security of systems interrelate. Small improvements to design, architecture, monitoring and maintenance policies, and personnel responsiveness can solve many problems before they become large issues.