Increasing Output: The Same Problem in Two Worlds

Nikunj Mehta
Apr 21, 2022

Key takeaways:

Capital intensive operations face special challenges when extending capabilities or capacity but smarter use of data is one way to address those challenges.
Operational data can be better utilized by mapping OEE factors to specific tasks that analytics can enhance.
Time series AI and working “in the now” are important concepts in achieving value at scale.

The steel and semiconductor manufacturing industries seemingly have little in common. Yet they are both capital-intensive operations that can make use of time series AI to increase their production at a lower cost.

A steel mill and a 200mm semiconductor fab face a similar challenge: How to improve manufacturing output, whether in Tons Per Year (TPY) or Wafer Starts Per Month (WSPM), while saddled with a huge capital equipment footprint that cannot be readily replaced or easily expanded. Steel plants and wafer fabs both have very large built-in investments. Extending capabilities by starting over isn’t a realistic option. Adding incremental capacity is also difficult. For example, there may be no physical space, older equipment that matches the installed base may not be available or harmonizing capabilities of the new equipment with the old can be a significant challenge.

The promise of Smart Manufacturing – bits are cheap but powerful

When equipment can’t be replaced, the only option is to make better use of existing resources. The inherent challenge is that older operations have been exhaustively tuned and characterized through years of constant use and maintenance. Under these conditions, getting more feels like squeezing water from a stone. As it turns out, a smarter solution is possible. Data and analytics can be a cost-effective way to increase production, but what exactly does that mean?

A good way to think about where data can help is by mapping OEE (Overall Equipment Effectiveness) factors to specific tasks that analytics can enhance. The table below lists some of these areas and the reasoning behind each.

OEE Factor	Use Cases Areas
Availability Reduce planned and unplanned downtime	Anticipate significant failures to avoid high impact events which can take a long time to recover from. Anticipate maintenance eventswith enough lead time to: Optimally schedule limited maintenance resources, thereby minimizing recovery delays due to unexpected issues. Move from scheduled to condition-based maintenance, thereby reducing total planned maintenance time. Improve process understanding from data to identify improvement opportunities : Signal patterns and importance of specific signals in such patterns allowing SMEs (subject-matter experts) to find root cause more quickly and reduce the total time to recovery. Differences between equipment with the highest and lowest maintenance needs in order to identify best practices which can increase asset reliability. Replace noisy alarm systems with learning systems that produce fewer but more understandable alarms to lower the time lost to dispositioning such alerts.
Performance Reduce cycle time variability and small stops	Detect and characterize patterns of equipment behavior to better understand and control the equipment. This understanding can be used to avoid small delays or to update rules-of-thumb that result in non-optimal cycle times.
Quality Increase good units out per unit time	Increase capture rate of quality escapes thereby reducing the number of misprocessed parts.

For example, a steel manufacturer needed to increase production output for an existing plant. However, the traditional way of achieving this involved a very expensive purchase of additional casting equipment. Instead of making that capital expenditure, the plant’s leadership calculated that a relatively small increase in availability would be enough, especially if that increase could be replicated at multiple plants. They decided to focus on increasing availability of the existing casting equipment by better identifying potential failures using data, reducing the outsized impact of unplanned downtime on availability. While this project is ongoing, new smart approaches to finding events of interest combined with a newly developing culture of expectations around using those approaches has shown promising results.

This experience was echoed by a semiconductor company customer who recently discovered patterns of etch equipment behavior that indicated the impending failure of the vacuum chuck. These patterns were discovered by training the time series AI engine to recognize the differences between wafers that had normal and abnormal after-etch CD measurements and by determining key signals as identified by software along with maintenance records to see that chuck-related signals were involved around the time that chuck replacement was performed. These models could be used to detect future chuck failures based on etcher signal data alone to reduce unplanned downtime due to failures during processing.

Another semiconductor customer was able to improve quality by using time series AI to detect operational patterns in its deposition equipment. The patterns indicated that delamination risk would be high for wafers processed during that time. Affected wafers could be stripped and reprocessed to avoid wasting time and consumables involved in putting them through subsequent steps. Further, equipment and process engineers used explanation scores to understand the root cause of the delamination. With this knowledge, they were able to make changes to the process recipe that drastically reduced the occurrence of delamination.

The reality of Smart Manufacturing – Using bits effectively isn’t trivial, but there is a better way

One major challenge with Smart Manufacturing is moving from examples like these to production, aka escaping the dreaded POC purgatory. There is insufficient value at low scale and getting to scale can be very difficult. Our experience with overcoming this barrier is to work “in the now.” This means getting the operations teams – the equipment engineers, process engineers and maintenance teams – leading the project themselves. This requires two important factors that are frequently missing from POCs:

Analytic tools must be fast and directly usable by reliability and process engineer.
Analyzed data should be from recent operations (as opposed to months or years old data).

“Analytic tools usable by non-data scientists” does not mean training engineers to be citizen data scientists. That kind of training still takes significant time, effort, and focus that the SME doesn’t have to spare. It also does not mean abandoning of the plant’s data scientists – this is a valuable skill that is hard to obtain. The data science approach has definite, important applications; it is just difficult to scale. Instead, the analytic tools must be designed specifically to automate tricky or tedious tasks (like data engineering) and by simplifying choices to a level that is manageable by someone with limited time for training.

Using data from recent, ongoing operations is important because SME engagement is crucial to understanding and solving problems. However, SMEs are too busy to work on problems from six months or two years ago. Two-year-old data is useful for experimental controls, allowing analysts to compare many options for an academic brief to be prepared for data scientists working on modeling the problem. It is entirely irrelevant to the people who need results from it most: the operations teams where the SMEs work. Therefore an entirely different approach to evaluation and learning is beneficial – one that recognizes this tension between traditional POCs and the limitations that those traditional approaches face in the smart, data-driven world.