Looking Down from the Clouds

05 August, 2018 02:00PM

Where It All Began

The advent of generic personal computers running application software for the purposes of control systems HMIs (aka SCADA) is a practice than began during the 1980s, however became the preferred method of human machine interface with a control system by the 1990s to such a degree that customised user interface devices are used only in a handful of situations. Usually when ruggedness and harsh conditions drive a requirement for physically hardened screens or physical buttons that a personal computer can not provide. This desire led to a dislike for custom HMI devices since an organisation could pay for a SCADA software licence once, then a relatively generic PC could run that application and be inexpensively replaced as needed.

As the decades have passed not only that, but the scale of control systems have also grown to a point where some control system networks equal or exceed the scale of their IT counterparts. Wherever that balance may lie for any given organisation, a divide has developed between two parts of organisations with control systems infrastructure: the control system group and the IT department. The IT department have been in cycles of insourcing and outsourcing for some time with their usage and support patterns having common equipment, common applications, common scripts and so on. IT have been wrestling with Service Level Agreements (SLAs) with IT support companies for years now to varying degrees. Conversely the control systems groups (sometimes referred to as OT: Operational Technology) have to learn every subtle variation of the highly customised equipment they are entrusted with maintaining. In this world in-house knowledge becomes precious and outsourcing considered to be too high a risk and a world where vendor relationships are critical for survival and SLAs are few and far between.

IT Pull Ahead

As decades have passed IT departments have come under increasing pressure from cyber-security threats to keep on top of the obsolescence of their equipment. Businesses demand internet connectivity which drives a huge attack surface for those wishing to do harm from the outside and it must be addressed. In addition the IT sector has been recently offered a simplification of their lifestyle in the last decade: cloud hosting. The old staple of IT departments, budgeting for 3, 4 or 5 year refresh cycles for their server equipment in particular could be changed to a single flat monthly fee as an operating expense with Microsoft Azure or Amazon Web Services (or others) hosting their virtualised machines or through a progressive migration of their services to platforms hosted in the cloud (eg Office 365). Whilst the raw cost economics of this move are still subject to some debate there is no doubt that this leaves IT departments with only standardised physical end point machines (laptops/desktops) to maintain and replace, which is scriptable and regularly outsourced.

This structure means IT can focus on application updates, have less internal networking requirements and by signing up to an SLA with the cloud hosting provider, the availability of core IT services is no longer directly their problem to ensure uptime. If there are outages in the cloud, they are generally brief and the business impact, whilst non-zero, is ultimately recoverable with minimal risk to the organisation or at least at a risk organisations are prepared to accept.

OT Left Behind

If control systems HMIs had evolved into customised HMI devices that weren’t PCs then there would be no discussion to be had. In that reality, there was no choice but to pay the higher prices for the customised HMI device to operate and maintain the facility. However the server and desktop as well as network switch hardware used for OT is often the same or very similar to that used by IT which inevitably leads to the question periodically: can IT and OT converge?

Control systems that operate chemical plants, oil refineries, gas compression facilities and water treatment plants are designed to have a very high availability and a long service lifetime. In some cases controllers can run without incident for 30 years. PCs on the other hand tend to struggle beyond 10 years and that’s if a top of the line model is purchased initially, kept in a temperature and humidity controlled environment and fed from filtered, clean mains power for that time.

Equipment that OT groups are charged with maintaining is highly customised and the outcomes of its failure are often dire, hence there is a great level of caution when making changes without understanding all of the detail first. For this reason great care is taken using Management of Change (MoC) reviews with technical design reviews of detail down to the code level in some cases, to ensure that no individuals or operating equipment is put at risk before a change is made. Indeed even rebooting a network switch can lead to a lack of visibility of an operational site and depending on how well designed that site is, it may cause a site to shutdown unexpectedly.

This leads to a drive away from regular updates of software, firmware and hardware that ultimately means that PCs that run the HMIs are rarely, if ever, updated at both a hardware, firmware, operating system and application software level. When they are updated it is carefully undertaken often running HMI PCs in parallel during an upgrade or upgrades are only permitted during complete facility shutdowns when the risk of a loss of visibility is minimised.

Beyond these things HMIs are the windows into the controller and control system without which, no human can see the entire process at a glance. Worse than that, trends in recent decades have shown that there is a growing trust in SCADA HMIs, and more traditional local indicators on instruments like flow meters and pressure transmitters as well as purely mechanical check gauges are being removed from designs to cut costs, simplify and to keep people physically out of dangerous areas. This trend has put an almost absolute reliance on HMI visibility for many plant components, where in the past instrument position was carefully planned to ensure a field operator could physically “walk the boards,” this is becoming impossible to perform without the HMI at most facilities.

To address the risks posed by putting so much faith in the HMI there have been two predominant design patterns in the control system space in recent decades: redundancy of SCADA, and co-location of SCADA and controller.

Redundancy

The easiest option to relying too much on a single window into the system is to create a hot standby HMI. A Secondary machine runs in step with the Primary machine ready to take over at a moments notice. Further beyond redundancy risk is often further reduced by using Client machines that communicate with the Server machine, whilst the Server machine does the interaction with the controllers themselves. This means operators use the client machines, often much cheaper desktop PCs, and the servers (the most critical devices) are housed in temperature controlled environments and are increasingly being put on server grade machines chasing higher reliability. To hedge bets even further, multi-client setups are increasingly common with two redundant servers connecting to two or more client machines, some clients being specifically web-based clients for remote access which is increasingly becoming a requirement.

Co-location

The larger the attack surface the more likely a disruption will occur given sufficient time. In the case of OT networks to reduce the risk of disruption of communication between the HMI server machine and the controllers that the server provides the (sometimes) only window into, the SCADA server is placed on the same local network as the controllers it communicates with. Often this is a closed network or subnet (for Cyber-security reasons) and it is often self contained within a physical facility with fencing around a defined boundary. Strict controls for planned digging at site and MoC is applied to changes to prevent and disruptions. Typically Uninterruptable Power Supplies continue to power the local server machines, their local client machines and controllers to ensure maximum uptime as well.

What About the Cloud

The risk of losing communication between the controller and the SCADA system is significant if you place the server equipment outside the plant boundary. Even if AWS have a relatively physically proximate instance (data center) there’s still a significant amount of copper and/or fiber between the facility and the cloud infrastructure. The organisation then must place considerable faith in the telecommunications companies to rapidly fix cable breaks, switch failures and such in a effectively completely uncontrolled environment. Beyond that, much of SCADA relies heavily of accurate time stamping of messages and near-real time command/responses between SCADA and the controller and latency can cause erratic operation in most systems in use today. Unless the SCADA system has been developed specifically to be cloud hosted to handle the additional latency, it is unlikely to perform reliably if it is shifted to the cloud. Of course that’s a function of time and industry pressure and whilst more cloud SCADA platforms are becoming available coming in to 2020, migrating between incumbent SCADA software platforms to alternative platforms can be an expensive exercise even if that risk is accepted.

One size never fits all. There are however some scenarios where cloud-based SCADA might work:

Full Local Visibility: The control system is indicated completely locally as the ultimate backup for a complete loss of SCADA visibility
Full Automation: A window into the controller has been completely designed out with automation and SIS (Safety Instrumented System) taking care of the plant/process under all operational conditions
Full Diversity and Redundancy: The cloud server infrastructure is distributed through multiple paths and data centers with full redundancy between all locations and no single point of failure exists

Full Local

For simple plants or highly segmented components of a plant or for simple processes, this is always a possibility. Whilst the trend towards cost-reduction away from local operation panels, having a manual override is always a good idea even if it comes at a price. That said, the larger the facility the less likely this will be the case. Unless it is designed in to begin with, such systems are difficult and expensive to retrofit.

Full Automation

There are some highly repetitive, greatly populous engineering tasks for which a sufficient amount of time and money can be invested into full automation, since the scale and reusability of that technology can be recovered through ongoing sales volumes and design reuse. A good example of this are vehicles from planes down to cars. The unfortunate part of control systems and OT networks is that they are almost always custom designed and built. Whilst not always the case, for the vast majority customised plant automation depth is stopped at a line of cost vs risk.

In many cases HAZOPs and LOPAs are performed to determine the risk and mitigation measures to ensure that a Safety Controller (SIS) can prevent Major Accident Events (MAEs) by shutting down a system or process before the worst can happen. Ultimately this isn’t cost effective or even possible for every potential circumstance and the control system will rely on an operator at some point to intervene. The only way operators can be alerted to a condition is via the SCADA either visually or audibly. If the window into the system isn’t there, then the operator won’t/can’t be alerted of a situation and just as critically, can’t directly intervene to prevent an incident from occurring.

It’s true that there are some installations where control systems utilise direct paging from the controller to get an operators attention however the paging systems lack of flexibility and cost have ultimately driven these out of favour in recent decades. This leaves full automation highly improbable for facilities that are custom designed, which is unfortunately most of them.

Diversity

To satisfy diversity cloud providers would need to offer a fully diverse hosting platform with multiple data centers in different parts of the same city or different cities, each interconnected and cross-connected via different physical/geographical routes, independent power supplies for each, with application redundancy applied geographically and forced via independent paths. The key is that there can be no common failure mode and as unlikely as it sounds, common failure modes include denial of service via network overload and telecommunication carrier loss of core switching functions, which although unlikely has still occurred. The far more likely common failure mode is large scale brownouts or blackouts, with many points requiring power to function between the data center and site, each of which requires power to carry the data traffic, it doesn’t take much for this to fail.

A Hybrid Approach

Rather than give up entirely on the OT cloud idea, given the extreme unlikeliness of the above three to be true for large scale facilities, can a hybrid approach reach a compromise? Provided there are redundant server machines co-located at the facility where the physical controllers are located then all other server machines could be cloud-based, then does that provide the best of both worlds? There will inevitably need to be physical machines, either set up as clients or as thin clients RDPing into virtualised client machines, however they’re implemented, these need to also be co-located unless the risk of letting operators use the server machines is considered to be acceptable - which is unlikely.

Beyond SCADA there is also the requirement for local programming via what is often called an EWS: Engineering WorkStation. This also needs to be locally connected to allow for programming and diagnostics to be performed on the local network and often comes with specialised software and a copy of the sites control system code and project files. At this point, physical equipment co-located is a requirement due to plant complexity, lack of local indications and controls and hence we are up for physical host server machines co-located at site.

If this is inevitable then we can make these servers as large as practical, potentially on a Hyper-converged architecture (eg Cisco HyperFlex of HP SimplifVity) with multiple levels of redundancy and entry-level desktop thin clients RDPing to the VMs they host. Any machines used as data repositories and historians can handle a short-term loss of communications with simple local data buffering so they could be moved to the cloud and anything that operators don’t require to operate the facility on a minute by minute basis could also be migrated to the cloud. However no matter how you look at it, the Hybrid approach can only go so far and cost savings come more from the Hyper-convergence of infrastructure only if the scale makes cost-effective sense.

Conclusion

As IT are looking down on OT from the Cloud above, OT are left to grapple with the issues that many IT departments have long left behind with physical machines co-located by necessity not by choice. New plants may be designed (in some instances) from the ground up with local controls and indications and sufficient risk treatments to allow for cloud based hosting of HMIs. That said the true end-to-end costs of such a decision need to be considered such that the benefits in cost reduction of cloud-hosting must be significant to justify the additional local instrumentation, design and maintenance costs and personnel risk in the field to enable full OT cloud hosting to be done safely.

However companies that drive for IT/OT convergence for existing facilities for which it was never designed take on significant risk in attempting to push OT equipment into the cloud. Considering all of the implications is critical to success and failure to do so properly can only end in failure.

WhitePaper Version

Peer Reviewed: Peyman Radnia RPEQ, FIEAust,CPEng, TUV FS Eng

Design Reviews in Name Only

05 August, 2018 01:00PM

Milestones are often tied to customer design reviews, factory acceptance tests and site acceptance tests. Having spent much of my career on the design side it’s been interesting representing a client and my perspective can’t help but be refined in the process.

Reviews are often considered to be onerous tasks that “have to be done” in order to meet an often arbitrary schedule milestone. They are regularly treated with contempt and those seeking changes are often muffled, given token concessions or even silenced completely. “Reviews for Reviews Sake” are thus essentially a questionable use of everyones time other than the designer (presumably also the meeting chair). Reviews are in place to ensure that the content is up to scratch not just to put a tick in a box, but budgetary and schedule pressures often make them ineffective.

The decisions about who attends and the format of the review are the keys to a reviews success or failure. There are two opposing perspectives surrounding these choices. Each shall be examined in turn:

Restrict the number of people attending

Cynic: Too many people means too many people to explain the design to. Too many people means too much feedback that creates unnecessary redesign.

Optimist: Clients (or other departments) often have multiple representatives and getting a clear view of what is ACTUALLY required will vary from person to person. It’s always better to have a smaller group of client representatives to act as a focal point for all feedback.

Restrict the experience of the people invited to attend

Cynic: We will invite attendees with less experience in the area under review but can still ‘represent’ some component of the area under review are less likely to have meaningful feedback of any significance.

Optimist: We will invite attendees with the right kind of experience in the area under review will have useful and helpful feedback and not just, “The grammar is incorrect…” type feedback.

Restrict the different types of feedback i.e. We’re not here to discuss the colour scheme, the font etc.

Cynic: Reject all feedback that doesn’t fit within the established feedback guidelines as the point of review is about something specific and not about making the overall design better.

Optimist: Noting any feedback that may not fall within the feedback guidelines can make the design better but stops getting bogged-down in less-critical details during the review.

Minimal or closed reviews during early development stages

Cynic: Reduces outside influences during initial development and once the design is well-developed then inform those suggesting changes that it’s too late to change anything at this late stage.

Optimist: Early design development needs to be kept in-house early on to reduce excessive external feedback before the design is fleshed out enough for a meaningful review.

Effective Reviews

The best approach for an effective review must be considered from both perspectives: The Designer and The Reviewer.

The Designer

Don’t call the review until you have a design with an agreed level of completeness that is suitable for review. If you haven’t gone through it thoroughly yourself and perhaps had at least one other persons informal feedback then releasing it for a full design review is likely to be premature and a waste of everyones time.
Invite people to the review that have knowledge about what you’re designing. Be clear in the invitation that you’re trying to keep numbers down and that only those invited should attend. Be willing to accept alternatives if your “chosen” individuals are otherwise busy on that day, but keep the numbers on the small side.
Circulate the design prior to the review amongst the attendees to allow them the chance to get caught up on the design under review.
Organise a minute-taker that is knowledgable about the subject and preferably also involved in the design. If not, take detailed minutes yourself. Reviews need to be traceable in case future design decisions require rework and these changes must be identifiable at variations to the main contract.
Acknowledge and accept all feedback as valid, initially. If some feedback is way off-base, politely inform them where you see the disconnect and note their feedback and your response in the minutes.
Progress through the design in a methodical way. Solicit feedback on specifics one section/functional area at a time. Opening the floor up to just “any comments you like from anywhere in the design” is a recipe for a drawn-out review.
Respect everyone’s time that attended. They are taking time out of their work schedule to attend and provide comments to make your design better.
If it’s a long meeting then schedule regular breaks and stick to it. Not everyone has a bladder like a balloon and people can’t concentrate without a leg-stretch once in a while.
If it’s a long meeting keep a bowl of mints or lollies in the centre of the table such that people can keep their blood-sugar levels up during the meeting. Low blood-sugar affects concentration and can affect peoples moods making them more/less critical.

The Reviewer

It’s easy to be nit-picky when you’re not the designer (or perhaps not ‘A’ designer at all) so keep your feedback focussed and relevant.
Respect the designer and their design where possible. Naturally if it’s a terrible design state the specific issues you have with it and why, however keep in mind that the designer is exposing their credibility and competence for all to see and potentially hack to pieces. Be kind.
Take the time to review the design BEFORE the meeting and make notes to discuss during the meeting. Pre-warming your brain to the design makes a huge difference and means you’re not wasting everyone else’s time reviewing during the review when you should be listening and interacting with other people involved.
Pay attention during the review meeting. Distractions such as phones, laptops and side-conversations mean that critical discussions may be misunderstood or missed entirely and this wastes everyone else’s time and reduces the overall effectiveness of the review.
Be thorough with your review and review comments. Design isn’t easy and you were asked to provide input on the design. Your help can provide a better end result.

Original Article

Nobody is Competent, We Are All Human

05 August, 2018 12:00PM

Engineering is about the design, construction and operation of engines, machines and structures. Engineers are bound by a code of ethics and legislation that can end careers if they are proven to be negligent in carrying out engineering that results in the death, injury or loss of property from our work. As engineers create things that the public use (and the public at large are mostly not Engineers themselves) the quality of the engineers work needs to be scrutinised to ensure that it is of an acceptable standard. In essence the challenge is to prove that the work of one or more engineers is competent and by inference if the individual engineers are themselves competent. Whilst the following piece focuses on Engineering it applies to many other professions, if not all.

Before attempting to understand how one might measure competence, it’s good to go over the issues that we all face as humans.

1) Humans Forget

Our brain is continuously bombarded with new information, sights, sounds, smells and events that push other knowledge aside inside our minds. Sometimes the information that is pushed aside is critical to the task at hand and can either slow down or stop work until sources can be cross-referenced to confirm what was once a known fact and has since become a more vague recollection.

2) Humans Lack Focus

Our bodies require nourishment, we get tired, we get sick, we become distracted by both work and non-work related issues. In short our emotions and our stresses drive us to lose focus on the task at hand, and we sometimes have a bad day when we can’t focus at all. In every job, time (and money) is measured the same irrespective of how much focus someone has on any given day.

3) Humans Are Driven By The Need to Survive

In many developed countries, money drives people to work as it ensures survival and the ability to have those things that we would choose to have. Seldom does it ever fully satisfy, however the need to survive and to have security is a primary driving force in us all (the so-called survival instinct). It indirectly causes issues by equating time to money (usually in the far too short term) and cutting corners and bypassing established processes to save money. The higher in the corporate chain the bigger the financial reward exists for saving money. Fighting money/greed/survival driven distraction is too hard for most to manage objectively. Does one take the extra time and thoroughly recheck the design, or just send it off to be built as it is so that the manager who’s pressing a delivery deadline leaves them alone? Whether that deadline is real or manufactured. People will usually cut the corner, submit the design and mistakes creep in.

4) Not All Humans Are Equal

Some adults tell their children: “You can be anything you want to be.” In truth it should be “You can be very good at anything you have a talent for if you work hard at it,” but that’s not as easy for a child to digest, unfortunately. Not everyone has a talent for problem solving, critical thinking, a critical eye for detail or inter-personal communication. These traits in particular enable some people to be more effective engineers and hence not everyone is playing on a level playing field. Some realise this during their careers and self-aware ones often change their career with some success. The less self-aware sometimes to go into management and this is not always a bad thing so long as they don’t interfere with the activity of engineering.

5) Humans Form Relationships With Each Other

We are social animals and when survival needs are met and we generally enjoy the company of others. Whether it’s to share a common complaint, regale a story or discuss the topic of the day, socialising is normal behaviour. As relationships grow, friendships can form that change the dynamic and drivers of the engineers involved with a design. Objectivity can be lost and confused when emotions affect judgement. A critical part of professional development is to provide critical feedback when mistakes are made. All too often feedback is excessively softened due to relationships between the people giving or receiving feedback for fear of offending someone with whom they’ve developed a relationship.

If we agree that the above is true then we can begin to address our own shortcomings. Before that, let’s explore the evolution of design as projects increase in size beyond the capacity for a single person to deliver.

One Man Band

Beware the sole design engineer. With no checks or reviews internally, a failure of any of the first four traits and mistakes will creep into their work. No matter how amazing they may be they are human and will make mistakes. This may be fine for a smaller project with a smaller budget and the cost of rectifying mistakes is small, but when companies invest multiple millions of dollars into a project to build a water or gas pipeline or a new manufacturing facility, it’s reasonable to expect that one engineer alone could not assure that such a massive design could be done alone, flawlessly.

Design Check

From the sole operator we introduce a checker whose sole purpose is to confirm that all calculations and design details are accurate and correct. Who is best to check design? To be thorough it must be someone with more experience than the lead designer. This will likely add significant cost by attracting a higher hourly rate and a thorough design check will take at least 50% of the design time (in recent experience). To increase throughput we add a second design engineer now fully utilising the design checker. Percentages can be argued but the concept is that for a genuine design check, a dedicated resource needs to be accounted for and one for each discipline being designed (electrical, mechanical, process, civil etc). Clients also often have their own engineers (on larger projects) checking design documentation and such client review can often provide additional design clarifications (or scope creep).

Iterative Design

Rather than produce a single design we will now add several gates to pass through. Not literal gates of course, but checkpoints in the progress of the design. Typical values are 35%, 70% and 100% however these can vary from project to project. A great deal of the detail in a design occurs in the first two steps but the idea is to get preliminary feedback (client review) before the design is fully fleshed out. This provides multiple opportunities to review the design to catch mistakes and improve our check effectiveness. Unfortunately each step needs to be reviewed and this takes additional time for both the designers and the client. Gates are often used as payment milestones as well, breaking down the total cost of the job into smaller, more regular chunks improving cash flow for the design company and ensuring a design is delivered.

Version and Document Control

In smaller projects with fewer people it can still be dangerous working from design documentation without version control. Being certain that the client has the correct version and avoiding clients claiming they never received documentation for client review is vital on larger projects, especially if milestone payments are at stake. This usually involves traditional wet signatures approving engineering design documents (even though they are subsequently scanned into soft copy afterward) and additional personnel to ensure that version numbering is adhered to and confirms that the client and internal reviewers received the documentation for their review so the design can progress.

Contract Management

Larger projects with many terms and conditions, milestone payments and contractual conditions, design engineers need to focus on the design and so a contract manager is employed with contract law experience to let the designers focus on their design.

Project Management

There are few things in engineering that are as nebulous as the concept of a project manager. With document control, engineering design and design checking going on, as well as a contract manager handling financials, someone needs to keep an eye on the budget available and negotiate with the client regarding progress on the project. It’s very, very hard to find and to be a good project manager as they must essentially know a bit about everything that is happening. Inevitably they motivate others and remind the designers that there is a fixed amount of money left or deadlines to finish the project, so they should hurry up and finish. Sometimes this can introduce mistakes driven by trait four.

Measuring Competence

Competence By Real World Performance

In reliability engineering we learn that test is a screen. Different test types will attract a different test effectiveness. We can apply the same concept to a design checker by attempting to measure a checkers “check effectiveness”. In the following two scenarios, assume the design checker is more experienced than the designer.

Scenario 1: Designer “X” is new at this and introduces 100 mistakes into every design document. If design checker “A” has a check effectiveness of 70%, then 30 mistakes will pass through. Assuming there are three gates in the design, no new mistakes are added on each design cycle, and check effectiveness remains the same each time, next time around there will be 9 mistakes left and finishing with about 3.

Scenario 2: Designer “Y” is young but talented and introduces only 27 mistakes per document. Due to budget cuts the design checker “B” is less experienced and achieves only 53.5% check effectiveness. If we calculate this through we end up with about 3 mistakes in the final design.

In real life designers and design checkers are all human and all have good and bad days - irrespective of talent or experience. Where this leaves us is with variable check effectiveness and variable design effectiveness further muddying the waters. On a large project with large volumes of documentation the numbers should average out and the above scenario is demonstrative of a major flaw in the design check ethos: how can we prove the design checkers competence based on their checking performance if they are paired with an excellent designer? For that matter, how do we determine the competence of the designer if design checker is not competent?

In practice the designer regularly gets the blame for mistakes however that is often simplistic. If a design checker is more experienced and employed specifically to check designs then ownership of missing design flaws surely must also lie with the checker just as much as the designer.

Familiarity Breeds Mediocrity

In larger teams it’s natural that relationships will form between the people involved. Where relationships grow, familiarity grows and this can be good for team cohesion but can also be very dangerous with under-performing engineers protected (to an extent) by their friendships with those elsewhere in the team. Suddenly feedback becomes less direct, more subtle and designers don’t have critical feedback to ensure improvement in the quality of their work. In addition too much conversation during work periods erodes everyones productivity.

Age Does Not Equal Wisdom

Faced with the two scenarios described above, how would an external manager judge the performance of their staff, given that in both scenarios both the designer and the design checker blames the other for letting 3 mistakes out in the final design? As the design checker is more experienced in these examples (usually meaning older) preference goes to the older, more experienced engineer as with age and experience comes wisdom. Apparently.

An example from recent experience: Two design teams from different countries in the world are designing a pump station. Team A uses an older technology that is proven but has known inefficiencies, the other (Team B) proposes a new approach that eliminates those inefficiencies and whilst has higher upfront cost, presents cost savings after only 5 years of service life. Team A have been implementing older systems for much longer and are given approval for the older design despite the fact that Team B has documented examples of the newer technology being successfully used at multiple other plants around the world.

Age and experience means nothing if your experience is outdated or worse, you’ve just been doing it wrong all that time and no-one gave feedback to that effect. Automatically trusting the design checker is flawed reasoning and makes it more critical that their competence is measured than that of the more junior staff.

Real World Performance is Only About Opinion

In the final analysis one cannot reliably measure an engineers competence based solely on past performance because all measures of real-world performance are based on opinion. The question is posed to the ‘senior’ engineer: “How did they perform?” Since the real world wasn’t a standardised exam there is no fair benchmark and hence there is no fair answer. Perhaps if you were to obtain enough opinions from enough qualified people you might reach a consensus among them, but the result is more likely to be confusing rather than conducive to a single judgement. Relying on a smaller handful of opinions that you trust is flawed since opinion is tainted due to relationships between yourself and those ’trusted’ people and this affects judgement and hence the final conclusion can also not be trusted.

Competence Based On Examination

To remove emotional bias element the only true way to measure competence is by a standardised exam. Exams have either right or wrong answers (in everything except artistic pursuits) and are essentially impartial. If set up correctly the testing and marking can be done blindly (i.e. the name of the individual under test is not known to the marker) to remove any potential bias. For this reason universities, colleges and schools have used standardised testing for a very long time to determine competence when learning a new subject.

Theory vs Practical

It is easier to write an exam that is purely theory. That is to say a collection of facts that test the students knowledge retention on the subject - usually with no reference material. The problem is that in Engineering that is not how engineers solve problems or do most parts of their job. Whilst it is handy to know which engineering standard has which information in it, becoming a walking encyclopaedia is less and less useful given that internet search exists and that documentation is now available in soft copy that can be easily searched by keyword. In essence, engineers are tested regularly on their ability to find information and then apply it rather than being a fountain of knowledge on the trivialities of their discipline. For those reasons when it comes to proving competency, theory exams are essentially worthless.

Practical exams require that the student apply certain rules or formulae to determine various design parameters: for example, how thick does the beam need to be to support that weight, how thick does the cable need to be to carry that current and so on. To do this we refer to a series of standards and textbooks (hard or soft copies) and reference those in our calculations to provide traceability. In effect, most days in engineering design are practical exams, and practical exams should be written as though they are day to day engineering design activities. If they aren’t they’re less likely to be useful in determining competency in the practical execution of engineering.

Degrees and Certificates

To revisit our human traits previously described suggests a problem: theory or practical knowledge gets pushed aside with the passage of time. Even if all college or university degrees had purely practical exams, in 20 years time the engineer with many years experience in the field would likely fail a great number of exams were they made to sit them again without warning. In early years out of university when applying for jobs, without much to go on employers heavily scrutinise the marks on the degree, but as time passes more care and attention is placed on recent experience. This is a big reason why.

To be clear: formal qualifications still have a place but the value that they bring is merely a snapshot in time that suggests potential ability. Some time ago an engineer obtaining their qualification having crammed for dozens of exams mostly proved they could pass and obtain a degree. Two decades later it’s common to question the relevance and usefulness of that degree as it relates to current employment and use of that degree as a measure of competence for engineers of several years experience is essentially invalid.

I’m an Engineer. Yes But Which Kind?

One of the problems in Engineering as a profession is the sheer breadth of the engineering profession. Pre-industrial revolution “Engineering” was just about building roads and bridges and houses but then there were steam trains and electricity and computers and now it’s seemingly endless. By degree as an Electrical Engineer there remains Instrumentation, Control Systems, Low Voltage Electrical Design, High Voltage Electrical Design and Software just to name a small handful. Even those sub-definitions are too broad especially when you consider software. There’s real-time systems, single and multi-threaded programming, graphics, networking, firmware and driver software. Even those can be broken down even further by programming language structures such as object oriented programming and memory safe programming languages.

Saying one has a degree in “insert kind of engineering here” and that makes them competent is somewhat disingenuous irrespective of the amount of experience. Qualifications need to be specific, current with technology and relevant to the job required.

Continuing Professional Development

One solution in local industry is the Registered Professional Engineer (RPE) and the Certified Professional Engineer (CPEng) qualification. Essentially to qualify one must have 5 years of relevant experience, write several essays about projects you’ve worked on and in what capacity, with confirmation that what you have written is correct signed off by someone who was supervising you directly who is also either a CPEng or RPE, and then hand over money to IEAust (the Institute of Engineers Australia) for the initial application and then again each subsequent year.

Once you have RPE/CPEng status you need to prove sufficient CPD (Continuing Professional Development) has been undertaken in the past year in order to maintain that qualification. They suggest training courses but also accept many different kinds of development which is not tested. Critically they only break down the qualification by high level discipline. In short, an electrical engineer doing LV Electrical Design goes to a training course on instrumentation but does no such work day to day in their job and this counts as valid CPD.

10 years ago it was not a requirement to have a CPEng/RPE (the concept of the RPE was introduced in the early 2000s locally) involved in your design project, however now it has become a requirement for almost every client. It’s a requirement that the CPEng/RPE is directly involved with the design from start to finish if they are to legally sign off on the design at any stage. This seems to offer a reassurance that the engineer working on your project is competent (or their design checker is at least) but it doesn’t stop people from treating their CPD with contempt, nor does it stop people from leaving companies mid-design, with a new CPEng/RPE who wasn’t involved with the design then forced into signing off on a design they had nothing to do with. With high industry turn overs, on large projects this is a regular occurrence.

Again to be clear: CPEng/RPE qualifications are better to have than to not. For the most part, Engineering practice locally is a better place as a result of their existence. The issues with the system stem from the validation of CPD not being an accurate enough method for determining ongoing competence and the entry into the ranks is subjective in the first place. There are better ways but these are more costly to administer.

To Summarise

Humans forget and lack focus: cover this by design checking and multiple design stages to increase the probability mistakes will be found. Humans form relationships can be tackled by employing design checkers from outside of the team to examine work with no emotional bias. Businesses need to focus on quality as well as cost and this is more difficult subject for another time. Finally, not all humans are created equal and designers as well as design checkers should be tested to ensure their competence. If design checkers check multiple engineers work it is vital that they have their own competency validated regularly and thoroughly.

Whilst we currently only need to rely on CPEng/RPE qualifications, without regulations and client demands to go beyond that, all responsible design organisations need to take additional steps to ensure the quality of their design checker staff and inevitably their competence.

All is not lost

Proving competence is a balancing act between regular, detailed examination and itfs cost and the ensuing frustration of engineers having to re-prove their capabilities. Exams should be a regular occurrence to ensure ongoing competence with different practical questions each cycle. The cycle time between exams is likely to be subject for debate, as clearly there is an inherent cost overhead in preparing, conducting and marking the exams. Since each is a snapshot of capability at any given moment in time being more than 12 months apart with diminish their value.

Exams must also be split by specific discipline. In other words, to be a design checker on a project with High Voltage design, the checker must have currently assessed competence by exam for that specific subject. This would mean that engineers would need to choose the strands they wish to be qualified for since it is conceivable that too many exams would lead to too little time actually reviewing and earning money for the company. Hence a limit would need to be set as to how many strands an engineer could take in certain circumstances.

The exam questions would need to be unique each year and recycled and should be set by personnel that are not related to the department under test. Without an external standards body like IEAust taking ownership of such a system it may be advisable to employ external consultants to create and another to review these questions. The key is to remove as much bias as possible and create examinations by true peers in the engineering field under test. Budgetary constraints would likely restrict the layers of separation required to ensure minimal bias. Inevitably a governing body would be the better approach and would ensure better question control and consistency.

Of course there are always the same examination risks such as foreknowledge and other methods of cheating however the test environment should be the same as that designers face every day - in other words fully open book with full access to the internet and all required, applicable standards in soft or hard copy available for reference. Not a memory test.

From these results companies could then maintain a competency matrix showing areas they are strong and weak and resource accordingly. Many companies have such matrices, however they have all been self-assessed which inevitably leads to egotistical untrustworthy data.

That Amount of Testing Is Over The Top Isn’t It?

Suggesting adding more layers of regulation seems like building a competency bureaucracy however these steps will surely improve the profession and reduce the number of less-competent under performers. That said the scale of the problem is huge. There are examples of million dollar pipes start rusting due to poor Cathodic Protection design, treatment plants massively undersized and overloaded when they were first turned on after their ‘upgrade,’ and power supplies specified that couldn’t possibly power their own load just to name a few. These were all design errors that should have been caught during design review. Worse than that, these cases had all the measures like multiple gates, design checks, independent reviews and client reviews and mistakes that were very costly to rectify still made it through.

The goal of this is breakdown to explore how we can stop this from happening.

Justification of Balance

The sad truth is that large companies are primarily interested in making money and the risk of losing it: if they weren’t they would exist for long after all. No lives were lost and no-one was injured in the design errors cited above. The cost of rolling out such an exam qualification program would be a guaranteed, ongoing expense for the company and whilst it may improve design check effectiveness significantly, humans aren’t perfect and a mistake could still make it through. There are no guarantees of perfection. If it costs $10 million over 10 years to run a qualification program and a replacement pipe costs $1 million then the company is ahead without the competency program as an ongoing expense. The numbers may well be baseless estimates however this is how larger companies generally think.

The biggest problem with reputation is that it is intangible making it difficult to genuinely assess relative risk as a result of damage to it. The other issue is that although reputation is not greatly expensive to build, it is time-consuming. With the average tenure of high-level executives in companies today being so short, long-term reputational damage generally isn’t their primary concern.

Companies are made up of people and those people determine the success or failure of that company and despite this it’s still easier to think of companies as being either being “good” or “bad” as a whole. The thinking is that if a company performs badly on a project then the company (not the individual designers) would be blacklisted and get a bad reputation. Sometimes this is just on an individual client basis but other times this bad reputation can leak out into the industry and across different markets. The bad designer(s) or bad design checker(s) on the offending project may well be sacked but then the company carries the reputational burden long after their departure.

Reputation-driven Companies Will Test Their Competencies

If we accept that nobody is competent all of the time and stop relying on the established methods of assessing competence then things can improve. Companies that truly care about their engineering reputation will take additional measures to ensure the on-going competence of their key engineering employees. If correctly balanced against cost they can still remain competitive and in time that will mean more good engineers are attracted to the company, they will win more work as a result of this and will ultimately prosper.

WhitePaper Version

Peer Reviewed: Peyman Radnia RPEQ, FIEAust,CPEng, TUV FS Eng

Original Article Originally Published: 5th July, 2013

TechDistortion

Engineering White Papers

Looking Down from the Clouds

Design Reviews in Name Only

Nobody is Competent, We Are All Human