General thoughts about patching

Patching. I’m not writing a travel report about visiting Patching near Worthing in the United Kingdom It’s the question about the amount of patching and the timing of patching. At foremost the timing of patching. Or: The question of proactive and reactive question. A topic equally contested like the war for the best editor in the last century that was clearly won by vi …
This blog entry comes at an interesting point of time: I was working on this article for some time when during last weekend the story about Wannacry went through all the media channels. This includes listening and seeing a lot “experts” in the media where you just want to go to “fremdschäm”-mode). I don’t know what some people are doing for money, but computer security was a unlikely profession based on their words. Nevertheless: At the end this event is a classic example showing that the timing of patching is a pivotal factor to prevent you from being susceptible to an attack. A good example to show people why patching and fast patching is important.

I don’t want to write in this article about all the point that makes the case of the vulnerability behind Wannacry special: An issue known to government actors in one of the most used products out there that was long known but not made public that felt into the hands of criminals after being used by agencies unable to keep the tool secret. Given there are some stories linking the event to North Korea it seems that the reality is not without a sense irony.

The story is interesting in itself as all the tin foil hats out there always think that agencies force backdoors into software. Now we know that they just ride piggy-back on the inevitable occurrence of bugs in software. Perfectly plausible deniability. We didn’t introduced it, we just used it and didn’t report it.

This is interesting, but i think many will go into this topic. So i won’t comment on this. I want to write about the mitigation of such problems by keeping your systems current. Or to say it differently: About patching your systems.

My blog entry is about patching in general. About keeping your systems current. It’s not about Solaris in special, not even about Oracle products in general. Given how many devices around us contain software, it’s a problem for everyone out there. It’s something to think about for your internet enabled coffee machine or fridge, your heating control system. Or your internet enabled orange juice machine.

However this article is influenced by my own experiences when i’m working with customers to solve their business problems on Solaris. So a lot of this considerations comes from where i’m in the middle of the area of conflict between the various stakeholders.

Patch early,patch often

In principle there are two major camps: The “Patch early, patch often” camp on one side and the - what i call - “stable state” fraction on the other side.

There are many installations that work with the first model. As soon as a patch arrives it goes into the process to be introduced into production. Perhaps not directly into production, but the process starts right at the publication of the patches.

The differences between the customers are often in the process. I’ve seen a lot of variants here from “Hey, no bad news about this patch for a week on the support portal of my software. Rollout time. Let’s patch the unimportant first, then the production. Today.” to elaborate processes with test, preproduction and Production A and Production B and a rolling upgrade of all systems over a span of several weeks.

But both have something in common: The basic idea is to get bug fixes as soon as possible into the system by starting the process as soon as possible and by starting it at all.

Well, the extreme in this regard are probably my notebook or my other Apple devices. I’m pretty much on the “Cool, there is a new software version since 5 minutes. I’m late with updating” schedule. So far i wasn’t hit by any problems. But such problems are the reasoning behind the second model.

Stable state

The second model ist much more conservative: As soon as a system went trough the acceptance process to go into production the system is almost never patched, except when a error occurs that needs to be patches. The system is frozen. Every little patch and every little configuration change is a change fought through various change boards, stakeholder meetings even before going through quality assurance of the customer. A fight that is fought at best with CVE number or a recent availability issue in mind to speed things up.

In this case the minimum necessary amount of patches is applied to the system. Let’s call it the “stable state/minimum change”-model. When the stable state is lost by an external disturbing force like an suddenly appearing performance or availability issue or a security issue the least amount of energy is put into the system to put it back into the stable state. Sometimes security issues are not even seen as as disturbing force as security problems are assumed to be mitigated by other technical means surrounding the system. I think this thought is dangerous because you cannot afford a weak link in the chain, but this is the way people think.

The basic assumption is: The system proved to work in the acceptance test and the time since the acceptance test and the larger the change is, the larger is the risk of problems that weren’t there before. So whenever you hit a problem or an issue, you just drop in enough patches to get rid of the problem. Often this is backed by anecdotal evidence of a patch that went bad or was revoked or something like that.

This approach is not without merit. Patches are software, software is developed by humans, humans are not without errors, errors in software are bugs, doesn’t matter how well you test it and thus code changes may introduce errors. Assuming that patches are an exception from this would be unwarranted by reality. There is a chance, may it be almost 0%, but never 0%. It’s like 0 Kelvin: You can specify a target of a probably of 0% not to fsck things up, but you can’t reach it.

Of course you can counter the risk of introducing new bug by limiting the patch to the minimal change to counter the bug. There is a high probability you won’t introduce any new errors if you think “Dang, why is this line of code commented out?” or something like that but that ends when you have to revise the code to a large extend to solve more complex errors.

In my experience the “stable state” model is most often not introduced by the technology side. Most of them are aware of the importance of keeping a system up to date. Often the non-technical stakeholders introduce such a concept because the look at the problem with a different perspective, because of this never “0%” thing i just described two paragraphs ago. And it’s a partly understandable one: The perception of non- technical people is that it’s running fine now and why should they risk change, when at the current moment everything looks to be perfectly running. Or to get back to the energy analogy. The system is perceived to be in a stable state, and kept in a stable state by the assumptions that the perception is correct.

However: In reality you have incurred a significant amount of technical debt as you didn’t patched the system. You can be thrown into the hot water of needing an emergency patch day as soon as someone shatters your assumption that the state of your system is stable and healthy.

However the tactic of “no fixes until it breaks” generates the impression of stability as in “Hey, no planned or unplanned outages for a long time”. The most obvious sign: The “stable state” model leads to machines with spectacular uptime.

When i was young i thought “coooool … that a demonstration of stability”. And it is indeed, the system was able to cope with the load for a long time without any hickup. But nowadays i think “Are you fscking insane, that system is totally downrev” which translates into a customer-facing “In the light of the multitude of factors like adherence to security best practices and a proactive (BINGO) management of the inevitable issues in products, a different patching model may be a better choice for your system in order to fulfill all business objectives a modern IT operations has to fulfill for all stakeholders (DOUBLE-BINGO)”.

But we traded in the stability and the availability for the actuality of the system with all it’s implications. I get a little bit cautious when with a uptime of more than 90 or 180 days. Not only because of patching but as well as nobody has checked if all the configuration changes have been really made boot-persistent.

And the in-between

And of course there is the large in between. Environments that want to patch often and fast but simply don’t have the people and thus the knowledge or time to do it. Or where there other heaps of technical debt that look more important. But for the sake of the length of this article i won’t concentrate on it, besides admitting, that i know that the reality isn’t that simple as i paint with describing the two models of patching.

Consensus

However why is it so hard to come to an consensus when to patch, out of what reason and at which scale? This is due to the fact that everyone is looking at the problems with different perspective with different objectives.

The security guys define nanoseconds as the time between the availability of a security patch and the moment they would like to see a patch on a system running. I think live-patching is mainly born out of this thinking as the assumption is when you don’t interrupt the business, the business will accept it and we get security fixes into the system really fast. But a change is a change even when you change it live, and while for some organization a change is okay as long as no interruption to operations is customer visible, for others the change in and by it self is problematic and the interruption in operations is only of secondary interest.

The business stake holders want their business running because they have to explain to their users that the services isn’t available and they get fire from their users when a service is not available. And the normal user have minimal interests in the details of IT operations as long as they get to they workplace with a perfectly running IT environment. And they are right, user should not think about it. It’s our job as IT people to think about it. However they get the flak from the management when suddenly the system behaves differently, stops . So they are hesitant to allow patches, as nobody can guarantee them, that problem wouldn’t happen, especially as sometimes “should not happen with a almost level of confidence an angstroem short of 100%” is not enough confidence for the business stake holder. Combined with the impression that everything is fine, we have a lack of allowed changes as naturally people interesting that the service is just working are not interested in risks to this states which may be hidden in change.

The admins have to do their job of keeping the systems running in a secure manner without any performance or availability issues. And this different perspectives make the life harder for everyone. They are somewhat in the middle of the situation.

I could probably name a number of additional stakeholders depending on your company.

Another totally different perspective

Interestingly then there are the people that think that they have not think at all about patching as they just throw away their environments and create new feshly minted ones.

But those are not in the focus of this blog entry and it would justify a long article starting with the point that both models of patching are applicable as well, just on much bigger chunks. You patch such installations by replacing one large blob with another large blob. But the considerations are essentially the same because you have to create the large blob and you have to decide why you need a new blob. I think this has it’s own set of advantages, but it has disadvantages as well.

Recipe for disaster

As a side note: From my point of view the stable state model is the model of appliances. Who is really updating his or satellite receiver? Only when you hit a bug, you look at the website if there is a new firmware (or your satellite receiver does this for you). Who heard of someone talking “Yeah, a software update came out for my washing machine. I think i will update it” while you a talking to your husband/wife “I need a patch window on the washing machine” with the discussion “Can you guarantee me, that the washing machine is behaving exactly as before? Do you have a roll-back strategy when the patch wrecks the washing machine? Why can’t you do the update while the washing machine is running”. I never heard someone talking like this, albeit i must admit this would be an quite interesting relationship.

Nevertheless: It is the same mindset that led to the “00:00” video tape recorders. People just didn’t invested the time and thoughts to set the system into a proper operating state. In my opinion this includes that the software of your appliance is up to date. How can we expect that those people will ensure that the software in there TV set is always current. Based on the idea that their TV set may be involved in DDOSing Facebook while they are are looking a finnish experimental movie about the history of muck-hills with russian subtitles and comments in german and a expert discussion afterwards on ARTE when they don’t do it. From their perspective the television set is working perfectly fine. It produces sound and moving pictures. The same people who kept their video recorders at 0:00 from unboxing to trashcan.

This behaviour wasn’t that dangerous when all this appliances were isolated in the basement or in the living room. I had a washing machine with software update capability advertised on the front since 2001, but the device got zero updates. Because it had no wifi, no ethernet port this was okay. The appliance did it’s job and thus everything was fine. And when it didn’t do it’s job by coloring my white shirts in a nice pink it was my error because i didn’t properly sanitized the input of the applicance.

However more and more of this devices get an TCP/IP stacks. More and more run a full fledged operating system, sometimes barely hardened and minimized. The success of Linux in the appliance area means essentially that you get a kind of Unix in your home. Same for the OSes derived from BSD. Just think that a common household will have UNIX systems by the dozens in a few years. Ikea starts to sell networked lightbulbs (okay they aren’t using TCP/IP but sill) which you could consider as an important step to mass market adoption. Everything starts to be networked in the home. I have moved my appliances in an own /24 network because address space was a little bit tight recently.

Companies have admins to handle their Unix systems. Working full time on it. You have only yourself and the spare time after work and family. I’m really quite opinionated that we need forced updates for devices and a guaranteed supply of security patches for at least 5 years.

Why automatic and forced? Because when you won’t force it, nobody will do it as nobody is willingly trying negotiate a a maintainance window for the television with all stakeholders to the family (“No, you can’t patch it on Sunday morning, we will miss a mediocre computer animated cartoon”) and investing the time . I really think that when you force the updates users will adapt and given all the data that those devices collect about our behaviour it should be quite easy for them to find a minimal disruptive moment for the update.

Currently with the cumbersome steps to upgrade with the need to download a patch and upload it via a webform we could tell the customer to do a aptitude update/aptitude upgrade on the shell with the same level of acceptance. Just by the way … i would really like to update my devices that way … then i could just shell script it.

But in general we need automatic and forced updates for everything that just gets near of a default gateway to the internet or we will be never able to stop large attacks from happening.

We are in a world where we see CVE for dishwater appliances like in CVE-2017-7240. I’m pretty sure that the internet will be killed in the future by hordes of appliances and IoT devices not patched by home users and i’m often joking the only thing between us and nuclear holocaust like in Terminator 3 is the the fact that the systems to control the missiles are older than most things out there …. including possibly 90% of all developers still active out there.

But I’m disgressing. Especially as forced automated updates are out of the question for servers.

My thoughts

I have a strong opinion on the topic of patching. It’s my personal opinion, you may ask someone else and get a totally different opinion. Possibly our patching experts inside of Oracle have another opinion. I’m not a patching process expert as i’m not concentrating on patching. I’m just working with the systems that are created by patching or the absence of patching. But this is my personal blog and my professional but still personal opinion. Just in case you like to hear it.

I’m a strong proponent of the “Patch early/Patch often” model backed by a process of testing it while going through the classes of system.

This opinion arises out of a number of thoughts that stems primarily from the fact that i consider the reasoning behind the model of keeping state of the system stable by only patching when something is broken as fundamentally flawed. Not because reasoning of the concept is wrong, but because an important assumption is incorrect.

I think the “stable state” has some basic shortfalls. In this part i totally ignoring the security part, which somewhat shortcuts the discussion altogether. Today from a security perspective alone no matter what device or system you are using the “not patching until it breaks” on a “stable state” model is out of question and won’t cut it anly longer . You may just jump to the “From the security perspective” section now. I will later write about it.

But the stable state model has shortfalls even without factoring security into the equation. I think the stable state model assumes that your load is in a stable state as well. The stable state model assumes that the probability of hitting a bug is static and is basically zero as soon as you put your system through the acceptance tests and thus proved that your load don’t trigger any bugs.

Let’s just assume for the sake of the argument that you can perfectly simulate the production load on the test/dev/QA-systems. This is is in and by itself not an easy feat but let’s assume you have such a load simulation.

This may correct for a machine that just processes a max number of events for example limited by the physical construction of the machine a system is controlling or the central management of a railway control center, the MRI at your radiologists office.

You know that it’s doing always the same. You know exactly the maximum load, you probably work at a high risk when something fails. A worker may be decapitated by a robot running amok or a train will crash into another one. And when you have magnet weighting hundreds of kilograms rotating around you at high speed, you want to be very sure that a patch a technician applied doesn’t do any harm. Any you don’t want to diagnose the imaging incorrectly just because someone dropped in a wrong libary delivering incorrect results.

But on the other side: In this rare cases you can test the load exactly. You are forced by regulations to test the load exactly. I think the stable state model can be a good one in such situations if you can mitigate the security implications of a system has isn’t patched up to date.

But the loads that are a good match to the stable state model are much more seldom than the occurrences of the stable state model in the wild may suggest. I saw the model of stable state where no MRI, no robot or rail traffic control was connected to the system.

The reality is that most loads on servers are differently and are totally unsuitable for a stable state patching model. Load is something transient, something fluid. It changes all the time. Loads grow with time, loads are using more and more subsystems in a different manner. You are using the system differently than you anticipated at the time you did the acceptance tests. Your software developer may forget to tell you that he or she has changed something in the software you are using and thus exerting a completely different load to your system.

And there are always bugs and issues and problems which probability to occur increases with the load you put on the system. For example locking in your application. Works good enough with your momentary loads, but may have disastrous results when your loads slightly increases. I think everyone has created code that looked good on your desktop but was a disaster when used multithreaded.

The assumption that your application is an exception from this is in my experience often unwarranted. The assumption that the acceptance tests of a environment has covered each and every situation is often unwarranted.

I was called into too many performance escalation which were root causes for example into locks in the application that went slowly into congestion over the time. Where the admins told me “nothing changed” but when you looked into the data the customer recorded because he recorded historical data the load wasn’t the same game, it was not the same ballpark and not even the same sport than at the acceptance test.

So the basic assumption of the stable state model is somewhat shaky. Well, not only shaky. The assumption that the probability of running into an issue is constant and is constantly zero if you haven’t hit it in the past is wrong if the load isn’t constant.

I have another proof point for the point that the assumption of a stable state is incorrect most of the time: I don’t know how many angry calls i got about a problem just to find out that the corresponding problems was patched 6 months or a year ago. But this implies: The system ran without problems with the issue in the code. It ran since the issue was discovered and since the issue was fixed. But suddenly the issue hits you. Something must have changed. And almost universal it isn’t just the time that has passed.

When they hit a problem patched a year ago, the customers were hit by perfectly preventable unplanned downtimes. By the attempt to reduce the risk due to changes they increased the risk by not fixing known problems.

At the end the basic assumption of stable state is the same as with crude old tactics to sweep mines. You sweep them by hitting them by letting people running over it. Effective may be, extremly cruel, and you generate a lot of debris and unsatisfied people using your system.

Fearing unknown errors

The other basic assumption in the reasoning behind the “stable state” model is more or less the assumption that the change would introduce an unknown error that would lead to a availability error or that it won’t work exactly as before. I understand this in some way. In every lifecycle of product there may be a bad patch in the past, a buggy firmware version, a new piece of code that shows different issues at different use cases. A new piece of code, that behaved a little bit differently. We all have stories like that. I bricked once a device with a bad patch. Yeah, this happens.

But basically this is still somewhat strange assumption and a strange argument not to patch: You assume a bug or issue in the patch that you or anybody else don’t know because otherwise the bug or issue would have been known, reported and fixed. On the other side you known a lot of bugs and issues in your current system. You know it, because you could simply look in the the release notes of you patch, of the new firmware version and so on.

So the question is more or less: Why are you more afraid of the assumed issues than of the known issues. Documented and very real issues (otherwise a software company wouldn’t fix them when they don’T exit … obviously) where you seldomly can professionally quantify the risk of hitting. It’s an perhaps simple task, when you can just say that you don’t have the hardware a driver with a patched issue supports, but for anything else .. it’s significantly harder to conclude out of the description of the problem that you can’t be affected by the issue and thus doesn’t need a fix when it’s in a arcane subystem of your operating system, database, video recorder, text processing, disk firmware.

Out of this reason i simply don’t believe that “don’t fix it when it’s not broken” is a valid strategy because you just assume that your system is not broken against the knowledge of bugs you’ve get by the release notes.

Adapt to change

I think it’s quite important to have procedures in place to allow change. That reduces the risk of change. For example by a process to get patches into production by moving them to the different classes of systems from dev to production. At a certain point of time you hit the problem anyway or you are haunted by a CVE and then you have to do the change anyway but then without proper preparation and process but in a fire drill.

Of course there is always the discussion, that some systems can’t be updated because you need the software on it and it isn’t available for a more current environment with patches available. But then you should really ask yourself why you have painted yourself into the corner of using a software that isn’t supported environment. The reality however is that almost everyone has such system in the data center, but then you should use tactics like isolation or keep the software running in VMs or servers that you just run, when you really need the software.

From the security perspective

There is a different reason why you should patch fast and get rid of the model of “as long it’s not broken, don’t fix it”. From my perspective this perspective is even more important.

In 2008 Verizon DBIR (Link) already wrote on page 15 that 71% of all breaches had a patch available for over a year. There were 0% breaches with a patch available 0%. This led to the following statement:

For the overwhelming majority of attacks exploiting known vulnerabilities, the patch had been available for months prior to the breach. This is clearly illustrated in Figure 12. Also worthy of mention is that no breaches were caused by exploits of vulnerabilities patched within a month or less of the attack. This strongly suggests that a patch deploymentstrategy focusing on coverage and consistency is far more effective at preventing data breaches than “fire drills”attempting to patch particular systems as soon as patches are released.



The 2016 Verizon DBIR (Link) went even further. The report stated that was an interesting statement of Verizon that practically all CVE have an available exploit within a year (See the section “From Pub to Pwn” on page 16). So when you don’t use a patch that is fixing a CVE, you have a high guarantee that you will you will significantly increase the attack surface over the time.

By the way: It’s really telling that of unpatched vulnerabilities even more interesting is the point that there are exploits in use from the last century.

It seems that many people don’t patch their systems even in the light of really nasty security problems. I would just like to point to the Heartbleed Report (2017-01)) . Three years went into the world and still 200.000+ systems are susceptible to this problem. And i don’t want to know how many systems are hidden behind firewalls and still sporting this security issue.

And the conclusion?

I think in a world where essentially all systems are connected to some networks, we can’t afford to have systems that aren’t patched or kept in a minimum change model. We need to patch the systems at least to a level to fix all the CVE (remember after a year almost 100% have an exploit), you have to keep the system at least at a level that you could install any CVE addressing patch without starting a massive project to upgrade your system including software updates. And when you really have applications that can’t be updated at all, you need to keep them in a intensive care unit, monitoring them more rigidly than other systems, isolating them more rigidly as other system, and keep the number of such systems rididly as low as possible. And Wannacry is perhaps a good event to convince stakeholders of such procedures.

And i’m not even talking about patches nescessary to fix non-security related issues. But the security related implications should be enough to get someone on a patch-early/patch-often schedule.