Outage / Fault - Network outage
|PMO REF : ||PMO-10420||Service Area(s) : ||Network Core, Network Transit|
|Start Date : ||10/01/2018 13:27||Risk : ||Medium - Plan for the Impact to happen|
|End Date : ||10/01/2018 14:30||Impact : ||Medium - Brief or minor service disruption|
Status : Service Restored
Overview - Internet Transit Outage and Network Core Re-routing has occurred and routing table updates are happening.
Actions and Event History
|11/01/2018 13:59 - ||Fault Closure Report|
There were 2 separate issues experienced within a short period which may be possibly related or triggered as a result. The first issue was at the Manchester Datacentre where a routine configuration inadvertently reversed changes made in the previous evening as part of an upgrade. This caused a network wide route recalculation of routes for the MPLS core.
This had the effect of taking down several key links within the infrastructure (initially looking like Vodafone failures), the configuration issue was immediately spotted and the configuration was re-instated. The outage was limited and most services restored and were operational within several minutes, if devices disconnected (i.e. Broadband) then additional downtime was incurred as result of the devices waiting for a period to re-connect.
The network stabilised and settled and then the London core router chassis reported a failure of the redundant routing engine and it dropped all connections and re-established them again causing a network wide MPLS level recalculation.
This outage of transmission lasted for around 20 seconds and data continued and calls for voice customers on Ethernet connections will have resumed (brief period of silence only). Other connections may have rebooted again delaying the restoration of service time.
This is the first incident of this scale since the new core network went into service 3 years + ago and all SLAs were met for restoration. All backup and tertiary connections and services did failover to restore services within acceptable times for public IP failover.
We are examining the London core chassis and may as a precaution replace the redundant routing engine despite the chassis returning it into service as good. Details of any works will be published in advance and performed outside of normal hours.
|10/01/2018 14:35 - ||Fault is being closed off now as all links have recovered and routing has returned to normal routes and connections. The move back onto the circuits happened with no disruption as this can be done gracefully.|
|10/01/2018 14:14 - ||We are leaving the fault open for the moment but downgrading this to degraded service until we know that stability has returned and we will await a report from Vodafone as to what has occurred.|
|10/01/2018 14:13 - ||Following investigation it appears that we have lost multiple 10G core links provided via Vodafone, other providers are not affected but due to the size and locations of the links major BGP core routing changes were instigated. This includes MPLS core re-build and re-routing. The routing tables at present hold around 1.5 millions routes and the re-routing and subsequent re-building of the mesh to circumnavigate the failures is taking a few minutes to complete. We have seen multiple drops of the links and thus a few re-routes have occurred. The latest re-route appears to have fully cleared and all locations and destinations should once again be available.|
|10/01/2018 13:59 - ||Correction. The issue has returned, investigating.|
|10/01/2018 13:47 - ||There has been a blip on the overall connectivity network for a duration of approximately 5 minutes around 1:30pm. We are still identifying the cause, however it appears everything has come back on its own.|
* Risk and Impact attributes were introduced recently therefore for some planned maintenance and outages records may not contain values for risk or impact.
Internet Central regrets any inconvenience this essential
maintenance may cause our customers. Should you wish to discuss
this matter further, please do not hesitate to contact our Support
Desk on 01782 667766, please ensure that you quote the above PMO
ticket reference as this will enable us to more easily respond to
Internet Central would encourage all customers to subscribe to
our Network Status RSS feed: /maintenance-outage-rss/
Tip: If you are using Google Chrome browser, you need to add RSS
Feed Reader plugin first.