Canada

How a coding error caused a Rogers outage that left millions without service

People use wifi at the Fairview Mall in Toronto on July 8. Yader Guzman/The Globe and Mail

Rogers Communications Inc. engineers. RCI-BT began the sixth step of a seven-step process to upgrade the core infrastructure that supports the company’s wireless and broadband networks at 2.27am on 8 July.

Two hours and 16 minutes later, a coding error was introduced that set off a cascade of events that led to a massive outage that left millions of Canadians without cellphone, Internet or home phone service for at least a day.

The shutdown of one of Canada’s dominant telecommunications networks created widespread chaos. Rogers failed to send four distress calls to its wireless customers in Saskatchewan, including three tornado warnings and one dangerous person report.

Rogers customers were unable to call 911, and Interac’s debit system was also affected, causing problems for consumers and businesses alike. In Toronto, the outage forced Canadian singer-songwriter The Weeknd to postpone a concert scheduled to take place at the Rogers Center tonight.

At first, even Rogers himself wasn’t sure what was causing the service outage. But weeks later, in a detailed presentation in response to questions from the Canadian Radio-television and Telecommunications Commission, the company gave a full account of its version of events.

Opinion: Rodgers still has some explaining to do regarding the layoff and fallout from the Shaw deal

Opinion: Rogers outage is a reminder of Canada’s failure to create a secure wireless network for emergency services

Those documents, which were made public by the CRTC in redacted form on Friday, provide new details about the outage and provide an early look at the set of facts that Rogers executives will refer to Monday when they are expected to testify about the incident in a public hearing before the Committee on Industry and Technology of the House of Commons.

Like many of its peers, Rogers currently has one core network that supports all the services it provides. The core is essentially the brain of the network. It receives, processes, transmits and connects all voice, wireless data, Internet and TV traffic.

The telco began the seven-phase core upgrade process back in February, following what the company described in a CRTC filing as a comprehensive planning process that included budget and project approvals, risk assessment and testing.

The first five phases went smoothly. But at 4:43 a.m. on July 8, a piece of code was introduced that deletes a routing filter. In telecommunications networks, data packets are guided and routed by devices called routers, and filters prevent these routers from becoming overloaded by limiting the number of possible routes presented to them.

Deleting the filter caused all possible routes to the Internet to go through the routers, causing memory and processing capacity to be exceeded on several of the devices. This caused the main network to shut down.

Rogers uses equipment from different manufacturers in the core of its network, and the two vendors from which the company buys routers have different designs and approaches to managing traffic and protecting equipment from overload. Those differences are at the heart of the termination Rogers experienced, company filings say.

But in the early hours, the company’s technicians had not yet determined the cause of the crash. Rogers apparently considered the possibility that his networks had been attacked by cybercriminals. At 6 a.m., Jorge Fernandez, who was the company’s chief technology officer at the time, called his colleagues at Telus Corp. TT and BCE Inc.’s Bell Canada BCE-T to inform them of the outage and warn them to watch for cyber attacks, the company it says in its filing.

Although Bell and Telus offered to help, Rogers quickly decided that it would not be able to transfer its customers to its competitors’ networks because some elements of Rogers’ network, such as its centralized customer database, were unavailable as a result of the outage. In any case, rival networks would not have been able to handle the sudden spike in traffic from Rogers’ 10.2 million wireless subscribers, the telco said.

Rogers’ disruption could weigh on the decision surrounding the $26 billion acquisition of Shaw, Champagne says

Mr. Fernandes was in Portugal when the outage began and he immediately began preparing to return to Canada, according to two sources familiar with his whereabouts. The Globe did not identify the sources because they were not authorized to speak publicly on the matter.

Meanwhile, Rogers’ network team gathered at the company’s network operations center in Brampton, Ontario, restored network access and began trying to figure out the cause of the outage.

To communicate with each other and coordinate recovery efforts, some employees began swapping out their SIM cards for Bell or Telus SIM cards they had received in 2015 as part of a contingency plan created between the wireless carriers.

It wasn’t until 8:54 a.m. — roughly four hours after the outage began — that the company publicly acknowledged the situation. “We know how important it is for our customers to stay connected,” the telco tweeted via its customer service account. “We are aware of the issues currently affecting our networks and our teams are fully committed to resolving the issue as soon as possible. We will continue to keep you updated as we have more information to share.”

The company’s disclosures to the CRTC indicate that the delayed response may be related to problems logging into online accounts used to communicate with customers. The telco said going forward it will ensure its crisis response teams have alternative methods of accessing social media accounts that are protected by two-factor authentication linked to Rogers devices.

It took the network team all day to restore the network. They had to shut down the equipment causing the problem, reroute traffic and confirm network stability before slowly bringing services back online. The process had to be done methodically to prevent overloading the network and causing another outage, the company said.

“Our wireless services are beginning to be restored and our technical teams are working hard to get everyone back online as quickly as possible,” the company tweeted shortly before 10 p.m.

The next morning, Rogers announced that it had restored service to the “vast majority” of its customers. But intermittent problems continued throughout the weekend.

In an open letter to customers this Sunday, Rogers CEO Tony Staffieri pledged to invest more in testing, monitoring and artificial intelligence to improve the reliability of the company’s networks. He put the cost of the changes at about $10 billion over three years.

The wireless giant will also physically separate its wireless and cable core networks to ensure that any future outages do not affect both services, Mr. Staffieri said.

The company last week replaced Mr Fernandes, a former Vodafone chief, with veteran telecoms executive Ron McKenzie. Mr. McKenzie was previously president of Rogers for Business, the division that offers wireless and Internet services to corporate customers.

Mr McKenzie will begin his new role by appearing before the House of Commons committee investigating the outage. The committee, which is made up of MPs from the four major federal parties, is expected to censure him, Mr Staffieri and Rogers’ chief regulatory officer Ted Woodhead over the five-day billing credit the company is offering to compensate its customers for the outage. The commission may also ask about network and operational changes the telco plans to make to prevent future outages.

As all this is happening, Rogers is awaiting regulatory approval for its controversial $26 billion takeover of Shaw Communications Inc., ahead of a July 31 deadline. The Competition Bureau is trying to block the merger, arguing it will lead to worse service and higher prices for mobile phone customers.

Your time is valuable. Have the Top Business Headlines newsletter conveniently delivered to your inbox morning or evening. Sign up today.