Automated Driving: Scalable Data Analysis in a DevOps Framework

DevOps shifts the focus to software as a value-adding product

Quirin Kögl, IT consultant, is a DevOps engineer in a customer project on automated driving. He and his colleagues work with a DevOps approach. In this interview, Martin Scholz, principal IT consultant, and Quirin Kögl analyze the two tasks of incidents and alerting and explain why DevOps shifts the focus to software as a value-adding product.

Martin Scholz: Quirin, what is this DevOps project about?

Quirin Kögl: We are developing data products for highly automated driving for our customer. This involves us continuously analyzing the events of the vehicle fleet and generating information for the vehicles and products for external users through aggregation. One example of this is our road works service. This supports the vehicles in highly automated driving on the road. With highly automated driving, the responsibility lies with the vehicle and no longer with the driver. This way, it can be experienced by customers and users.

Do you have any questions?

Quirin Kögl

Contact now

Martin Scholz: Why did the customer want to implement this as a DevOps project?

Quirin Kögl: The specialist area wants to work with only one contact for the development and operation of the products. That's why we are working directly with the specialist department without an IT department being involved for operations. We just got our Azure subscriptions from the IT department. Development, provisioning, infrastructure and operation are entirely our responsibility. This responsibility was requested of us by our customer.

Martin Scholz: DevOps is a generic term with a broad meaning. Let’s focus on incident management: When an incident occurs, how are you notified about it? Who among you responds?

Quirin Kögl: For all of our services, we have central mailboxes as official communication channels. In the event of incidents, we receive automated alerts there – in other words, e-mails sent by the system. To ensure that nothing gets lost in the mailboxes, we have set up a rolling ops responsibility within the team (two people). This responsibility includes regularly checking the mailbox and passing on new items - such as new incidents – to the entire team.

Martin Scholz: How do you make sure that everything relevant gets documented?

Quirin Kögl: We create a ticket for every incident to prevent any loss of information. All the necessary information is documented in this ticket. In addition, we have an online documentation page in Confluence for each product, which provides an overview of all incidents. There is a link to the respective ticket. We use this overview as a knowledge database.
For each product, we have documented the first steps for analysis and troubleshooting of the different components. This way, each team member can address problems in each product, even if they are not technically involved (in this). In addition, communication with the customer only takes place via mailboxes, which can be viewed by every team member. This way, no information is lost if someone is not there.

Martin Scholz: Your project is very well organized and has an overview of the current incidents at all times. Metrics and alerts are essential to know the current status of the products in live operation. How do you determine these metrics for incidents and system messages to fix malfunctions quickly?

Quirin Kögl: We work with two different customer Azure environments. In one of the environments, we use Azure’s own services to display the technical and functional metrics of our products and components in several dashboards. In these product-specific dashboards, we can see the status of our products and components at a glance. This is where we run Big Data applications and most of our Azure resources. In the second environment, provided by the customer, our interfaces to the vehicle backend run as microservices. Here, we rely on the logging system provided by the customer. There, we define alerts and dashboards based on micrometer metrics. In addition, we use Azure services that are approved to use the data. On this basis, we install different alerts that automatically inform us about product error behavior.

Martin Scholz: When is an alert useful to you?

Quirin Kögl: Basically, our goal is to use alerts to know about potential problems in our products as quickly as possible and before users do. Alerts need to be useful and not over used, so that the important issues don't get drowned out by the noise. Therefore, we have defined the following questions in the project, which we answer for each alert:

What happens if we ignore an alert?
When we ignore the alert and nothing bad happens, then we don’t need the alert.

How can we analyze the problem that occurred?
Documentation helps us to get to the cause of the problem faster, especially if the problem occurs rarely.

How can we fix the problem that occurred?
This information helps us so that we don't have to think about the solution every time from scratch and get back to normal faster. If the solution is to restart the affected product, then we can also automate that.

Martin Scholz: I guess you started with zero alerts. How many alerts do you currently have? In what cases did you decide to create new alerts and do you ever delete alerts?

Quirin Kögl: We started with a few standard alerts, such as heartbeat alerts, which check the availability of the application. We are now at close to a hundred alerts across all products. We create alerts mostly iteratively as new issues come up, and we anticipate that they may come up again.

Recently, six alerts were actually triggered. However, the problem was not with us, but with various interface partners. We see potential here to automatically forward such issues to the affected product, if the non-availability of a neighboring system affects us.

Martin Scholz: Can you give an example for a central service in the products?

Quirin Kögl: We are talking about products that can each consist of several components. For us, these are interface services or distributed computing pipelines – in our case, Spark pipelines. One important product is our Topology Map. This is the base map on which both the route clearances for highly automated driving and, for example, recognized road works are stored. The product consists of a Spark pipeline that generates the map on a regular basis and a REST service that provides the map to external systems.

Martin Scholz: What is the conclusion from this project?

Quirin Kögl: We have achieved a lot. We have a fully automated CI/CD process and commission our infrastructure and our monitoring and alerts automatically with Terraform (infrastructure as code). This high level of automation results in fewer errors and allows us to focus more on product development with the customer.

As a result, we have gained a lot of trust from the customer. Through this trust, we take on a lot of responsibility. In return, however, we are also given the freedom and authority to live up to this responsibility. For example, we take on the functional design of the products.

Martin Scholz: Thank you for your insight into the complex topic of DevOps using incidents and alerting as an example. In the next part of the interview, we want to talk about efficient team organization of operations in a large DevOps project. I’m looking forward to it.