SpaceX Software Lead Reveals Starlink Details in Reddit AMA

SpaceX Director of Starlink Software Matt Monson revealed some new details about the company’s mysterious Starlink constellation during an “Ask Me Anything” (AMA) session on Reddit on June 6. Joined by several colleagues who worked on the Crew Dragon mission, Monson said that the Low-Earth Orbit (LEO) constellation runs on “a ton of software to make it work,” and that improvements in software could have a huge impact on the quality of service provided by the constellation.

Monson also addressed cybersecurity concerns, and why SpaceX is trying to reduce the amount of data each Starlink satellite has to transmit while scaling up the size of the constellation itself. When asked whether or not the constellation satellites would be able to communicate with each other with laser links (as President and COO Gwynne Shotwell promised would happen later this year during an interview with CNN), Monson did not answer.

There has been active discussion in the industry about Starlink’s potential inter-satellite links, ground systems, and ability to reduce latency. Monson provided some clues by detailing the work of his software engineering team. Via Satellite has “unrolled” Monson’s comments about Starlink from the discussion and compiled below. (Note — these questions were asked by public participants in the Reddit AMA)

With a few hundred Starlink satellites in orbit, are there parts of individual satellite or constellation-related operations that you’ve come to realize are not well covered in testing?

Monson: For Starlink, we need to think of our satellites more like servers in a data center than special one-of-a-kind vehicles. There are some things that we need to be absolutely sure of (commanding, software update, power and hardware safety), and therefore deserve to have specific test cases around. But there’s also a lot of things we can be more flexible about — for these things we can take an approach that’s more similar to the way that web services are developed. We can deploy a test build to a small subset of our vehicles, and then compare how it performs against the rest of the fleet. If it doesn’t do what we want, we can tweak it and try again before merging it. If we see a problem when rolling it out, we can pause, roll back, and try again. This is a hugely powerful change in how we think about space vehicles, and is absolutely critical to being able to iterate quickly on our system.

We’ve definitely found places where our test cases had holes. Having hundreds of satellites in space 24/7 will find edge cases in every system, and will mean that you see the crazy edges of the bell curve. The important thing is to be confident about the core that keeps the hardware safe, tells you about the problem, and then gives you time to recover. We’ve had many instances where a satellite on orbit had a failure we’d never even conceived of before, but was able to keep itself safe long enough for us to debug it, figure out a fix or a workaround, and push up a software update. And yes, we do a lot of custom ASIC development work on the Starlink project.

How did creating the Crew Display software affect the development of the Starlink interface for SpaceX operations (map views, data visualizations, etc.)?

Monson: The tech from the crew displays (especially the map and alerts) formed the basis of our UI for the first couple Starlink satellites (Tintin). It’s grown a ton since then, but it was awesome to see Bob and Doug using something that somehow felt familiar to us too.

What level of rigor is being put into Starlink security? How can we, as normal citizens, become comfortable with the idea of a private company flying thousands of internet satellites in a way that’s safe enough for them to not be remote controlled by a bad actor?

Monson: In general with security, there are many layers to this. For starters, we designed the system to use end-to-end encryption for our users’ data, to make breaking into a satellite or gateway less useful to an attacker who wants to intercept communications. Every piece of hardware in our system (satellites, gateways, user terminals) is designed to only run software signed by us, so that even if an attacker breaks in, they won’t be able to gain a permanent foothold. And then we harden the insides of the system (including services in our data centers) to make it harder for an exploited vulnerability in one area to be leveraged somewhere else. We’re continuing to work hard to ensure our overall system is properly hardened, and still have a lot of work ahead of us (we’re hiring!), but it’s something we take very seriously.

I am sure there are tons of redundancy strategies you guys implemented. Care to share some?

Monson: On Starlink, we’ve designed the system so that satellites will quickly passively deorbit due to atmospheric drag in the case of failure (though we fight hard to actively deorbit them if possible). We still have some redundancy inside the vehicle, where it is easy and makes sense, but we primarily trust in having system-level fault tolerance: multiple satellites in view that can serve a user. Launching more satellites is our core competency, so we generally use that kind of fault tolerance wherever we can, and it allows us to provide even better service most of the time when there aren’t problems.

What’s the amount of telemetry (in GBs) you usually get from Starlink? Do you run some machine learning and/or data analysis tools on it?

Monson: For Starlink, we’re currently generating more than 5 TB a day of data! We’re actively reducing the amount each device sends, but we’re also rapidly scaling up the number of satellites (and users) in the system. As far as analysis goes, doing the detection of problems onboard is one of the best ways to reduce how much telemetry we need to send and store (only send it when it’s interesting). The alerting system we use for this is shared between Starlink and Dragon.

For some level of scope on Starlink, each launch of 60 satellites contains more than 4,000 Linux computers. The constellation has more than 30,000 Linux nodes (and more than 6,000 microcontrollers) in space right now. And because we share a lot of our Linux platform infrastructure with Falcon and Dragon, they get the benefit of our more than 180 vehicle-years of on-orbit test time.

How different is the development experience and the rate of change on production software between the rarely flown Dragon and NASA scrutinized?

Monson: The tools and concepts are the same, and many of the engineers on the team have worked on both projects (myself included), but being our own customer on Starlink allows us to do things a bit differently.

How often do you remotely upgrade the software on the satellites that are in orbit?

Monson: The Starlink hardware is quite flexible – it takes a ton of software to make it work, and small improvements in the software can have a huge impact on the quality of service we provide and the number of people we can serve. On this kind of project, pace of innovation is everything. We’ve spent a bunch of time making it easier, safer, and faster to update our constellation. We tend to update the software running on all the Starlink satellites about once a week, with a bunch of smaller test deployments happening as well. By the time we launch a batch of satellites, they’re usually on a build that already older than what’s on the rest of the constellation! Our ground services are a big part of this story as well – they’re a huge part of making the system work, and we tend to deploy them a couple times a week or more.

Are Starlink satellites programmed to de-orbit themselves in case they aren’t able to communicate back for a given amount of time?

Monson: The satellites are programmed to go into a high-drag state if they haven’t heard from the ground in a long time. This lets atmospheric drag pull them down in a very predictable way.

The full AMA with Monson’s colleagues about the Crew Dragon mission can be read here.