On October 14, 2011, Apple introduced the new iPhone 4S. One of its major new features was Siri, a personal assistant application. Siri uses a natural language processing technology to interact with the user.
Interestingly, Apple explained that Siri works by sending data to a remote server (that’s probably why Siri only works over 3G or WiFi). As soon as we could put our hands on the new iPhone 4S, we decided to have a sneak peek at how it really works.
Today, we managed to crack open Siri’s protocol. As a result, we are able to use Siri’s recognition engine from any device. Yes, that means anyone could now write an Android app that uses the real Siri! Or use Siri on an iPad! And we’re goign to share this know-how with you.
The best demo probably is Siri’s speech-to-text feature. We made a simple recording of us saying “Applidium vous souhaite une bonne journé”, and got a perfect result !
This sound sample never went through any iPhone, but nonetheless we got Siri to analyze it for us.
Understanding the protocol – A brief technical history
At Applidium we’re used to building mobile applications. The best way to chat with a remote server isHTTP, as it’s the protocol that is the more likely to work in any case.
The easiest way to sniff HTTP traffic is to setup a proxy server, configure your iPhone to use it, and look at what goes through the proxy. Surprisingly, when we did, we wouldn’t gather any traffic when using Siri. So we ressorted to using
tcpdump on a network gateway, and we realised Siri’s traffic was TCP, on port 443, to a server at 22.214.171.124.
https://126.96.36.199/ on a desktop machine we noticed that this server was presenting a certificate for
guzzoni.apple.com. So it seemed like Siri was communicating with a server named guzzoni.apple.com over HTTPS.
As you know, the “S” in HTTPS stands for “secure” : all traffic between a client and an https server is ciphered. So we couldn’t read it using a sniffer. In that case, the simplest solution is to fake an HTTPSserver, use a fake DNS server, and see what the incoming requests are. Unfortunately, the people behind Siri did things right : they check that guzzoni’s certificate is valid, so you cannot fake it. Well… they did check that it was valid, but thing is, you can add your own “root certificate”, which lets you mark any certificate you want as valid.
So basically all we had to do was to setup a custom SSL certification authority, add it to our iPhone 4S, and use it to sign our very own certificate for a fake “guzzoni.apple.com”. And it worked : Siri was sending commands to your own HTTPS sever! Seems like someone at Apple missed something!
That’s when we realised how Siri’s protocol is opaque. Let’s have a look at a Siri HTTP request. The request’s body is binary (we’ll get into that later), and here are the headers :
ACE /ace HTTP/1.0 Host: guzzoni.apple.com User-Agent: Assistant(iPhone/iPhone4,1; iPhone OS/5.0/9A334) Ace/1.0 Content-Length: 2000000000 X-Ace-Host: 4620a9aa-88f4-4ac1-a49d-e2012910921
A few interesting things :
- The request is using a custom “ACE” method, instead of a more usual GET.
- The url requested is “/ace”
- The Content-Length is nearly 2GB. Which is obviously not conforming to the HTTP standard.
- X-Ace-host is some form of GUID. After trying with several iPhone 4Ses, it seems to be tied to the actual device (pretty much like an UDID).
Now let’s move on to the body. The body is some raw binary content. When we first looked at it with an hex editor, we noticed it started with
0xAACCEE. Oh, seems like header ! Unfortunately, we couldn’t understand anything of what was after that.
That’s when we took some time to think. As people who are used to designing mobile application, we know there’s one thing which is very important when talking over a network : compression. The bandwidth is often limited, so it’s usually a very good idea to compress your data. And what is the most ubiquitous compression library around ? zlib:“http://zlib.net/”. It’s a very solid library, really efficient and powerful (makes sense, it’s half french!). So we tried to pipe that binary data through zlib. But nothing came out, we were missing a zlib header. That’s when we thought “hmm, so there’s already this AACCEEheader in the request body. Maybe there’s some more ?”. We developpers like to keep things packed. 3 bytes is not a good length for a header. 4 would be. So we tried un-zipping after the 4th byte. And it worked!
Now when we unziped the content, we got onto some new binary data. Not very understandable either, but some parts were text. Among them, one caugh our attention :
bplist00. Hurray, it seems like the data is some binary plist. After fiddling a little bit with that binary stream, we figured out it was made out of chunks :
- Chunks starting with
0x020000xxxxare “plist” packets,
xxxxbeing the size of the binary plist data that follows the header.
- Chunks starting with
0x030000xxxxare “ping” packets, sent by the iPhone to Siri’s servers to keep the connection alive. Here
xxis the ping sequence number.
- Chunks starting with
0x040000xxxxare “pong” packets, sent by Siri’s server as a reply to ping packets. Without surprise,
xxis the pong sequence number.
And deciphering the content of binary plists is very easy, you can do it on Mac OS X with the “plutil” command-line tool. Or in ruby with the
CFPropertyList gem on any platform.
What we learned
We did really learn a few interesting things about how the iPhone 4S talks to Apple’s servers :
The audio data
The iPhone 4S really sends raw audio data. It’s compressed using the Speex audio codec, which makes sense as it’s a codec specifically tailored for VoIP.
The iPhone 4S sends identifiers everywhere. So if you want to use Siri on another device, you still need the identfier of at least one iPhone 4S. Of course we’re not publishing ours, but it’s very easy to retrieve one using the tools we’ve written. Of course Apple could blacklist an identifier, but as long as you’re keeping it for personal use, that should be allright!
The actual content
The protocol is actually very, very chatty. Your iPhone sends a tons of things to Apple’s servers. And those servers reply an incredible amount of informations. For example, when you’re using text-to-speech, Apple’s server even reply a confidence score and the timestamp of each word.
What’s next ?
Here’s a collection of tools we wrote to help us understand the protocol. They’re written mostly in Ruby (because that’s a wonderfully simple language), some parts are in C and some in Objective-C. Those aren’t really finished, but should be very sufficient for anyone technically inclined to write a Siri-enabled application.
Let’s see what fun application you guys get to build with it! And let’s see how long it’ll take Apple to change their security scheme!
The YouTube HTML5 player is naughty. As a nerd, you can view the stats of the video playing now. These stats are not how many views or likes but how many frames are dropped and bandwidth etc. Google always brings us some surprise.