Cracking Siri

On October 14, 2011, Apple introduced the new iPhone 4S. One of its major new features was Siri, a personal assistant application. Siri uses a natural language processing technology to interact with the user.

Interestingly, Apple explained that Siri works by sending data to a remote server (thatâ€™s probably why Siri only works over 3G or WiFi). As soon as we could put our hands on the new iPhone 4S, we decided to have a sneak peek at how it really works.

Today, we managed to crack open Siriâ€™s protocol. As a result, we are able to use Siriâ€™s recognition engine from any device. Yes, that means anyone could now write an Android app that uses the real Siri! Or use Siri on an iPad! And weâ€™re goign to share this know-how with you.

Demo

The best demo probably is Siriâ€™s speech-to-text feature. We made a simple recording of us saying â€œApplidium vous souhaite une bonne journÃ©â€, and got a perfect result !

Sample_Siri_speech_to_text.zip

70.78 KoDownload

This sound sample never went through any iPhone, but nonetheless we got Siri to analyze it for us.

Understanding the protocol â€“ A brief technical history

At Applidium weâ€™re used to building mobile applications. The best way to chat with a remote server isHTTP, as itâ€™s the protocol that is the more likely to work in any case.

The easiest way to sniff HTTP traffic is to setup a proxy server, configure your iPhone to use it, and look at what goes through the proxy. Surprisingly, when we did, we wouldnâ€™t gather any traffic when using Siri. So we ressorted to using tcpdump on a network gateway, and we realised Siriâ€™s traffic was TCP, on port 443, to a server at 17.174.4.4.

Going to https://17.174.4.4/ on a desktop machine we noticed that this server was presenting a certificate for guzzoni.apple.com. So it seemed like Siri was communicating with a server named guzzoni.apple.com over HTTPS.

As you know, the â€œSâ€ in HTTPS stands for â€œsecureâ€ : all traffic between a client and an https server is ciphered. So we couldnâ€™t read it using a sniffer. In that case, the simplest solution is to fake an HTTPSserver, use a fake DNS server, and see what the incoming requests are. Unfortunately, the people behind Siri did things right : they check that guzzoniâ€™s certificate is valid, so you cannot fake it. Wellâ€¦ they did check that it was valid, but thing is, you can add your own â€œroot certificateâ€, which lets you mark any certificate you want as valid.

So basically all we had to do was to setup a custom SSL certification authority, add it to our iPhone 4S, and use it to sign our very own certificate for a fake â€œguzzoni.apple.comâ€. And it worked : Siri was sending commands to your own HTTPS sever! Seems like someone at Apple missed something!

Thatâ€™s when we realised how Siriâ€™s protocol is opaque. Letâ€™s have a look at a Siri HTTP request. The requestâ€™s body is binary (weâ€™ll get into that later), and here are the headers :

            ACE /ace HTTP/1.0
            Host: guzzoni.apple.com
            User-Agent: Assistant(iPhone/iPhone4,1; iPhone OS/5.0/9A334) Ace/1.0
            Content-Length: 2000000000
            X-Ace-Host: 4620a9aa-88f4-4ac1-a49d-e2012910921

A few interesting things :

The request is using a custom â€œACEâ€ method, instead of a more usual GET.
The url requested is â€œ/aceâ€
The Content-Length is nearly 2GB. Which is obviously not conforming to the HTTP standard.
X-Ace-host is some form of GUID. After trying with several iPhone 4Ses, it seems to be tied to the actual device (pretty much like an UDID).

Now letâ€™s move on to the body. The body is some raw binary content. When we first looked at it with an hex editor, we noticed it started with 0xAACCEE. Oh, seems like header ! Unfortunately, we couldnâ€™t understand anything of what was after that.

Thatâ€™s when we took some time to think. As people who are used to designing mobile application, we know thereâ€™s one thing which is very important when talking over a network : compression. The bandwidth is often limited, so itâ€™s usually a very good idea to compress your data. And what is the most ubiquitous compression library around ? zlib:â€œhttp://zlib.net/â€. Itâ€™s a very solid library, really efficient and powerful (makes sense, itâ€™s half french!). So we tried to pipe that binary data through zlib. But nothing came out, we were missing a zlib header. Thatâ€™s when we thought â€œhmm, so thereâ€™s already this AACCEEheader in the request body. Maybe thereâ€™s some more ?â€. We developpers like to keep things packed. 3 bytes is not a good length for a header. 4 would be. So we tried un-zipping after the 4th byte. And it worked!

Now when we unziped the content, we got onto some new binary data. Not very understandable either, but some parts were text. Among them, one caugh our attention : bplist00. Hurray, it seems like the data is some binary plist. After fiddling a little bit with that binary stream, we figured out it was made out of chunks :

Chunks starting with 0x020000xxxx are â€œplistâ€ packets, xxxx being the size of the binary plist data that follows the header.
Chunks starting with 0x030000xxxx are â€œpingâ€ packets, sent by the iPhone to Siriâ€™s servers to keep the connection alive. Here xx is the ping sequence number.
Chunks starting with 0x040000xxxx are â€œpongâ€ packets, sent by Siriâ€™s server as a reply to ping packets. Without surprise, xx is the pong sequence number.

And deciphering the content of binary plists is very easy, you can do it on Mac OS X with the â€œplutilâ€ command-line tool. Or in ruby with the CFPropertyList gem on any platform.

What we learned

We did really learn a few interesting things about how the iPhone 4S talks to Appleâ€™s servers :

The audio data

The iPhone 4S really sends raw audio data. Itâ€™s compressed using the Speex audio codec, which makes sense as itâ€™s a codec specifically tailored for VoIP.

Signature

The iPhone 4S sends identifiers everywhere. So if you want to use Siri on another device, you still need the identfier of at least one iPhone 4S. Of course weâ€™re not publishing ours, but itâ€™s very easy to retrieve one using the tools weâ€™ve written. Of course Apple could blacklist an identifier, but as long as youâ€™re keeping it for personal use, that should be allright!

The actual content

The protocol is actually very, very chatty. Your iPhone sends a tons of things to Appleâ€™s servers. And those servers reply an incredible amount of informations. For example, when youâ€™re using text-to-speech, Appleâ€™s server even reply a confidence score and the timestamp of each word.

Whatâ€™s next ?

Hereâ€™s a collection of tools we wrote to help us understand the protocol. Theyâ€™re written mostly in Ruby (because thatâ€™s a wonderfully simple language), some parts are in C and some in Objective-C. Those arenâ€™t really finished, but should be very sufficient for anyone technically inclined to write a Siri-enabled application.

Letâ€™s see what fun application you guys get to build with it! And letâ€™s see how long itâ€™ll take Apple to change their security scheme!

Source:http://applidium.com/en/news/cracking_siri/