Privacy Papers Write-up
I was enrolled in a privacy class for my masters degree recently. As part of this class we had to read a metric buttload of papers (but at least not an old English buttload) and then write a short review of each, trying to find at least three critical points we could make or at least points of interest. These are my write-ups (with a few spelling fixes) and links to the papers in question. Sometimes I may come off as overly critical, but we were asked to find perceived weak points. Sometimes I just did not understand what the author was trying to get at, either because of my lack of background or a lack of explaining. Mostly what you will get out of this page is me being a curmudgeon about academic papers vs. hackers/infosec practitioners. Enjoy, or don't. :)
Low-Cost Traffic Analysis of Tor
http://www.cl.cam.ac.uk/~sjm217/papers/oakland05torta.pdf
The focus of Tor is to protect against less than global attackers, ones that can easily see both end s of the traffic to correlate who is who by looking at timing and amounts of data. The researchers in this paper come up with ways they think may lead to revealing the identity of users, but require less power on the part of the snooper (in other words, they don’t have to necessarily see the first and last hop).
I’d not though about choosing paths,
and messing with latency by modulating how my Tor client sends data, then looks
for these effects elsewhere to tie together who is talking to who.
What about other things that effect latency? Could the suspicions of the snooper based on latency be suspect because other things cause latency as well? By this I mean on the Tor nodes they don’t control, the one they do control have some measures taken that help mitigate this. This could be something as simple as the Tor server having a higher load for a bit(Say, if the Tor server is someone’s home box and they decide to encode some video ), throwing off things off. I could also see two researchers trying the same attack at the same time with similar paths, perhaps messing with each other’s results.
I can understand that it could be easy it is to mess with a Tor relay, but is it also expected that you have to make the victim talk to a corrupted sever (web in this case) that you control that is outside of Tor as well? I can think of ways to do it, but I’m not sure how reliably they would be. This is somewhat covered in section 5.3.
While the attack in the paper may
let a snooper see who is talking to who, I’m not sure is that useful. If the
attacker does not control the end point, they can’t see what the content is, and
won’t know if they want to investigate (the Tor user could just be web surfing
for LOL cats). If snooper controls the entry Tor node, they already know the
user is using Tor, so this does not seem to add an extra threat from repressive
regimes that may not like their citizens are using Tor. It seems that without
being able to see what the traffic is, being able to tell who is talking to who
may not be that useful. Then again, I could be missing something. The snooper
does own the corrupted sever, so they can see that the victim is using this, but
one would assume the may be of little interest to the attacker unless some sort
of entrapment is ok (as in: I know you were downloading pirated media since I
own the server it was pirated from). I could however see a legal entity seizing
a server with contra band on it, and leaving it up so they could track users
back to their real identities.
Crowds
http://www.cs.unc.edu/~reiter/papers/1998/TISSEC.pdf
This article describes the base problem of anonymity well. It predates many of the other anonymizing networks out there, and helps introduce the idea. That said I can’t see where much has been directly done with Crowds (the network, not the idea) in practice, though I imagine other modern in use darknets take advantage of some of its advances/ideas. Some of the ideas we will see extensively in later papers:
1. Mentions the global type attacker.
2. Brings up the problem of sniffing an exit node.
The core idea is to mix up user traffic, and route between each other, so everyone had plausible deniability. In a pinch, they could say "That was not me, just someone routing through me". Depending on the laws where they live, I’m not sure that would be enough.
I suppose a completely mixed net would scale better than Tor where you have different classes of nodes (directory, client, Tor sever), but a good selection method would have to be chosen to make sure things are routed to those that have the resources to handle it (I’d hate to be peered with users that only have a 56k modem). With Tor, people self-select, and for better or worse not everyone is a server (though according to a Defcon talk I just watched, there was a bug in an old version that could make any client an inadvertent server as well).
I’m still a little confused about the terminology used between the different papers. Is there a hard and deference difference between the concepts of mixes and crowds?
The article also spelled out
application layer problems that could reveal real identities. For more
information on these sorts of attacks see :
http://www.decloak.net/
I wish the crypto, and who has who’s keys, was covered better. Does it do wrapping like Tor? From the descriptions I read, it seems to be beginning to end, with only the first and last nodes using the encryption keys. There’s the hostile regimes problem, and like most papers this does not seem to be covered. The paper made me think about the whole timing of requests because of the nature of the application layer. For example web browsing, where your browser requests one page, then automatically grabs an image right after wards. Crowds seems to take care of this by understanding the application layer somewhat, and doing both requests for both objects before sending them on to the end user. I wonder how Tor handles this?
The article mentions a Closed vs Open version of the darknet. Problem I see with the closed idea is that you have fewer people to spread the "plausible deniability" across, and it might also make for some interesting conspiracy charges depending on what the darknet is being used for.
The article mentions a Diffie–Hellman public key. I thought Diffie–Hellman was used to negotiate a shared secret while in view of a hostile? I could be splitting hairs here. Also, how does the current version negotiate keys? It’s indicated it happens when the blender is contacted, but not much seems to be said past that unless I missed it.
The paper also sees mostly web
focused, what about applying the ideas to other applications?
Tor: The Second-Generation Onion Router
http://www.usenix.org/events/sec04/tech/full_papers/dingledine/dingledine.pdf
I'm sending my thoughts on this
article first, since I have to moderate it.
I found it interesting the number of
problems they had to solve not just for the sake of privacy, but to keep users
from bringing down the system by hogging resources. Think of the number of
people who would want to use high bandwidth applications like Bittorrent over
such a system as Tor, just to avoid being found by the MPAA/RIAA. Many articles
that talk about Tor only seem to mention it’s out proxy functionality, but the
hidden services section is what interests me the most, and I’m glad to see it
covered here. Wish they would do more work on that feature, as I2P seems faster
for that functionality last I tested.
The biggest weakness of the article
might be age, but you can’t fault the authors for the passage of time. Tor has
grown a lot in the last 6 years. They mention 30 nodes in the article, they have
over 100 now.
A few things I could see as
criticisms, at least things I’d like covered more:
How to decide what exit node to use
based on policy? For example, what if I’m a user that wants to use SMTP or
another protocol that is banned by most exit note rules? It seems to indicate
it’s possible, but I don’t remember reading how that is negotiated.
I wish they had gone more into the
weaknesses of the directory server model they chose. Before I’d read the article
I had not thought of the problems with flood like systems that I2P and other
darknets use. Still, about a year ago China blocked access to these Tor
directory servers, which was a big problem for those that did not know about
bridge nodes.
The legal aspects of running an exit node is something I’d like to see
covered better, but may be out of scope for this article. I loved the term
"jurisdictional arbitrage". Also, I’d be concerned about using the system is
some repressive regimes because even if they can’t they what I’m sending, or who
I’m talking to, they may object to the very fact I’m hiding communications.
They also mention the "leaky pipe" topology for helping solve the problem
with end point traffic analysis. I’ve used Tor a fair amount, and did not know
about this feature. Wonder how I can set my config to use it?
Also, I’d like to talk more about hostile exit points. This guy did some
interesting stuff with that concept:
http://www.theregister.co.uk/2007/09/10/misuse_of_tor_led_to_embassy_password_breach/
Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms
http://doi.acm.org/10.1145/358549.358563
I found this article to be the
quickest read, but I had less comprehension of what the author was trying to say
versus the other articles.
Its very mail focused, which is
understandable back in 1981. At that time, email would be the major "killer app"
of the networked-networks (not quite the Internet as we know it today, since the
migration to TCP/IP was not finished till 1983 as I understand it). Latency is
not as big of an issue with email, especially in 1981 with expectations being
lower (I read my email several times per hour, and am use to a constant
connection). The delay of a message getting though the system a few hours to a
few min after it is sent is not a deal breaker, and if it staves off traffic
analysis so be it.
I do have a question as to something I don’t think was covered in the article where it pertains to voting. If someone compromises the private key of someone else , they can use that to impersonate them. That key could perhaps be revoked later, but I imagine the person trying to get his "digital identity" back would have to reveal his real identity and prove he belongs in the system to have that done. I’m interested in seeing more of how "Digital Pseudonyms" would work for voting based systems.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
http://pdos.csail.mit.edu/chord/papers/paper-ton.pdf
I’d like more details on
reliability over anonymity. From the article it states: "Chord does not provide
anonymity", so I assume it’s use in privacy/free speech might be to allow
lookups for resources without the need for a central server that can be taken
down?
How does it handle hash collisions if two values hash to the same thing? I
understand SHA -1 is pretty good about distribution, but one happens when it
fails?
I see the paper mentions using "Distributed indexes" to allow for keyword searches, but I wonder how efficient this would be? For a few words using the "or" clause, I don’t think it would be a big problem, but searching using the "and" clause may be an issue since you have to find nodes with one keyword, then filter out once that don’t have the others.
There are a few time in the paper when they infer that they are using a "non-adversarial model." The nature of privatizing would be such that many would be adversarial to it. I wonder what would happen if an attacker joined, got their node ID to be one that was closest to the kind of data they want to get rid of, then removed all their nodes at once. Would this destroy the data, effectively deleting it?
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
http://pdos.csail.mit.edu/~petar/papers/maymounkov-kademlia-lncs.pdf
I’m not sure I know enough to really analyze the weaknesses of these systems. I’m new to the concept of hash tables, much less distributed ones. I know the authors have done some talks on the subject( I’ve found the slides online) so I’ve contacted them to see if anyone recorded one of their sessions.
The XOR metric for judging distance makes distance easy to understand. As for organization, I take it the address that is close to the key of a data value is the one that stores it? When it comes to implementation, how do you handle searching for keywords? There would not be a one to one mapping, but from reading elsewhere it seems this can be implemented on top of the Kad protocol at a higher level (I assume this is what eMule is doing).
What happens to someone that is unlucky enough to be the one that stores contraband data and is found? They may not have meant to host it, they have it just because they happen to be the person whose user hash is closest to the value’s hash.
Is it possible to cause a netsplit, where nodes that would normally be in a k-bucket and forwarding to other nodes all disappear at once, causing two separate kad networks to be created?
Why Kad Lookup Fails
http://www-users.cs.umn.edu/~hopper/kad.pdf
I was kind of surprised to find other users of Kad outside of the eMule family. I assume they are different networks lookup wise, just using the same lookup scheme? The statement is made that the improvements in the paper can be implemented incrementally into existing clients, but I wonder how many clients fully follow the protocol and don’t extend it in some odd way?
It seems that the same nodes are used most of the time to look up an object, and the queries are not evenly spread to other nodes that could answer them. I wonder if a company/organization trying to block the distribution of something could become one of those nodes?
The limit of 300 returned results may also be exploitable, if you can increase the seeming popularity of fakes.
ShadowWalker: Peer-to-peer Anonymous Communication Using Redundant Structured Topologies
https://netfiles.uiuc.edu/mittal2/www/shadowwalker-ccs09.pdf
One of the major problems with Tor is its central directory. ShadowWalker says they need to use a P2P system to help alleviate this. As I stated in another review, a good route selection method would have to be chosen to make sure things are routed to those that have the resources to handle requests (Example: I’d hate to be peered with users that only have a 56k modem). With Tor, people self-select, and not everyone is a server. I don’t see the paper addressing this.
That does bring up an idea for an attack. Have your malicious node be honest, forward, and sign things properly, but do everything so slowly that it’s unusable. A security system that is inconvenient would be a security system that does not get used. Guess they could tack on things from other systems that are already designed to solve this in ShadowWalker.
The paper mentions the nodes knowing the public keys for all of the other nodes involved. If it’s truly all of the nodes in the network, than I’m not sure how this will scale because of problems with key distribution. Maybe they mean just the public keys of the nodes they peer with, and their shadows? I found the paper hard to follow, but I’d like more details on how a node boot straps into the network, and decides who to have "fingers" to.
Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity, and
Identity Management — A Consolidated Proposal for Terminology
http://dud.inf.tu-dresden.de/literatur/Anon_Terminology_v0.31.pdf
As this paper is mostly about setting forth a standard terminology, it will be hard to criticize, but I’ll try.
The old joke is: The great thing about standards is there are so many to choose from. I’m not sure how many people will use this as a standard naming convention paper vs. just using their own terms with their on connotation.
On the subject of connotation, one of their terms struck me: Unobservability. The examples given for unobservability sometimes literally seem to mean "it’s not possible to know it’s there", and sometimes seem to be used as just a stronger version of anonymity. For example, they mentioned the use of spread spectrum, and while with spread spectrum you may not observe all of it, you can certainly observer some of it if you happen to be at the right frequency while it’s channel hopping.
During our discussion of the Chaum paper we mentioned the problem with tying a pseudonym to a single real person for the purpose of voting. Section 10 and 12 cover this issue, though I don’t know if we get a truly workable solution, merely suggestions on how it can be done (identity brokers for example) depending on what tradeoffs we wish to make.
Case in point about terminology: Sometimes it seems like we use the term Sybil to mean any malicious node, but the direct meaning based on the names referenced is one node (or controlling entity) posing as many. Sock puppets is another term that is used for much the same thing in online forums.
SybilGuard: defending against sybil attacks via social networks
http://www.comp.nus.edu.sg/~yuhf/sybilguard-sigcomm06.pdf
Ok, not sure I can make three concise points because I’m not sure what system this paper is trying to defend from Sybil attacks. I’m thinking from the stand point of Darknets and peer to peer file shares, and trying to see how these principles would apply to them.
SybilGuard seems to be based around human initiated trusts, but how do you do that without reveling your real identity? For a Facebook voting application (like a poll that asks who is the best singer), that could be fine because people are letting each other know who they are and people want to make connections. In P2P file sharing and darknets, people wish to stay anonymous, and if they are bright don’t want connections that can be tracked back to their real ID.
The article mentions out of band communications to set up the peering. Does this mean something like calling someone, leaving real information about yourself? Not something most file sharers/darknet users would want to do. Side note:The VOIP encryption application Zfone does something similar to this. It uses a Diffie–Hellman key exchange to set up encryption keys, but then shows the secret to both ends so by voice they can confirm who they are talking to.
If people only peer with others they know, does this not limit the total size of the network? If I only peer with people I trust, that’s not going to be as many people, and it may make it harder for me to "find files" or "establish connection that don’t link back to me". I suppose it is fine for some applications where most people don’t want to hide themselves (Facebook apps for example), but I’m not sure it would scale in P2P file sharing and darknet systems.
SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks
http://www.comp.nus.edu.sg/~yuhf/yuh-sybillimit.pdf
As this article shares three out of four of the same authors as the SybilGuard paper, and seems to be a refinement of the previous work, many of my same implementation critiques apply.
As I alluded to before, p2p sharing systems are not really "social" in the sense the paper seems to use the word. Darknets can be social, as in "I use a private darknet just to communicate with a subset of people, and we all implemented a Hamachi type network together", but once one node is compromised (let’s say raided) all of the nodes could be easily linked. Guess this goes back to the old saying: If more than one person knows about it, it’s not a secret.
To return to the paper: "Anonymity, Unlinkability, Undetectability, Unobservability, Pseudonymity", the SybilLimit paper highlights the need for common language. The article is hard to follow unless you have read A LOT of related work. I’m suggesting since they are using a social networks model, actually using real human to human communications for examples may make things clearer. Lot of papers do this with the classic Alice/Bob examples.
This point may be covered in the paper, but I did not catch it. With a random walk inside of a social network would you get a truly random final node, or would you have a serious clustering problem? For example. I’m friends with Bob, Alice and Chuck. Alice is friends with me, Chuck and Dave. Bob is friends with me, Dave and Alice. Seems like Dave has a good chance to be in control of some sensitive operations, or we may not hop very far away from our little cluster. This may be ok for some applications, but for implementing something like Crouds/Tor/I2P it would be problematic as the person I use as my out connection will be closely tied to me.
SybilInfer: Detecting Sybil Nodes using Social Networks
https://netfiles.uiuc.edu/mittal2/www/sybilinfer-ndss09.pdf
Once again, many of the same objections apply as all the other solutions that rely on a Social Network, but I’ll see if I can put a different focus on them.
People will add anyone. Just because I decided to trust another node, do I really trust them or am I just being nice? Let’s use Facebook as an example; people add others they barely know all of the time. I’m not sure we can really use this as a trust model, as humans are not very reliable gate keepers under many circumstances.
To further the point above, some people may have different levels of trust because they have different levels of sensitivity about the data they have on the network (or perhaps just paranoia). Let’s say Alice is doing some shady stuff on the network, but she really does trust Bob, so add him to you network. Bob is squeaky clean, and thinks he has nothing to hide, so he adds just about anyone who asks, even Dave the local narc. Can careful Alice now be compromised by Dave because Bob was not so cautious?
Even cautious groups can be compromised with the right type of social engineering. Read about the case of "Robin Sage": http://www.darkreading.com/insiderthreat/security/privacy/showArticle.jhtml?articleID=225702468
To summarize the "Robin Sage" incident:
1. Have a pretty woman ask to be added to the social network of a bunch of guys
in a mostly male occupation (Infosec).
2. Target high profile people, since if you can get them to add you it enhances
the value of the trust relationship.
3. See what data can be spider after you are added.
The Ephemerizer: Making Data Disappear
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.1047&rep=rep1&type=pdf
Many say "once it’s on the Internet, it never goes away". I can’t even find a definitive source for who first said that, but I imagine the gist of the phrase was co-invented over time by many people. This paper is definitely better than the ones from last week, at least from a clear writing standpoint.
1. While the paper mentions this, it’s worth dwelling on more. It is very hard in real life to make sure data does not slip out of the application and someplace else: temp files, swap, etc. I’ve done a fair amount of research into anti-forensics, and it can get tough to destroy all of the footprints that operating systems squirrel away into various nooks and crannies.
2. The paper makes mention of some of the steps needed to alleviate the
problems brought up in point one, but will anyone really use the system after
all of these precautions have been implemented? One of the lines from the paper:
"Bob’s must be capable of communicating with the ephemerizer and doing the
cryptography necessary to get the message decrypted, and not to store the
decrypted message in stable storage. It is possible in many operating systems to
have a process that will not be swapped out to stable storage, so the
ephemerization software at the client should use that feature to ensure that the
decrypted message never appear on stable storage."
By necessity, the software will require some configuration, and may not be intimately tied to the OS and standard applications. Will most users want to bother to use it? We can see by the current low usage of full disk encryption and PGP/GPG that few people will take the more private, but harder to set up route.
3. More should be dealt with concerning law. Even if you can get rid of the
data, there are legal ramifications. A while back I did some research on the
topic, and if the court knows the data existed and it has since disappeared
(aka: been permanently encrypted) the judge may slap you with a "Destruction of
evidence" or "Spoliation of evidence" charge. More info can be found in:
Federal Rules of Civil Procedure Chapter V
Zubulake v. UBS Warburg LLC
Qualcomm v Broadcom
It’s been about a year since I read
those laws/cases, so I may have to refresh before class. Sebastien Boucher's
case may also be of interest. Point two also brings up that few people may use
the system, so those that do may be looked on with suspicion and an attitude of
"What do you have to hide?" which will influence the judge/jurors.
Vanish: Increasing Data Privacy with Self-Destructing Data
http://vanish.cs.washington.edu/pubs/usenixsec09-geambasu.pdf
First, I have to say I like the idea of using the natural occurring bitrot that happens in P2P DHTs to one’s advantage. It’s not a bug, it’s a "feature".
1. It seem Vanish is a software band-aid to a social problem. If people don’t want something to get out, then they should not post it online in the first place. Can’t Ann just call Carla, or better yet if she is worried about wiretapping just meet her in person to talk? Other legal use cases may also be suspect, but I’ll go into that in the next point. People may also get a false sense of security, and post data they normally would not because they think Vanish will take care of it. The paper does mention this, but to me it would seem to be a bigger concern then they make it out to be. Also, one example given in the article is a Facebook post, that seems like an incredibly bad idea as if I saw that a post used Vanish (which you can tell by the headers) I’d automatically save the message expecting it to be a pretty juicy post.
2. Many of the legal issues I mentioned from the first article also apply here. If the person is using Vanish to destroy data they think may be of legal significance later, don’t they already have a "Duty to preserve"? For more details on the legal term "Duty to preserve" please see this link:
if the user is using Vanish for legal fears, do they not already have a "Duty to preserve" because as the linked article explains ‘When Litigation or an Investigation is "Reasonably Anticipated"’ "Duty to preserve" begins. They may still be some legal wrangling that can be done, but it is a legitimate worry. The introduction to section 6 covers this very lightly.
3. The previous article at least gives lip service to having to use non-standard applications to help keep forensic foot prints to a minimum. As Vanish has implemented Firefox plugins, I wonder what data is leaked out by these plugins? They could of course implement these tools another way, but without integration into existing applications how many people will adopt them? I’d also worry about Gmail’s drafts functionality saving a copy before the Vanish plugin is used.
To be concise, I’ll stick with just three point, however I’m also suspect of the assumptions that the party you are sending to can be trusted not to make a copy.
Defeating Vanish with Low-Cost Sybil Attacks Against Large DHTs
http://www.cse.umich.edu/~jhalderm/pub/papers/unvanish-ndss10-web.pdf
To be critical of this article, perhaps I’ll need to defend the Vanish system? I’m not sure quite how to do this review, but I’ll bring up some things I found interesting from the paper, and some ideas to look into further.
1. An inferred assumption (in both this paper and the previous one) that does not seem to be explicitly stated is that the VDO will not be known about by the attacker until after the keys have expired from the DHT. However, I can see some person that does not understand the system making a semi-public post someplace with the Vanish protocol, and the people who come across it automatically looking up the keys and storing it for personal amusement or to use as leverage later. Also, you still have to trust a single party in some cases: Who’s to say Google could not choose to parse all content that appears to use a VDO, and grab the keys within the time limit?
2. As the Vuze system bases its node hashes on the IP and the port number, could this be a tracking issue? As it is implemented here, I don’t think it will be an issue as the IPs will already have to be known to be queried, but I wonder if the predictable nature could be used to attack other systems using this scheme for identity. 32 IP bits plus 16 port bits means 2^48 possible combinations. Assuming no collisions, to store a rainbow table of all possible hashes would be 5,242,880GG more or less, which I guess would be prohibitive at this point. Then again, maybe there are ways to optimize this, as not all possible IPs are really in use.
3. The attacks specified require someone to be watching the Vuze network for items that are likely to be stored Vanish keys. They show in the paper that this is not nearly as expensive as the original paper specified. However, I’m not sure who will run these sorts of continuous lookups, just under hopes to recover data later. Totalitarian governments perhaps. A private business might be created that offered the service of recovering expired VDOs for a fee, but you would have a bit of a chicken and egg problem there: Until there are enough people using the Vanish system, the recovery business would likely not get enough sales to stay open. If Vanish got very popular, that could perhaps become its own downfall as at that point some group could make money by providing the VOD recovery service.
4. I’ll add a fourth point, as point two is only weakly tied to the paper. Since the protocol is UDP based, why not look into packet spoofing since you won’t have the same problems that may exist in TCP based protocols? I’m not sure all attacks would require getting an answer back, and if you can manipulate the DHTs with spoofed packets from non-existing hosts it might be an even lower cost way to attack the system. I don’t understand the Vuze protocol well enough to make firm suggestions, but perhaps there is something that could be spoofed to either make the keys unreadable way prematurely (essentially being a denial of service attack that makes people not use the Vanish system because of reliability issues), or refresh commands that cause the data not expire fast enough.
Privacy-preserving P2P data sharing with OneSwarm
http://www.cs.washington.edu/homes/piatek/papers/oneswarm_SIGCOMM.pdf
I went ahead and installed OneSwarm just to play with it, till I get some "friends" I does not look like it will be of much use to me personally. On with the critique:
1. One of the uses cases was someone wanting to download a political podcast, but not wanting others to know their political affiliation. Maybe for that case the system would be useful, but really, how many people is this really a concern for, and would be in a position to use this system (have a computer, a decent internet connection, not be firewalled from the rest of the world in other ways)? I imagine maybe in some repressive regimes that might be an issue, and if you were "friends" with someone in another country you could grab the files from them will less concerns about consequences. In the case of the United States, it seems to be anonymous enough for a person to just download the podcast directly. I listen to podcasts I don’t want to explain to some of my coworkers, but they are not going to find out unless:
a. Someone is doing a lot of invasive traffic analysis of what sites I go to.
b. My coworkers look at my computer (in which case OneSwarm would not help anyway).
c. The person who hosts the podcast leaks out information about who is downloading it.
Most of my coworkers don’t even care enough to Google my name, so I doubt they would go to the steps above. The use case they give seems somewhat disingenuous. The real use for such a system would be sharing pirated media, and being worried about the RIAA/MPAA. Not that I have a problem with that, it just seems disingenuous to use a different use case. With that in mind, I’m going to think of the RIAA/MPAA as my adversaries for this analysis. Guess if I have a friend in another country with less strict copyright laws, I could use them to download the next Hanna Montana CD as they are not likely to have legal hassles (more on this in the next point).
2. From a legal pain perspective, does it really matter if my adversaries can’t be sure I’m hosting the file, or merely relaying as the exit point? I’m probably still going to get a cease and desist from my ISP, or a letter from some lawyers. I may be able to prove doubt that it was really me hosting the data, but my understanding is that not criminal cases (like when one person/company sues another) the burden of proof is substantially less, so doubt may not be enough. I’d also be curious to know if various safe harbor provisions that ISPs enjoy would be applied equally to a home user providing such a service.
3. It is stated in the paper "OneSwarm assumes that users are conservative when specifying trust in peers, as trusted peers can view files for which they have permissions" and that OneSwarm’s aimis to let users control their level of anonymity. I’m not sure how much faith I would put in that assumption, and until more people use OneSwarm and can make peer groups, most of the data will be coming in via the traditional BitTorrent route. I know in my limited testing, I could not find much of interest when I connected to the public community servers (I did see some security lecture however). The friends of mine who I trust, and are tech savvy enough to want to run OneSwarm, could just as soon set up their own private SFTP site. Then again, OneSwarm would be easier for the average user, but I wonder how often the average user thinks about privacy at this level?
4. Just as a side note, while connected to the UW CSE Community Server, I
noticed the node names seem to be the same as the PCs host name. Granted, the IP
is already given if you care to look with netstat or a sniffer, but having the
node name be the same as the computer’s host name seems to be a bad design
decision from an anonymity standpoint, especial based on their use case example.
I can’t subpoena a school for who had a given IP at a certain time, but I can
certainly look at the host name and do a little Googling to figure out who’s
node that is. Example names: ATHENA, alsoan, alves, AP-PC, Benet-Copelands-MacBook-Pro,
catjam, Cou2rde, dunkirker, Fred-PC, hannong, jonfil, marc-PC and so on. By the
way, I do this all the time via reverse DNS for people who visit my site from
school.
Drac An Architecture for Anonymous Low-Volume Communications
https://www.cosic.esat.kuleuven.be/publications/article-1422.pdf
1. First, I like the idea of padding to make things resistance against traffic analysis . Costly bandwidth wise, but if you think of all the bandwidth that is wasted on YouTube and Facebook it’s not so bad. That said, not all places in the world have that sort of bandwidth/technology to spare, as pointed out in the Adrian Hong talk I linked to awhile back.
2. They bring up the idea of social networks as a model for building trust. I think we have spoken enough on why that could be a problem, but as long as things stay end to end encrypted I suppose it will be ok. One line that did catch my attention was "Yet we are the first to propose boldly making use of a social network as the backbone of anonymous paths." Best as I can find, the Drac paper was published this year (2010), did they not know of all of the work done with using social network based trust models against Sybil attacks?
3. As has been established elsewhere, modern crypto is generally pretty tough to break directly, but because of implementation mistakes cracks abound (WEP’s misuse of initialization vectors with RC4 is a prime example). That said, this section of the Drac paper caught my attention: "In this work we are not overly concerned with the cryptographic details of Drac. There exist well established, provably secure, cryptographic constructions to support relaying anonymized messages and extending anonymous connections. Similarly we assume that a padding regime is established that makes the output channels traffic statistically independent of the input channels . This can be done simply by sampling a traffic schedule for the output channel independently and before even seeing the input channels, and sticking to it by adding cover traffic if there is not enough, or dropping messages if the queues become too long." Good as far as it goes, but hopefully good padding material will be chosen to avoid "know clear text" attacks against it.
Privacy Preserving Social Networking Over Untrusted Networks
http://www.cl.cam.ac.uk/~fms27/papers/2009-AndersonDiaBonETAL-privacy.pdf
The best way I can describe this paper is "what wonderland are they living in?"
1. Ok, first question: what is the incentive for SNOs (Social Network
Operators) to implement (or allow others to implement and just use them as the
back bone) better privacy systems? The average SNOs’ business is based on
selling ads to others that marketing things they think you might buy base d on
your private data. Anything that hinders that is not in the SNO/Marketer’s best
interest.
If you can’t get the SNOs to develop
and use it you have to make your own social network infrastructure system, and
best of luck getting people to leave Facebook/Myspace/etc for your system.
2. An extension to point one: I suppose SNOs might have an interest in extra privacy for their user’s data if they fear their users leaving in droves because of privacy breaches. Unfortunately, I seriously doubt most social network users care enough, or think enough about it, that they will leave in droves. Social network users want their Farmville and quick access to message their friends. Without the economic incentive for better privacy that comes with customer dissatisfaction, it seems better profit wise for an SNO to upset a few privacy enthusiasts to the point of leaving, and keep the mass majority that they can send targeted ads to who don’t care as much about privacy.
3. I like this line from the paper: "We propose a client-server social networking architecture in which the server is untrusted, providing only availability, and clients are responsible for their own confidentiality and integrity." That already exists, it’s called: Don’t post things online that you don’t want public. Sometimes I feel like the anonymity/privacy research community seems to be obsessed with making better band aids to help people who like to play Russian roulette for fun.
The Anatomy of a Large Scale Social Search Engine
http://vark.com/aardvarkFinalWWW2010.pdf
As this paper does not directly concern privacy, I’ll concentrate on potential abuse and usability problems I see. I imagine more discussion of privacy/security will take place in the follow up articles. First, I’ve got to agree with the paper that some questions are better answered by asking a person, and are not well suited towards general web search engines. This is especially true for location/opinion based questions ("What is the best place in the Louisville area to buy parts for electronic projects" for example). That said, I’ve played with the Aardvark service now and so far I’m not impressed.
1. Let’s talk about malicious responders to questions. I suppose marketing via this system could be a problem. A vendor could sit there and try to steer people their way if they find a suitable question, but that’s a lot of effort and I’m doubting it would be profitable unless there are a lot of questions along the lines of "where can I buy x/who should I buy x from". I can easily see people abusing this for entertainment however. Also known as "doing it for the lulz" in chan culture. For example, someone asks if there is a cure for the common cold, and someone tells them to take a bottle of Ex-lax. Things would need to be done to be able to efficiently ban people. Doing it by IP would be problematic, but if they only accept people with a certain amount of social network ties that may help. A good reputation system would also be beneficial, but I’m not sure of Aardvarks details when it comes to user reputation.
2. So far, the questions I’ve asked have been answered by people that do not understand the topic well enough to be able to answer.
Examples:
Question: Anyone know how to change the DHCP vendor ID Windows 7 uses
when it tries to get an IP?
Answer from Max R.: tried "ipconfig /setclassid adapter_name class_id" ? btw, welcome to aardvark
Me: I'm trying to set the Vendor id, as in change it to something other than "MSFT 5.0"
Answer from Max R.: so have you tried to use that command?
Me: Yes, I tried it. No change as viewed in Wireshark: Option: (t=60,l=8) Vendor class identifier = "MSFT 5.0"
That one at least shows some understanding. The next one not so much (I left in spelling mistakes)
Answer from Pablo O.: Dhcp is asigned by yiur router or dhcp server, not windows.
Me: DHCP Vendor ID is a fixed string in a DLL, I can edit the DLL in a hex editor, but I'm looking for a better way. It's all about getting around passive OS fingerprinting.
You might be thinking I’m expecting too much, and perhaps I am. However this still points out a problem with the system. People will try to answer things even when they have no idea. This can cause the following problems:
a. Misinformation. Not a big concern for my question, but imagine if someone asked something medically/legally related? Granted, people should not ask for medical/legal advice on Aardvark, but I’m pretty sure some will.
b. If enough people answer the question, Aardvark stops the search until you tell it to find more answers. This could mean many retries to get an answer.
When a human on the other end fails the Turing test, that’s also an issue.
3. Back on a security related topic, and something the next papers will talk about I imagine, is information the person gives away about themselves. For example, let’s say I ask the question "What is the best way to secure Apache with modules x, y and z". Now an attacker watching the questions can Google up where I work based on my profile/username and knows some useful information for compromising one of my systems. A different example might me "I’m leaving for floridly Friday, what is the best route to take from Bloomington Indiana?" Now a potential attacker knows where the person lives, when they will be gone, and with a little extra cyber stalking the attacker can most likely figure out an exact street address.
Anonymous Opinion Exchange over Untrusted Social Networks
http://www.inf.unibz.it/~mkacimi/eurosys2009.pdf
As mentioned before, I like the idea of being able to ask a question of my peer group without tying it back to me. For example, asking the question from my last review "What is the best way to secure Apache with modules x, y and z" without giving too much information away about who I am and where I’m running these boxes. Here are my top three thoughts on this paper:
1. Building an anonymity system on top of a preexisting a social network seems like building a house on soft ground. Would it not be better to just use something equivalent to Tor hidden services/I2P eepSite running forum software for the questions? It’s also not clear to be from the paper, but are they completely unencrypting and re-encrypting between each hop, or are they doing some sort of layered encryption like Onion routing?
2. The paper states "However, it does not protect requestor's anonymity against the platform. Recall that the platform is a very powerful attacker since it knows all relations between all users, and it can monitor and store all exchanged messages. Moreover, it can create fake accounts; however, it cannot create fake relations since they are based on real friendships. " Why can’t it create fake relations? As mention in other reviews, it seems that people will add just about anyone who asks, and I doubt if they would notice added "friends". Then again, maybe the paper’s authors define "relations" as the out of band exchanging of keys specified on page three.
3. This is not something that is the responsibility of the system the authors are proposing, but it’s still something I would have liked to see mentioned. Care still needs to be taken in how the questions are asked, and how the responses are phrased. Depending on language and the details given in the questions/responses identity information may still be leaked.
What Do People Ask Their Social Networks and Why A Survey Study of Status Message Q&A Behavior
http://people.csail.mit.edu/teevan/work/publications/papers/chi10-social.pdf
As this paper is more about sociology than straight technology, I’m not quite sure how to critique it. Here are issues I could potential see with the data:
1. The demographics were rather skewed, though this is admitted to by the authors. Only 25.5% of the participants were women and all worked for Microsoft in one way or another. This would definitely have effects on the kinds of question asked and answered. I would imagine the technology questions category would be especially skewed because of all of the participants being Microsoft employees. People would either ask a lot of tech questions because that is what they work on, on no tech questions because they did not want to seem technologically illiterate. Also, I’d be interested in seeing more details on average number of Twitter followers most users have. I have pretty good luck asking question via twitter, but I also have 1,777 which skew things greatly.
2. I would say that how the authors asked questions in their survey, and how they chose to categorize response made a big difference in interpreted outcome. One of my biggest fears with the social "sciences" is skewing of data based on the researcher’s belief system, and confirmation bias. As Ernest Rutherford supposedly said: "All science is either physics or stamp collecting."
3. Related to the point above, I wonder how honest people were when asked regarding their motivation for bothering to answer someone else’s question. I truly think the "ego", "social capital" and the "free time" categories would be higher. Then again, maybe I just have that bias because I’m self-centered.
"I’ve Got Nothing to Hide" and Other Misunderstandings of Privacy
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=998565
1. Before I even read the paper, just after seeing the title, I thought: The paper must be framing the problem wrong, everyone has something to hide. I’m glad to see the paper does mention this concept, even quoting Aleksandr Solzhenitsy: "Everyone is guilty of something or has something to conceal. All one has to do is look hard enough to find what it is." Perhaps the statement "everyone has something to hide" comes off as too acerbic to be useful in a debate. The author mentions a comment on his blog that said: "If you have nothing to hide, then that quite literally means you are willing to let me photograph you naked? And I get full rights to that photograph—so I can show it to your neighbors?" Framed that way, it may sound too much like a "straw-man argument", which is perhaps why the author wishes to move away from it. I still think "everyone has something to hide" is the best, and perhaps only, counter argument and even though the author tries to get away from it, he ultimately just gives reasons why someone would want to hide something.
2. Framing privacy as a banner term for related concepts seems like a good idea. I could perhaps argue about some of his taxonomy. That said, as privacy is used as an umbrella terms for many related rights concepts, it means many different arguments for and against need to be made for all of the sub concepts. I see much arguing and splitting of hairs in the future, but since the author is a law academic I imagine that’s great for job security. Also, the whole "individual rights" vs. "utilitarian principles" I don’t see being resolved. Thinking "what is best for the group" to the exclusion of the individual is a dangerous road.
3. Let’s see if I can tear apart this quote from the paper: "Many commentators who respond to the argument attempt a direct refutation by trying to point to things that people would want to hide. But the problem with the nothing to hide argument is the underlying assumption that privacy is about hiding bad things. Agreeing with this assumption concedes far too much ground and leads to an unproductive discussion of information people would likely want or not want to hide." For example, the chilling effects argument still seems to boil down to:
a. People fear would happen if their political/ philosophical /religious
beliefs were exposed.
b. Because of point A. they wish to hide some of their political/ philosophical
/religious beliefs.
c. Because of this, fewer opinions are express in relation to the controversial
matter.
Granted, what the author may care about is the chilling effect, but ultimately it is caused because someone wants to hide something. In the case of future misuse, unintended consequences and lack of control:
a. The individual fears what someone will do with the information.
b. As such, the individual has something to hide.
To sum up my thoughts, "everyone has
something to hide" is still the best argument for privacy, but to be effective
you must give examples of why someone would want to hide things that others
would agree with (and perhaps not see as "bad"). I have tons of things to hide,
and understand others do as well, and I’m fine with that.
Saving Facebook
http://works.bepress.com/cgi/viewcontent.cgi?article=1019&context=james_grimmelmann
Let’s phrase this a different way, is Facebook worth saving, and how much do we care to protect people from themselves?
1. On page 1141 there is this quote: "Giving users "ownership" over the
information that they enter on Facebook is the worst idea of all; it empowers
them to run roughshod over others’ privacy." Perhaps this is just an over
statement, by the author. If it were easy to export your data it would be easier
for new social networking sites to spring up that may have more concern for
privacy than Facebook. How about just the option to only export the data items
you created, not those of your "friends", and a list of your current contacts
only. This to me would be a nice feature to offer, and for those concerned with
others carrying off their data elsewhere, they can already do that with a page
scrapper if they have the desire. In the balance I’d like for people to be able
to save what they have created incase Facebook goes belly up like GeoCities.
Jason Scott did a great talk on this earlier in the year:
http://ascii.textfiles.com/archives/2416
http://www.youtube.com/watch?v=hUzIy9UXmuc
2. On page 1159 they mention the "tremendous politics" involved, especial around "top 8 spaces". The same kind of politics comes up with who you add and don’t add to your network. Seems like all the more reason just not to use Facebook, and avoid the drama.
3. The points made about changing/dependable privacy policies are interesting. What was not a concern at one time now is. Two examples come to mind: The front wall "news stream" that so many hated (even though their data was already out there) and the addition of more types of people to the network (no longer having to be part of a university, now your mom can add you as a friend and see your drunken pictures). Guess you have the option of not adding your mom (as I have chosen to exclude out all of my family) but people look at your weird. One of my main reasons for dropping 3/4ths of my "friends" was concerns about changing policies and how they might affect future disclosures (see the applications section at the bottom of this write-up).
4. Is it just me, or does saying "Privacy Policies" and "Technical Controls" won’t work, but "Reliable Opt-Out" and "Predictability" may work seem a little silly? The last two seem to be subsets of the first two.
All and all, I’d like to see more
people take responsibility and just realize not to put some stuff online.
"User-Driven Education" may be the answer, but in my experience most users won’t
learn. I started looking into what data that your friends can give up about you
just by adding an application. Here are the items you can set under "Control
what information is available to applications and websites when your friends use
them":
Bio
My videos
Birthday
My links
Family and relationships
My notes
Interested in and looking for
Photos and videos I'm tagged in
Religious and political views
Hometown
My website
Current city
If I'm online
Education and work
My status updates
Activities, interests, things I like
My photos
Places I check in to
Only "Interested in and looking for" and "Religious and political views" are off by default. Made me wonder if application creators were selling this data, and it looks like they were:
http://arstechnica.com/web/news/2010/11/facebook-punishes-app-developers-found-selling-user-data.ars
http://www.pcworld.com/article/209444/surprise_your_facebook_data_is_for_sale.html
Looks like there were/are plenty of apps doing it. Not that I care much, I ensure by privacy by being boring. Guess they could get my email address elsewhere and do better targeted spam. I have a good business model for you: Write the next Farm-vile (not a spelling error), scrape all the data you can from all the people you can (including their friends list), wait 5 to 20 years for one to go for political office, sell juicy stuff to reporters.
l-diversity: Privacy beyond k-anonymity
http://www.cs.cornell.edu/~mvnak/pubs/ldiversity-icde06.pdf
This is another one of those papers that wishes to define terms and make a new way of categorizing privacy. The most useful thing I think I got out of it was the idea that sometimes anonymized data sets are so homogeneous that facts can be easily inferred. Now for my main thoughts on the paper:
1. I’m curious how useful the data will be after making it so you can’t identify the individuals? I’d love to have a better use case than the hospital records example they used. If I was a researcher, depending on the topic, I’d like to have as much data as I can so I can make those very correlations they are trying to avoid. Getting rid of the homogeneous nature of some data may inhibit its usefulness. If all of the people from the same area code, of the same age, had the same disease that could be useful in figuring out what’s was going on.
2. This goes back to the example use case, could they not fine a better case example? Why should a hospital be posting this kind of data publicly? Seems like researchers may want to know more than what was posted, and concerned parties for a sick individual should just care about the single patient. I’m not sure why the hospital would want to post this data publicly, so is seems like sort of a weak case example.
3. What is the threshold for a property being well represented? I guess that
depends on a situation to situation basis.
Why not keep all of the data "need to know" only? Sure, some with the "need to
know" may also leak the data, but so could the person putting out the anonymized
data sets. Maybe applying this to a more really world example would help, like
the AOL search incident that happened several years back. I’d also like to hear
more about how this can be used in a useful database.
Privacy in the Clouds: Risks to Privacy and Confidentiality from Cloud Computing
http://www.worldprivacyforum.org/pdf/WPF_Cloud_Privacy_Report.pdf
This is a useful paper as a survey of mostly US laws and how they may apply to cloud computing. Also, let me take a moment to say that I dislike the term "cloud" as it’s pretty much just a marketing buzz term at this point which means too many different things to different people. The only commonality is pretty much: "someone else hosts it, not us".
1. From the anti-forensics research I did awhile back, I had a rough idea that your expectation of privacy and protections under the 4th amendment were somewhat less with outside hosting than if you had the data on media in your physical possession. I like that this paper points out real case law and how it may apply.
2. Seems like the individual data that has the most protection may be healthcare information. I’ve heard a lot of people in the security community say they have yet to see anyone really get hit hard by HIPAA from a privacy standpoint, but perhaps it does have some useful teeth. I wonder if there is a way to mix HIPAA related data with other information so as to give it more legal protection? Perhaps mention some of my medical maladies in my Google docs?
3. Politicians make laws and lawyers/judges interpret them (thus making new laws in the form of case law) without really understanding the technology, unforeseen effects, or if it’s even feasible. Much of the meaning of the various laws is still up in the air until more trials take place and case law is established.
I wonder how feasible it would be
to make a Firefox plugin that allow you to encrypt the data in any form element
before it is submitted to a cloud provider? I am imaging getting the plugin to
work with an AJAX web application that uses a lot of JavaScript to modify the
fields as they are being typed, and not send any intermediate data to the web
app server, would be a pain. If a workable way could be found users could have
the convenience of having their data "out there" on the cloud where they can get
to it, but still have it secured from the prying eye of the cloud provider and
others. The user would still have to install the plugin, or carry around a
portable browser with it already installed, but it could be an interesting
project. Think something similar to FireGPG.