Cryptography

The History and Mathematics of Codes and Code Breaking

Tag: statistics

Needle in a Haystack

Cracking codes seems like it should be a relatively straightforward task. Codes are not designed to be trivial; people encrypt text with the intention that someone else somewhere will be able to decipher it into a meaningful message. In order to form an intelligible message, the code maker employs an agreed upon pattern so that their code breaker can later translate the text easily. But for someone without knowledge of the key, cracking a code proves to be a much harder task. In chapter 3 of The Code Book, which talks about more advanced ciphers, Singh provides examples of how a cryptanalyst may begin deciphering a message. Using the example of a piece of text encrypted using a keyword and the Vigenère polyalphabetic system, he shows how, by testing common words at various points of the ciphertext, one can begin to uncover some words of the plaintext and ultimately, the entire message.

This method, while theoretically plausible, poses an incredibly tedious task for a codebreaker, even for relatively short messages. It is simply a mathematical problem. In the English language alone, there are 26 letters that form over 170,000 words (Oxford Dictionaries). These words can be arranged in an increasingly-near infinite number of ways as the length of text increases, and it is statically impossible for any human — or existent computer at the moment — to test every possible combination. Although there are some typical things to look for, these patterns may not always be obvious or present at all. Singh uses short examples that he designed for the purpose of demonstrating such tactics. Real codes are not designed to work so nicely. In reality, sifting through a ciphertext is like looking for a needle in a haystack; although the pile of letters lays right in front of you, finding what you are actually looking for may prove to be a nearly impossible task.

Misleading Statistics

An interesting pint Cory Doctorow brought up in his novel, Little Brother, is the idea of the "false positive." He writes, "Say you have a new disease, called Super­AIDS. Only one in a million people gets Super­AIDS. You develop a test for Super­AIDS that's 99 percent accurate... You give the test to a million people. One in a million people have Super­AIDS. One in a hundred people that you test will generate a 'false positive' ­­ the test will say he has Super­AIDS even though he doesn't. That's what '99 percent accurate' means: one percent wrong... If you test a million random people, you'll probably only find one case of real Super­AIDS. But your test won't identify one person as having Super­AIDS. It will identify 10,000 people as having it" (128).

This idea can be linked to Michael Morris' essay on student data mining. Critics of Morris' argue that looking at students' data would not be an effective method of school shooting prevention, as many innocent behaviors can be seen as "suspicious." Even if looking into student data is deemed 99% effective in detecting threatening individuals (Which it is not. In fact, it is most likely nowhere near that statistic), the false positive theory explains that many more non-suspicious students will be marked as suspicious than actually suspicious people. However, one can argue that the pros of these "threat tracking" methods outweigh the cons. If data surveillance can prevent a dangerous school attack, then it is worth identifying a couple innocent people as suspicious. (This opinion can be seen as a bit Machiavellian.)

The paradox of the false positive can be applied to beyond data encryption. One can use this idea to examine how misleading statistics are in general. For example, hand sanitizer claims to kill 99.9% of bacteria. There's about 1500 bacterial cells living on each square centimeter of your hands. If 99.9% of those bacterial cells are killed off by hand sanitizer, there's still several billions left, and the ones left are probably the strong ones capable of making you sick.

The Paradox

In Doctorow’s Little Brother, Marcus Yallow is a young boy who is falsely accused and interrogated on the grounds of being a terrorist. He decides to wage war against the DHS, the organization that kidnapped him, by creating more instances of suspicious behavior in order to make their security systems seem wildly inaccurate. He explains it by saying, “the more people [the security system] catches, the more it gets brittle. If it catches too many people, it dies”. He uses the paradox of the false positive to help him achieve this.

So, what is the paradox of the false positive? Well, let’s say 1 in every 100,000 college students commits suicide and universities have a system that can predict these tragic events 99% of the time based on the student’s web behavior. At first glance this seems pretty accurate, right? Wrong. This means that for every 100,000 students, 1% of students flags up on the system. 1% of 100,000 is 1,000 students. That is way larger than the actual number of students who commit suicide. Therefore, only if only 1 of these 1,000 students commit suicide, that’s an inaccuracy of 99.9%. This is known as the paradox of the false positive.

When reading Yallow’s explanation of this paradox it caught my eye. I found it very interesting because it highlights just how easy it can be for data to be manipulated in many different ways in order to portray a certain story. For example, a test for XYZ disease could be 99% accurate, however, it doesn’t paint the whole picture of how reliable the actual product is. This could lead to consumers who falsely tested positive for the disease to not only worry but also pay a money for medication that they don’t necessarily need. This can apply to many other products and services as well, and so it has made think twice before blindingly accepting data.

Statistically Suspicious

Corey Doctorow's novel Little Brother is a rousing look at a society where security has become dominant over privacy and liberty. In the aftermath of a terrorist attack on San Francisco, Marcus Yallow sparks and leads an underground movement to take back the rights of citizens from an oppressive police-state government.

I've read Little Brother before, but the issues and solutions regarding online security always fascinate me. The problem of the increased levels of encrypted traffic standing out particularly interested me. In the novel's universe, an agent monitoring web traffic and noticing a large amount of encrypted information passing to a single machine compared to the relatively larger number of people with unencrypted data would have his or her suspicion aroused. In some investigations, you don't need to know what someone is hiding, only that they are hiding something. However, this also brings to mind the possibility that a large amount of encrypted traffic from a computer doesn't always imply that illegal activity is occurring. Just as walking down the street with your hands in your coat pockets doesn't imply you're hiding drugs, stolen goods, or a weapon.

A statistically significant outlier might be an individual with malicious intent. But it also might just be one errant but innocent data point. The beginning of chapter 8 mentions the histograms and Bayesian analysis being used to find abnormal behavior. These were "not guilty people, but people with secrets." Privacy and the ability to keep some aspects of one's life out of the public eye are pretty much inalienable natural rights. Secrecy is an integral part of keeping ourselves normal, so it makes no sense to see a desire for privacy as statistically abnormal.

Powered by WordPress & Theme by Anders Norén