Does Too Much Data=Bad Predictions?
There's no shortage of data available to police, meteorologists, and other soothsayers. But could there be a point where more data means worse predictions? Yes, sayeth camera skeptic and BoingBoing impresario Cory Doctorow:
Take London: cover every square inch of the city with CCTVs and you'll get so much information that you'll never make any sense of it. Scotland Yard says that CCTVs help solve fewer than 3% of all crimes, while a study in San Francisco found that at best, criminals simply move out of camera range, while at worst they assume no one is watching.
Similarly, if you take fingerprints from every person who applies for a visa – or worse still, from every person in Britain who has to carry one of the proposed new biometric cards – you will fill the databases with chaff that slows down searches, generates endless false matches, and threatens everyone in the database with the worst kind of identity theft.
Check out Doctorow's recommendations for political books for young adults. And check out his book, Little Brother.
Editor's Note: As of February 29, 2024, commenting privileges on reason.com posts are limited to Reason Plus subscribers. Past commenters are grandfathered in for a temporary period. Subscribe here to preserve your ability to comment. Your Reason Plus subscription also gives you an ad-free version of reason.com, along with full access to the digital edition and archives of Reason magazine. We request that comments be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of reason.com or Reason Foundation. We reserve the right to delete any comment and ban commenters for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
You're not talking about predictions.
In other words, getting data is never a bad thing (assuming the data is accurate). You can always filter it if you need a more manageable search. You lose information that way, but you're never worse off than you would be if you didn't have any of this data at all.
This is more of a case of technology not improving things as much as was hoped. If all I was concerned with was raw crime solving ability, I'd prefer having cameras everywhere to having cameras nowhere. Even if I have to ignore all the camera video because it's useless, I'm not any worse off than I would be if there were no cameras.
Since you invoked meteorology, I have to throw in my two cents...
I don't see any way in which "too much" data would make predictions worse. To be sure, you can reach a point at which you simply cannot use or compute all the data, but that doesn't negatively impact your predictions; it just means you've wasted your time/resources collecting excess data. You could also argue that an effort to collect large amounts of data may compromise the quality of said data, but that's not an inherent problem.
I guess ultimately I'm expanding on Chris's point: matching suspects is different than making predictions. I suppose technically, if your data resolution was greater than your actual model, you'd have to use processing resources to interpolate a dataset with matching resolution, but you already do that with data sets that are too small. Bottom line is, if you're talking about numerical predictions, the excess data is much more likely to simply be ignored rather than clogging things up.
Even if I have to ignore all the camera video because it's useless, I'm not any worse off than I would be if there were no cameras.
You mean all that stuff costs $0.00?
Good point, Finger, but that doesn't mean that the data itself hampers efforts to fight crime, as KMW is arguing. But yes, the costs and threats to privacy do have to be considered in the final analysis.
Perhaps the data doesn't actually hurt the investigation, but it does seem that drinking from the firehose is unproductive compared with more intelligent approaches to, um, intelligence.
What are we talking about here? Initially Doctorow asks whether or not too much data causes us to make worse predictions, but veers off course to make the point that CCTV is ineffective in criminal prosecution.
Leaving aside whether or not either of those statements happen to be true and in what context, the entire post is a bit of non-sequitur.
For a perspective on the history and cultural of data collection try to find Ian Hacking's "Biopower and the Avalanche of Printed Numbers".
Doctorow is (probably incorrectly) assuming that the cams are designed with best intentions, rather than their probable goal of just asserting authority over the proles and letting them know they're being watched.
And, since this post extols Doctorow, let me offer an OT comment about his fellow BB poster XeniJardin.
Here's a couple of my comments she deleted from BoingBoing. That follows (and might be related to) an incident from a few years ago when I added some negative information to her WP page about some pictures she posted to BB supposedly involved the MMP. But, one of them was an inflammatory poster not from the MMP, and she failed to note that.
That was eventually deleted from her WP entry, as you can read about on this archived WP talk page:
tinyurl.com/54xdvn
Chris Potter: Doctorow was fairly on-target about the ways in which too much data can be a bad thing: false positives and increased handling time/costs.
Figuring out that false positives are false takes time and/or money, and it's time and/or money that could have been spent doing something productive in terms of solving the case. Similarly, if each trawl through your database takes a half hour because it's a few terabytes of data, you not only slow down investigations, you probably limit the extent to which the data can be/will be deployed. Nobody's going to check everyone getting onto an airplane for being a terrorist if it takes 30 minutes (or, really, even 10 minutes) to check each person.
Now, of course, it's trivially easy to imagine scenarios in which having more data helps solve cases, and, on balance, it probably is true that they more often help than hurt in the narrow, focused sense. I think that the arguments about privacy, abusability, etc. are better than the "too much data hurts" one. But there is something to that argument.
Doctorow is (probably incorrectly) assuming that the cams are designed with best intentions, rather than their probable goal of just asserting authority over the proles and letting them know they're being watched.
In my (pretty large) city, there's a traffic intersection that has SIXTEEN CAMERAS (4 pointing in each direction). There can't be enough cops to watch 'em all, so the objective must be simple intimidation -- or "deterrence" of whatever the poobahs think should be deterred at that corner.
Chris Potter-
Do you think, in the final analysis, that Tony Blair and Gordon Brown have given much consideration to privacy?
What right does the state have to view one's comings and goings? There are too many people, including, it would appear, far too many libertarians, who accept omnipresent cameras as inevitable.
I think Orwell might have preferred his own imagery of a boot stomping on a human face.
Doesn't that get the point across better?
far too many libertarians, who accept omnipresent cameras as inevitable.
I have no problem with everpresent cameras, in fact I wholeheartedly approve of having a true record of what's happening out in the public word. What's objectionable is the state owning and operating them.
Let Little Brother watch with a thousand prying eyes, and train an especially watchful glare on those entrusted with the state's monopoly on violence.
liberty mike,
There is no expectation of privacy in public spaces. Nor should there be. How is this any more of a privacy violation than posting a policeman at every corner?
I keep trying to post a huge screed on why "more data is not better"
It keeps getting sucked up in the intertubes. I think it's the matrix fucking with me.
Short answer why "more data" approach isnt at all good for focused problems - see will rogers quote here =
http://www.bobcongdon.net/blog/2004/06/boil-ocean.html
Doctorow was fairly on-target about the ways in which too much data can be a bad thing: false positives and increased handling time/costs.
That's not the fault of the data collection, it's the fault of a particular way of using the collected data. If searches are taking too long or turning up false positives, filter the data set, as I suggested before. You'll still be better off than if you had no data at all.
And Doctorow's quoted point had nothing to do with the costs, but rather the claim that more data directly hampers investigations.
Also= read the Black Swan book by Taleb
And Lonewacko, you're a douche
Chris Potter | June 17, 2008, 8:14pm | #
Doctorow was fairly on-target about the ways in which too much data can be a bad thing: false positives and increased handling time/costs.
That's not the fault of the data collection, it's the fault of a particular way of using the collected data.
You dont consider data collection a cost?
Or the time spend eliminating an unlimited amount of negatives, a cost either?
What do yo do for a living?
maybe you're confusing "investigations" with "tangible results".
I mean, it's great that they have 200billion hours of tv footage of guys pissing in doorways, but it aint exactly doing anything about it.
the panopticon effect is useless when you dont have a controlled population. Soon, we'll need to facial-scan everyone to make sure the SYSTEM works.
Chris Potter-
Posting a policeman at every corner is obnoxiously repugnant to a free society. I have a right to not have my image and likeness video-taped without my consent. I have a right to determnine who will examine, use and/or distribute my image.
What is really lame is the proposition that one "consents" to having his image videotaped, analyzed, used and/or distributed just because he is locomoting on a public street. That is totalitarian clap trap.
If the framers had intended to give the state the right to establish permanent surveilance, they would have provided for such. They did not.
Chris Potter-
This place was birthed by extremely radical folk who, overwhelmingly, were adherents to natural rights philosophy. Posting a cop at every corner is utterly inconsistent with natural rights philosophy. Is your philosophy more appealing than that of the framers?
GILMORE,
I, um, collect...uh, data.
But not that kind. 😉
Listen, I'm down with the concern that the costs might be high enough that it's not worth it. That's not what the claim was. Read the friggin title of the post. KMW and Doctorow are saying that collecting large amts of data inherently leads to bad results.
I have a right to not have my image and likeness video-taped without my consent. I have a right to determnine who will examine, use and/or distribute my image.
You have a right to life, bodily integrity, liberty, and the use of your property. Where do these supposed rights fit in?
What is really lame is the proposition that one "consents" to having his image videotaped, analyzed, used and/or distributed just because he is locomoting on a public street. That is totalitarian clap trap.
More like a necessary guiding principle for a high-tech society. Are you saying that if a mom videotapes her family celebrating one of the children's graduation in a public space, and you happen to walk through the background, she has to track you down and ask your permission before she can show the video to anyone?
If the framers had intended to give the state the right to establish permanent surveilance, they would have provided for such. They did not.
The Constitution doesn't enumerate powers for state govts, just the federal one. State govts have all the powers that are not forbidden to them in the US Constitution or in their own constitutions.
Chris-
It is axiomatic that the natural rights philosophy undergirded the "american experiment in ordered liberty" and that "the language of the Declaration of Independence provided the standard American expression of that philosophy." Kimberly C. Shankman and Roger Pilon, Revising the Privileges or Immunities Clause to redress the Balance AMong States. Individuals and the Federal Government, 3 Texas Rev. of Law & Policy, 1, 12 (1998).
Posting a cop at every corner is utterly inconsistent with natural rights philosophy.
How so? You keep claiming this, I don't find it convincing. In any case, despite my religious views, I'm becoming more of a utilitarian than a natural rights philosopher in legal matters these days. Natural rights legalism leads to some pretty hideous conclusions.
liberty mike,
The Founders weren't God. Nor is Ron Paul, but we've already discussed that on other threads and you seem recalcitrant on that point.
No offense to the framers, they did a good job for their time, but I'm not going to surrender my will to wife-beating slaveholders any time soon.
Chris -
No they do not.
1. Read the state constitutions.
2. Have you ever heard of the 9th amendment?
Do you understand natural rights philosophy? You do know that John Adams was a self described natural rights adherent? Are you familar with the writs of assitance cases argued by James Otis in 1761? Do you know of the relationship between Otis and Adams?
Our rights do not come from the state or what some majority ordains. They inhere; they are god given. That is what Mr. Adams believed. Ditto Mr. Jefferson.
The natural rights philosophy undergirds the ninth amendment. Thus, the question, where in the constitution does it say one has a right of privacy is not the question as the framers conception of rights included the proposition that the sum of of all of our rights could never be catalogued -thus the ninth amendment. No 9th amendment = no ratification = no USA.
Nor am I going to surrender my rights to pusillanimous pussies who want uniformed thugs on every corner.
Chris-
Of course they were not gods. As I have often been forced to admit, they sure didn't always practice what they preached.
That's not the fault of the data collection, it's the fault of a particular way of using the collected data. If searches are taking too long or turning up false positives, filter the data set, as I suggested before. You'll still be better off than if you had no data at all.
If you could filter false positives out of the a data-set, you wouldn't get false positives.
Your argument is, "If we ignore the cases where large data sets cause bad results, large data sets cause only good results." That's... true, I guess.
You still haven't shown that having a police officer* on every corner conflicts with the Bill of Rights or natural rights theory, or explained why, given that the Framers were not God, I should care what their philosophy was.
* I mean of course an officer bound by the same laws as everyone else, not the unaccountable gods-in-their-own-eyes that infest our PDs and other LEAs today.
Not quite, Mr Sullivan. I was saying filter on some quicker basis, even just randomly throw out huge chunks if that's what you need to do to get the data set down to size. True, you might wind up throwing out data that would actually be helpful in the process, but in no case do you wind up worse off than you would if you had no data at all to work with.
Let me give an example: Detectives Smith and Jones are old-school detectives who don't like all this new high-tech stuff. They prefer the old-fashioned ways of questioning eyewitnesses and sniffing around the crime scene after the fact. However, the mayor decides to hedge his bets by investing in a camera system to blanket public places and a geek to run facial recognition and tracking software on the collected video, while still letting Smith and Jones do their investigation their way.
Now, maybe the geek turns up false positives left and right, has to throw out 3/4 of the video randomly to be able to search properly, and runs up the electricity bill. But there's no way he actually makes things worse than if there was just Smith and Jones doing their detective work.
* I mean of course an officer bound by the same laws as everyone else, not the unaccountable gods-in-their-own-eyes that infest our PDs and other LEAs today.
Chris --
There's the rub. What the framers understood (better than most) is the potential for abuse once the means of abuse are granted to the government. There'd be no objection to cameras everwhere if we could be absolutely assured that those videos would never be used to further the personal political or military or power agenda of any particular individual or group.
Fat chance.
CHris-
You do not have a right to impose your fear on me.
To your point, yes I have. Read the 9th amendment. It says "the enumeration in the constitution of certain rights, shall not be construed to deny or disparage others retained by the people."
Our government is one of limited power or enumerated powers. There is no grant of authority in the federal constitution or in any constitution of the original 13 that enables government to maintain permanent surveilance over the citizenry. Period.
On this point, I'll go with the framers over those that think they or the majority have the right to impose their fear on the rest of us.
Only a radical fearmongering pussy would argue that the state has a right to maintain permanent surveilance on the citizenry. It is entirely unreasonable and utterly at odds with first principles. Oh, of course, those who stood to benefit by the permanent surveilance state would be in its favor. Yes, my position is first a moral one and superior to the "moral" position that some mob can maintain a permanent surveilance state. Second, there is no practical justification either-unless one believes that there is an Atta in every attic waiting for the chance to to do some evil.
Chris-
But I forget-you are not a libertarian. If a libertarian does not understand that libertarianism is rooted in natural rights philosophy, then he is ready to be hannitized.
Crusader Rabbit-
I agree except that you are not speaking for me in your hypothetical-as I do not consent to one pence of what I produce being confiscated for rent seekers, state actors or other parasites.
This point has probably already been made, and doesn't touch on the implications to liberty, but there's a difference between data flow and data amount. (namely the first derivative) The rate is what really matters; this is why there is an optimum amount of gauges on a car's dashboard. As another example, it did not matter that there were about 1000 different alarms on the control panel at three mile island. What mattered is that at any given time, about a third of them were locked in (i.e. continuously in an alarming condition but silenced). So a true problem was difficult to ascertain.
Our government is one of limited power or enumerated powers.
The federal govt, yes.
The state govts, no.
Does your state's constitution specifically grant it power to set speed limits and parking rules? And while you are doing a great job of repeating that surveillance violates your natural rights, you have given no evidence for this viewpoint.
Surveillance in public places, that is.
liberty mike, you must really enjoy having a drink!
Now that small remote control helicopters are available in most toy stores its only matter of time before they can be equipped with paint-ball guns that can be aimed at surveillance camera lenses.
In my (pretty large) city, there's a traffic intersection that has SIXTEEN CAMERAS (4 pointing in each direction). There can't be enough cops to watch 'em all, so the objective must be simple intimidation -- or "deterrence" of whatever the poobahs think should be deterred at that corner.
Any idea if they are police- or transportation-run? Our local intersections often have four cameras, which are used to automatically adjust traffic light timing. Those are not "real" cameras, in the sense that they do not transmit images of the vehicles elsewhere. There are other spots where there are transportation cameras hooked up for remote viewing to watch for crashes, traffic backups, etc. I think there is even a website from our state DoT.
Or maybe your state has red-light cameras that send automatic tickets.
Zubon --
These are so-called red light cameras, with elaborate flash and other accoutrements to go with them. Lots of cool (and expensive) equipment that no doubt got the juices of the city procurement folks flowing.
But I assume the police can use them for other purposes, too. Once the image is captured, imagination is the only limit on what it can be used for.
yes, and no.
if you're trying to do an analysis that for example needs to draw some conclusion from all, or a combination of this data, then yes more data is more problems. usually some methods are needed to try to achieve a more useful set of data.
if you're just monitoring, recording, or archiving for retrival when necessary, all you need is more resources.
.. so in conclusion, yes you really cant gather a ton of data and think you can extract something useful with it automatically.
It has been my experience that most people inadequately appreciate the importance of sensory and perceptual adaptations to environment in creating a successful species. That is to say, species that are dominant in their environments appear to have evolved sensory apparatus that is well-tailored for the environment and the species' role in it. And, as much as sensory mechanisms provide ways to experience the world, they also entail built-in filters that shape the organism's experience, sieving out irrelevant or distracting data points, even before the brain can try to make sense of them. The basic sensory limitations that an individual has have PROVEN, through evolution, to admit just enough information for the success of the species, and to reject the rest so as not to overtax (or confuse) the brain's ability to discern patterns in the input.
I observe, on the other hand, that classical paranoids seem overly preoccupied with very tiny details, picked from a broad canvas of as much information as they can hold in their heads at one time. Perhaps a touch of this is conducive to survival, but I can easily imagine a situation of "too much" information, in which the paranoid mind connects many dots he might otherwise ignore, to create "patterns" where there really are none. The worldview caused by recognition of such bogus patterns would lead to behavior that were inappropriate and counterproductive for the envrionment, probably to the point of negatively affecting the individual's chances at survival or reproduction.
It is not much of a stretch to imagine that a situation of "too much information," which is potentially bad for individuals and species might also be bad for their organizations and institutions, as well.
Once the work of Jeff Hawkins and associates (http://www.numenta.com/, http://www.onintelligence.com/) attains some maturity, and artificial intelligences in the mold he presecribes are hooked into camera networks, for example, it will be interesting to see which cameras need to be enhanced (and how), and which need to be turned off entirely, to get optimum pattern-recognition results. I think we will see that "legitimate" peacekeeping and public-safety functions will require only a few cameras, and that too many will simply serve to confuse or overload the AI.
"behavior that were inappropriate and counterproductive FOR the envrionment" should be "... IN the environment..."
If searches are taking too long or turning up false positives, filter the data set,
You're presuming the data can be meaningfully filtered.
For instance, ATF records instances when they trace the ownership of a firearm. Apparently the individual records do not contain a data field for the reason a law enforcement agency requested the firearm be traced. Therefore it's impossible to filter either for or against "the firearm was used in a crime" or "the firearm was found somewhere" or "the firearm was unusual" or "it was a slow day and we had nothing else to do."
Filters also take some time to run. If, for instance, you take a database that stores the images of bullets for comparison and expand it from 500,000 bullets found at crime scenes (each with six images to store) and expand it to add 200,000,000 bullets from legally-owned firearms just filtering out the 99.75% unlikely matches could get to be at least enough of a problem to reconsider adding the extra data.
Not to mention funding the data collection and storage space.
Now, maybe the geek turns up false positives left and right, has to throw out 3/4 of the video randomly to be able to search properly, and runs up the electricity bill. But there's no way he actually makes things worse than if there was just Smith and Jones doing their detective work.
Unless the mayor orders Smith and Jones to get back to the office and watch the video.
Not quite, Mr Sullivan. I was saying filter on some quicker basis, even just randomly throw out huge chunks if that's what you need to do to get the data set down to size.
The point is, you don't know when you need to do this until you know if you've gotten false positives. And you don't know if a positive is false until you've already spent resources following up on it.
Sure, it might be that you can make some reasonable guesses. If you run a search against a national database for "Michael Sullivan," you'll get thousands of results, and obviously in most circumstances you could say, "Wait, clearly there are a ton of false positives, here."
But that doesn't really save you. Okay, so you chop the data-set down to the point where it's only 2 Michael Sullivans, and then you use that data-set from then forward. But then John Smith shows up, and as common a name as Michael Sullivan is, there are more John Smiths, and you're back in false positive land.
And, in practice of course, that's not how it works. How it works is I try to get on a flight or get a job, and someone asks a national database, "Uh, hey, I've got a Michael Sullivan here. Is he a terrorist?" And the database says, "Well, there's a Michael Sullivan who's connected with the IRA. Cavity search the sucker! Or don't give him a job!" Nobody's winnowing down the data-set because the person making the search doesn't look and see, "Oh, hey, there are thousands of Michael Sullivans in this data-set," they just see a flag that says, "IRA."
Well...sorta. Suppose you have a large database with say, 3,400,000 people in it. The probability of a false positive is 1 in 1.1 million. The probability that you'll get more than one match is going to be about 0.81 or 81%. So if you do use such a database then there will indeed be instances where you have false positives. With things like DNA or finger prints you'll have problem since only one of your multiple hits is the subject you really want.
People often think that things like DNA evidence is "slam dunk" evidence. In terms of indicating innocence its pretty good. In terms of pointing towards guilt it can be highly misleading. Consider the case of William Pucket, true case. They had a partial DNA profile. Chances of the random person matching that profile, 1 in 1.1 million. The cops ran it through a database of 338,000 DNA profiles. What is the probability you'll get one hit, irrespective of that person actually being guilty or not--i.e. what is the unconditional probability of a match? About 0.23 or 23%. In other words, give me 100 such databases and 23 of them will spit out a match. So given a match does that mean the person has to be guilty? No. Do prosecutors tell jurors this kind of information? Not in Puckett's case. Should they have? I'd say so.
Now, does it help you narrow down your list of suspects? Maybe. Keep in mind you could just have gotten a hit via dumb luck. You could be spending your time investigating an innocent person while the guilty party gets away.
Uhhhmmm, no. The issue of false positives isn't just about filtering. Filtering just makes the database "smaller" thus limiting the number of "trials" and thus reducing the chance of a false positive, it does not eliminate it. And it depends on the probabilities in question. Filtering can help, but it can't eliminate the problem, and the larger the database, the greater the chance of a false positive. Your solution of filtering is basically saying, "Yeah you're right so make the database smaller." Granted it is doing so in a logical manner, but you are basicaly conceding the point.
It can if you don't understand how to utilize the results and are careful in your analysis. Most cops don't know crap about probability on a basic level. Introduce things like conditional probability and Bayes Theorem and 99.9% of them have probably just fallen asleep. And prosecutors don't like the idea that their preferred suspect isn't the guilty party due to some arcane theorem in a text book somewhere.
And there are plenty of geeks to point out these problems with current methods and current data bases. Guess what, police, prosecutors, etc. don't like them. Why? I don't really know. Turf protection, don't like some pencil necked geek telling them are wrong. For whatever reason, when someone points out the problem, they hunker down and insist they got the right guy. Maybe they have, maybe they haven't.
You do realize you could be throwing out the actual guilty party while keeping nothing but the false positives. At this point you've just made the case for the position you are trying to refute.