Data Scraping Is Not a Crime
South Carolina's NAACP and ACLU are challenging the state's ban on automated data collection.

South Carolina has the highest eviction rate in the country, and the state chapter of the NAACP wanted to find out why. Given the difficulty of tracking down every case by hand, the organization hoped to use a software program called a "scraper" to collect data from South Carolina's online repository of legal filings.
Researchers, academics, and investigative journalists frequently use scrapers to automate this kind of laborious, large-scale project. But the South Carolina Court Administration categorically bans such automated data collection.
Now the American Civil Liberties Union (ACLU) of South Carolina and the South Carolina NAACP are challenging the state's scraping ban in federal court. In a lawsuit they filed in the U.S. District Court for the District of South Carolina in March, the groups argue that the policy unreasonably restricts their First Amendment rights. "This case is about ensuring core First Amendment principles, like the right to access public court filings, are applied in a way that meets our rapidly expanding digital reality," Allen Chaney, the ACLU of South Carolina's legal director, said in a press release.
The NAACP says collecting eviction filings would allow it to research the issue and contact affected tenants to ensure they have meaningful access to the courts. But scraping has numerous other legitimate uses.
In 2018, for example, I wanted to find out how often Texas police used a loophole in the state's public record law to hide information on deaths in custody. So I wrote code to scrape more than 300,000 pages of public-record rulings that the Texas Attorney General's Office had posted on its website. Then I filtered the results for those that cited the specific provision I was investigating.
That would have been impossible without a bot to do the heavy lifting. By scraping data, I identified more than 80 cases in which Texas police withheld information about deaths in custody from families, lawyers, and journalists.
The South Carolina lawsuit is the latest challenge to state anti-hacking laws and the federal Computer Fraud and Abuse Act (CFAA). The U.S. Court of Appeals for the 9th Circuit issued a landmark ruling in April that scraping publicly available data from websites does not constitute "unauthorized access" under the CFAA. While it's true that scrapers can bog down websites, ethical coders add courtesy delays to their programs that avoid that problem and include identifying information in their HTTP requests to government website administrators.
Banning scrapers is not about preventing unauthorized hacking. It just makes it harder for the public to know what the government is doing.
Editor's Note: As of February 29, 2024, commenting privileges on reason.com posts are limited to Reason Plus subscribers. Past commenters are grandfathered in for a temporary period. Subscribe here to preserve your ability to comment. Your Reason Plus subscription also gives you an ad-free version of reason.com, along with full access to the digital edition and archives of Reason magazine. We request that comments be civil and on-topic. We do not moderate or assume any responsibility for comments, which are owned by the readers who post them. Comments do not represent the views of reason.com or Reason Foundation. We reserve the right to delete any comment and ban commenters for any reason at any time. Comments may only be edited within 5 minutes of posting. Report abuses.
Please
to post comments
It’s ok when we do it.
"scraping publicly available data" is not hacking. But that does not mean all the data is free to use. Copyright does exist for most information online.
What do you mean by "free to use?" That seems to defies logic. Why publicly post data online that cannot be used by the public?
Clicks, dude, clicks.
Don't you know how the internet works?
Right. I briefly forgot the internet was for clicks, porn and government surveillance.
And non-fungibles. Get with the times!
Oh, I may have missed that while I was looking at wedding photos that show too much.
As a general principle, that's true. In this specific instance, that's utterly irrelevant. Court documents are public records by definition. They are not and cannot ever be copyrighted.
Fair use is a thing. Copyright doesn't mean complete control over information by the holder. In general you can cite copyrighted material in research.
We are talking about government data here. US copyright law specifically prohibits government copyrights in government work product.
That includes software, right?
Well, well.
Common sense laws relating to the first amendment are different from laws relating to the second amendment.
What if you had to get a permit from the sheriff to do data scraping? We can call it 'high capacity reading'. Just pay for a background check and fingerprinting, and then file an application with the appropriate fees. After the requisite delays, you might get the permit. If not, no appeal. Of course, you can't use certain kinds of programs to do the scraping, no matter how efficient, only the ones approved by the state.
Doesn't that sound better than allowing just anyone to execute programs that run amok through the public data?
I'm really not sure how data scraping falls under the 1st Amendment anyway. Is this one of those "penumbra" thingies?
I don't either, but that is the position of the ACLU and their co-conspirators, so that is how I snarked.
Free press? If you can't collect information, you can't do journalism. Whether the information is collected by reading a piece of paper or making a piece of software that reads lots of documents shouldn't matter.
What caliber is the bot, and how many data points does it scrape per minute?
Common sense scraping control. Nobody needs that much data.
In 2018, for example, I wanted to find out how often Texas police used a loophole in the state's public record law to hide information on deaths in custody. So I wrote code to scrape more than 300,000 pages of public-record rulings that the Texas Attorney General's Office had posted on its website. Then I filtered the results for those that cited the specific provision I was investigating.
Is this data (and your code) posted somewhere, like GitHub? That assertion (you wrote code and definitively showed 80 instances) of yours is not verifiable, CJ Ciaramella. Can you prove the data exists?
Is this Reason journalism....making completely unsubstantiated statements like that?
How is that any different from doing research by less automated means? Journalists don't, in general, post complete source materials and references.
So the group that wrote a defamatory article for AH is now "helping" the NAACP with this. One question, what does this have to do with 1A? It's not assembly, it's not religion, it's not restricting their speech and they can still petition the government for redress, unlike Republicans and those are restrictions Reason enthusiastically endorses.
It's connected to the First Amendment in the same way that the right to record on-duty police is. Court documents are public records. The government has no authority to impede their collection or use.*
* The government can impede some use through redaction orders or sealing orders but a) those are already well-established with compensating controls and b) they apply at the point of document creation and are not uniquely applied based on whether the document is in hardcopy or electronic form.
So scraping bots have a negative impact on the systems they search, which gives a plausibly legitimate rationale to limit their use, since it sounds like they may the sort of thing that could be used in a denial of service attack. On the other hand, the government may be using that practical concern as a smokescreen to limit the ability to shift through compromising data efficiently.
The author noted that code can be written to minimize interference with the web site's stated purpose. It seems that government could simply set standards for the codes used and for time when they could be used. Eliminate data scraping during high website use time.
Scraping is the wrong way to access the data. The data should simply be available for direct download (as an archive) and/or available upon request on a hard drive.
I was listening to the radio yesterday about a state representative here in Maine who was trying to pass a law prohibiting the police from doing the same thing. Apparently they've got bots that scour Facebook and such, and even grab your credit scores. He thought that was creepy, but the DA and State Police shut him down.
Exactly. And I believe Reason has printed articles in the past whinging about law enforcement gathering data from people's public facebook profiles who were too stupid to not broadcast their felonies. It seems data 'scraping' is one of those concepts that you either are for or against depending on by whom, what and who is being scraped.
Well, a law limiting what police can do is in the proper purview of government. Limiting what people can do with public records, not so much.
Close but not quite. Limiting what police can do implies that they have unlimited power that is only restrained by laws saying what they cannot do. That's backwards. They have limited powers which means they cannot do anything unless authorized. I just wish the courts felt the same way.
The difference is who the scraping is done TO. It's perfectly legitimate for private individuals to examine government records for whatever reason they want with whatever tools they want. The government examining private records is a different matter.
I'm going to have to read about this more carefully, because I'm getting a whiff of Fusion GPS. Like the ACLU supported this before they were against it.
" While it's true that scrapers can bog down websites, ethical coders add courtesy delays to their programs that avoid that problem and include identifying information in their HTTP requests to government website administrators."
Have you ever noticed that most coders are not ethical coders? Ever been online? (yes I know, rhetorical)
So, in other words, there is no "scraping ban" and scraping per se is not a crime in South Carolina.
It sounds like this particular site, the "SC repository of legal filings" has a ban on scraping in its terms of service. And such a ban makes sense because scraping can place a big load on interactive services. So, talk of a "scraping ban" is a red herring.
The real issue here is that legal filings should be publicly available without scraping, via direct downloads.
Yeah, that seems like the real answer here. If the system can't handle the inquiries people want to make of it, then they should make a better system that works with the way people are actually using the records.
"A better system" consists of copying the data onto a 20 TB drive and sending it to the ACLU, at the ACLU's expense (a few hundred dollars).
Why is the ACLU interested in evictions?