Sharing our Public Content Policy and a New Subreddit for Researchers

https://www.reddit.com/r/reddit/comments/1co0xnu/sharing_our_public_content_policy_and_a_new/

created by traceroo on 09/05/2024 at 16:08 UTC*

9 upvotes, 39 top-level comments (showing 25)

1: https://support.reddithelp.com/hc/articles/26410290525844

Hi, redditors - I’m u/Traceroo, Reddit’s Chief Legal Officer, and today I’m sharing more about how we protect content on Reddit.

Reddit is an inherently public platform, and we want to keep it that way. Although we’ve shared our POV before[2], we’re publishing this policy to give you all (whether you are a redditor, moderator, researcher, or developer) a better sense of how we think about access to public content and the protections that should exist for users against misuse of public content.

2: https://support.reddithelp.com/hc/en-us/articles/24722157271188-Data-Licensing-Privacy

This is distinct from our Privacy Policy[3], which covers how we handle the minimal private/personal information users provide to us (such as email). It’s not our Content Policy[4], which sets out our rules for what content and behavior is allowed on the platform.

3: https://www.reddit.com/policies/privacy-policy

4: https://www.redditinc.com/policies/content-policy

Public content includes all of the content – like posts and comments, usernames and profiles, public karma scores, etc. (for a longer list, you can check out our public API) – that Reddit distributes and makes publicly available to redditors, visitors who use the service, and developers, e.g. to be extra clear, it doesn’t include stuff we don’t make public, such as private messages or mod mail, or non-public account information, such as email address, browsing history, IP address, etc. (this is stuff we don’t and would never license or distribute, because we believe Privacy is a Right[5]).

5: https://www.reddit.com/r/reddit/comments/suvhyq/reddit_community_values/

Unfortunately, we see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content. Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests. While we will continue our efforts to block known bad actors, we can’t continue to assume good intentions. We need to do more to restrict access to Reddit public content at scale to trusted actors who have agreed to abide by our policies. But we also need to continue to ensure that users, mods, researchers, and other good-faith, non-commercial actors have access.

Our policy outlines the information partners can access via any public-content licensing agreements. It also outlines the commitments we make to users about usage of this content, explaining how:

Anyone accessing Reddit content must abide by our policies, and we are selective about who we work with and trust with large-scale access to Reddit content. We will block access to those that don’t agree to our policies, and we will continue to enhance our capabilities to hunt down and catch bad actors. We don’t want to but, if necessary, we’ll also take legal action.

Nothing changes for redditors. You can continue using Reddit logged in, logged out, on mobile, etc.

Users get protections against misuse of public content. Also, commercial agreements allow us to invest more in making Reddit better as a platform and product.

In addition to those we have agreements with, Reddit Data API access remains free for non-commercial researchers and academics under our published usage threshold. It also remains accessible for organizations like the Internet Archive.

It’s important to us that we continue to preserve public access to Reddit content[6] for researchers and those who believe in responsible non-commercial use of public data. We believe in and recognize the value that public Reddit content provides to researchers and academics. Academics contribute meaningful and important research that helps shape our understanding of how people interact online. To continue studying the impacts of how behavioral patterns evolve online, access to public data is essential.

6: https://support.reddithelp.com/hc/en-us/articles/14945211791892-Developer-Platform-Accessing-Reddit-Data

That’s why we’re building tools and an environment to help researchers access Reddit content. If you're an academic or researcher, and interested in learning more, head over to r/reddit4researchers and check out u/KeyserSosa’s first post.

EDIT: Formatting and fighting markdown.

Comments

Comment by kerovon at 09/05/2024 at 16:21 UTC

97 upvotes, 3 direct replies

So if I am reading this right, reddit will still bundle and sell bulk user data, but there will at least be some privacy restrictions and respect for EU and California privacy laws. What is changing is that random groups that may or may not care about all of the laws will not be allowed to scrape and sell Reddit data.

I am glad that researchers will still be supported though. There actually is valid research that is done, and supporting that is valuable.

Of course, reddit bulk user data will only be valuable for another year or two, and then chatgpt bots will have so thoroughly polluted it that it becomes more or less worthless.

Comment by WalkingEars at 09/05/2024 at 18:33 UTC

57 upvotes, 2 direct replies

Can I opt out of my personal stories and conversations on Reddit being sold to AI chatbot developers?

Comment by James20k at 09/05/2024 at 17:32 UTC

40 upvotes, 2 direct replies

This is nice in theory, but lets say we have an AI being trained on reddit users' data - which we do. Our comments and content are part of that dataset. We've seen that AI models can be used to output their training data in many cases, because they encode a lot of that training data inherently in the model

So with this in mind:

We require our partners to uphold the privacy of redditors and their communities. This includes respecting users’ decisions to delete their content and any content we remove for violating our Content Policy.

If I delete content from my reddit account, are you saying that these companies will be forced to delete that content from their training data, and retrain their models?

Partners are not allowed to use Reddit content to conduct background checks, facial recognition, government surveillance, or help law enforcement do any of the above.

Similarly, if I train an AI model on reddit content, and that AI model is then put into the public for other people to use, someone might ask it "Does the reddit user /u/james20k have any questionable information in their background I should know before I hire them for a job?". That AI model will have been trained on a dataset that contains a significant amount of information on me, and it will have an answer

Does the no-background-checks etc encompass a commitment to prevent partnered large language models trained on reddit being used by downstream third parties for these purposes, or does it only encompass the immediate third parties themselves using it directly for these purposes?

Comment by N1ghtshade3 at 09/05/2024 at 18:08 UTC

12 upvotes, 2 direct replies

Can you explain what's meant when you say partners have to respect user decisions to delete their content? Like, suppose they've bulk downloaded a bunch of info containing my posts. How would they ever know if I deleted my Reddit account later?

Comment by SarahAGilbert at 09/05/2024 at 17:47 UTC

18 upvotes, 1 direct replies

Hi traceroo,

First off, I just want to say how happy I am to see a public data policy, particularly one that forefronts user privacy (unlike some other platforms *cough cough*[1]). I know this is something you all have been thinking about for a while, but given that one of Reddit's key assets right now *is* its data, making those internal policies and values public is even more important now than ever.

1: https://www.tomshardware.com/tech-industry/artificial-intelligence/stack-overflow-bans-users-en-masse-for-rebelling-against-openai-partnership-users-banned-for-deleting-answers-to-prevent-them-being-used-to-train-chatgpt

I have a couple of questions about details:

1. Does Reddit consider moderated data public or private? On the one hand, it's not visible in the communities its moderated from, but on the other, it's still visible on users' profile pages. For what it's worth, I see pros and cons to classifying it as either/or. Some pros: moderated data is an important data source for understanding, well, lots of questions about content moderation and training AI assisted moderation tools. Some cons: it might feel more private to users/mods, it might inadvertently put mods at risk (especially in communities with small moderation teams), it be used to train shitty moderation AIs, or be used to develop bots/tools to subvert moderation.

2. Are there plans for added transparency about who's licensing Reddit data and/or who's violated the policy? Obviously the google deal is very public, but I can imagine lots of smaller deals that wouldn't make the news.

Comment by Halaku at 09/05/2024 at 16:12 UTC

44 upvotes, 0 direct replies

as a lawyer, I am not allowed to be brief

Yet as a lawyer, you are allowed to prepare briefs. Ironic, no?

On a more serious note, thanks for keeping us updated on Reddit's efforts to protect our privacy.

Comment by Ghigs at 09/05/2024 at 16:21 UTC

43 upvotes, 2 direct replies

Tools to access deleted posts are crucial to modding.  Banning such tools will cripple us.

Comment by InfectedBananas at 10/05/2024 at 13:43 UTC

7 upvotes, 0 direct replies

Unfortunately, we see more and more commercial entities using unauthorized access or misusing authorized access to collect public data in bulk, including Reddit public content. Worse, these entities perceive they have no limitation on their usage of that data, and they do so with no regard for user rights or privacy, ignoring reasonable legal, safety, and user removal requests.

Comment by [deleted] at 09/05/2024 at 16:28 UTC

17 upvotes, 1 direct replies

[deleted]

Comment by VladWard at 09/05/2024 at 19:21 UTC

10 upvotes, 0 direct replies

Partners are not allowed to use content to identify individuals or their personal information, including for ad targeting purposes.

When you say personal information here, what exactly qualifies? Are you aligned with GDPR's definition of personal information, CCPA/CPRA's, or is this section referring only to PII?

Are there limits on how partners use *anonymized* personal information that they collect from Reddit? For example, could Google construct a machine learning model that uses my Reddit personal information to conclude that "BIPOC men like plastic robots" without identifying me personally?

If Google then independently identifies me as a BIPOC man using its own data collection and targets ads to me accordingly outside of the Reddit platform, is this a violation of the policy?

We need to do more to restrict access to Reddit public content at scale to trusted actors who have agreed to abide by our policies. But we also need to continue to ensure that users, mods, researchers, and other good-faith, non-commercial actors have access.

Can you expand a bit on what this might look like?

Comment by Watchful1 at 09/05/2024 at 19:01 UTC

5 upvotes, 1 direct replies

u/traceroo two questions.

1. How can I determine when content is deleted without re-accessing it from the API each time? I'm fairly sure your commercial partners have access to a feed of deleted object ID's to remove from their data set, but that's not available to the rest of us.

2. If content is public on reddit, does that mean we can keep using it even if the author doesn't want us to (outside things like copyright)?

Comment by Jakeable at 10/05/2024 at 03:37 UTC

3 upvotes, 0 direct replies

Is anything happening with the "allow my data to be used for research purposes" preference[1]? It still shows up in preferences (at least on old.reddit), but it doesn't seem to have any effect on this

1: https://old.reddit.com/prefs/#research

Comment by abrownn at 09/05/2024 at 16:22 UTC

15 upvotes, 1 direct replies

Thanks for including us in the process!

Comment by Full_Stall_Indicator at 09/05/2024 at 16:42 UTC*

15 upvotes, 2 direct replies

Thanks for working to protect Redditors and for seeking out user/mod feedback as part of the process!

If you’re reading this and are interested in giving Reddit feedback on various aspects of the platform, consider joining one of Reddit’s collaborative programs. Check out the User Feedback Collective[1] and the Mod Council[2]. 🎉

1: https://support.reddithelp.com/hc/en-us/articles/24066967580180-What-is-the-Reddit-User-Feedback-Collective

2: https://support.reddithelp.com/hc/en-us/articles/15484058898196-Reddit-Mod-Council

Edit: fixed a typo

Comment by Lil_SpazJoekp at 10/05/2024 at 02:50 UTC

3 upvotes, 0 direct replies

Thanks for the invite to the roundtable discussion!

Comment by nerdshark at 11/05/2024 at 01:43 UTC*

3 upvotes, 0 direct replies

I'm curious what implications this[1] might have for this new policy? It looks like the judge is ruling that:

1: https://arstechnica.com/tech-policy/2024/05/elon-musks-x-tried-and-failed-to-make-its-own-copyright-system-judge-says/

The conflict seems to arise from trying to claim both Section 230 safe harbor protections *and* ownership and exclusive control of platform content:

The judge found that X Corp's argument exposed a tension between the platform's desire to control user data while also enjoying the safe harbor of Section 230 of the Communications Decency Act, which allows X to avoid liability for third-party content. If X owned the data, it could perhaps argue it has exclusive rights to control the data, but then it wouldn't have safe harbor.
"X Corp. wants it both ways: to keep its safe harbors yet exercise a copyright owner’s right to exclude, wresting fees from those who wish to extract and copy X users’ content," Alsup wrote.
If X got its way, Alsup warned, "X Corp. would entrench its own private copyright system that rivals, even conflicts with, the actual copyright system enacted by Congress" and "yank into its private domain and hold for sale information open to all, exercising a copyright owner’s right to exclude where it has no such right."
That "would upend the careful balance Congress struck between what copyright owners own and do not own," Alsup wrote, potentially shrinking the public domain.
"Applying general principles, this order concludes that the extent to which public data may be freely copied from social media platforms, even under the banner of scraping, should generally be governed by the Copyright Act, not by conflicting, ubiquitous terms," Alsup wrote.

So, how does this affect reddit? It seems to me like the judge is saying that platforms don't get to charge for access to public data without losing access to certain legal protections. Here's the judge's order,[2] for anyone who's interested.

2: https://cdn.arstechnica.net/wp-content/uploads/2024/05/X-Corp-v-Bright-Data-Order-Dismissing-Complaint-5-9-24.pdf

Comment by shiruken at 09/05/2024 at 17:03 UTC

9 upvotes, 0 direct replies

Thanks for involving us in the process! Are there any plans to improve the "make your content non-public" process? Right now it's extremely tedious to bulk delete posts and comments on accounts with extensive histories. Many users have to rely upon (and trust) third-party scripts or websites. Would Reddit ever consider implementing an automatic content deletion setting in the user profile similar to that offered on Mastodon[1]?

1: https://fedi.tips/deleting-posts-automatically-in-mastodon-after-a-certain-time-period/

Comment by quirkycurlygirly at 25/05/2024 at 06:16 UTC

2 upvotes, 0 direct replies

Why won't certain moderators tell me how I violated the rules? They haven't given any explanation and when I ask for one they mute me for a week without answering. I did not point out any race or ethnicity when I said I'd experienced begging in developing countries and that got me banned without warning. I thought moderators were required to give some sort of rationale. This is not a report on a specific subreddit.

Comment by Dahl0_0 at 12/07/2024 at 15:45 UTC

2 upvotes, 1 direct replies

WHAT IS KARMA AND HOW DO I GET IT i’m new and barley on this app but want to ask questions in certain groups BUT THEY WONT LET ME BC I DONT HAVE ENOUGH “KARMA” HELP😭😂

Comment by EnglishMobster at 11/05/2024 at 20:29 UTC*

2 upvotes, 2 direct replies

So if Reddit says they "own" the content produced by their users on this site (by choosing who can and cannot view it), isn't that a violation of Section 230 and Reddit is giving up their safe harbor protections? Because that's what a judge says.

According to Alsup, X failed to state a claim while arguing that companies like Bright Data should have to pay X to access public data posted by X users.
"To the extent the claims are based on access to systems, they fail because X Corp. has alleged no more than threadbare recitals," parroting laws and findings in other cases without providing any supporting evidence, Alsup wrote. "To the extent the claims are based on scraping and selling of data, they fail because they are preempted by federal law," specifically standing as an "obstacle to the accomplishment and execution of" the Copyright Act.
The judge found that X Corp's argument exposed a tension between the platform's desire to control user data while also enjoying the safe harbor of Section 230 of the Communications Decency Act, which allows X to avoid liability for third-party content. If X owned the data, it could perhaps argue it has exclusive rights to control the data, but then it wouldn't have safe harbor.
"X Corp. wants it both ways: to keep its safe harbors yet exercise a copyright owner’s right to exclude, wresting fees from those who wish to extract and copy X users’ content," Alsup wrote.
If X got its way, Alsup warned, "X Corp. would entrench its own private copyright system that rivals, even conflicts with, the actual copyright system enacted by Congress" and "yank into its private domain and hold for sale information open to all, exercising a copyright owner’s right to exclude where it has no such right."

I don't see how this policy is legal given the above. Either I own the copyright to my comments as a third-party (at which point Reddit cannot deny access to others, as they do not control my copyright), or by me posting here Reddit takes the copyright of my comment and in turn loses Section 230 privileges.

Comment by skeddles at 09/05/2024 at 23:01 UTC

3 upvotes, 0 direct replies

"*We are, unfortunately, seeing more and more commercial entities collecting public data,*"

You mean like YOU? So you can sell it to google without notifying anyone?

Comment by keyjan at 17/05/2024 at 19:13 UTC

1 upvotes, 0 direct replies

OpenAI strikes Reddit deal to train its AI on your posts

https://www.theverge.com/2024/5/16/24158529/reddit-openai-chatgpt-api-access-advertising

you're welcome, steve.

Comment by LooseSwing88 at 21/05/2024 at 14:13 UTC*

1 upvotes, 0 direct replies

it's fine though you can hand my ip to the quantum computer to keep it safe and use all my personlity data to assemble legally "entitled" bot clones it's "totally" cool bro "entities" aren't doesn't even

bt if i buy reddit gold

Comment by drainthoughts at 21/05/2024 at 15:13 UTC

1 upvotes, 0 direct replies

Racism seems alive and well in the r/combatfootage sub and the moderators allow it. How do I take the next step?

Comment by Jesyka_ at 23/05/2024 at 20:36 UTC

1 upvotes, 0 direct replies

Hello, while I understand the posts made are public and anyone can view them, are there any efforts being made to prevent YouTube creators from using a members post to create videos that they subsequently earn revenue on?