[META] Taken together, many recent questions seems consistent with generating human content to train AI?

https://www.reddit.com/r/AskHistorians/comments/1d5ggze/meta_taken_together_many_recent_questions_seems/

created by RockyIV on 01/06/2024 at 06:12 UTC

566 upvotes, 18 top-level comments (showing 18)

Pretty much what the title says.

I understand that with a “no dumb questions” policy, it’s to be expected that there be plenty of simple questions about easily reached topics, and that’s ok.

But it does seem like, on balance, there we’re seeing a lot of questions about relatively common and easily researched topics. That in itself isn’t suspicious, but often these include details that make it difficult to understand how someone could come to learn the details but not the answers to the broader question.

What’s more, many of these questions are coming from users that are so well-spoken that it seems hard to believe such a person wouldn’t have even consulted an encyclopedia or Wikipedia before posting here.

I don’t want to single out any individual poster - many of whom are no doubt sincere - so as some hypotheticals:

“Was there any election in which a substantial number of American citizens voted for a communist presidential candidate in the primary or general election?“

“Were there any major battles during World War II in the pacific theater between the US and Japanese navies?”

I know individually nearly all of the questions seem fine; it’s really the combination of all of them - call it the trend line if you wish - that makes me suspect.

Comments

Comment by AutoModerator at 01/06/2024 at 06:12 UTC

1 upvotes, 0 direct replies

Hello, it appears you have posted a META thread. While there are always new questions or suggestions which can be made, there are many which have been previously addressed. As a rule, we allow META threads[1] to stand even if they are repeats, but we would nevertheless encourage you to check out the META Section[2] of our FAQ, as it is possible that your query is addressed there. Frequent META questions include:

1: https://www.reddit.com/r/AskHistorians/comments/4re4rw/rules_roundtable_no_14_make_your_own_meta_post/

2: https://www.reddit.com/r/AskHistorians/wiki/faq/meta

You may also be interested in the AskHistorians Browser Extension[6] for a more accurate comment count, or subscribing to the weekly roundup[7]. Twitter[8], Facebook[9], and the Sunday Digest[10] also highlight content already written. This isn't intended to be the last and final word, and we encourage you to bring up any further questions you might have which are not addressed there as well, but we hope that this will at least provide you some additional information until a moderator is able to show up and respond further!

3: https://www.reddit.com/r/AskHistorians/wiki/faq/meta#wiki_why_is_everything_deleted.21.3F

4: https://www.reddit.com/r/AskHistorians/comments/g3ph8c/rules_roundtable_xi_answered_answered_flair_and/

5: https://www.reddit.com/r/AskHistorians/comments/h8aefx/rules_roundtable_xviii_removed_curation_and_why/

6: https://www.reddit.com/r/AskHistorians/comments/d6dzi7/tired_of_clicking_to_find_only_removed_comments/

7: https://www.reddit.com/r/AskHistorians/comments/di8t2i/tired_of_clicking_over_to_a_thread_too_early_so/

8: https://twitter.com/askhistorians

9: https://www.facebook.com/askhistorians/

10: https://www.reddit.com/r/AskHistorians/search?q=title%3A%22Sunday+Digest%22&restrict_sr=on&sort=new&t=all

11: /message/compose/?to=/r/AskHistorians

Comment by [deleted] at 01/06/2024 at 08:50 UTC

589 upvotes, 4 direct replies

[deleted]

Comment by crrpit at 01/06/2024 at 10:10 UTC*

195 upvotes, 2 direct replies

While we do have a zero tolerance policy towards use of AI to answer questions, we don't have such a strict policy against using it to generate questions (with an important caveat below). While it's not exactly something we love, we can see the use case in terms of formulating clearer questions for people with limited subject matter background, non-native speakers,.etc. There's at least one user we know of who actually built a simple question-generating bot with the worthy goal of diversifying the geographical spread of questions that get asked. Ultimately, if it's a sensible question that can allow someone to share knowledge not just to OP but a large number of other readers, then the harm is broadly not great enough to try and police.

Where we are more concerned is the use of bot accounts to spam or farm karma. It's broadly more common to see such bots repost popular questions or comments, but using AI to generate "new" content is obviously an emerging option in this space. Here, the AI-ness of a question text is one thing we can note in a broader pattern of posting behaviour. We do regularly spot and ban this kind of account.

Comment by IAmDotorg at 01/06/2024 at 11:52 UTC

48 upvotes, 1 direct replies

I don't think they seem especially different. Going back as long as it has existed, it seems like 90% of the questions in here are from students trying to do their homework.

Generally speaking, it would be uncommon to do directed training of an LLM that way, and if you're going to that level of effort (and there are companies doing it), you're going to be far more directed about the training data. As solid as this sub is, it wouldn't be a useful training set of knowledge-based LLM training.

Comment by jazzjazzmine at 01/06/2024 at 08:42 UTC

125 upvotes, 2 direct replies

The answer to

“Was there any election in which a substantial number of American citizens voted for a communist presidential candidate in the primary or general election?“

Is not just yes or no, though. Asking it here (ideally) means you also get a lot of background info that is much harder to find on your own, if it is findable for a layman at all.

That worry seems bit farfetched, to be honest. A single book contains much more good text than the answers here amount to in a full week, I'd guess.

Comment by [deleted] at 01/06/2024 at 13:22 UTC

26 upvotes, 3 direct replies

[deleted]

Comment by LexanderX at 01/06/2024 at 08:30 UTC

39 upvotes, 1 direct replies

What’s more, many of these questions are coming from users that are so well-spoken that it seems hard to believe such a person wouldn’t have even consulted an encyclopedia or Wikipedia before posting here.

Perhaps these questions are sincere and human derived, but polished by the emery of AI tools such as grammarly, co-pilot, and writeful; many of which can be installed as browser extentions.

Comment by Xaeryne at 01/06/2024 at 19:39 UTC

6 upvotes, 0 direct replies

Doesn't this same thing happen every year around this time, because people have term papers due and think they can get away with being lazy?

Comment by symmetry81 at 01/06/2024 at 12:36 UTC

21 upvotes, 1 direct replies

Modern high end AIs are trained on hundreds of TB of data. I just looked at a recent, well answered post[1] and found that it contained 25kb of text. The scale of data that AIs are trained on are so drastically at odds I can't see it being worth the effort.

1: https://www.reddit.com/r/AskHistorians/comments/1cuvs50/why_did_rome_import_so_much_grain_from_egypt/

Comment by PublicFurryAccount at 01/06/2024 at 17:10 UTC

4 upvotes, 0 direct replies

The examples seem consistent with how people “formalize” their natural questions to be more like the questions asked on exams. Given that a large chunk of Reddit has been doing exams, a shift toward that style wouldn’t surprise.

Comment by Neutronenster at 01/06/2024 at 21:50 UTC

4 upvotes, 0 direct replies

Honestly speaking, I don’t really see what use these kinds of posts would have for training AI (when compared to already existing information).

What’s important to realize here is that ChatGPT is essentially a language model and not a knowledge database. So if you ask it a medical question, it will be able to use this language model to come up with an answer that may seem great and plausible at first glance, but this answer is likely to contain factual mistakes. That’s because it basically predicts the most likely words and sentences in such an answer, rather than look up facts. No amount of extra training will increase the factual accuracy, since ChatGPT remains a language algorithm.

Of course AI companies are currently researching ways to combine a language-based AI with some kind of “fact-checking AI”. However, this is really high level research that requires access to huge datasets. Because of that, it is limited to a few large companies like Google. These companies have their own ways for legitimately obtaining their data, so they won’t resort to tactics like churning out bot questions here. Small companies also don’t need the extra data from this subreddit, because their use of AI is much more limited.

In conclusion, I think that “actual people creating these low quality Reddit posts” is the most plausible explanation.

Comment by -p-e-w- at 01/06/2024 at 16:33 UTC

9 upvotes, 0 direct replies

You're making a rather bold claim by implication (that there is a – presumably coordinated – effort to farm the experts in this sub for training data).

Yet you haven't presented even a shred of actual evidence to support that claim.

You don't even link to any actual questions that you believe fall into that category (not that pointing out a few such questions would be "evidence" of anything).

You don't explain what you believe the problem is, if any. In fact, you admit that "individually nearly all of the questions seem fine". You don't propose any action to be taken.

What exactly are you trying to achieve here?

Comment by Master-Dex at 02/06/2024 at 05:21 UTC

2 upvotes, 0 direct replies

Unfortunately, being hostile to data useful for training and being useful to the community for being able to answer arbitrary questions seems somewhat at odds. I'd say the forum should just double down on quality rules and ignore odd behavior.

Comment by deltree711 at 01/06/2024 at 15:09 UTC

2 upvotes, 0 direct replies

How do you know it's not just confirmation bias?

Comment by erobin37 at 01/06/2024 at 16:12 UTC

2 upvotes, 0 direct replies

I'm pretty certain that this post, for example, is generated by AI:

https://www.reddit.com/r/AskHistorians/comments/1d5puvu/how_did_the_meiji_restoration_change_japans/

Comment by LordBecmiThaco at 01/06/2024 at 13:11 UTC

0 upvotes, 3 direct replies

What's the worst case scenario, that the AI is fed well researched information? Is that so horrible?

Comment by Acadia_Clean at 02/06/2024 at 04:35 UTC

1 upvotes, 0 direct replies

I would like to believe I am well spoken. You are saying it's suspect that well spoken individuals aren't researching some of these questions that seem to have easily searchable answers. The way I see it, AskHistorians is full of experts that have much of the information related to their area of expertise readily available, whether it be memory or their own research. Logically it makes no sense to do hours of research looking to answer a question that another may already know the answer. For example, I'm an electrician, I have a wealth of training and experience that allow me to complete my job in a timely and workman like manner, that a novice would have difficulty achieving. If someone walked up to me and asked me an electrical question, I would answer it to the best of my ability. I would not tell them that their question was easily researchable and then accuse them of training an AI. The short of it, people are busy, if I have a historical question, even if it seems relatively simple, I would rather just ask a historian and get the answer. Many times some of the seemingly simple questions on here have had deeply complex answers, that I don't believe I would have found if I had researched myself.

Comment by TheyTukMyJub at 01/06/2024 at 12:25 UTC

-7 upvotes, 0 direct replies

Equating the quality of a Wikipedia or encyclopedia article to an academically sourced answer here is kinda silly. Many Wikipedia articles are absolutely atrocious when it comes to providing context or are based on outdated scholarship and lack newer or suffering insights that the historians here offer us readers.