💾 Archived View for rawtext.club › ~locha › entries › 20230829_LLMsforresearch.gmi captured on 2023-12-28 at 16:45:38. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2023-09-08)

-=-=-=-=-=-=-

LLMs for research

Aug 29, 2023

LLMs are for research

Pretty much everyone has dabbled with LLMs by now, and most have found it nigh-unusable. My own experience of it is like working with a very, very dumb and lazy research assistant who's only saving grace is that they can justify any of their half-assed answers. It's a profoundly frustrating contact.

But if you need an idiot, it might be the right thing.

In research, we tend to outsource tasks to strangers. Typically, this is either annotation—we get people to read some text and extract some information—or participation in studies—the tasks are part of an experimental protocol to illuminate something. Strangers are pretty much anyone. We don't expect any special skill or knowledge, except basic linguistic habilities. Want to know if people are talking about covid-19? If the text in your corpus are about a conspiracy or another? Some LLMs can do just fine.

Some experimentations

To be clear, I don't want to talk about ChatGPT itself. ChatGPT is completely closed-source; I'm not very comfortable sending all my datasets to OpenAI, and pay a fee on top. I'm talking about models for which we have the weights, like Vicuna, Llama 2, or, in my case, WizardLM. (Why WizardLM? Basically, it ranks high on AlpacaEval, which rates instruction-following LLMs, and my limited experience shows it does better than Vicuna and Llama 2 for me.)

WizardLM

But here are some tasks where it has helped me so far:

Your milleage will vary. The most common the concepts in your query, the better it does. So, a query on Canadian official languages will not do as well as, say, a query about islamophobia (in fact, the former is unusably bad, while the former is near-perfect).

Furthermore, well, LLMs are really slow. When I use it for work, I have to run it on CPU (a good one, but still), so it takes about 30s per entry for a 13B model, more if the answer is fairly long; when I run it on my GPU (RTX 4070), it's about 10s per entry. It might not sound so bad, but humans typically take 3-10s for similar tasks.

But the issue is also that LLMs are very sensitive to how you phrase your prompt. Reframing your question might change a dud into outstanding performance. This is, of course, super frustrating, because you have to check a bunch of times before either finding the best way to do your task or confirming that it won't work. And it's especially frustrating if it takes 30-45s every time to get an output.

What I've learned so far

The ways of making a good prompt are largely mysterious; I could not really find any good compendium of wisdoms on the subject. There are a few things under the title "Prompt engineering", but it's mostly about making queries clear, something I'd already learned in philosophy and while refining tasks for humans. And at this point, I can tell that prompting humans and prompting a LLM are two very different things. Here are a few things I learned.

1. Always check against human-made annotations

This should go without saying, after what I just shared. My daughter is almost two, and while she's good with language for her age, I always check how she understood what I told her, because it can be wildly different from what I meant. Well, I trust LLMs a lot less than my toddler.

Use standard information retrieval measures for evaluation: precision, recall, f1 and Matthews' correlation coefficient, and perhaps Cohen's kappa if you're lucky enough to have more than one human annotator. You don't necessarily need more than 50 exemplars (although sometimes you do), but you should never assume that because it seems alright it is doing the task as you want it done.

1. Use the model's formatting

It's not something people say explicitely, but LLMs are trained on dialogues, and if you don't format your query as the training data was trained, the LLM won't know which part they play. In WizardLM, it looks like this:

<s>USER: In the following sentence, what is the main action?
Text: """Sing, o goddess, the wrath of Achilles son of Peleus, wrath that brought countless woes on Acheans and sent forth to Hades many valiant souls of heroes and made them spoils for dogs and birds of every type."""</s>
ASSISTANT:

In Llama-cpp, you have to give a list of stop tokens; a good one here would be "</s>".

2. Try several models

I had trouble getting a consistent formatting with Vicuna, even when I asked for it. So I tried a few models on lmsys (see below), and it turned out WizardLM was pretty good at following instructions.

LMSys' "Chat with Open Large Language Models"

LMSys has less variety nowadays, but there are demos for a lot of LLMs if you look for them.

3. When using forced-choice, prefer rating

LLMs are pleasers; they don't like to say "no". To counter that, instead of asking yes-no questions, I ask for ratings on a 1 to 10 scale. This way, you can choose the cutoff that gives you the best performance.

4. Give the LLM the time to think

Asking for 1-word or 1-token answers is convenient, and it also saves a lot of time, but it might not give you the best results. In one task, I had the LLM telling me if one word in the sentence meant "to know" (because it's what I'm studying). The performance was rather poor... but if I ask the LLM to look at every word, tell if the word means "to know", and then give me a final answer for the sentence, then it does a lot better. Apparently, it's a common mistake in evaluation methods.

The video in which I learned that

To be clear, it probably won't make too much of a difference if it's a fairly intuitive, one-step kind of task, but if it can be decomposed, as in my example, then it's probably worthwhile to do so.

Other things I want to try

I've seen that some people get better results by setting a high temperature (e.g. 1.2-1.5) then asking for several answers and taking the average/most consensual answer. It does seem slow to me, so I haven't really done it seriously, but I'm still curious about it. I wonder, for example, if it works as well if, say, you have a very good prompt and you give your LLM a lot of time to think.

When a model and prompt setup works well, I'd assume that I can fairly easily generate enough examples to train a classifier to finish the annotation a lot faster. The obvious candidate here would be to fine-tuning a BERT-like model—still computationally expensive, but nothing like the old ones.

I also want to try using LLMs as a proxy for human understanding of language. In the examples I gave, the LLM is mostly used as an annotator, but in this case, I would consider it more like a test subject. I no longer have any affiliation, and while I can probably write to my former advisor to test some things together, experimental work is long and time-consuming. I'd rather leave it to the professionals for the moment.

🏷 llm, AI

To share a thought or add a comment: locha at rawtext dot club.

--EOF--