💾 Archived View for dioskouroi.xyz › thread › 29393903 captured on 2021-12-03 at 14:04:38. Gemini links have been rewritten to link to archived content
➡️ Next capture (2021-12-04)
-=-=-=-=-=-=-
________________________________________________________________________________
Regarding the histogram issue, I worked on a project that had a few hundred histograms based on data from over 3 billion data points. It turns out that after a few thousand data points many histograms will stop changing significantly.
So, unless you really need to show exactly how many data points each bucket contains, it's much easier to run the analysis once offline, then serve just the histogram percentage data. From that you can make an SVG and overlay additional user-specific data on top. The point is that this histogram data is small and easy to cache.
You can then rerun the histogram analysis later if you'd like. However, for this project I never saw anything change with more data. It was overkill even to run it as a cron job.
This is a good example of why understanding the principles of statistics can come in handy.
Right now this project is on the scale of ~100k points, but I'm starting to see a drop in percentage change as you mentioned. In the beginning, though, the trends weren't as clear so I wanted to keep it updating.
wow, that's >1600 person-hours of time dedicated to this waiting task.
You can do what Prometheus does and pre-aggregate; You size it at 1s buckets or .5s buckets, store the values of those buckets, and increment them as data points come in. You can store the data-points individually too, and regenerate the histogram if you really need to, but it's far more efficient to store the aggregations of a histogram.
You were sending the entire database (all times from all users) to each individual end user to compute histograms client-side after they finished? Ah, yeah, that could get expensive on Firebase.
I guess it wasn't top of mind at the small scale I planned to operate at, but definitely a facepalm when you put it that way.
Eh, YAGNI is a valid development strategy. When you get thousands of users you can make it more efficient, and you did.
I have to brag here. I silently visualized a wall clock ticking off the seconds and got 60.02 seconds.
2 hundredths of a second off from reality!
I could try it again to validate, but I can't be bothered. As far as I'm concerned, I'm super accurate judging the passage of time. No need to find out if it was a lucky random pick in the interval [58, 62] :)
60.38s for me, which also surprised me.
But the real surprise was just how close a lot of people get. I was fully expecting a bell curve that was offset high or low due to some imagined bias people would have to count fast or slow.
Unless you try it again, we have no quantitative basis for estimating your variance. You can brag all you like, but I'm not listening.
Ok, I just did it again right now. This result was 59.48s.
I'm going to acknowledge that result as an outlier, exclude it from my analysis, and stick with my median result of 60.02.
$ np.median([59.48, 60.02]) 59.75 $ stats.median_abs_deviation([59.48, 60.02]) 0.27
;-)
Just some feed back. I started the challenge, but the constant text changing like "I'm bored" etc was enough to throw off my internal clock to guess when to stop it.
Close your eyes.
All I did was count from 20 to 80. Why from 20? Its a trick I learned from drivers license practice. In my language, the 1 to 20 are too short, so if you want to time a second they're not appropriate.
I pushed a quick fix to the issue by freezing the data being sent to the client, thereby halting the rapid growth in data consumption.
What do you mean by "freezing the data"?
Regarding the excessive download problem, my first instinct is to periodically (for example, every hour) compute summary statistics for the bar chart and store that in Firebase. This, of course, would require an additional script/service to perform these periodic jobs.
I'm not sure if that's what you ended up doing and I'm curious what your solution is.
> What do you mean by "freezing the data"?
https://github.com/JinayJain/just-a-minute/blob/master/app.j...
As others have mentioned, my solution in the moment was to download the JSON file from Firebase, compute histogram statistics manually, and hardcode the histogram into the JS itself.
Obviously not a scalable solution, and I think I would have done something very similar to the periodic updates like you mentioned (if I had more experience with cloud functions etc.)
> This, of course, would require an additional script/service to perform these periodic jobs.
Also worth noting that Firebase has built-in “cloud functions” which have access to the database API. It would be pretty easy to run one on a schedule.
Probably (and I'm guessing, not the author here) took a snapshot of the data and hardcoded that to be sent to the client instead of live data.
There used to be a very old DOS program that did this, but for 5 seconds.
is there a way to see the results without going through the challenge?
I filter out any data points <5 seconds (as seen in the graph), so completing your attempt in under 5 seconds should do it.
Ideally, people would do the challenge first and then see where they lie on the graph before seeing the data itself.
that's what I ended up doing. I was looking at it, clicked to see your other projects, and then went back and couldn't view the results
Nice bell curve around 60s but a little higher on left than right (more people underestimate, but proportionally) and there’s a spike at 0-8 seconds (from people who just wanted to see results or decided to quit quickly.)
I think there's also a good amount of people who didn't get what the site was actually asking the user to do and just did what was prompted.