💾 Archived View for dioskouroi.xyz › thread › 29397267 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Wikipedia as the data source: taming the irregular, pt.1

Author: Tomte

Score: 7

Comments: 4

Date: 2021-11-30 20:47:10

Web Link

________________________________________________________________________________

rocheio wrote at 2021-11-30 21:46:36:

What I like most about it is how easy it is to achieve something useful with a very moderate amount of code.

100%. One of the best things about both Wikipedia and Python IMO, neither may deliver perfect results but they get you WORKABLE results very quickly.

I was also delighted reading this article about writing a Python parser for Wikipedia on a Jekyll blog... because I did an eerily similar thing ~5 years ago and it's still my most starred repo -

https://roche.io/2016/05/scrape-wikipedia-with-python

. Small world :)

Best of luck with the project! On one hand it seems impossible with all the irregularities in article structure and being able to QA the long-tail of niche topics. But on the other if you can manage to wrangle 99% of it into a reliable query language... that can mean a lot to many other side projects!

zaik wrote at 2021-11-30 21:37:49:

Wikidata is the way to go. If you manage to get a machine readable form of Wikipedia knowledge which is not yet present in Wikidata, please consider contributing to Wikidata.

rocheio wrote at 2021-11-30 21:49:55:

Good point on this too. I think there's value in allowing and exploring one-off / alternative views into Wikipedia especially where the data isn't accessible already - but long term any serious effort should be merged back into (or at least offered) to the official source.

zverok wrote at 2021-11-30 22:10:18:

I actually hope the WikipediaQL, when it will become a bit more mature, to be helpful in parsing Wikipedia data _into_ Wikidata. As of now, Wikidata lacks a lot of knowledge yet (the article talks about that, too).