💾 Archived View for tilde.pink › ~ssb22 › ph-corpus.gmi captured on 2022-04-28 at 17:30:00. Gemini links have been rewritten to link to archived content

View Raw

More Information

⬅️ Previous capture (2022-03-01)

➡️ Next capture (2023-01-29)

🚧 View Differences

-=-=-=-=-=-=-

Is the PH Corpus really public domain?

The PH Corpus of Mandarin Chinese, compiled by Guo Jin, is distributed as “public domain”. It is a potentially-useful source of word-segmentation examples, but it has no pinyin and I found some errors in its segmentation. Before publishing my corrected version, I need to double-check that the text is *really* public domain (not just that Guo Jin *thought* it was public domain), otherwise *I’d* be infringing Xinhua’s copyright. 

China joined the Berne Copyright Convention in 1992, but it *did* have its own copyright law in 1991. The PH Corpus includes text published by Xinhua News Agency in 1990 and 1991, and it does not say which *months* of 1991, which is a problem because China’s 1990 copyright law went into force in June 1991—if all the PH text came from January through May, it will have been from a period during which the People’s Republic of China had no copyright law in force (the 1928 ROC law was abolished by the PRC in 1949, and it seems the PRC did not immediately replace it, leaving all texts within its territory as public-domain until the 1990 law came into effect), but if the PH Corpus includes text from the period of June through December 1991, this part of its text *might* have been subject to China’s 1990 law, which we should consider. 

China’s 1990 copyright law

Article 5, point 2 of the 1990 law excludes “news on current affairs” from copyright, but that can’t have meant *all* newspaper content, otherwise it wouldn’t have been necessary to add a special provision in Article 22 point 4 for newspapers to reprint each other’s “editorials or commentator’s articles” (this point gave such rights only to newspapers, not corpus compilers). 

So it seems this law divided newspaper content into “news on current affairs” (which was public domain), “editorials or commentator’s articles” (which could be reprinted, but only if you’re a newspaper), and, possibly, other material (newspaper crosswords don’t work in Chinese, but they might have had quizzes or something). 

Therefore, the PH Corpus could be redistributable if:

1. we can confirm that all the articles it draws from were published before June 1991, or

2. we can confirm that all articles published by Xinhua News Agency (or at least the subset of them that made it into the PH Corpus) counted as “news on current affairs” and not as “editorials or commentator’s articles” or any other category, or

3. we can confirm that Xinhua at some point gave the public permission to copy their text—at least all the text they published in 1991.

That’s as far as I got. If anyone knows of any more evidence showing whether or not the PH Corpus is really public domain, please let me know. 

I am not a lawyer; use these notes at your own risk.

Legal

All material © Silas S. Brown unless otherwise stated.