💾 Archived View for gemini.ctrl-c.club › ~ssb22 › baidu.gmi captured on 2023-03-20 at 18:52:49. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Adding Chinese web pages to Baidu

From 2001 to 2019, my personal web space was on official University of Cambridge servers in the cam.ac.uk domain.

In September 2019 the University decided to shut down their official personal home-page service and asked the remaining users to move to the Student-Run Computing Facility. Most of the notes on this page were made *before* this shutdown; some notes on the effect of the move are at the end.

My University-provided personal web space had always been indexed by the globally-dominant “Google” search engine, but as a public service I thought I should also check if the few Chinese-language pages I have are indexed by China’s dominant “Baidu” search engine. These notes are posted in case anyone finds them useful.

Dreadful machine translation

To begin with, I tried to find out if Baidu had already indexed my English home page, by searching Baidu for the quoted sentence “I am a partially-sighted computer scientist at Cambridge University in the UK.” The first hit (in mid-2017) was a Chinese site, reportedly a Baidu ‘spin-off’, called zybang (short for homework help) which had evidently scraped the first three sentences from my home page in 2011, added no fewer than 25 advertising spaces (the obvious aim was to intercept search traffic for English sentences—do they try to scrape every English sentence on the web?—and earn money from advertisement impressions), and added an awful machine translation: they rendered “partially sighted” as “without any future”. (What? Chinese robots officially declared partially-sighted people to have no future? Is that because USTC’s “robot goddess” project, which replied “no, not at all” to the question “do you like robots or humans” in a publicity stunt, has no use for the visually impaired as we can’t be seduced by the bot’s visual acts? I’d rather believe they are simple programs with no idea what they’re saying.) In all fairness, an alternate translation suggested further down zybang’s page was “partly farsighted”, as in part of me can predict the future, so I tried asking that part “when will zybang fix their translations” but got no answer. In mid-2021 zybang hid their alternate translations behind some kind of paywall, so only the “without any future” translation remained (and “Dorset” became the Egyptian god “Set” which is equally wrong—if you don’t know, don’t just make stuff up). Their scraper had not been clever enough to find my own translation on the linked Chinese version of my home page—OK so my Chinese skills are not as good as a real Chinese person, but I hope they’re not as bad as *that* machine translation. I explained this in a Chinese email to zybang (and asked if they could at least link to my real site) but I don’t know if they got it.

(Incidentally, USTC’s “Jiajia” robot subsequently had a preprogrammed response that accidentally gave an English lesson instead of the scriptwriters’ philosophy: “The question ‘why do you think the meaning of life is’ cannot be answered; ‘why’ is an improper word to use”. Indeed: the proper word “what” had been shortened in the interviewer’s speech, so mishearing it as “why” and telling her off was a semi-believable story. But I’m digressing.)

Baidu itself took the second place with a similar “scrape some sentences and auto-translate” page, once again mistranslating “partially sighted” as “no future prospects” and completely missing my own translation. At this point I gave up on the idea of emailing in corrections as it was obvious I wasn’t going to figure out who was *actually* responsible for the machine-translation technology in use.

After that the results lost relevance, and my real site was nowhere to be seen.

Baidu Mobile “English results”

The above test was conducted on Baidu’s desktop version. A search for the same phrase on Baidu’s mobile version (well, the first 9 words of it anyway: the input field was limited to 64 characters) did not show the zybang site, but did give Baidu’s own machine translation inline (this time they rendered “partially sighted” as “lazy eye”), and also gave an offer to look at a separate page of “English results” which didn’t *at first* include my page. But a search for “Silas Brown Cambridge” *did* find my page (along with an old copy of my stolen Weibo account), and a search for “I come from rural West Dorset on the South-West peninsula” put me in the second hit of the ‘English results’ (the Chinese results were scraped machine translations again). The following day, a repeat search for the *original* phrase (which had previously not found me at all) put me at the *top* of the ‘English results’ (finally)—had my previous day’s tests caused it to do some reprocessing, or was it just a matter of which part of their cluster happened to be handling my requests that day?

Anyway, the *desktop* site was *not* showing this “English results” option. Or if it was, it wasn’t making it obvious enough for *me* to find. Was that option pioneered by their mobile team and not yet fully integrated into their older desktop interface? Who knows.

Indexed via Weibo forum?

The above preliminary tests were checking what Baidu had done with my English pages, but what I *really* wanted them to do was to list my Chinese-language pages in their Chinese results. They weren’t doing that: the Chinese equivalents of these test searches (on both desktop and mobile) weren’t finding my Chinese home page at all.

However, I *could* find the Chinese version of my Gradint page on Baidu’s Chinese results, and I could also find the *English* version (but *not* the Chinese version) of my Xu Zhimo page (which does contain some Chinese text) on Baidu’s Chinese results. Why had just *these* pages been listed?

A Baidu search for the Gradint page’s URL found a mention of it on a scraped copy of a Weibo forum, on which someone had once asked for recommendations of free websites to learn English. At that time I was still using my Weibo account, and Weibo placed this question in my inbox so I thought the questioner was asking *me* (I realised later that Weibo had an algorithm for placing public questions into inboxes; the human questioner had not singled me out as I’d thought). I’d replied saying I might not be the best person to ask because as a native speaker I didn’t learn English online myself and I don’t know what sites are blocked in your country, but you could try howjsay, Forvo, BBC or my Gradint, and I invited the questioner to PM me to discuss their specific interests to see if I could suggest other appropriate material. (They didn’t PM me, and it’s too late now my Weibo’s been stolen.) It now seemed this one mention of Gradint on a Weibo forum had been sufficient to get that page—and *only* that page—onto Baidu’s Chinese index. Similarly, a few people had linked to my Xu Zhimo page from some Chinese forum sites, and those links had to go to the English version as the Chinese version wasn’t yet published.

So apparently Baidu were listing Chinese-language pages *one* hop away from big Chinese sites, but not *two* hops. They had listed my Chinese Gradint page that had been mentioned on the Weibo forum, but they had not followed the Chinese Gradint page’s link to my Chinese home page. And they had listed the English version of my Xu Zhimo page (calling it a Chinese result because it does contain *some* Chinese text), but had not followed that page’s link to the version with all-Chinese notes.

Why was my Chinese Gradint page, which Baidu had correctly identified as a Chinese-language page, not regarded as carrying as much authority as a Weibo forum for giving Baidu additional Chinese-language links, such as its link to my Chinese home page? I don’t know, but I guessed it might be because Baidu has a webmaster registration scheme and they might be setting their cutoff for Chinese pages at “one hop away from registered Chinese sites”. So I reasoned I should try to make my Chinese home page a “registered Chinese site”, or, if this wasn’t possible, at least try to get a registered Chinese site elsewhere to link to *all* my Chinese pages, so they’re all only one hop away from a registered site.

Registering with Baidu

Firstly I needed a mainland China mobile number. So I asked a Chinese friend who’d collaborated on my Chinese computer-voice downloads. Presumably he’s the one they’d call if there’s a legal problem. I’m not planning on creating any problems, but they want an identifiable Chinese citizen on their books for every site, and it could be a heavy responsibility if you don’t know what you’re vouching for. So I asked someone who’d had some input into my material (and therefore a legitimate interest in registering it), and I will not accept requests to link random other sites via this registration.

Then it turned out Baidu don’t accept registrations of user subdirectories on a university server: you have to own a whole domain. So I used my SRCF subdomain, which also had the benefit of giving me access to the Apache request logs, which, I hoped, could give more insight into what Baidu was doing. (I wasn’t allowed to read logs for my subdirectory on the DS server, but I *can* read them for my SRCF subdomain.)

I tried a ‘301-redirect’ from the top of my SRCF subdomain to my Chinese home page, but Baidu’s registration validator said “unknown error 301” (it pretended to be Firefox 20 on Windows 7 but hadn’t been programmed to understand 301-redirects). So I placed the registration-validation file on the SRCF itself, and then the registration process worked.

But I didn’t want to move my actual homepage unnecessarily, so I made a simple Chinese site-map for Baidu to crawl (I generally avoided including my English-only pages in that list, because Baidu’s English results didn’t seem to have the “one hop away from a registered domain” restriction, and I wanted my “registered domain” to be ‘considered’ Chinese if they ever made statistics on it). If Baidu was going to treat my SRCF subdomain in the same way as it treated the Weibo forum, a list of links on the SRCF *should* be sufficient to make it index my existing pages on the DS server, and meanwhile I hope Google et al won’t count these as “bad links”.

robots.txt issues?

Three days after submission, the SRCF server saw a GET /robots.txt from a Baidu-registered IP pretending to be Firefox 6 on Windows XP. (Note however that not all of the IP addresses Baidu uses are necessarily registered to Baidu: the IPs it had used during validation were registered to China Telecom Beijing and China Mobile.)

Unfortunately I had made a mistake in that robots.txt: I had put Disallow: /*.cgi in it. The Robots Exclusion Standard does not define what * does in a Disallow line (it specifies only that * can be used as a universal selector in a User-Agent line), so it’s up to the programmers of each individual robot how to interpret * in a Disallow line. Baidu did not send any further requests. Had it stopped parsing at the * and abandoned its crawl?

Four days later the SRCF server saw two more requests, both arriving in the same second. These came from IPs registered to China Unicom Beijing and to ChinaNet Beijing. The first asked for /robots.txt in the name of Baiduspider/2.0, and the second asked for / in the name of a Samsung Galaxy S3 phone configured to the UK locale. Well that’s odd—I thought the part of Baidu that correctly identifies itself was the English-results department, developed separately from the Chinese department that sends out the old browser headers I’d seen during validation, so I didn’t expect a correctly-identified request to be followed so rapidly by one that wasn’t. (And it’s rather unlikely to be a *real* consumer browser that just *happened* to arrive in the same second: who on earth walks around Beijing with a UK-configured 5-year-old phone and just randomly decides to load up the root page of my SRCF subdomain, which I wasn’t using before, with no referring page, once only, coincidentally just after the robot.) So had some department of Baidu actually indexed / despite the fact that I hadn’t yet fixed the dubious * in robots.txt?

But then the next day saw a repeat request for robots.txt from “Firefox 6 on XP” at a Baidu-registered IP, 5 days after the first. I still hadn’t fixed that *, and Baidu (or at least *that* department) still didn’t fetch anything else.

At that point I fixed the * issue; 6 days later a Baidu-registered IP again fetched robots.txt from “Firefox 6 on XP” and then after 4½ hours a ChinaNet Beijing IP once again fetched / from the “Galaxy S3”—had the robots.txt finally worked, or was that just a repeat request from the other department?

Language tagging?

Allegedly Baidu still requires the old method of language tagging at the document level, i.e.

<meta http-equiv="content-language" content="zh-Hans">

or for a bilingual page

<meta http-equiv="content-language" content="zh-Hans, en">

and I have now done this to my Chinese-language pages “just in case”, but I’m not *really* sure it’s necessary because Baidu *had* indexed my Chinese Gradint page when it had just the more-modern lang attributes. Of course I still use modern lang attributes *as well*—they can specify the language of individual elements and text spans, which is also good for some screen readers, and for font selection in some visual browsers, when including phrases not in the page’s main language.

All that still didn’t work

For the next 4 months I continued to see occasional checks from Baidu IPs (perhaps to find out how often my site changes), but my Chinese homepage still failed to appear in their search results. It’s likely that Baidu update some parts of their index more frequently than others and have given me a low priority, but I was surprised they were taking *this* long. They *did* list my Chinese-English page saying Xu Zhimo’s poem is not about Lin Huiyin (although giving no results for that URL in a link search), but they did not list the Chinese version of the poem page itself. It *can’t* be that they de-list Chinese versions of foreign-hosted pages that also have English versions listed in their link hreflang markup, as that wouldn’t explain why they *did* list my Chinese Gradint page which has the same thing. It wasn’t making much sense.

“Baidu Knows” didn’t work either

In November 2017 we tried something else: posting a question on Baidu’s public question-and-answer forum “Baidu Knows” (zhidao.baidu.com). This also required going through a sign-up process where you needed a mainland China mobile number (and annoyingly you’re told this only *after* you’ve gone to the trouble of typing out your Chinese question and optionally associating your QQ or Weibo account), so I asked the same Chinese friend to post on my behalf.

The question (in Chinese) basically said “why can’t Baidu find these Chinese pages on a Cambridge University server” and then listed the URLs I wanted to be indexed. I wasn’t expecting any serious answers (I’m told Baidu’s forum has a reputation for low quality), but I hoped the URL list *in the question itself* might finally prompt Baidu to index them, as they are now just one hop away from Baidu’s own forum (if they count Weibo’s forum, surely they should count their *own*, I thought). Of course I would also be happy if I got a good answer, so it wasn’t *entirely* done under false pretences, but I hope posting questions with hidden agendas like that is not in breach of some Terms and Conditions of which I’m unaware.

(I thought the ‘hidden’ purpose of that URL list was obvious, but my Chinese friend—also a programmer—didn’t see it until I pointed it out; his initial reaction had been to ask if I’m sure the low-quality forum would be good enough for my question. So perhaps I was suffering from “hindsight bias”: a plan is less obvious if you don’t already know. Perhaps in the interests of ‘transparency’ we should have added a postscript to the question. Still, it’s not as if I’m trying to cheat on a page-ranking system: I’m just trying to be *in* Baidu’s index as opposed to *not* being in Baidu’s index.)

I waited a month, and Baidu *still* hadn’t indexed these pages. “Baidu Knows” didn’t seem to be helping.

Pinging the API URL didn’t work

In December 2017 I tried xargs -n 1 curl http://api.share.baidu.com/s.gif -e < urllist.txt after finding that URL in Baidu’s Javascript, but I doubted it would prompt the indexing of pages on domains that are not registered with Baidu’s webmaster tools. Indeed it had not given me any listings by February 2018.

Mid-2018: index of titles only?

In July 2018 I discovered that a search for my Chinese name on Baidu *did* list my Chinese homepage, on the second page of Baidu results. So it *was* being indexed. It is *possible* that this reflects a change of Baidu policy, since at the same time I also noticed Baidu was listing certain other foreign-hosted Chinese-language pages that it didn’t previously list. It’s also possible that it had been indexed for some time and I simply hadn’t tried the right search.

My page was found when I searched for my Chinese name, but it was not found when I searched for *other* Chinese phrases I’d used. Even phrases which appeared in Baidu’s summary of my site (so must have been “seen” by Baidu), and which don’t occur as exact phrases on any of the other sites Baidu retrieves, still didn’t put my Chinese homepage into the first ten pages of Baidu results.

At this point I suspected it was searching on only the title. A search for my exact title put my page very near the top of the results. I had *not* included the full title in the “sitemap” links I’d left on my SRCF subdomain, so it must have been searching on the actual page title, not just the link text on the SRCF domain. And yet it didn’t seem to search on any of the other text in the page (although it would *highlight* words in the summary, if they happened to occur there too after the page was retrieved using, apparently, only the title).

Unfortunately, many of my Chinese pages, and especially Chinese-with-English pages, did not have Chinese in the HTML title. (Searching in English would put Baidu into the wrong mode for this test.) I knew my Gradint page was already indexed, so the only other meaningful test I could do at this point was on my Chinese Xu Zhimo notes page. This was now retrieved by Baidu when searching for part of its title, but it was *not* retrieved by searching for its first sentence.

So at this point it seemed full-text search was still reserved for the preferentially-treated China-hosted sites, so if you want to add something to Baidu, you had better hope to match your title to the keywords people will use, like you had to do in the pre-Web days of Gopher and Veronica.

Full text indexed by end-2018

My Chinese home page finally appeared in Baidu’s results for one of the phrases it used (not the title) on 28th December 2018, some 18 months after I started trying to get it there.

Searching for variations of my phrase (reordering the words etc) gave inconsistent results: sometimes my page wouldn’t be there at all, and once it appearead in the third result with summary taken from my own text and title taken from the Baidu Knows question that had been posted the previous November. But at least a decent result was obtained by searching for my Chinese name and the Chinese word for Cambridge.

De-listed by server outages

The University personal web-page server had several major outages in 2019:

In day-long outages due to operator error on 10th January, power failure on 29th/30th January and scheduled maintenance on 22nd/23rd March, the server had returned code 403 “forbidden” for all pages. The Baidu listing was not affected by these.
But in June 2019, the server returned code 404 “not found” for all pages for about 120 hours from Thursday 20th through Tuesday 25th while they chased down a break-in in another user’s account. Google retained its listing of *most* of my pages during this period, but Baidu de-listed my homepage by the 24th and reinstated its listing by the 28th.
Then on Friday 27th September the server was switched off altogether, so HTTP requests simply timed out. Baidu de-listed my homepage by Sunday 29th.

After this final outage, I migrated my site to the SRCF server (replacing the previous “made for Baidu” sitemap I’d put there); a few days later the original server’s DNS was updated to point to a “service has been decommissioned” page that was served with HTTP code 301 but no Location header, and Google de-listed my pages at this point. The “decommissioned” page said I could request a redirect, which was put in place on Tuesday 8th October, and the same day Baidu re-listed me but pointed to the *original* URL (not the redirected one). Google meanwhile re-listed my home page—again at its original URL—by October 22nd, and started pointing to the new URL during November.

Baidu then de-listed me by 1st December, re-listed me at the new URL by 10th December, changed back to the original URL by 20th December, de-listed me again by 2nd January 2020, re-listed me (with lower ranking) on September 25th 2020 (again at the original URL, not the new SRCF one), and de-listed me from 3rd to 19th November 2020, from 23rd November 2020 to 26th January 2022, from 11th February to 6th March and from 23rd March 2022. I’m not sure how to explain all this.

Legal

All material © Silas S. Brown unless otherwise stated. Apache is a registered trademark of The Apache Software Foundation. Baidu is a trademark of Baidu Online Network Technology (Beijing) Co. Ltd. Firefox is a registered trademark of The Mozilla Foundation. Google is a trademark of Google LLC. Javascript is a trademark of Oracle Corporation in the US. Samsung is a registered trademark of Samsung. Weibo is a trademark of Sina.Com Technology (China) Co. Ltd. Windows is a registered trademark of Microsoft Corp. Any other trademarks I mentioned without realising are trademarks of their respective holders.