đž Archived View for gemini.ctrl-c.club âş ~ssb22 âş baidu.gmi captured on 2023-03-20 at 18:52:49. Gemini links have been rewritten to link to archived content
âŹ ď¸ Previous capture (2023-01-29)
âĄď¸ Next capture (2024-05-26)
-=-=-=-=-=-=-
From 2001 to 2019, my personal web space was on official University of Cambridge servers in the cam.ac.uk domain.
My University-provided personal web space had always been indexed by the globally-dominant âGoogleâ search engine, but as a public service I thought I should also check if the few Chinese-language pages I have are indexed by Chinaâs dominant âBaiduâ search engine.âThese notes are posted in case anyone finds them useful.
To begin with, I tried to find out if Baidu had already indexed my English home page, by searching Baidu for the quoted sentence âI am a partially-sighted computer scientist at Cambridge University in the UK.ââThe first hit (in mid-2017) was a Chinese site, reportedly a Baidu âspin-offâ, called zybang (short for homework help) which had evidently scraped the first three sentences from my home page in 2011, added no fewer than 25 advertising spaces (the obvious aim was to intercept search traffic for English sentencesâdo they try to scrape every English sentence on the web?âand earn money from advertisement impressions), and added an awful machine translation: they rendered âpartially sightedâ as âwithout any futureâ.â(What?âChinese robots officially declared partially-sighted people to have no future?âIs that because USTCâs ârobot goddessâ project, which replied âno, not at allâ to the question âdo you like robots or humansâ in a publicity stunt, has no use for the visually impaired as we canât be seduced by the botâs visual acts?âIâd rather believe they are simple programs with no idea what theyâre saying.)âIn all fairness, an alternate translation suggested further down zybangâs page was âpartly farsightedâ, as in part of me can predict the future, so I tried asking that part âwhen will zybang fix their translationsâ but got no answer.âIn mid-2021 zybang hid their alternate translations behind some kind of paywall, so only the âwithout any futureâ translation remained (and âDorsetâ became the Egyptian god âSetâ which is equally wrongâif you donât know, donât just make stuff up).âTheir scraper had not been clever enough to find my own translation on the linked Chinese version of my home pageâOK so my Chinese skills are not as good as a real Chinese person, but I hope theyâre not as bad as *that* machine translation.âI explained this in a Chinese email to zybang (and asked if they could at least link to my real site) but I donât know if they got it.
(Incidentally, USTCâs âJiajiaâ robot subsequently had a preprogrammed response that accidentally gave an English lesson instead of the scriptwritersâ philosophy: âThe question âwhy do you think the meaning of life isâ cannot be answered; âwhyâ is an improper word to useâ.âIndeed: the proper word âwhatâ had been shortened in the interviewerâs speech, so mishearing it as âwhyâ and telling her off was a semi-believable story.âBut Iâm digressing.)
Baidu itself took the second place with a similar âscrape some sentences and auto-translateâ page, once again mistranslating âpartially sightedâ as âno future prospectsâ and completely missing my own translation.âAt this point I gave up on the idea of emailing in corrections as it was obvious I wasnât going to figure out who was *actually* responsible for the machine-translation technology in use.
After that the results lost relevance, and my real site was nowhere to be seen.
The above test was conducted on Baiduâs desktop version.âA search for the same phrase on Baiduâs mobile version (well, the first 9 words of it anyway: the input field was limited to 64 characters) did not show the zybang site, but did give Baiduâs own machine translation inline (this time they rendered âpartially sightedâ as âlazy eyeâ), and also gave an offer to look at a separate page of âEnglish resultsâ which didnât *at first* include my page.âBut a search for âSilas Brown Cambridgeâ *did* find my page (along with an old copy of my stolen Weibo account), and a search for âI come from rural West Dorset on the South-West peninsulaâ put me in the second hit of the âEnglish resultsâ (the Chinese results were scraped machine translations again).âThe following day, a repeat search for the *original* phrase (which had previously not found me at all) put me at the *top* of the âEnglish resultsâ (finally)âhad my previous dayâs tests caused it to do some reprocessing, or was it just a matter of which part of their cluster happened to be handling my requests that day?
Anyway, the *desktop* site was *not* showing this âEnglish resultsâ option.âOr if it was, it wasnât making it obvious enough for *me* to find.âWas that option pioneered by their mobile team and not yet fully integrated into their older desktop interface?âWho knows.
The above preliminary tests were checking what Baidu had done with my English pages, but what I *really* wanted them to do was to list my Chinese-language pages in their Chinese results.âThey werenât doing that: the Chinese equivalents of these test searches (on both desktop and mobile) werenât finding my Chinese home page at all.
However, I *could* find the Chinese version of my Gradint page on Baiduâs Chinese results, and I could also find the *English* version (but *not* the Chinese version) of my Xu Zhimo page (which does contain some Chinese text) on Baiduâs Chinese results.âWhy had just *these* pages been listed?
A Baidu search for the Gradint pageâs URL found a mention of it on a scraped copy of a Weibo forum, on which someone had once asked for recommendations of free websites to learn English.âAt that time I was still using my Weibo account, and Weibo placed this question in my inbox so I thought the questioner was asking *me* (I realised later that Weibo had an algorithm for placing public questions into inboxes; the human questioner had not singled me out as Iâd thought).âIâd replied saying I might not be the best person to ask because as a native speaker I didnât learn English online myself and I donât know what sites are blocked in your country, but you could try howjsay, Forvo, BBC or my Gradint, and I invited the questioner to PM me to discuss their specific interests to see if I could suggest other appropriate material.â(They didnât PM me, and itâs too late now my Weiboâs been stolen.)âIt now seemed this one mention of Gradint on a Weibo forum had been sufficient to get that pageâand *only* that pageâonto Baiduâs Chinese index.âSimilarly, a few people had linked to my Xu Zhimo page from some Chinese forum sites, and those links had to go to the English version as the Chinese version wasnât yet published.
So apparently Baidu were listing Chinese-language pages *one* hop away from big Chinese sites, but not *two* hops.âThey had listed my Chinese Gradint page that had been mentioned on the Weibo forum, but they had not followed the Chinese Gradint pageâs link to my Chinese home page.âAnd they had listed the English version of my Xu Zhimo page (calling it a Chinese result because it does contain *some* Chinese text), but had not followed that pageâs link to the version with all-Chinese notes.
Why was my Chinese Gradint page, which Baidu had correctly identified as a Chinese-language page, not regarded as carrying as much authority as a Weibo forum for giving Baidu additional Chinese-language links, such as its link to my Chinese home page?âI donât know, but I guessed it might be because Baidu has a webmaster registration scheme and they might be setting their cutoff for Chinese pages at âone hop away from registered Chinese sitesâ.âSo I reasoned I should try to make my Chinese home page a âregistered Chinese siteâ, or, if this wasnât possible, at least try to get a registered Chinese site elsewhere to link to *all* my Chinese pages, so theyâre all only one hop away from a registered site.
Firstly I needed a mainland China mobile number.âSo I asked a Chinese friend whoâd collaborated on my Chinese computer-voice downloads.âPresumably heâs the one theyâd call if thereâs a legal problem.âIâm not planning on creating any problems, but they want an identifiable Chinese citizen on their books for every site, and it could be a heavy responsibility if you donât know what youâre vouching for.âSo I asked someone whoâd had some input into my material (and therefore a legitimate interest in registering it), and I will not accept requests to link random other sites via this registration.
Then it turned out Baidu donât accept registrations of user subdirectories on a university server: you have to own a whole domain.âSo I used my SRCF subdomain, which also had the benefit of giving me access to the Apache request logs, which, I hoped, could give more insight into what Baidu was doing.â(I wasnât allowed to read logs for my subdirectory on the DS server, but I *can* read them for my SRCF subdomain.)
I tried a â301-redirectâ from the top of my SRCF subdomain to my Chinese home page, but Baiduâs registration validator said âunknown error 301â (it pretended to be Firefox 20 on Windows 7 but hadnât been programmed to understand 301-redirects).âSo I placed the registration-validation file on the SRCF itself, and then the registration process worked.
But I didnât want to move my actual homepage unnecessarily, so I made a simple Chinese site-map for Baidu to crawl (I generally avoided including my English-only pages in that list, because Baiduâs English results didnât seem to have the âone hop away from a registered domainâ restriction, and I wanted my âregistered domainâ to be âconsideredâ Chinese if they ever made statistics on it).âIf Baidu was going to treat my SRCF subdomain in the same way as it treated the Weibo forum, a list of links on the SRCF *should* be sufficient to make it index my existing pages on the DS server, and meanwhile I hope Google et al wonât count these as âbad linksâ.
Three days after submission, the SRCF server saw a GET /robots.txt from a Baidu-registered IP pretending to be Firefox 6 on Windows XP.â(Note however that not all of the IP addresses Baidu uses are necessarily registered to Baidu: the IPs it had used during validation were registered to China Telecom Beijing and China Mobile.)
Unfortunately I had made a mistake in that robots.txt: I had put Disallow: /*.cgi in it.âThe Robots Exclusion Standard does not define what * does in a Disallow line (it specifies only that * can be used as a universal selector in a User-Agent line), so itâs up to the programmers of each individual robot how to interpret * in a Disallow line.âBaidu did not send any further requests.âHad it stopped parsing at the * and abandoned its crawl?
Four days later the SRCF server saw two more requests, both arriving in the same second.âThese came from IPs registered to China Unicom Beijing and to ChinaNet Beijing.âThe first asked for /robots.txt in the name of Baiduspider/2.0, and the second asked for / in the name of a Samsung Galaxy S3 phone configured to the UK locale.âWell thatâs oddâI thought the part of Baidu that correctly identifies itself was the English-results department, developed separately from the Chinese department that sends out the old browser headers Iâd seen during validation, so I didnât expect a correctly-identified request to be followed so rapidly by one that wasnât.â(And itâs rather unlikely to be a *real* consumer browser that just *happened* to arrive in the same second: who on earth walks around Beijing with a UK-configured 5-year-old phone and just randomly decides to load up the root page of my SRCF subdomain, which I wasnât using before, with no referring page, once only, coincidentally just after the robot.)âSo had some department of Baidu actually indexed / despite the fact that I hadnât yet fixed the dubious * in robots.txt?
But then the next day saw a repeat request for robots.txt from âFirefox 6 on XPâ at a Baidu-registered IP, 5 days after the first.âI still hadnât fixed that *, and Baidu (or at least *that* department) still didnât fetch anything else.
At that point I fixed the * issue; 6 days later a Baidu-registered IP again fetched robots.txt from âFirefox 6 on XPâ and then after 4½ hours a ChinaNet Beijing IP once again fetched / from the âGalaxy S3ââhad the robots.txt finally worked, or was that just a repeat request from the other department?
Allegedly Baidu still requires the old method of language tagging at the document level, i.e.
<meta http-equiv="content-language" content="zh-Hans">
or for a bilingual page
<meta http-equiv="content-language" content="zh-Hans, en">
and I have now done this to my Chinese-language pages âjust in caseâ, but Iâm not *really* sure itâs necessary because Baidu *had* indexed my Chinese Gradint page when it had just the more-modern lang attributes.âOf course I still use modern lang attributes *as well*âthey can specify the language of individual elements and text spans, which is also good for some screen readers, and for font selection in some visual browsers, when including phrases not in the pageâs main language.
For the next 4 months I continued to see occasional checks from Baidu IPs (perhaps to find out how often my site changes), but my Chinese homepage still failed to appear in their search results.âItâs likely that Baidu update some parts of their index more frequently than others and have given me a low priority, but I was surprised they were taking *this* long.âThey *did* list my Chinese-English page saying Xu Zhimoâs poem is not about Lin Huiyin (although giving no results for that URL in a link search), but they did not list the Chinese version of the poem page itself.âIt *canât* be that they de-list Chinese versions of foreign-hosted pages that also have English versions listed in their link hreflang markup, as that wouldnât explain why they *did* list my Chinese Gradint page which has the same thing.âIt wasnât making much sense.
In November 2017 we tried something else: posting a question on Baiduâs public question-and-answer forum âBaidu Knowsâ (zhidao.baidu.com).âThis also required going through a sign-up process where you needed a mainland China mobile number (and annoyingly youâre told this only *after* youâve gone to the trouble of typing out your Chinese question and optionally associating your QQ or Weibo account), so I asked the same Chinese friend to post on my behalf.
The question (in Chinese) basically said âwhy canât Baidu find these Chinese pages on a Cambridge University serverâ and then listed the URLs I wanted to be indexed.âI wasnât expecting any serious answers (Iâm told Baiduâs forum has a reputation for low quality), but I hoped the URL list *in the question itself* might finally prompt Baidu to index them, as they are now just one hop away from Baiduâs own forum (if they count Weiboâs forum, surely they should count their *own*, I thought).âOf course I would also be happy if I got a good answer, so it wasnât *entirely* done under false pretences, but I hope posting questions with hidden agendas like that is not in breach of some Terms and Conditions of which Iâm unaware.
(I thought the âhiddenâ purpose of that URLÂ list was obvious, but my Chinese friendâalso a programmerâdidnât see it until I pointed it out; his initial reaction had been to ask if Iâm sure the low-quality forum would be good enough for my question.âSo perhaps I was suffering from âhindsight biasâ: a plan is less obvious if you donât already know.âPerhaps in the interests of âtransparencyâ we should have added a postscript to the question.âStill, itâs not as if Iâm trying to cheat on a page-ranking system: Iâm just trying to be *in* Baiduâs index as opposed to *not* being in Baiduâs index.)
I waited a month, and Baidu *still* hadnât indexed these pages.ââBaidu Knowsâ didnât seem to be helping.
In December 2017 I tried xargs -n 1 curl http://api.share.baidu.com/s.gif -e < urllist.txt after finding that URL in Baiduâs Javascript, but I doubted it would prompt the indexing of pages on domains that are not registered with Baiduâs webmaster tools.âIndeed it had not given me any listings by February 2018.
In July 2018 I discovered that a search for my Chinese name on Baidu *did* list my Chinese homepage, on the second page of Baidu results.âSo it *was* being indexed.âIt is *possible* that this reflects a change of Baidu policy, since at the same time I also noticed Baidu was listing certain other foreign-hosted Chinese-language pages that it didnât previously list.âItâs also possible that it had been indexed for some time and I simply hadnât tried the right search.
My page was found when I searched for my Chinese name, but it was not found when I searched for *other* Chinese phrases Iâd used.âEven phrases which appeared in Baiduâs summary of my site (so must have been âseenâ by Baidu), and which donât occur as exact phrases on any of the other sites Baidu retrieves, still didnât put my Chinese homepage into the first ten pages of Baidu results.
At this point I suspected it was searching on only the title.âA search for my exact title put my page very near the top of the results.âI had *not* included the full title in the âsitemapâ links Iâd left on my SRCF subdomain, so it must have been searching on the actual page title, not just the link text on the SRCF domain.âAnd yet it didnât seem to search on any of the other text in the page (although it would *highlight* words in the summary, if they happened to occur there too after the page was retrieved using, apparently, only the title).
Unfortunately, many of my Chinese pages, and especially Chinese-with-English pages, did not have Chinese in the HTML title.â(Searching in English would put Baidu into the wrong mode for this test.)âI knew my Gradint page was already indexed, so the only other meaningful test I could do at this point was on my Chinese Xu Zhimo notes page.âThis was now retrieved by Baidu when searching for part of its title, but it was *not* retrieved by searching for its first sentence.
So at this point it seemed full-text search was still reserved for the preferentially-treated China-hosted sites, so if you want to add something to Baidu, you had better hope to match your title to the keywords people will use, like you had to do in the pre-Web days of Gopher and Veronica.
My Chinese home page finally appeared in Baiduâs results for one of the phrases it used (not the title) on 28th December 2018, some 18 months after I started trying to get it there.
Searching for variations of my phrase (reordering the words etc) gave inconsistent results: sometimes my page wouldnât be there at all, and once it appearead in the third result with summary taken from my own text and title taken from the Baidu Knows question that had been posted the previous November.âBut at least a decent result was obtained by searching for my Chinese name and the Chinese word for Cambridge.
The University personal web-page server had several major outages in 2019:
After this final outage, I migrated my site to the SRCF server (replacing the previous âmade for Baiduâ sitemap Iâd put there); a few days later the original serverâs DNS was updated to point to a âservice has been decommissionedâ page that was served with HTTP code 301 but no Location header, and Google de-listed my pages at this point.âThe âdecommissionedâ page said I could request a redirect, which was put in place on Tuesday 8th October, and the same day Baidu re-listed me but pointed to the *original* URL (not the redirected one).âGoogle meanwhile re-listed my home pageâagain at its original URLâby October 22nd, and started pointing to the new URL during November.
Baidu then de-listed me by 1st December, re-listed me at the new URL by 10th December, changed back to the original URL by 20th December, de-listed me again by 2nd January 2020, re-listed me (with lower ranking) on September 25th 2020 (again at the original URL, not the new SRCF one), and de-listed me from 3rd to 19th November 2020, from 23rd November 2020 to 26th January 2022, from 11th February to 6th March and from 23rd March 2022.âIâm not sure how to explain all this.
All material Š Silas S. Brown unless otherwise stated. Apache is a registered trademark of The Apache Software Foundation. Baidu is a trademark of Baidu Online Network Technology (Beijing) Co. Ltd. Firefox is a registered trademark of The Mozilla Foundation. Google is a trademark of Google LLC. Javascript is a trademark of Oracle Corporation in the US. Samsung is a registered trademark of Samsung. Weibo is a trademark of Sina.Com Technology (China) Co. Ltd. Windows is a registered trademark of Microsoft Corp. Any other trademarks I mentioned without realising are trademarks of their respective holders.