2016, Aug 27 - Dimitri Merejkowsky License: CC By 4.0
Quick! What's the bug fixed by this commit?
-def compare_links(a_html, b_html): - a_soup = bs4.BeautifulSoup(a_html, "lxml") - b_soup = bs4.BeautifulSoup(b_html, "lxml") +def compare_links(local_html, remote_html): + local_soup = bs4.BeautifulSoup(local_html, "lxml") + remote_soup = bs4.BeautifulSoup(remote_html, "lxml") - a_links = set(a_soup.find_all("a")) - b_links = set(b_soup.find_all("a")) + local_links = set(local_soup.find_all("a")) + remote_links = set(remote_soup.find_all("a")) - old_hrefs = [x.attrs["href"] for x in a_links] + old_hrefs = [x.attrs["href"] for x in remote_links] - new_links = [x for x in a_links if x.attrs["href"] not in old_hrefs] + new_links = [x for x in local_links if x.attrs["href"] not in old_hrefs]
Not that obvious, right?
Please read on to find out how this bug could never had happened, and how this commit should never have been made.
The above is a real commit I made recently.
It's a script I use to automatically tweet about new links, like here
The first version of the algorithm was quite simple:
That's where the `compare_links` function is used:
import requests def main(): build_blog() local_links_path = os.path.join(BLOG_SRC_PATH, "public", "pages", "links", "index.html") with open(local_links_path, "r") as fp: local_links_html = fp.read() remote_links_html = requests.get("https://dmerej.info/links.html").text new_links = compare_links(local_links_html, remote_links_html) if not new_links: return tweet_new_links(new_links)
In the first version I was just comparing two html files, so I named the parameters with `a_` and `b_` prefixes:
def compare_links(a_html, b_html): a_soup = bs4.BeautifulSoup(a_html, "lxml") b_soup = bs4.BeautifulSoup(b_html, "lxml") a_links = set(a_soup.find_all("a")) b_links = set(b_soup.find_all("a")) new_links = a_links - b_links return new_links
Simple enough, right?
I was happy with this for a while, until I made a change in the *description* of a link:
--- a/content/pages/links.md +++ b/content/pages/links.md - [First description](http://example.com/website) + [Second description](http://example.com/website)
This caused my script to tweet about `http://example.com/website` twice!
I thought: "OK, the fix is simple: instead of intersecting the two lists (old and new), build a new list containing all the links which URLs are not in the old list of links"
So this is the patch I made:
- new_links = a_links - b_links + old_hrefs = [x.attrs["href"] for x in a_links] + + new_links = [x for x in a_links if x.attrs["href"] not in old_hrefs]
Found the bug yet?
Let me help you: instead of trying to fix the bug directly, let us think about the variable names for a moment.
The name with `a_` and `b_` prefixes suggest that the variables can be swapped, and it's not obvious that `a_` refers to the "local" links, i.e the new ones that we just added to the `links.md` files, and `b_` refers to the "remote" links, i.e the one we just fetched from `dmerej.info`.
Let's re-write the function to use meaningful names instead:
def compare_links(local_html, remote_html): local_soup = bs4.BeautifulSoup(local_html, "lxml") remote_soup = bs4.BeautifulSoup(remote_html, "lxml") local_links = set(local_soup.find_all("a")) remote_links = set(remote_soup.find_all("a")) new_links = local_links - remote_links return new_links
Then when we want to re-do the patch, we naturally write:
- new_links = a_links - b_links + old_hrefs = [x.attrs["href"] for x in old_ref] + + new_links = [x for x in new_links if x.attrs["href"] not in old_hrefs] + retur new_links
I hope that by now you found the bug. Instead of using the `a_links` list to create the list of new links, we should have used the `b_links` list, but the mistake was not obvious to spot because of the bad naming.
This rule comes from an old article[2] from our friend Uncle Bob.
2: http://programmer.97things.oreilly.com/wiki/index.php/The_Boy_Scout_Rule
Paraphrasing a little bit, it says:
Always push a cleaner code than the one you pulled.
Have I followed this rule, I would have first made a commit to fix the variable names *before* changing the code to introduce a new feature, and then the bug would not have happened.
Instead I was lazy, so I introduced a bug, and the commit that fixes it is not obvious to read.
(At this point it does not really matter because I'm the only one writing this code, but still ...)
That's all for today, see you soon!
----