DNS Debugging

One problem with DNS debugging is that useful information may not have been obtained prior to the problem exhibiting itself; in particular the old SOA record serial number and whether it has recently changed may not be available. Some sites do encoding the year, month, day, and two more digits into the serial number,

    $ host -t SOA example.org
    example.org has SOA record ns.icann.org. noc.dns.icann.org. 2024081402 7200 3600 1209600 3600

                                                                ^^^^^^^^^^

which does let you know when the last change was made (2024081402 translates to August 14th, 2024, second change of the day) assuming they are still using the YYYYMMDDNN form and not just simply incrementing the record by one. Some sites may use a hybrid approach where manual edits use YYYYMMDDNN but then automated DNS updates increment the serial number, in which case the YYYYMMDD... is the record of the last manual change, or when they switched over to using automatic increments. Most sites probably are not doing manual zone file edits these days, most of the time.

Why is knowing whether the serial number has changed important? If you made a DNS change via some interface or even the zone file directly, and the serial number reported by a DNS query does not change, then the problem is likely in the zone file or whatever interface has been put in front of that. If the serial number has changed, and your DNS change isn't showing up, then it is more likely (but not 100% certain) that the problem is with caching or a client, in which case waiting for some amount of time might help clear any cached values.

DNS in particular caches negative results, so if you lookup some new host, foo.example.org, and then request that foo.example.org be created, the new foo.example.org will not appear until the negative result from your original DNS query expires. If the SOA record serial number has been incremented (it may however wrap around to a lower value in a weird edge case I've never seen in practice) then it is likely that you are running into the negative cache, or that the new zonefile hasn't gotten out to all the DNS servers yet, or something like that.

Being able to do a full zone transfer is also handy for debugging, but most sites will not allow that. The full zone would let you see whether your new record is in the zone file or not, at least according to the DNS server you did the zone transfer from. Authoritative DNS servers may have different zone files at different times; ideally they should all end up with the same zone files, but there can be delay or issues to debug that may prevent that.

P.S. if you have impatient customers then it might be a good idea to clear any caches after adding a new record so that any negative cache entries are thrown out on all your DNS servers (but probably not their DNS server). This is less efficient than just waiting for any old cache records to expire, but patience, how long will that take?

P.P.S. various modern sites set the DNS time-to-live to crazy low values so that things can be changed more rapidly (probably "because cloud"); one downside here is more frequent DNS queries, and a risk that if someone can knock your DNS servers (or any upstream DNS servers) offline for long enough, your hosts will vanish from the internet. I generally favor very long DNS TTL, as this results in less DNS traffic, and with a long DNS TTL the DNS server can be offline for longer before the cached values start falling off other DNS servers. The current DNS provider I'm using has a maximum TTL of one day, which I consider to be a very low value, not long enough to ride through a weekend outage. Maybe they expect you to come in on a Saturday?