Trouble with decoding byte 0xFA

What kind of cryptic title is this? For the process of my ADIF log, which I collect daily from LoTW, I have written a small tool. This tool uses a few standard Python libraries. One of them has been giving me an error message for the last two days, causing my tool to fail. The error message I receive is initially as cryptic as the title of this article, namely: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 2585339: invalid start byte. That position does indeed contain byte 0xFA, a character we know as ú.

At the moment I am not the right person to start debugging or patching Python's codec library so I have to look for it in a workaround. The workaround I found uses the iconv program available in Linux and Unix. This program can convert between different character encodings via the command line or via an API.

I added my workaround to the bash script that takes care of getting the latest LoTW ADIF log, calling my tool and uploading the results to this website. After retrieving the ADIF log file I call the following in my script:

for adif in *.adi
do
  iconv -f utf-8 -t utf-8 -c $adif -o $adif-new
  rm -f $adif
  mv $adif-new $adif
done

Now my log files have been modified and my tool doesn't choke on this 'strange' character. But what caused this problem? When I look at the QSO record around position 2585339 I see the following ADIF field:

<STATE:2>KO // Respúblika Komi (Komi Republic)

There, behind 'Resp' is the infamous character, the ú, on which the Python codec library chokes. This is a comment field that is added by LoTW but not used in my tool. In the near future I will modify the tool to remove all comment fields before the data is further processed. Hopefully there will be no more choking when this is done.

Usually I'm dealing with the problem at the source but I can't get any closer to the source than my log file, a QSO record in LoTW can't be modified as far as I know. And even then, the comments are the result of the process at LoTW.


Update 12 Januari 2021

In hindsight, it is easier and wiser to just delete all comment fields in the ADIF log rather than convert the character encoding of the complete log file. So what I did was to swap the iconv command in the for-loop shown above for a sed solution. The sed solution used is:

sed 's/ \/\/.*//' $adif > $adif-new

This looks also somewhat cryptic but what it does is remove everything from the space before a double slash (//) in the log file, or actually replace this with nothing. It simply deletes the comment behind an ADIF field. The result is written to a new file which will be renamed later. For the full explanation of the sed syntax there are plenty of examples on the internet but a useful tool is the GNU sed live editor.