Thrilling Afternoon: Netbox Upgrade broken

It was Oct 27, 2020 afternoon, one day after netbox-docker release version 0.26.0. I was remote working at home. In release note, there was no break changes or compatibility issues. I replaced docker image tag in docker-compose.yml did the upgrade: ran docker-compose pull and docker-compose up -d.

Well, netbox was unable to boot. The logs reported 'NoneType' object has no attribute 'lower'. I double checked the configuration and release document, everything was fine. It seemed there were bugs in new release netbox-docker. It was totally out of my expectation. It was a minor version release, and the upgrade should be seamless.

Revert docker image back did not recover from error. So there were 3 options:

  1. Rollback to old version and restore from backup.
  2. Wait for netbox-docker team to fix.
  3. Fix by myself.

For option 1, the netbox was not a crucial service, it was OK to be out of service for several hours. So there were time for me to try to fix before restore from backup. For option 2, I should not rely on the netbox-docker team. It was one day since the broken version release. I had no idea whether there was someone with bad luck like me reported to the team. So I chose option 3.

Problem 1

The error 'NoneType' object has no attribute 'lower' was easy to locate in source code, just by searching lower keyword. The line of buggy code was to lower case one environment variable value, but the environment variable may not be defined. In that case, the value was None and called lower method to None will reported the error.

I created a pull request to fix it.

Problem 2

After manually fix problem 1, another error shown: No such file or directory: '/tmp/metrics/counter_26.db'

So I searched counter_26.db, found nothing. Tried /tmp/metrics I found gunicorn_config.py contained it. I had no idea what this file work for. So I just set raw_env to empty string and restarted netbox. The error disappeared.

Luckly, another guy create a pull request for it: just remove raw_env.

Problem 3

Netbox was running this time and the UI could open in browser. But LDAP authentication was broken: it reported missing LDAP configurations. I had double check the configuration. It was at the right place with correct values.

Netbox is coded in Python, so I could change python file in docker container, add log, to do fix and verify quickly. After about 2 hours later, I found where the problem was and create another pull request.

Lesson Learn: do upgrade in test environment first.

Why Software/System need upgrading in time?

I have a doubt: whether operation system and software need upgrading to latest stable version in time. In a short answer, it should be. But in reality, with limited time and resources, developers does not want to touch the production environment if it is running well. They are afraid of making changes. Any component upgradation may cause failure.

While tech debt is still growing even there is no changes in production environment, because things are changes in outside world.

Why need upgrading Operation System in time?

The most important to upgrade operation system in time is to fix security bugs, then new hardware support, new features for development and performance enhancement.

In 2005/2006, I was in college, my laptop was running windows XP. To make OS security, I install the windows XP with official ISO, then installed a anti-virus software. The most important, I turned on windows update and kept it up-to-date.

Most of classmates turned off windows update and leave OS with a lot security vulnerabilities. Once a time, one of my roommate used hack tools to scan vulnerable computers in local network. He found a lot. He tried scan my IP and found no way to hack my system.

Nowadays, operation system security vulnerabilities are spreading much quickly in internet. Safe patch apply time window is much narrow. It is too late if there is data lost by being hacked.

Why need upgrading software components in time?

A software system uses many third party libraries. Upgrading third party libraries to latest stable version has similar benefits from those of operation system: fix bugs, new features and performance enhancement. Another important benefit is development support.

Most of third party libraries are open sourced. Active open sourced projects are in rapid development. They provide limited support in free. Most of time, premium support is not available. Latest major version support is well supported for about one year. It is worse that old version will be forgotten in internet: document, community, even in StackOverflow. So keep third party libraries not updated is growing its tech debt for an active software project.

Patrice

Nice testing coverage

The main reason of afraid of software changes is lack of testing, or testing coverage is not well. Without good coverage testing, no one has confident to upgrade OS or libraries without problems. The reason why testing is missing is another story of profit and management.

gorm is ORM library used in trading management system. V2 release brings many nice features and performance enhancements but also break changes. The upgrade migration was smooth because not only almost every database operation was test covered, some specified behaviors (like hook) were also covered. So after fixing compile error, the unit test reports and run-time behavior changes need also be migrated.

Subscribe Release Note

Subscribe OS and third-party libraries release notes, analysis compatibility and do upgrade plan early.

There are two main subscription method: email and RSS. Github, gitlab and artifacthub.io supports email notification. So project websites support RSS can be subscribed by RSS reader.

Docker hub has no notification support for a docker image repository. Docker Hub RSS can be used to generated RSS feed.

Test Environment

If there are break changes in new version, do upgrade in test environment first to make sure there are no issues.

Development team of Netbox docker has limit resource to do full testing on new versions. They may do changes to configuration layout and dynamic loading. Such breaking changes may not be full testing, like LDAP integration. So do a upgrade in test environment first.

Make To-Do to Done

Once a upgrade plan comes out, finish the upgrdation in time.

Mikrotik WiFi Trick: Firmware Upgrade

The access point controller, Mikrotik router hEX, had upgraded to latest stable version: 6.48.4. All other APs were at 6.47.10. How to upgrade all APs easily?

CAPsMAN can set upgrade policy to let APs upgrade to the same version as APC. After setting the policy to suggest-same-version, All APs haven’t did the upgrade after waiting for a while.

I noticed package-path field: If empty string is set, CAPsMAN can use built-in RouterOS packages, note that in this case only CAPs with the same architecture as CAPsMAN will be upgraded. Same architecture means same CPU architecture. Router hEX is MMIPS, mAP Lite is MIPSBE while wAP ac is ARM. So I need to download MIPSBE and ARM packages and put them to the router for APs to download. It is easy to found devices packages from download page.

The router, hEX, has 16MB storage, only several MB left besides system usage. each upgrade package is about 12MB, it is no way to upgrade package to internal storage. I formatted a USB storage to FAT32, and inserted to the router. The router auto mounted the USB stroage to CF folder. After uploading two packages under CF/upgrade, and set the package-path to CF/upgrade. All APs auto rebooted in a short time to latest version. It is fast and amazing.

终于刷机成功

今天早上不知是什么冲动,打算再把自己的i7500刷一次,使它支持VPN.

Drakaz大神已经针对i7500做出1.6.3的ROM,国内已经有无数白老鼠测试过还安然无恙,我只需要按照他们走过的路走就行了。

不过Android比起windows mobile,刷机的复杂度大了许多。首先要下载的东西也比较多。

第一个就是Recovery,它就是充当刷机的软件,可是跟windows mobile不一样,它不是在windows运行的,而是在手机上的,而已还跟android独立。于是第一步就是把recovery刷到手机上,刷的工具还是得从google android的开发包里面拿,好在网友已经把工具提取出来了,没必要下载比较庞大的开发包。但是问题又来了,i7500连接到PC后,PC无法识别,这时得上论坛下载ADP驱动。安装倒也省事。然而最大的问题不是这样。

刷Recovery必须把手机启动到fastboot模式,而这个模式只停留15秒。当进入fastboot模式时,PC又识别不了手机啦,得重新安装驱动,还得在15秒内完成,在windows xp下,我屡试屡败。今天在windows 7尝试一下,驱动刚安装完,15秒也就过了。再重新启动进入fastboot,想不到win7已经识别,于是赶紧用flash工具刷进去,令我想不到的是,刷的过程相当迅速,我还没反应过来,就提示completed了。

好了,把手机启动到recovery,得承认这个软件很强大。按照教程,我把系统备份。同时发现ROM还没放到内置sd卡上,好在recovery有mount sd到PC的功能,不用重启就mount到笔记本上,就操作U盘一样把ROM放进去。

不过还缺少一个软件JC6 PDA,看了描述这个软件是用来解压ROM的,但Drakaz没提供下载地址,在中文论坛上找到了一个地址,去到那个网站发现处于维护状态,真是郁闷啊。于是把软件的tar文件名在网上搜一下,竟然发现有个网盘有提供,而且就是我要找的那个,赶紧下载下来,然后放到sd卡上。全部就绪……

在Reovery刷ROM,然后restore Google App,整个过程提心吊胆,担心有什么意外。最后complete之后,就reboot。最紧张的时刻来了,希望成功进入系统。

Bingo,进入了,速度还不错。赶紧摸索,跟1.5版本差不了多少,market改进了。测试VPN,试了一个小时,怎么都connect不了。打算去recovery获取root权限,发现总是有error。

最后实在折腾不了,就这样吧。庆幸的是刷机成功,遗憾的是root权限和VPN无法正常工作。