MTU, huh?

| 3 min read

One of the things I constantly regret is not taking uni courses in a serious way. Courses such as databases theories, algorithms & data structures, AI and so on. There is, however, one course in particular that is haunting me which is networks. Yes I'm struggling now with basic network principles but I'm working on that (yellow smile goes here).

Why the need for this long introduction? here it goes.

What’s the Story ?

Two months ago, on a project that I was working on and got launched, there were some major faults/errors that affected the software functionality. I can't get into details due to NDA, but the projects used several technologies such as Elastic Search, Kafka, java, and more, and was provided by a third party vendor.

You might be asking: What kind faults/errors were found? Most of the issues were related to refusing of connection, such as: (errors details )

- RemoteTransportException
- Connection refused
- exception caught on transport layer [Netty4TcpChannel]
- NodeDisconnectedException
- ConnectException

It was a tough situation as there was no clear trace of the root cause of this issue. The initial assumption was that it was something that is network-related, so I had to get assistant from my infra friends :)

tcpdump and Wireshark

"Let's do tcpdump!"
Okkkkey ?!, But what's the hell is tcpdump ?

tcpdump is a command line program that is used for analyzing packets transmitted through the network.

We selected several connections based on the system behavior and generated the tcpdump (which you have to know had a big size if you let it run for more than minutes). Later, we analyzed the tcpdump using Wireshark.

Again, what the is this one?

Wiresharks is a network protocol analyzer. In our case, it showed 2 types of protocols: TCP and Elastic Search.

Fun fact: Elastic Search has a custom native transfer protocol.

Elastic Search packet size was relatively large in comparison with the TCP in the tcpdump. Packet size reached 65k bytes while TCP packet size was approximately 1500 bytes. The 65k bytes is considered a large size. As far as in know, there is no standard packet size and it depends on your network if you exceed the limit.

At that point, I was introduced to a new concept: "We need to check our MTU". Again, what does that even mean??

MTU

maximum transmission unit (MTU) is the size of the largest protocol data unit (PDU) that can be communicated in a single network layer transaction.

Large packets occupy a slow link for more time than a smaller packet, causing greater delays to subsequent packets, and increasing network delay and delay variation

Based on the that, any packet that is larger than the MTU should be split. So, what was the exact issue?

  • Was the large packets split? Yes.
  • Was the splitting done properly? No.
  • How do you know this? Fragmentation errors were appearing while analyzing the tcpdump.

We learned that our OS has an MTU of 1500 Bytes. Any larger packets needed to be split. And for some reasons, splitting wasn't executed in a proper manner.

We had no option to increase the MTU in our network. So, we had to return back to the software system. Bear in mind that until this moment we weren't sure if this was the actual reason of this issue, Fortunately, There was an option to decrease the MTU on an application level. A new deployment was installed considering this change, and surprisingly, our problem was resolved :)

Lessons learned:

1- While solving such issues, you may not find any clues on the internet. Build an assumption and use trial-and-error approach. Hopefully, one way or another, it will eventually lead you to the solution.

2- Because there is no clear approach, you will be pressured by management/system users to find the solution urgently. You will be asked to provide a deadline!! OK Mr, I know you want a functioning system but how will I provide a deadline for something I don't even know?? In this case, try to list down all the possible approaches/measures you will take with a time plan. Also keep your stakeholders in touch with the current progress as much as possible to the point they don't ask about a deadline.

3-You will never know anything about anything. Be humble and take your troubleshooting journey as an opportunity of learning.

This is it :)