Cas (@PuppyGames) (of PuppyGames fame – Titan Attacks, Revenge of the Titans, etc) is working on an awesome new MMO RTS … thing.
He had a weird networking problem, and was looking for suggestions on possible causes. I used to do a lot of MMO dev, so we ran through the “typical” problems. This isn’t exhaustive, but as a quick-n-dirty checklist, I figured it could help some other indie gamedevs doing multiplayer/network code.
Scenario
- Java, Windows 7
- DataInputStream.readFully() from a socket… a local socket at that… taking 4.5 seconds to read the first bytes
- and I’ve already read a few bits and bobs out of it
Initial thoughts
First thought: in networking, everything has a realm of durations. The speed of light means that a network connection from New York to London takes a few milliseconds (unavoidably) – everything is measured in ms, unless you’re on a local machine, in which case: the OS internally works in microseconds (almost too small to measure). Since Cas’s problem is measured in SECONDS – but is on localhost – it’s probably something external to the OS, external to the networking itself.
Gut feel would be something like Nagle’s algorithm, although I’m sure everyone’s already configured that appropriately ;).
That, or a routing table that’s changing dynamically – e.g. first packet is triggering a “let me start the dialup connection for you”, causing everything to pause (note that “4 seconds” is in the realm of time it takes for modems to connect; it’s outside the realm of ethernet delays, as already noted)
My general advice for “bizarre server-networking bugs”: start doing the things you know are “wrong” from a performance view, and prove that each makes it worse. If one does not, that de facto that one is somehow not configured as intended.
Common causes / things to check
1. Contention: waht’s the CPU + C libraries doing in background? If system has any significant (but small) load, you could be contended on I/O, CPU mgiht not be interrupting to shunt data from hardware up to software (kernel), up to software (OS user), up to JVM
Classic example: MySQL DB running on a system can block I/O even with small CPU load
2. file-level locking in OS FS. Check using lsof which files your JVM is accessing, and what other processes in the machine may have same files open.
(NB: there’s a bunch of tools like “lsof” which let you see which files are in use by which processes in a unix/linux system. As a network programmer, you should learn them all – they can save a lot of time in development. As a network programmer, you need to be a competent SysAdmin!)
Classic problem: something unexpected has an outstanding read (or a pre-emptive write lock) which is causing silliness. I remember several times having a text editor open, or etc, that was inadvertently slowing access to local files (doh).
3. can you snoop the traffic? (i.e. its socket, not pipe)
Classic problem: some unrelated service is bombarding the loopback with crap. eg. Windows networking (samba on linux) going nuts
Try running Wireshark, filtered to show only 127. and see what’s going through at same time
Also: this is a good way to check where the delay is happening. Is it at point of send, or point of receive? … both?
There’s “net traffic” and there’s “net traffic”. Wireshark often shows traffic that OS normally filters out from monitoring apps…
4. Check your routing table? AND: Check it’s not changing before / aftter the attempted read?
5. Try enabling Nagle, see if it has any effect?
My point is: use this as a check: it ought to make things worse. If not … perhaps the disabling wasn’t working?
6. Have you done ANY traffic shaping (or firewalling) on this machine at any time in the past?
Linux: in particular, check the iptables output. Might be an old iptables rule still stuck in there – or even a firewall rule.
Linux + Windows: disable all firewalls, completely.
7. Similarly: do you have any Anti-Virus software?
Disconnect your test machines from the internet, and remove all AV.
AV software can even corrupt your files when they mistakenly think that the binary files used by the compiler/linker are “suspicious” (IIRC, that happened to us with early PlayStation3 devkits).
8. On a related note: security / IPS tools installed? They will often insert artificial delays silently.
CHECK YOUR SYSTEM LOG FILES! …whatever is causing the delay is quite possibly reporting at least a “warning” of some kind.
9. (in this case, Cas’s socket is using SSL): Perhaps something is looking up a certificate remotely over the web?
…checked your certificate chain? (if there’s an unknown / suspicious cert in the chain, your OS might be trying to check / resolve it before it allows the connection)
10. (in this case, Cas is using custom SSL code in Java to “hack” it): Get a real SSL cert from somewhere, see if it behaves any different
(I think it would be very reasonable for your OS to detect any odd SSL stuff and delay it, as part of anti-malware / virus protection!)
11. “Is it cos I is custom?” – Setup an off-the-shelf webserver on the same machine and check if YOUR READING CODE can read from that SSL localhost fast?
(it often helps to get an answer to “is it the writer, the reader, both … is it the hacked SSL, or all SSL?” etc)
12. (getting specific now, as the general ideas didn’t help): if you bind to the local IP address instead of 127 ?
13. Howabout reverse-DNS? (again, 4 seconds is in the realm of a failed DNS lookup)
Might be a reverse DNS issue … e.g. something in OS (kernel or virus protection), or in JVM SSL library, that’s trying to reverse-lookup the 127 address in order to resolve some aspect of SSL.
I’ve had to put fake entries into hosts file in the past to speed-up this stuff, some libs just need a quick answer
Result
Turns out we were able to stop here…
Cas: “Well, whaddya know. It was a hosts lookup after all – because I’m using 127.0.0.2 and 127.0.0.3 it was doing a hostname lookup. I added those to hosts and all is solved
Me: “(dance)” (yeah. It was skype. The dance smiley comes in handy)