xmpp: Session establishment can get stuck with certain malfunctioning servers despite context with timeout or deadline #124
Labels
No Label
bug
CI
documentation
duplicate
enhancement
good first issue
help wanted
i18n
invalid
needs-investigation
ops
proposal
proposal-accepted
proposal-declined
question
refactor
security
testing
upstream
wontfix
Kind: Breaking
Kind: Bug
Kind: Documentation
Kind: Enhancement
Kind: Feature
Kind: Maintenance
Kind: Question
Kind: Security
Kind: Testing
Priority: Critical
Priority: High
Priority: Low
Priority: Medium
Reviewed: Confirmed
Reviewed: Duplicate
Reviewed: Invalid
Status: Blocked
Status: Completed
Status: Help wanted
Status: In progress
Status: Needs feedback
Status: Stale
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: mellium/xmpp#124
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Minimal reproducer:
The domains used are configured as follows:
When running the above example, the first two tests terminate with an error (i/o error and connection refused respectively) and the third one gets stuck, despite the context timeout.
This was discovered because mellium does an implicit fallback to port 80 if port 5222 is not used and the domain in question had a strangely configured httpd which probably waited for the first newline or whatever. It did not reply at all to the stream header.
NB: If you ever get
Connection refused
in the third test, let me know. Then my makeshift "do not send any reply" server broke :).So this appears to be getting stuck reading the reply stream header (which of course is never read if this is a black hole that returns nothing). This is expected behavior (more or less) because it's up to the user to set a deadline on the connection. We could also make it respect the ctx possibly, but I'm not sure if that makes sense or if it would be overriding the users intent if they already set a deadline on the conn. It would also require spawning a goroutine before any attempt at reading the network, which doesn't feel great. I'm really not sure what the expected Go-like thing to do is here. I'll think about it and make a decision, but in the mean time you can use
conn.SetDeadline
to work around this (and as a general rule of thumb you should always set deadlines on everything, I should add this to the examples).Hi and thanks for the response.
Using
conn.SetDeadline
during login seems very appropriate! I wasn’t aware of this golang way of doing things.Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own, without me having to set a deadline on the connection. If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go?
Maybe my expectation is not go-idiomatic in that regard though.
I tend to agree, it feels odd that it can lock up even after the context is expired.
The context is also used to break out of the negotiate loop, so we'd need a context either way.
This is something I hadn't thought about before and I'm really not sure what the idiomatic way to do this is. I generally think the connection deadline should be controlled by the user, not my library, but context of course is something I have more control over. It's sort of Go's idioms clashing with POSIX or C idioms (or wherever this socket behavior is defined, it's not strictly a Go thing).
I talked this out on chat earlier and came to a conclusion (though I am of course still open to other opinions).
I was forgetting that some of the convenience functions that wrap
NewSession
such asDialClientSession
create the conn themselves and the user has no chance to set a deadline before passing one in. That makes this an actual bug because there's no way we want these to block forever with no way to cancel them.If we make the context cancellation respected in those methods, it makes sense not to cancel it (this part is still debatable) in the methods that take a conn, partially because these actually take an
io.ReadWriter
so we'd have to do a type check which always feels jank and partially because this way these methods give the user maximum flexibility to set deadlines differently if they want.So the current solution would be to add a goroutine that cancels connection reads/writes to:
DialSession
,DialClientSession
, andDialServerSession
and to leave the other methods alone (and maybe just add a reminder to the documentation that if you're using a conn you should always set a deadline and come up with a strategy for resetting it as necessary).
That sounds very reasonable!
@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.
And I guess we'd want to have a pre-declared time that's used for the actual cancellation, something like
aLongTimeAgo
fromnet
:I’d rather not. My golang fu is not very solid yet and I don’t use either of those functions (I need to be on a lower level in my code), so I’d not be confident in that.
(Not to mention the huge pile of other stuff which needs to be done ;))
Not to worry, I'll get to it later. Thanks for pointing this DOS out!
I just pushed up a branch with a fix for this (and for a related but that was discovered and fixed incidentally). It still feels wrong to me somehow, but I can't quite put my finger on why. Going to merge it for now (assuming tests pass) and we can always re-visit later if other problems arise.