xmpp: Session establishment can get stuck with certain malfunctioning servers despite context with timeout or deadline #124

Closed
opened 2021-04-01 14:02:29 +00:00 by horazont · 11 comments
horazont commented 2021-04-01 14:02:29 +00:00 (Migrated from github.com)

Minimal reproducer:

package main

import (
	"context"
	"crypto/tls"
	"fmt"
	"time"
	"mellium.im/xmpp"
	"mellium.im/xmpp/dial"
	"mellium.im/xmpp/jid"
	"mellium.im/sasl"
)


func test(address string) {
	ctx, _ := context.WithTimeout(context.Background(), 2 * time.Second)

	fmt.Printf("%s: testing ...\n", address)

	addr, err := jid.Parse(address)
	if err != nil {
		fmt.Printf("%s: failed to parse: %s\n", address, err)
		return
	}

	dialer := dial.Dialer{
		NoTLS:     true,
		S2S: false,
	}
	dialer.Deadline, _ = ctx.Deadline()
	conn, err := dialer.Dial(ctx, "tcp", addr)
	if err != nil {
		fmt.Printf("%s: failed to dial: %s\n", address, err)
		return
	}
	defer conn.Close()

	tls_config := &tls.Config{}

	features := []xmpp.StreamFeature{
		xmpp.StartTLS(tls_config),
		xmpp.SASL(
			addr.Localpart(),
			"foobar2342",
			sasl.ScramSha256, sasl.ScramSha1, sasl.Plain,
		),
	}

	session, err := xmpp.NewSession(
		ctx,
		addr.Domain(),
		addr,
		conn,
		0,
		xmpp.NewNegotiator(
			xmpp.StreamConfig{
				Lang: "en",
				Features: func(s *xmpp.Session, _features ...xmpp.StreamFeature) []xmpp.StreamFeature {
					return features;
				},
			},
		),
	)

	if session != nil {
		defer session.Close()
	}

	if err != nil {
		fmt.Printf("%s: session failed: %s\n", address, err)
	} else {
		fmt.Printf("%s: session succedeed!\n", address)
	}
}


func main() {
	test("foo@blackhole.dreckshal.de")
	test("foo@reject.dreckshal.de")
	test("foo@http-blackhole.dreckshal.de")
}

The domains used are configured as follows:

  • blackhole.dreckshal.de: drops traffic to the XMPP port (courtesy badxmpp.eu). This is to demonstrate the expected behaviour.
  • reject.dreckshal.de: replies with connection refused on the XMPP port (also courtesy badxmpp.eu). This is to demonstrate a random test case.
  • http-blackhole.dreckshal.de (bad name, but whatever): accepts the connection and then … does nothing.

When running the above example, the first two tests terminate with an error (i/o error and connection refused respectively) and the third one gets stuck, despite the context timeout.


This was discovered because mellium does an implicit fallback to port 80 if port 5222 is not used and the domain in question had a strangely configured httpd which probably waited for the first newline or whatever. It did not reply at all to the stream header.

Minimal reproducer: ```golang package main import ( "context" "crypto/tls" "fmt" "time" "mellium.im/xmpp" "mellium.im/xmpp/dial" "mellium.im/xmpp/jid" "mellium.im/sasl" ) func test(address string) { ctx, _ := context.WithTimeout(context.Background(), 2 * time.Second) fmt.Printf("%s: testing ...\n", address) addr, err := jid.Parse(address) if err != nil { fmt.Printf("%s: failed to parse: %s\n", address, err) return } dialer := dial.Dialer{ NoTLS: true, S2S: false, } dialer.Deadline, _ = ctx.Deadline() conn, err := dialer.Dial(ctx, "tcp", addr) if err != nil { fmt.Printf("%s: failed to dial: %s\n", address, err) return } defer conn.Close() tls_config := &tls.Config{} features := []xmpp.StreamFeature{ xmpp.StartTLS(tls_config), xmpp.SASL( addr.Localpart(), "foobar2342", sasl.ScramSha256, sasl.ScramSha1, sasl.Plain, ), } session, err := xmpp.NewSession( ctx, addr.Domain(), addr, conn, 0, xmpp.NewNegotiator( xmpp.StreamConfig{ Lang: "en", Features: func(s *xmpp.Session, _features ...xmpp.StreamFeature) []xmpp.StreamFeature { return features; }, }, ), ) if session != nil { defer session.Close() } if err != nil { fmt.Printf("%s: session failed: %s\n", address, err) } else { fmt.Printf("%s: session succedeed!\n", address) } } func main() { test("foo@blackhole.dreckshal.de") test("foo@reject.dreckshal.de") test("foo@http-blackhole.dreckshal.de") } ``` The domains used are configured as follows: * blackhole.dreckshal.de: drops traffic to the XMPP port (courtesy badxmpp.eu). This is to demonstrate the *expected* behaviour. * reject.dreckshal.de: replies with connection refused on the XMPP port (also courtesy badxmpp.eu). This is to demonstrate a random test case. * http-blackhole.dreckshal.de (bad name, but whatever): accepts the connection and then … does nothing. When running the above example, the first two tests terminate with an error (i/o error and connection refused respectively) and the third one gets stuck, despite the context timeout. --- This was discovered because mellium does an implicit fallback to port 80 if port 5222 is not used and the domain in question had a strangely configured httpd which probably waited for the first newline or whatever. It did not reply at all to the stream header.
horazont commented 2021-04-01 15:45:11 +00:00 (Migrated from github.com)

NB: If you ever get Connection refused in the third test, let me know. Then my makeshift "do not send any reply" server broke :).

NB: If you ever get `Connection refused` in the third test, let me know. Then my makeshift "do not send any reply" server broke :).

So this appears to be getting stuck reading the reply stream header (which of course is never read if this is a black hole that returns nothing). This is expected behavior (more or less) because it's up to the user to set a deadline on the connection. We could also make it respect the ctx possibly, but I'm not sure if that makes sense or if it would be overriding the users intent if they already set a deadline on the conn. It would also require spawning a goroutine before any attempt at reading the network, which doesn't feel great. I'm really not sure what the expected Go-like thing to do is here. I'll think about it and make a decision, but in the mean time you can use conn.SetDeadline to work around this (and as a general rule of thumb you should always set deadlines on everything, I should add this to the examples).

So this appears to be getting stuck reading the reply stream header (which of course is never read if this is a black hole that returns nothing). This is expected behavior (more or less) because it's up to the user to set a deadline on the connection. We could also make it respect the ctx possibly, but I'm not sure if that makes sense or if it would be overriding the users intent if they already set a deadline on the conn. It would also require spawning a goroutine before any attempt at reading the network, which doesn't feel great. I'm really not sure what the expected Go-like thing to do is here. I'll think about it and make a decision, but in the mean time you can use `conn.SetDeadline` to work around this (and as a general rule of thumb you should always set deadlines on everything, I should add this to the examples).
horazont commented 2021-04-02 09:15:13 +00:00 (Migrated from github.com)

Hi and thanks for the response.

Using conn.SetDeadline during login seems very appropriate! I wasn’t aware of this golang way of doing things.

Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own, without me having to set a deadline on the connection. If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go?

Maybe my expectation is not go-idiomatic in that regard though.

Hi and thanks for the response. Using `conn.SetDeadline` during login seems very appropriate! I wasn’t aware of this golang way of doing things. Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own, without me having to set a deadline on the connection. If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go? Maybe my expectation is not go-idiomatic in that regard though.

Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own

I tend to agree, it feels odd that it can lock up even after the context is expired.

If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go?

The context is also used to break out of the negotiate loop, so we'd need a context either way.

Maybe my expectation is not go-idiomatic in that regard though.

This is something I hadn't thought about before and I'm really not sure what the idiomatic way to do this is. I generally think the connection deadline should be controlled by the user, not my library, but context of course is something I have more control over. It's sort of Go's idioms clashing with POSIX or C idioms (or wherever this socket behavior is defined, it's not strictly a Go thing).

> Nevertheless, if a method like NewSession takes a context, I would expect it to adhere to that deadline on its own I tend to agree, it feels odd that it can lock up even after the context is expired. > If the conn is the only thing involving a deadline, then maybe not taking a context there is the way to go? The context is also used to break out of the negotiate loop, so we'd need a context either way. > Maybe my expectation is not go-idiomatic in that regard though. This is something I hadn't thought about before and I'm really not sure what the idiomatic way to do this is. I generally think the connection deadline should be controlled by the user, not my library, but context of course is something I have more control over. It's sort of Go's idioms clashing with POSIX or C idioms (or wherever this socket behavior is defined, it's not strictly a Go thing).

I talked this out on chat earlier and came to a conclusion (though I am of course still open to other opinions).

I was forgetting that some of the convenience functions that wrap NewSession such as DialClientSession create the conn themselves and the user has no chance to set a deadline before passing one in. That makes this an actual bug because there's no way we want these to block forever with no way to cancel them.

If we make the context cancellation respected in those methods, it makes sense not to cancel it (this part is still debatable) in the methods that take a conn, partially because these actually take an io.ReadWriter so we'd have to do a type check which always feels jank and partially because this way these methods give the user maximum flexibility to set deadlines differently if they want.

So the current solution would be to add a goroutine that cancels connection reads/writes to:

  • DialSession,
  • DialClientSession, and
  • DialServerSession

and to leave the other methods alone (and maybe just add a reminder to the documentation that if you're using a conn you should always set a deadline and come up with a strategy for resetting it as necessary).

I talked this out on chat earlier and came to a conclusion (though I am of course still open to other opinions). I was forgetting that some of the convenience functions that wrap `NewSession` such as `DialClientSession` create the conn themselves and the user has no chance to set a deadline before passing one in. That makes this an actual bug because there's no way we want these to block forever with no way to cancel them. If we make the context cancellation respected in those methods, it makes sense *not* to cancel it (this part is still debatable) in the methods that take a conn, partially because these actually take an `io.ReadWriter` so we'd have to do a type check which always feels jank and partially because this way these methods give the user maximum flexibility to set deadlines differently if they want. So the current solution would be to add a goroutine that cancels connection reads/writes to: - `DialSession`, - `DialClientSession`, and - `DialServerSession` and to leave the other methods alone (and maybe just add a reminder to the documentation that if you're using a conn you should always set a deadline and come up with a strategy for resetting it as necessary).
horazont commented 2021-04-03 15:17:26 +00:00 (Migrated from github.com)

That sounds very reasonable!

That sounds very reasonable!

@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.

@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.

And I guess we'd want to have a pre-declared time that's used for the actual cancellation, something like aLongTimeAgo from net:

// aLongTimeAgo is a convenient way to cancel dials.                             
var aLongTimeAgo = time.Unix(1, 0)
And I guess we'd want to have a pre-declared time that's used for the actual cancellation, something like `aLongTimeAgo` from `net`: ``` // aLongTimeAgo is a convenient way to cancel dials. var aLongTimeAgo = time.Unix(1, 0) ```
horazont commented 2021-04-03 15:27:33 +00:00 (Migrated from github.com)

@horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not.

I’d rather not. My golang fu is not very solid yet and I don’t use either of those functions (I need to be on a lower level in my code), so I’d not be confident in that.

(Not to mention the huge pile of other stuff which needs to be done ;))

> @horazont any chance you'd be interested in implementing this? It should be pretty straight forward (one goroutine that gets copied into those three functions just after the conn is created and an update to the changelog). No pressure of course, I can get to it later this weekend if not. I’d rather not. My golang fu is not very solid yet and I don’t use either of those functions (I need to be on a lower level in my code), so I’d not be confident in that. (Not to mention the huge pile of other stuff which needs to be done ;))

Not to worry, I'll get to it later. Thanks for pointing this DOS out!

Not to worry, I'll get to it later. Thanks for pointing this DOS out!

I just pushed up a branch with a fix for this (and for a related but that was discovered and fixed incidentally). It still feels wrong to me somehow, but I can't quite put my finger on why. Going to merge it for now (assuming tests pass) and we can always re-visit later if other problems arise.

I just pushed up a branch with a fix for this (and for a related but that was discovered and fixed incidentally). It still feels wrong to me somehow, but I can't quite put my finger on why. Going to merge it for now (assuming tests pass) and we can always re-visit later if other problems arise.
Sign in to join this conversation.
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: mellium/xmpp#124
No description provided.