api-gateway: nginx will cache resolved DNS addresses

Description

nginx caches resolved DNS names. As an effect, when restarting a number of services in the backend, nginx may incorrectly direct proxy requests.

The reason is that each service may get a different IP address when it starts. The best case is that docker assigns a new (previously unused) IP address to created service, in which case nginx will have no upstream to direct the requests to. The worst case is that service that had one address before restart, gets a new IP address that was used by another service. In this case, nginx will direct requests to incorrect service and accessing endpoints will return 404.

Possible solutions:

try to figure out if proxy_pass with variable and a resolved fix the problem
provide a helper that sends SIGHUP to nginx if addresses change

Affects versions

1.0.0

1.0.1

1.1.0

Environment

None

Checklist

Activity

Show:

Marcin ChalczynskiAugust 8, 2017 at 8:48 AM

PR https://github.com/mendersoftware/mender-api-gateway-docker/pull/71

based on MaciekB's example.

verified to work both when scaling up, and scaling down, including killing off all instances of a service and restarting.

I noticed a few non-critical quirks, but I have some suspicions this is sth about my local environment (acting up lately):

when a service goes away and pops back up, it never appears back in the compose output (the one you get attached to at ./demo up)
- it's visible if you do {{./demo logs ... }} in a different terminal though, and everything works correctly
sometimes when killing/restarting the service, there's a http connection timeout and you get detached from compose output
- again, everything still works, as confirmed by reattaching or dumping logs

I'm mentioning this because I've never seen it with any previous POC, and I've done a lot fiddling around with this.

Maciej BorzeckiAugust 7, 2017 at 12:09 PM

this might be enough

NAMES='mender-useradm mender-device-auth mender-device-adm mender-device-auth'
while true; do 
    dig $NAMES |grep -v -e '^;' -v -e '^$' -v -e '^\.' > /tmp/addrs.new
    if test -e /tmp/addrs; then 
         if ! cmp /tmp/addrs.new /tmp/addrs; then 
            echo '-- reload'
        else 
            echo '-- no reload' 
        fi
    fi
    mv /tmp/addrs.new /tmp/addrs
    sleep 10
done

Marcin ChalczynskiAugust 7, 2017 at 11:35 AM

ok, but I'd propose to have this optimisation in right from the start:

have a primitive python script that does nslookup on known services
- and dumps the ips to a file
- and compares current ips to previous ones, if any new ips are detected - reloads nginx
have cron run it however often we want

MaciejAugust 7, 2017 at 10:37 AM

@Maciej Borzecki for sure, probably doing whole graceful shutdown and so on. @Marcin Chalczynski anything sane would be minutes probably doe to the cost

as mentioned this is pretty much a hack, if we are out of options we can try this - still slightly more convenient than reseting whole container
we could also be a bit smart and detect if it needs reload if we want to spend more time on optimising

Marcin ChalczynskiAugust 7, 2017 at 10:21 AM
Edited

Well, what worries me is that the reload interval probably couldn't be on the order of seconds.

What would be a sane value here? 1 min, 5mins? Anyway the longer the better, but this means longer downtime for the user (EDIT: what I mean is - they're likely to get frustrated and just restart the whole setup in the meantime).

Fixed

Details
Assignee
Marcin Chalczynski
Reporter
Maciej Borzecki
Labels
BackendSaaS
Story Points
8
Priority
Medium
Sprint
None
Backlog
yes

Zendesk Support

Checklist

Created May 24, 2017 at 5:44 AM

Updated August 9, 2017 at 7:00 AM

Resolved August 9, 2017 at 7:00 AM

api-gateway: nginx will cache resolved DNS addresses

Description

Affects versions

Environment

Checklist

Activity

Marcin ChalczynskiAugust 8, 2017 at 8:48 AM

Maciej BorzeckiAugust 7, 2017 at 12:09 PM

Marcin ChalczynskiAugust 7, 2017 at 11:35 AM

MaciejAugust 7, 2017 at 10:37 AM

Marcin ChalczynskiAugust 7, 2017 at 10:21 AMEdited

DetailsAssigneeMarcin ChalczynskiMarcin ChalczynskiReporterMaciej BorzeckiMaciej BorzeckiLabelsBackendSaaSStory Points8PriorityMediumSprintNone+2Backlogyes

Details

Assignee

Reporter

Labels

Story Points

Priority

Sprint

Backlog

Zendesk SupportLinked Tickets

Zendesk Support

ChecklistOpen Checklist

Checklist

Marcin ChalczynskiAugust 7, 2017 at 10:21 AM
Edited

Details
Assignee
Marcin Chalczynski
Reporter
Maciej Borzecki
Labels
BackendSaaS
Story Points
8
Priority
Medium
Sprint
None
Backlog
yes

Zendesk Support

Checklist