A short introduction story. Here at Orange Sputnik we got an unusual request from one of our client. Client’s product is a complex software project which include a simple promo-website with landings, admin dashboard, web interface for the client area, mobile applications for iOS and Android, private API and public API. The client came to us with a question is it possible to speed-up the mobile apps without changing themselves. The client related the most concerns to API due to the speed of a response from the server was like from 500ms to 2 seconds. For the high-load project this is literally a nightmare such a response time. Flash forward to the results we were able to drop the response time to 2-5 ms.
Ok, let’s proceed to the resolution.
First of all, we got access to the server and the source code for the backend part including APIs. In the scope of this project, we had very little interest to the actual implementation of the mobile app. While we see such a response time from the server API, it is highly likely about having the gap on the backend side.
So that way we made a review of all the system structure, what database is used, do they use any caching mechanism, and what is the programming language of the backend.
Technical stack of the project
We discovered the technical stack of the project is:
- Memcache together with Redis
- Firebase for sending the notifications to the mobile apps
- Sockets.io, WebSockets
- SendInBlue integration for sending all marketing and automation emails
The way to analyze
To find all the bottlenecks in the project, we used services that are publicly available over the Internet and tools on the server side:
- PageSpeed Insights from Google
- WebPageTest. Seems like they are not going to have some awards for their UI 😄…. Anyway it is one of the best tools
- Tools from Pingdom
- ab (Apache Benchmark)
We did a series of benchmarks for our client and before proceeding to the server optimization we gave a couple of hints about protecting their API, removing Memcached from the technical stack of the project, and stay with Redis since it is a more flexible solution and it is really ridiculous to see both of them on the same board. Then we had to involve their developer in this process to get an understanding of why they went that way.
Anyway, we’re here today pursuing a different aim.
Actually, we built quite trustworthy relations with our client, so we’ve got root access to the production instance running under AWS. And asked to do the backup of this instance as a part of a common approach of such type of work. 🤞🤞🤞
In most cases, you don’t really care about “default setup” and “default installed applications”, but should be. 🤷
I stick to Midnight Commander (MC) and its editor (mcedit) rather than nano or vim. In case when it is not possible to use in environment MC then for sure I’ll use nano or vim. And while having root access we go with MC.
Install MC with the command:
sudo apt install mc
While having Nginx as a web-server we made optimizations to it first.
Enable HTTP/2 on Nginx
The first step in tuning Nginx for faster TTFB/Latency with HTTPS is to make sure HTTP/2 is enabled.
HTTP/2 was delivered the first time with Nginx version 1.9.5 to replace spdy protocol.
Enabling the HTTP/2 module on Nginx is simple. You just need to add the keyword http2 in the server block into Nginx config file (ex. /etc/nginx/sites-enabled/sitename). And bear in mind that HTTP/2 requires HTTPS enabled.
listen 443 ssl http2;
Then do the reload of Nginx service and that’s it!
service nginx reload
Enable SSL session cache
Why do we need to enable SSL session cache? With HTTPS connections the connection needs an extra handshake, instead of end-users connecting via the round trip. However, using HTTP/2 and enabling Nginx ssl_session_cache will ensure faster HTTPS performance for initial connections and simply a lot faster-than-http page loads.
You can configure Nginx to share this cache between all workers. And while having a multi-CPU instance is extremely recommended.
1Mb can store about 4000 sessions. We put into config 20Mb and cache TTL (reuse allowed time) as we expect a high load on the server:
ssl_session_cache shared:SSL:20m; # holds approx 80000 sessions
ssl_session_timeout 2h; # 2 hours for Cache TTL
TLS (also known as Transport Layer Security) — is a cryptographic protocol designed to provide communications security over a computer network. It replaced SSL. I would even draw an analogy that TLS actually is the next generation of SSL, which is deprecated and affected by POODLE-attacks. The TLS protocol comprises two layers: the TLS record and the TLS handshake protocols.
It is important to understand that TLS operations are quite expensive for resources so that even several worker processes are required to run on multiprocessor systems. The TLS Handshake operation is the most “difficult” in terms of both load and execution time. Therefore, the best solution would be to optimize the session: persistent connections, caching, static keys, and enabling OCSP Stapling. That’s why we started optimization by enabling SSL session cache.
Keepalive – persistent connection
Using persistent connections makes it possible to process several requests at once in a single connection.
The TLS protocol is able to use session tickets to resume a session, in case the client supports it (Chromium and Firefox browser families).
To do this, the TLS server sends the client a session ticket, encrypted with its own key, and the key identifier. The client resumes the secure session by sending the last ticket to the server during the initialization of the TLS Handshake procedure. Then the server resumes the session according to the saved parameters.
Online Certificate Status Protocol is an SSL certificate validation mechanism that replaces the slower Certificate Revocation List (CRL) protocol.
When using the CRL, the browser downloads the certificate revocation list and checks the current certificate which increases the connection time. While using OCSP, the browser sends a verification request to the OCSP address and in response receives the certificate status, which can heavily load the certification authority (CA) servers.
To use this protocol, OCSP Stapling is used – the owner of the certificate independently polls the OCSP server at a certain interval and caches a response that contains a signature. Then the response is attached to TLS Handshake through the Certificate Status Request expansion. So that way the CA servers don’t get a huge number of requests that also contain sensitive information about the user’s views.
To enable OCSP Stapling simply add few lines:
Strict Transport Security
To force the browser to use the HTTPS protocol, there is a mechanism called Strict Transport Security.
When getting such a header from the server, the browser will understand that it is necessary to use the HTTPS protocol even after following the HTTP link.
This instruction helps to prevent some of the attacks, especially if the server does not have a redirect from HTTP to HTTPS.
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains";
Reduce SSL buffer size
Nginx ssl_buffer_size option sets the size of the buffer used for sending data over HTTPS. By default, the ssl_buffer_size is set to 16k. This is a good “default” one-size-fits-all value approach geared toward big responses.
However, to minimize TTFB (Time To First Byte) it is often better to use a smaller value and depends on your setup. On our client instance we set:
SSL Ciphers and deploying Diffie-Hellman
Following this link you can find the explanation why you need to use a Strong Diffie-Hellman Group.
Modern browsers, including Google Chrome, Mozilla Firefox, and Microsoft Internet Explorer have increased the minimum group size to 1024-bit. We recommend you generate a 2048-bit group, but if you’re paranoid go with 4096:
openssl dhparam -out dhparams.pem 2048
In Nginx config ssl_ciphers option is placed into server block:
I would say that’s all… 😊 It’s not! We just finished with tuning for Nginx. The next step is to do the optimization to the server for the high load and protect it from DDoS attacks.
You can find a lot of optimization tips for Linux kernel configuration over the Internet, but not all of them are explained well. Here is our optimization setup for your attention.
We will go step by step to have some explanation on what we’re going to do and optimize.
All these configuration lines are being served under /etc/sysctl.conf
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.send_redirects = 0
These settings say to the Linux kernel to do not receive or send ICMP redirect packets. These ICMP redirects can be used by an attacker to modify routing tables.
So it sounds reasonable to disable it (set to zero/false/0).
Enabling these options is only meaningful for hosts that are used as routers. If we are talking about server optimization, these options are not needed.
net.ipv4.tcp_max_orphans = 65536
tcp_max_orphans parameter specifies the maximum number of TCP sockets allowed in the system that are not associated with any user file id (user file handle).
When this threshold is reached, orphan connections are immediately dropped with a warning. This threshold only helps prevent simple DoS attacks. It is better not to lower the threshold (rather, to increase to meet system requirements — for example, after adding memory). Each orphan connection consumes about 64KB of unswappable memory. So if you put here 65536, then you need to have 4Gb of RAM for these orphans.
net.ipv4.tcp_fin_timeout = 10
Parameter tcp_fin_timeout determines how long sockets are kept in FIN-WAIT-2 state after our (server) side has closed it. The client side (remote browser, etc.) may never close this connection, so that means it should be closed after the timeout has expired. The default timeout is 60 seconds. For example, a Linux kernel of 2.2 series with a value of 180 seconds was typically used. Basically, you can keep this value, just bear in mind that on high-loaded web servers you’re running into the risk of wasting a lot of memory for storing half-broken dead connections.
FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1 since they consume less than 1.5KB of memory, but also they can “live” longer.
net.ipv4.tcp_keepalive_time = 1800
net.ipv4.tcp_keepalive_intvl = 15
net.ipv4.tcp_keepalive_probes = 5
How often to check if the connection is no use for a long period? That stated in tcp_keepalive_time parameter. This value is meaningful only for sockets created with the SO_KEEPALIVE flag. The integer variable tcp_keepalive_intvl defines the interval tries are being done. Then the multiplication tcp_keepalive_probes * tcp_keepalive_intvl shows the time drop the connection in case of no response. By default, this interval is set to 75 seconds, so we can calculate that the connection will be closed and dropped in approximately 11 minutes.
net.ipv4.tcp_max_syn_backlog = 4096
tcp_max_syn_backlog defines the maximum number of connection requests kept in memory for which we received no acknowledgment from the connecting client. If you find the server is experiencing overloads, you can try increasing this value.
net.ipv4.tcp_synack_retries = 1
tcp_synack_retries controls the number of SYNACK retransmissions for passive TCP connections. The number of attempts should not exceed 255. A value of 1 corresponds to approximately 35 seconds to establish a connection.
net.ipv4.tcp_mem = 50576 64768 98152
The vector variable (minimum, load mode, and maximum) in the tcp_mem file contains the general settings for memory consumption for the TCP protocol. This variable is measured in pages (usually 4Kb), not bytes.
Minimum: While the total memory size for TCP structures is less than a minimum number of pages, the operating system does nothing.
Load mode: As soon as the number of memory pages allocated for TCP operation reaches this value, the under load mode is activated. In this mode, the operating system tries to limit memory allocations. This mode remains until the memory consumption return to the minimum level.
Maximum: This is the maximum number of memory pages allowed for all TCP sockets.
net.ipv4.tcp_rmem = 4096 87380 16777216
Another vector variable (minimum, default, maximum) in the tcp_rmem file. It contains 3 integers specifying the size of the TCP socket receive buffer.
Minimum: every TCP socket has the right to use this memory upon creation. The possibility of using such a buffer is guaranteed even when the limit is reached (moderate memory pressure). The default value of the minimum buffer size is 8 KB (8192).
Default: The amount of memory allowed for the default TCP socket send buffer. This value replaces the /proc/sys/net/core/rmem_default parameter used by other protocols. The default buffer is usually (by default) 87830 bytes. This defines a window size of 65535 with the default tcp_adv_win_scale and tcp_app_win = 0, slightly smaller than the default tcp_app_win.
Maximum: The maximum buffer size that can be automatically allocated to receive on a TCP socket. This value does not override the maximum set in the /proc/sys/net/core/rmem_max file. When allocating memory “statically” using SO_RCVBUF, this parameter is not applicable.
net.ipv4.tcp_wmem = 4096 65536 16777216
Yet another vector variable in the tcp_wmem file. It contains 3 integer values that define the minimum, default, and maximum amount of memory reserved for TCP socket transmit buffers.
Minimum: every TCP socket has the right to use this memory upon creation. The default minimum buffer size is 4KB (4096).
Default: The amount of memory allowed for the default TCP socket send buffer. This value replaces the parameter /proc/sys/net/core/wmem_default used by other protocols and is usually less than value /proc/sys/net/core/wmem_default. The default buffer size is usually (by default) 16 KB (16384).
Maximum: The maximum amount of memory that can be automatically allocated for the TCP socket transmit buffer. This value does not override the maximum specified in the /proc/sys/net/core/wmem_max file. When allocating memory “statically” using SO_SNDBUF, this parameter is not applicable.
net.ipv4.tcp_orphan_retries = 0
tcp_orphan_retries value specifies the number of unsuccessful attempts, after which the TCP connection that was closed from the server-side and is destroyed. The default value is 7. This is approximately 50 seconds to 16 minutes depending on the RTO. On high-loaded servers, it makes sense to decrease the value of this parameter, since closed connections can consume a lot of resources.
net.ipv4.tcp_syncookies = 0
According to the kernel developers’ recommendations, this mode is better to disable, so we put 0 here.
net.ipv4.netfilter.ip_conntrack_max = 16777216
The maximum number of connections for the work of connection tracking mechanism (for example, iptables). If the value is too low, the kernel rejects incoming connections with a respective entry in the system log.
net.ipv4.tcp_timestamps = 1
It enables TCP timestamps (RFC 1323). Their presence allows you to control the operation of the protocol under high-load conditions (see tcp_congestion_control for the details).
net.ipv4.tcp_sack = 1
Allow TCP selective acknowledgments. This option is actually the requirement for the efficient usage of all the available bandwidth of some networks. Say hello to AWS and GCP! 👋
net.ipv4.tcp_congestion_control = htcp
That option is about the protocol used to manage traffic on TCP networks. The default bic and cubic implementations contain bugs in most versions of the RedHat kernel and its clones. It is recommended to use htcp.
net.ipv4.tcp_no_metrics_save = 1
Says to do not store TCP connection measurements in the cache when closed. Sometimes, it helps to improve performance. Just play with this option for better results.
net.ipv4.route.flush = 1
This option is relevant for kernels 2.4. For some strange reason in 2.4 kernels, if occurs retransmission with a reduced window size within a TCP session, all upcoming connections to this host in the next 10 minutes will have the same reduced window size. This option simply flushes this setting. With current Ubuntu versions, we’re having kernel 4.15 and higher, just bear it in mind.
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.lo.rp_filter = 1
net.ipv4.conf.eth0.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
These options are activating protection from IP Address Spoofing.
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.lo.accept_source_route = 0
net.ipv4.conf.eth0.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0
Then disabling Source Routing.
net.ipv4.ip_local_port_range = 1024 65535
With this option, we increasing the range of local ports available for establishing outgoing connections.
net.ipv4.tcp_tw_reuse = 1
Allow reuse of TIME-WAIT sockets if the protocol considers it as safe.
net.ipv4.tcp_window_scaling = 1
Allowing dynamic resizing of the TCP stack window.
net.ipv4.tcp_rfc1337 = 1
Enabling the protection from TIME_WAIT attacks (RFC 1337).
net.ipv4.ip_forward = 0
Disabling packets forwarding, since we’re still not a router.
net.ipv4.icmp_echo_ignore_broadcasts = 1
Say to not respond to ICMP ECHO requests, sent with broadcasting packets.
net.ipv4.icmp_echo_ignore_all = 1
We also can totally disable response to the ICMP ECHO requests and that way server would not respond to PING requests. Decide yourself if you need this.
net.ipv4.icmp_ignore_bogus_error_responses = 1
Do not respond to bogus error responses. Some routers violate RFC1122 by sending bogus responses to broadcast frames. Such violations are normally logged via a kernel warning. If this is set to TRUE, the kernel will not give such warnings, which will avoid log file clutter.
net.core.somaxconn = 65535
The maximum number of open sockets waiting for a connection. It makes sense to increase the defaults to increase server responsiveness.
net.core.netdev_max_backlog = 1000
The parameter defines the maximum number of packets put in the queue for processing if the network interface receives packets faster than the kernel can process them.
net.core.rmem_default = 65536
net.core.wmem_default = 65536
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
The last values are for the default receive buffer size, default send buffer size, the maximum size of the receive data buffer, and maximum data transfer buffer size. All these settings are for all connections.
These are not all improvements we made on the client’s server. Our next steps were about database optimization. So together with mentioned above optimizations, it helped us to dramatically reduce TTFB, simply the server response time as low as 2-5ms per request on average.
👍 Like? Share! 🥰 Cheers! 🥂 Hugs! 🤗 With love… ❤