Wednesday, March 7, 2012

"Mysql server has gone away" or "Lost connection to mysql server" when using Amazon RDS for for backup or whatev's

Hey,
OMFG what a PITA.
You run long mysql jobs, especially on an amazon RDS instance, especially e.g. backup, and it randomly fails, esp as the db gets bigger, and esp after several hours.
Well FML!
The answer is all over Google; it's either

a) [most common] the "net_write_timeout" variable on the source server (e.g. that you're dumping from); the AWS default is like 60s, which is sensibly cautious if your clients are some crashy bullshit (as hanging result sets/cursors are obviously expensive to keep lying around). 60s is fine for a web app, but if your client is occasionally/randomly gets stalled for a really long time (think; EBS, Innodb reindex if you're streaming direct via a pipe from mysqldump | mysql to do server->server backups; which can be a fine idea depending on yr needs).

OR

b) [less common] the destination server that you're unpacking/writing to has timed out for similar reasons; you were in the middle of providing [usually a metric fuck-ton] of data in an INSERT or UPDATE and the server doing the insert got hung up on something randomly once in a blue moon (e.g. system backup) and boom; the mysql box you're writing too tells you to f-off.
Typical AWS timeout here (net_read_timeout) is like 30s.

Ok so there you go; AWS defaults are sensible for a high-volume web db trying to protect itself from bad clients. Backups occasionally classify as 'bad clients'. FML again.

------THE FIX:

--EITHER (easy,global)
a) Change your RDS instance settings (amazon control panel) to set 'net_read_timeout" (and write) to something bigger globally across all connections. You might pick e.g. 20mins. If you have a lot of crappy/dropping DB connections tbis might be an issue.
--OR-- (usually easy for client code, not for mysqldump)
b) If you're running client sql code (e.g. php,python) simply do "SET net_write_timeout=3600"or whatever (or read_timeout, depending if yr problem is in SELECT or INSERT/UPDATEs) on each conn after you open it and bingo.
The var is set per-connection. This works perfectly for client code but not for mysqldump backups which I'll get to below.
--OR-- (middling, fixes mysqldump)
If your problem is with mysqldump piped dumps vi
e.g.
mysqldump yadda | ... | mysql yadda
(I do this using 'tee' and pipe a copy via gzip into a backup file at the same time I copy a db direct from one host to another making a compressed copy @ same time with no limit on db size. ..actually that is why I'm writing this..)

The issue you see is when the target DB stalls (can happen for e.g 1-10 min at bad times) the source DB times out (quickly; 60s default) on the write socket. This is not AFAIK fixable with normal mysqldump without changing the server's global timeouts for everyone.

The obvious fix is for mysqldump to send "set net_write_timeout=blah" after opening the conn it uses for dumping. Wierdly I cannot make it do that regardless of options, so I hit up the src code.

Basically I patched mysqldump so it sends "net_write_timeout=somebigtime" to the source server AND prefixes "set net_read_timeout=blah" into the SQL dump it outputs.

This solves both ends of the problem, especially when piping from one server to another with msqldump | mysql - both source and dest servers set the timeout _temporarily on that connection_ to a nice long value, e.g. 1hr.


Fixing mysqldump so it sends "net_write_timeout"

So I basically grabbed the mysql src and compiled it (google), on an aws box.
(actually "yum install mysql-devel ncurses-devel" may be handy, prolly a few more)
You don't need all the mysql stuff, just mysqldump,
..but I compiled everything (./configure {google for opts} and make)

MySqldump is in client/ and all the guts is in mysqldump.c
The hack is utterly trivial;

at the end of "connect_to_db" before teh DBUG_RETURN I added


#ifdef MUNKY_HACK
my_snprintf(buff, sizeof(buff), "SET net_write_timeout=3600");
if (mysql_query_with_error_report(mysql, 0, buff))
DBUG_RETURN(1);
#endif


..which is most important and tells the src server to use longer timeouts just for the dumping.

IF you're being fancy and piping from one db to another, you may want the output dump to include a prefix telling the target db to use longer timeouts.
I addeed this to near the top of 'dump_table' as ;

fprintf(md_result_file, "/* Munky Hack*/\n set net_read_timeout=3600;\n");


It's that easy (it seems) to make this stuff work properly reliably