Category Archives: Programming

Connection aborted error using python elasticsearch with large files on AWS ES

AWS ES has an upload limit of 10MB. If you are using the bulk helpers or reindex and some documents are above this limit you will get an error ConnectionError: ('Connection aborted.', error(32, 'Broken pipe')).

To solve it, use the max_chunk_bytes argument, which can be used with reindex like so:

es, source_index, target_index, chunk_size=100,
bulk_kwargs={'max_chunk_bytes': 10048576},  # 10MB AWS ES upload limit

Ideally make the chunk size the average number of documents before the size is 10MB, and then in the case there are some larger documents that push the size over 10MB the elasticsearch library will handle it.

Handling HTTP status code 100 in Scrapy

You might have some problems handling the 100 response code in Scrapy.  Scrapy uses Twisted on the backend, which itself does not handle status code 100 properly yet:

The remote server will first send a response with the 100 status code, then a response with 200 status code.  In order to get the 200 response code, sent the following header in your spider:

‘Connection’: ‘close’

If your 200 response is also gzipped, Scrapy might not gunzip, in which case you need to set the following header as well:

‘Accept-Encoding’: ”

And if Scrapy will not do anything with the responses at all, you might need to set the following Spider attribute:

handle_httpstatus_list = [100]

Postfix queue management bash scripts

Couple of scripts I used while cleaning up a mail server. I’m sure they can be improved, and the last one is quite specific to my own requirements, but I’ll put them here anyway.

Move emails with a particular subject from the hold queue to the deferred queue:

#change directory to postfix's queue directory#
cd $(postconf -h queue_directory)/hold
#loop over queue files
for i in * ; do
# postcat e file, grep for subject "test" and if found
# run postsuper -d to delete queue'd message
postcat $i |grep -q '^Subject: test' && postsuper -H $i

Delete emails in the hold queue that are being sent to a recipient that has already recieved an email (is in the mail log) or duplicate emails (with the same email/subject):

cd $(postconf -h queue_directory)/hold
#loop over queue files
for i in * ; do
   if [ -f "$i" ]; then
       IDENT=$(postcat $i | grep -A 1 "To:")
       RECIPIENT=$(postcat $i | grep "To:" | cut -c 5- )
       if grep -q "$RECIPIENT" /root/postfixtmp/logs/mailsent.log; then
           echo "* already sent to $RECIPIENT, deleting $i " | tee -a /root/postfixtmp/queueclean.log
           echo $IDENT | tee -a /root/postfixtmp/queueclean.log
           NUM=$[NUM + 1]
           postsuper -d $i
           echo "----" | tee -a /root/postfixtmp/queueclean.log
           for o in * ; do
              if [ -f "$o" ]; then
                  if [ $o != $i ]; then
                     CURRENT=$(postcat $o | grep -A 1 "To:")
                     if [ "$CURRENT" = "$IDENT" ]; then
                        echo " duplicate email, deleting $o *" | tee -a /root/postfixtmp/queueclean.log
                        echo $CURRENT | tee -a /root/postfixtmp/queueclean.log
                        NUM=$[NUM + 1]
                        postsuper -d $o
                        echo "----" | tee -a /root/postfixtmp/queueclean.log
echo "Deleted $NUM emails" | tee -a /root/postfixtmp/queueclean.log

Rails namespaced models NameError uninitialized constant

I’m using rails v3.2.3

Using namespaced models, you should specify full class names for database associations to avoid this error.


class Assets::Resource < ActiveRecord::Base
  has_many :assets_resource_users, :class_name => “::Assets::ResourceUser”

class Assets::ResourceUser < ActiveRecord::Base
  belongs_to :asset_resource, :class_name => “::Assets::Resource”

Put :: at the beginning to specify the namespace from the root.

Also, you should set the foreign key on your associations, or rails gets confused.  For example, if you set up resource:references in the migration to create the ResourceUser above, it will create a column “resource_id”, but rails looks for “assets_resource_id” by default.

Bash script command not found while command line works

Lots of results on google say to make sure you have #!/bin/bash at the top of your file, and make sure the file is not in windows format (line endings could mess it up).

In my case, however, I had created a variable called PATH accidentally. This overwrites the built-in PATH environment variable that is responsible for giving the script access to commands, and without the commands available, you’ll get the command not found error.

Make sure to name your variables something else!

Paramiko channel hangs

When sending a command via ssh using paramiko, the script would hang. eg:

def ssh_connect(self):
    """ connects to the remote server using paramiko """
    ssh = paramiko.SSHClient()
    ssh.connect(self.hostname, self.remoteport, self.remoteuser, None, None, self.keypath)
    return ssh

def ssh_command(self, command):
    """ executes long command on remote server """
    ssh = self.ssh_connect()
    channel = ssh.invoke_shell()
    stdin = channel.makefile('wb')
    stdout = channel.makefile('rb')
    ssh_out =
    stdout.close(); stdin.close(); ssh.close()
    return ssh_out

command = """
    tar -zxvf {rp} || echo 'deploy-copy-untar-error {rp}'
    rm {rp} || echo 'deploy-copy-delete-error {rp}'
    echo 'deploy-copy-success {rp}'
    """.format(rp = remotepath)
ssh_out = self.ssh_command(command)

You need to add ‘exit’ to the end of the command so the channel quits and the script continues. Like so:
command = “””
tar -zxvf {rp} || echo ‘deploy-copy-untar-error {rp}’
rm {rp} || echo ‘deploy-copy-delete-error {rp}’
echo ‘deploy-copy-success {rp}’
“””.format(rp = remotepath)
sshout = self.sshcommand(command)

Installing python pip on Ubuntu 10.04 LTS

Just a notice; if you try to apt-get install pip, it will get the wrong package.
If you try to apt-get install python-pip, it will get a very old version of pip.
Best thing to do; download and install manually:

Ubuntu 11.04 does not have this problem.



Alternatively, use apt-get install python-pip
Then upgrade it:
pip install –upgrade pip
If you do pip –version, it will probably still show 0.3.1
apt-get puts pip into /usr/bin/pip, and upgrading adds the new version to /usr/local/bin/pip (if I remember correctly), so what you can do:
mv /usr/bin/pip /usr/bin/pip-0.3.1
pip –version again should show you 1.2.1, or whatever the latest version is

Picking out a part of a string

A limited use-case, but in case you get to the situation where you need the following bold part of a string:


(The asterisk matching any word)

And here’s the slightly messy solution I used; in this case, I needed the first three directories of the path the script was running in:

pubpath=`echo $fullpath | rev | cut -c13- | rev`

So I’ve reversed it so I know the exact number of characters from the start of the string that I need, used cut to select part of the string; from character 13 to the end (reversed form of /home/*/public_html), then reserved it back again.