Example `kWIP` Analysis Protocols¶

These protocols should work on any computer running a unix-like OS with sufficient resources (8GB RAM, two cores, 100GB disk). If any step fails, please re-try that step. If you are unable to complete a protocol, this is probably a bug in either kWIP or this protocol, so please let me know by creating an issue on GitHub.

Oryza sativa grouping¶

This example uses data from the 3000 Rice Genomes Project. The rice genome isn’t tiny, so you’ll want a machine with at least 8GB RAM, and a least 2 cores.

Requirements¶

virutalenvwrapper
git
the SRA toolkit
kWIP, installed such that the kwip binary is in $PATH
We also need the img.R script which comes with kWIP, but isn’t installed. Hereafter I assume that you have this script in the analysis directory, ~/rice_kwip. Copy it with cp util/img.R ~/rice_kwip from the directory you git clone-d or unziped kWIP into.

Setup¶

Create a working directory

mkdir ~/rice_kwip
cd ~/rice_kwip

Install SRApy and khmer using pip. We will require the master branch of khmer until the 2.0 release, as bugs exist in the latest release uploaded to PyPI.

mkvirtualenv rice-kwip
pip install srapy
pip install -e git+https://github.com/dib-lab/khmer.git

Create a list of SRA identifiers in file sra_run_ids.txt using the following command:

cat >sra_run_ids.txt <<EOF
ERR626208
ERR626209
ERR626210
ERR626211
ERR619511
ERR619512
ERR619513
ERR619514
EOF

Download the above sra files:

mkdir sra_runs
get-run.py -s -f sra_run_ids.txt -d sra_files

And export them to FASTQ files:

mkdir fastqs
for srr in $(cat sra_run_ids.txt)
do
    fastq-dump                              \
        --split-spot                        \
        --skip-technical                    \
        --stdout                            \
        --readids                           \
        --defline-seq '@$sn/$ri'            \
        --defline-qual '+'                  \
        sra_files/${srr}.sra |              \
        gzip -1 > fastqs/${srr}.fastq.gz
done

Don’t worry about the full details of this fastq-dump, but it produces a gzipped interleaved fastq file.

Hashing¶

We directly utilise khmer‘s load-into-counting.py to hash reads to a hash (Countgraph).

mkdir hashes
for srr in $(cat sra_run_ids.txt)
do
    load-into-counting.py       \
        -N 1                    \
        -x 1e9                  \
        -k 20                   \
        -b                      \
        -f                      \
        -s tsv                  \
        hashes/${srr}.ct.gz     \
        fastqs/${srr}.fastq.gz
done

This creates a hash with a single table and a billion bins for each run. Hashes are saved, with gzip compression, to the *.ct.gz files under ./hashes. These hashes are the direct input to kwip. Note that this hash is probably a bit small for this dataset, but we will go ahead anyway so this works on most modern laptops.

Distance Calculation¶

So here’s the core of the protocol: calculating the pairwise distances between these samples, which are from the two major groups of rice, Indica and Japonica.

kwip                            \
    -t 2                        \
    -k rice.kern                \
    -d rice.dist                \
    hashes/*.ct.gz

This should calculate the weighted distance matrix between these samples, using two threads.

Now, we plot these results using the R script img.R. This creates plots of the distance and kernel matrices, as well as a cluster dendrogram and multi-dimensional scaling plot.

Rscript img.R rice

This should create rice.pdf. Inspect, and you should see two large groupings corresponding to the two rice families.

Example `kWIP` Analysis Protocols¶

Oryza sativa grouping¶

Requirements¶

Setup¶

Hashing¶

Distance Calculation¶

Table Of Contents

Related Topics

This Page

Example kWIP Analysis Protocols¶

Oryza sativa grouping¶

Requirements¶

Setup¶

Hashing¶

Distance Calculation¶

Example `kWIP` Analysis Protocols¶