# Example kWIP Analysis Protocols¶

These protocols should work on any computer running a unix-like OS with sufficient resources (8GB RAM, two cores, 100GB disk). If any step fails, please re-try that step. If you are unable to complete a protocol, this is probably a bug in either kWIP or this protocol, so please let me know by creating an issue on GitHub.

## Oryza sativa grouping¶

This example uses data from the 3000 Rice Genomes Project. The rice genome isn’t tiny, so you’ll want a machine with at least 8GB RAM, and a least 2 cores.

### Requirements¶

• virutalenvwrapper
• git
• the SRA toolkit
• kWIP, installed such that the kwip binary is in $PATH • We also need the img.R script which comes with kWIP, but isn’t installed. Hereafter I assume that you have this script in the analysis directory, ~/rice_kwip. Copy it with cp util/img.R ~/rice_kwip from the directory you git clone-d or unziped kWIP into. ### Setup¶ Create a working directory mkdir ~/rice_kwip cd ~/rice_kwip  Install SRApy and khmer using pip. We will require the master branch of khmer until the 2.0 release, as bugs exist in the latest release uploaded to PyPI. mkvirtualenv rice-kwip pip install srapy pip install -e git+https://github.com/dib-lab/khmer.git  Create a list of SRA identifiers in file sra_run_ids.txt using the following command: cat >sra_run_ids.txt <<EOF ERR626208 ERR626209 ERR626210 ERR626211 ERR619511 ERR619512 ERR619513 ERR619514 EOF  Download the above sra files: mkdir sra_runs get-run.py -s -f sra_run_ids.txt -d sra_files  And export them to FASTQ files: mkdir fastqs for srr in$(cat sra_run_ids.txt)
do
fastq-dump                              \
--split-spot                        \
--skip-technical                    \
--stdout                            \
--defline-seq '@$sn/$ri'            \
--defline-qual '+'                  \
sra_files/${srr}.sra | \ gzip -1 > fastqs/${srr}.fastq.gz
done


Don’t worry about the full details of this fastq-dump, but it produces a gzipped interleaved fastq file.

### Hashing¶

We directly utilise khmer‘s load-into-counting.py to hash reads to a hash (Countgraph).

mkdir hashes
for srr in $(cat sra_run_ids.txt) do load-into-counting.py \ -N 1 \ -x 1e9 \ -k 20 \ -b \ -f \ -s tsv \ hashes/${srr}.ct.gz     \
fastqs/\${srr}.fastq.gz
done


This creates a hash with a single table and a billion bins for each run. Hashes are saved, with gzip compression, to the *.ct.gz files under ./hashes. These hashes are the direct input to kwip. Note that this hash is probably a bit small for this dataset, but we will go ahead anyway so this works on most modern laptops.

### Distance Calculation¶

So here’s the core of the protocol: calculating the pairwise distances between these samples, which are from the two major groups of rice, Indica and Japonica.

kwip                            \
-t 2                        \
-k rice.kern                \
-d rice.dist                \
hashes/*.ct.gz


This should calculate the weighted distance matrix between these samples, using two threads.

Now, we plot these results using the R script img.R. This creates plots of the distance and kernel matrices, as well as a cluster dendrogram and multi-dimensional scaling plot.

Rscript img.R rice


This should create rice.pdf. Inspect, and you should see two large groupings corresponding to the two rice families.