Last week I wrote an essay on electron microscopy (EM) and X-ray crystallography. I originally thought X-ray’s superiority in resolution has always been unrivalled, but the situation is changing—the best resolution achived by cryo-EM is now comparable with the average/median resolution achived by X-ray crystallography, which is around 2.3Å (this resolution allows for unambiguous identification of most amino acid residues in proteins). Recently it’s even reported that 1.75Å resolution has been achieved using cryo-EM, which really struck me. Thus, I was motivated to visualise the resolution revolution of cryo-EM. This article is a record of how I downloaded all PDB files from the protein data bank and did some simple analyses on them.
1 Getting All the PDB Files
All the PDB files are available via PDB’s FTP server. Simply cd
to a directory and use wget
to download all of them.
mkdir pdb; cd pdb; mkdir compressed decompressed; cd compressed
wget ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/*
Depending on the traffic, The first download will take 2~4 days. After the first download, to sync the local database with the FTP server, add the -N
option to wget
(essentially this means downloading “new” files only).
The *.ent.gz
files downloaded are compressed. You can use gunzip
to decompress them to ready-to-use *.ent
files (which is written in PDB format), but that will remove the *.gz
files, which is inconvenient if you want to keep your database in sync with the remote. Use gzcat
(or equivalently gunzip -c
) instead to preserve the *.gz
files while decompressing:
find . -name "*.gz" | xargs -I{} -n1 bash -c '
src={}; dst=../decompressed/$(basename ${src%*.gz})
[ -f $dst ] && echo "$dst already exists, skipping..." || ( gzcat -cv $src > $dst )'
You can monitor the progress from the verbose output of gzcat
:
./pdb1byf.ent.gz: 78.1%
./pdb1byg.ent.gz: 76.3%
./pdb1byh.ent.gz: 77.3%
./pdb1byi.ent.gz: 75.2%
...
2 Parsing PDB files
The decompressed *.ent
files are in accordance with the PDB format specification, which is accessible here. Many ready-to-use PDB parsers are available, such as in BioPython’s PDB module. However, I am only interested in three variables: experimental technique, resolution and date, and it would be a overkill to parse entire PDB files using those standard parsers. Instead, I manually wrote a parser (with python) with reference to the format specification:
#!/usr/bin/env python3.8
import sys
for fn in sys.argv[1:]:
with open(fn) as f:
HEAD = f.readline()
date, id = HEAD[50:59], HEAD[62:66]
while True:
line = f.readline()
if line[:6] == 'EXPDTA':
method = line[10:].rstrip()
if line[:6] == 'REMARK':
if line[11:22] == 'RESOLUTION.':
resolution = line[23:30] if line[31:40] == 'ANGSTROMS' else 'NA'
break
print(','.join([id, date, resolution, method]))
Then I parsed the data, 100 files at a time (took 25 min in total):
echo "id,date,resolution,method" > date_method_resolution.csv
find decompressed -name '*.ent' | xargs -n100 ./parse >> date_method_resolution.csv
To see whether it is Python that causes this operation slow, I used Go to rewrite the parser:
Expand to view
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
func check(e error) {
if e != nil {
panic(e)
}
}
func main() {
for _, arg := range os.Args[1:] {
f, err := os.Open(arg)
check(err)
b1 := make([]byte, 81)
f.Read(b1)
check(err)
fmt.Print(string(b1[50:59]) + "," + string(b1[62:66]) + ",")
scanner := bufio.NewScanner(f)
var method string
var resolution string
for scanner.Scan() {
var line string
line = scanner.Text()
if line[:6] == "EXPDTA" {
method = strings.TrimSpace(line[10:])
}
if line[:6] == "REMARK" {
if line[11:22] == "RESOLUTION." {
if line[31:40] == "ANGSTROMS" {
resolution = strings.TrimSpace(line[23:30])
} else {
resolution = "NA"
}
break
}
}
}
fmt.Println(resolution + "," + method)
if err := scanner.Err(); err != nil {
panic(err)
}
}
}
And it turns out that Go achived a similar speed as Python, meaning the speed is limited by the file system I/O instead.
The csv file looks like this:
| id | date | resolution | method |
| ---- | --------- | ---------- | ----------------- |
| 100D | 05-DEC-94 | 1.90 | X-RAY DIFFRACTION |
| 101D | 14-DEC-94 | 2.25 | X-RAY DIFFRACTION |
| 101M | 13-DEC-97 | 2.07 | X-RAY DIFFRACTION |
3 Data Analysis with R
3.0.1 Prerequisites
First, load the required packages:
library(tidyverse)
## ── Attaching packages ──────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ─────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
3.1 Load Data
Parse the csv data file with readr::read_csv()
. It is a good practice to specify the data type of each column (c
means ‘character’, d
means ‘double precision floating number’). Note the date column cannot be directly parsed, and it should be read as a character column and parsed by lubridate::date()
in a separate step. The entries without resolution data are filtered out.
pdb <- read_csv('./src/pdb/date_method_resolution.csv',
col_types = 'ccdc') %>%
mutate(date = dmy(date)) %>%
filter(!is.na(resolution))
3.2 Tidying Data
Note that some entries have more than one experimental methods, and individual columns are separated by semicolons, meaning there could be something like X-RAY DIFFRACTION; ELECTRON MICROSCOPY; SOLUTION NMR
in the method
column. The following code convert such entries into entries each with the first method of the original, dropping the rest:
single_method <- pdb %>% filter(!str_detect(method, ';'))
multiple_method <- pdb %>% filter(str_detect(method, ';'))
pdb <- bind_rows(single_method, do.call(bind_rows,
lapply(1:length(multiple_method[[1]]),function(i) {
original_row = multiple_method[i,]
mutate(original_row, method=str_split(original_row[['method']], '; ')[[1]][1])
}
)
))
You can also use a for loop (shown below) to bind new rows to pdb
, but this is noticeably slower than the lapply()
and do.call()
equivalent (shown above). In R, use those vectorised and highly efficient functions whenever possible (they should be able to replace almost all for loops).
pdb <- single_method
for (i in 1:length(multiple_method[[1]])) {
original_row = multiple_method[i,]
bind_rows(pdb, mutate(original_row, method=str_split(original_row[['method']], '; ')[[1]][1]))
}
3.3 Plotting
I used the ggplot2
package for plotting.
3.3.1 Plot Setup
ggplot2::theme_set()
sets global themes for all plots.
ggplot2
doesn’t have a built-in reverse_log10
scaling, which will be needed, so I used scales::trans_new()
to make this.
theme_set(
theme_bw()+
theme(
text = element_text(family = 'Courier'),
plot.title = element_text(hjust = 0.5)
)
)
reverselog_trans <- function(base = exp(1)) {
trans <- function(x) -log(x, base)
inv <- function(x) base^(-x)
trans_new(paste0("reverselog-", format(base)), trans, inv,
log_breaks(base = base),
domain = c(1e-100, Inf))
}
3.3.2 Median Resolution of Different Methods Over Time
This plot shows the median resolution of different structure-solving methods every year. Size represents log(number_of_cases)
.
pdb %>% mutate(year = year(date)) %>%
group_by(year, method) %>%
summarise(
median = median(resolution, na.rm = TRUE),
n = n()) %>%
mutate( alpha = ifelse(method=='X-RAY DIFFRACTION', 'A', 'B')) %>%
ggplot(aes(year, median, color=method, size=log(n), alpha=alpha))+
geom_point()+
scale_y_continuous(breaks = seq(0,25,5))+
scale_alpha_discrete(range=c(0.7, 0.9), guide=FALSE)+
labs(title='Median Resolution of Different Methods Over Time')+
scale_color_brewer(palette = 'Set2')
3.3.3 All Cases of X-Ray and EM Structures
These plots show the resolution and date of all cases as dots of EM and X-Ray Crystallography, respectively. Linear fits are added.
ggplot(filter(pdb, method =='ELECTRON MICROSCOPY', date>ymd(20000101), resolution < 20), aes(date, resolution))+
geom_point(alpha=0.1)+
labs(title = 'Resolution of EM Structures Over Time')+
scale_y_continuous(limits = c(0, 20), breaks = c(2, 3, 4, seq(0, 20, 5)), minor_breaks = c(7.5, 12.5, 17.5, seq(2, 5, 0.5)))+
geom_smooth(method = 'lm')
ggplot(filter(pdb, method =='X-RAY DIFFRACTION'), aes(date, resolution))+
geom_point(alpha=0.1)+
scale_y_continuous(limits = c(0, 10), breaks = seq(0, 10, 2), minor_breaks = c(seq(1,9,2), seq(1,3,0.1)))+
geom_smooth(method = 'lm')+
labs(title = 'Resolution of X-Ray Crystallography Structures Over Time')
EM structures show a strong trend of improvement in resolution over time. For X-Ray crystallography, however, although the maximum resolution improves over time, the number of published low-resolution structures also increases, making the median virtually unchanged over time.
3.3.4 Boxplots
Boxplots summarise the resolution, grouped by year, of different structural biology methods. Here the reverse_log10
scale is used, so that lower resolution values correspond to ‘high positions’ in the plot, which is more intuitive:
boxPlotScale = scale_y_continuous(
minor_breaks = c(1:10, seq(10, 100, 10), seq(0.1, 1, 0.1)),
breaks=c(0.5, 1, 2, 3, 5, 10, 20, 60),
trans=reverselog_trans(10))
ggplot(mutate(pdb, year = year(date)), aes(year, resolution, group=year))+
geom_boxplot()+
facet_wrap(~method)+
boxPlotScale+
scale_x_reverse()+
coord_flip()+
labs(title = 'Comparison of Resolution of Different Methods Over Time')
To compare EM and X-ray crystallography:
ggplot(mutate(pdb, year = year(date)) %>% filter(method %in% c('X-RAY DIFFRACTION', 'ELECTRON MICROSCOPY')), aes(year, resolution, group=year))+
geom_boxplot()+
facet_wrap(~method, ncol = 1)+
boxPlotScale+
labs(title = 'Comparison of Resolution of EM and X-Ray Crystallography Over Time')