抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

项目目的

利用CBE的碱基编辑能将正常氨基酸密码子转换成终止密码子的性能,设计出针对人类、猪、小鼠的全部基因的CBE-STOP芯片。通过TRAP系统的细胞内测试,检测分析所有gRNA介导的STOP效率,最终建立人类、猪、小鼠的CBE-STOP的gRNA效率数据库,供做base-editing相关研究的科研人员使用。

算法

  1. 找出human、pig、mouse的所有已知基因及其全长序列,建立3个基因数据集(所有protein coding gene,不包含非编码RNA基因),这里直接使用物种的exon序列即可;
  2. 提取所有基因的前3个exon,如果只有1-3个exon的,保留所有exon(需要过滤掉在UTR上的外显子,还需要过滤掉外显子长度小于23bp的);
  3. 找出前3个外显子中所有正义和反义链上的gRNA(20 bp序列).
  4. 做一定程度的过滤:
    • 去除23bp中有BsmBI(5‘-CGTCTC-3‘,5‘-GAGACG-3‘)酶切位点的gRNA
    • 去除23bp中有4个及以上连续T的gRNA
    • GC content(20bp)在40%~90%之间
    • 记录脱靶数目
  5. 在每条剩余的gRNA(4-8)bp序列上寻找CGA,CAG,CAA,TGG(反向互补),并判断其是否为潜在的综止密码子
  6. 组装

具体实现步骤

step_1

过滤外显子,从Ensembl上获取所有编码基因的所有外显子序eg:AllExonSeq.fa,外显子信息eg:AllExonInfo.txt首先从信息文件入手,每个外显子去除UTR部分同时要保证序列长度是大于23bp的并且只保留前3个外显子并和序列merge在一起

首先,用R简单过滤信息表格较为方便

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
library(readr)
library(dplyr)
exon <-
read_delim(
"AllExonInfo.txt",
delim = "\t",
col_names = FALSE,
col_types = cols("n", "n", "c", "n", "n", "c", "n", "n", "c")
)
names(exon) <-
c(
"Exon region start (bp)",
"Exon region end (bp)",
"Gene name",
"5' UTR start",
"5' UTR end",
"Strand",
"3' UTR start",
"3' UTR end",
"Chromosome"
)
##先去掉UTR,在过滤掉长度小于32bp的,去重,按照每个基因保留前3个exon
need <-
exon %>% filter(is.na(`5' UTR start`) | is.na(`3' UTR end`)) %>%
mutate(
seqLength = `Exon region end (bp)` - `Exon region start (bp)` + 1,
seqName = paste0(
">",
`Gene name`,
"|",
`Exon region start (bp)`,
"|",
`Exon region end (bp)`,
"|",
Strand
)
) %>%
filter(seqLength > 32) %>%
distinct(seqName, .keep_all = TRUE) %>%
group_by(`Gene name`) %>%
slice(1:3) %>%
select(3, 9, 1, 2, 6, 11) %>%
arrange(Chromosome) %>%
ungroup()
write.table(
need,
"AllExonInfoTop3.txt",
sep = "\t",
row.names = FALSE,
quote = FALSE
)

然后,用python脚本读取fasta序列,并利用pandas库和信息表整合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import pandas as pd

name_lst = []
allseq_lst = []
seq_lst = ["test"]
with open("AllExonSeq.fa", "r") as f:
for line in f:
if line.startswith(">"):
allseq_lst.append("".join(seq_lst))
seq_lst = []
name = line.strip("\n")
name_lst.append(name)
else:
seq_lst.append(line.strip("\n"))
allseq_lst.remove("test")
allseq_lst.append(seq_lst)
df = pd.DataFrame({"seqName": name_lst, "sequence": allseq_lst})
df2 = pd.read_csv("AllExonInfoTop3.txt", sep="\t", header=0)
df = df.set_index("seqName")
df2 = df2.set_index("seqName")
df3 = df2.join(df, how="inner")
df3 = df3.reset_index()
df3.drop_duplicates("seqName", keep="first", inplace=True)
df3.to_csv("exonWithAnno.txt", sep="\t", index=None)
ss = df3["seqName"].tolist()
sn = df3["sequence"].tolist()
with open("AllGeneTop3Exon.fa", "a") as l:
for name, seq in zip(ss, sn):
l.write("{}\n{}\n".format(name, seq))

step_2

exon序列小调
这里获取的序列如果是反向序列的话,就会造成后期寻找gRNA的不便,为此,我们先把所有的exon序列转换为统一的方向

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd


def complement(seq):
return seq.translate(str.maketrans("ACGT", "TGCA"))


def revcomp(seq):
return complement(seq)[::-1]


df = pd.read_csv("exonWithAnno.txt", sep="\t", header=0)
df2 = df[df["Strand"] < 0]
df1 = df[df["Strand"] > 0]
df2["sequence"] = df2["sequence"].apply(revcomp)
fin = pd.concat([df1, df2], axis=0, ignore_index=True)
ss = fin["seqName"].tolist()
nn = fin["sequence"].tolist()
with open("allexoninzhengliantop3.fa", "a") as f:
for name, seq in zip(ss, nn):
f.write(f"{name}\n{seq}\n")

接着,外显子序列数目巨大,很难放在一起寻找gRNA,因此,我们先将该序列拆分,每分留6000条序列

1
seqkit split2 allexoninzhengliantop3.fa -s 6000 -f

事实证明,做这一步反而让后面的分析变得麻烦了

step_3

运行FlashFry软件发现所有的gRNA并打分,
首先,要在ENSEMBL上下载完整的HG38基因组并构建索引

1
2
3
4
5
6
7
8
9
axel -n 50 ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
mv Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz hg38.fa

java -Xmx4g -jar /dellfsqd2/ST_LBI/USER/panxiaoguang/app/FlashFry/bin/flashfry \
index \
--tmpLocation ./GranTmp \
--database hg38 \
--reference ./hg38.fa \
--enzyme spcas9ngg

然后,分别对10份序列设计gRNA并打分

1
2
3
4
5
6
7
8
9
10
11
12
java -Xmx10g -jar /dellfsqd2/ST_LBI/USER/panxiaoguang/app/FlashFry/bin/flashfry \
discover \
--database hg38 \
--fasta ./test/allexoninzhengliantop3.fa.split/allexoninzhengliantop3.part_001.fa \
--output hg38_exon.part_001.out

java -Xmx10g -jar /dellfsqd2/ST_LBI/USER/panxiaoguang/app/FlashFry/bin/flashfry \
score \
--input hg38_exon.part_001.out \
--output hg38_exon.part_001.scored \
--scoringMetrics doench2016cfd,moreno2015,rank,minot \
--database hg38

step_4

对打分结果做初步过滤,主要包括GC含量,BSMB1位点,4个以上连续T

用R语言内置函数对数据表过滤很方便,这里额外获取了gRNA的基因,正负链以及在基因组上的位置信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
library(readr)
library(dplyr)
library(pystr)
args = commandArgs(T)
name_header<-args[1]
df <-
read_delim(
pystr_format("{1}.scored",name_header),
delim = "\t",
col_types = cols(
"c",
"d",
"d",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c",
"c"
)
)
df2 <- read_delim("./test/exonWithAnno.txt", delim = "\t")
df2 <- df2 %>% select(`Gene name`, Chromosome) %>% distinct(`Gene name`, .keep_all = TRUE)%>%setNames(c("gene","chromosome"))
calGC <- function(x) {
x <- stringr::str_sub(x, 1, 20)
stringr::str_count(x, "[GC]") / 20
}
df <-
df %>% select(contig, start, stop, target, orientation, `0-1-2-3-4_mismatch`) %>%
##记录0-4bp mismatch
tidyr::separate(
`0-1-2-3-4_mismatch`,
into = c("mis0", "mis1", "mis2", "mis3", "mis4"),
sep = ","
) %>%
##计算gRNA在基因组的起始和结束位点
mutate(
genomestart = purrr::map_dbl(contig, function(x)
as.numeric(unlist(strsplit(
x, "\\|"
))[2])),
genomestart = genomestart + start,
genomeend = genomestart + 22
) %>%
##标记正负链
mutate(
strand = purrr::map_dbl(contig, function(x)
as.numeric(unlist(strsplit(
x, "\\|"
))[4])),
strand = if_else(strand > 0, "+", "-")
) %>%
mutate(gene = purrr::map_chr(contig, function(x)
unlist(strsplit(x, "\\|"))[1])) %>%
##记录是否具有BSMB1位点
mutate(bsmb1 = case_when(
stringr::str_detect(target, "CGTCTC") ~ TRUE,
stringr::str_detect(target, "GAGACG") ~ TRUE,
TRUE ~ FALSE
)) %>%
##记录是否具有4个以上的连续T
mutate(nt = if_else(stringr::str_detect(target, "TTTT+"), TRUE, FALSE)) %>%
##记录GC含量
mutate(gcContent = purrr::map_dbl(target, function(x)
calGC(x))) %>%
filter(bsmb1 == FALSE &
nt == FALSE & gcContent >= 0.4 & gcContent <= 0.9) %>%
select(-2, -3)
df <- df %>% left_join(df2, by = "gene")
write.table(
df,
pystr_format("./tmp/{1}.first_filter.txt",name_header),
sep = "\t",
row.names = FALSE,
quote = FALSE
)

step_5

最后一步过滤,要求在gRNA4-8bp空间内具有stop位点,这里采用python脚本去下载好的ATG开头的CDS基因集中反向MAP,正反向 ** (这里的正反向不是基因的正反向,而是gRNA和cds方向的一致性,如果基因在正链上,那么gRNA正向设计的为顺,反向设计为逆;如果基因在负链上,那么gRNA正向设计的为逆,反向设计为顺) ** gRNA要分别运算,最后获取可用的gRNA和其对应的stop 位点。
为了加速最后一步过滤,分成三步完成,最核心的一步采用pypy

step_1.py

1
2
3
4
5
6
7
8
import pandas as pd
import sys
name_header = sys.argv[1]
gRNA_filer = pd.read_csv("./tmp/{}.first_filter.txt".format(name_header), sep="\t", header=0)
gRNA_Z = gRNA_filer[((gRNA_filer["orientation"] == "FWD") & (gRNA_filer["strand"] == "+")) | ((gRNA_filer["orientation"] == "RVS") & (gRNA_filer["strand"] == "-"))].reset_index()[['gene','target']]
gRNA_F = gRNA_filer[((gRNA_filer["orientation"] == "RVS")&(gRNA_filer["strand"] == "+")) | ((gRNA_filer["orientation"] == "FWD")&(gRNA_filer["strand"] == "-"))].reset_index()[['gene','target']]
gRNA_Z.to_csv("./tmp/{}.part_1.txt".format(name_header), sep="\t", index=None, header=None)
gRNA_F.to_csv("./tmp/{}.part_2.txt".format(name_header), sep="\t", index=None, header=None)

step_2.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import re
import sys
name_header = sys.argv[1]
def read_cds(fs):
name_lst = []
seq_lst=[]
with open(fs, 'r') as f:
for line in f:
if line.startswith(">"):
name = line.strip("\n").replace(">", "")
name_lst.append(name)
else:
seq_lst.append(line.strip("\n"))
fin=[(key,value) for key,value in zip(name_lst,seq_lst) if 'unavailable' not in value]
return fin

def complement(seq):
return seq.translate(str.maketrans('ACGT', 'TGCA'))

def revcomp(seq):
return complement(seq)[::-1]

def motif_Z(gRNA, site):
if gRNA[:20][site: site + 3] == "CAG":
return "CAG"
elif gRNA[:20][site: site + 3] == "CGA":
return "CGA"
elif gRNA[:20][site: site + 3] == "CAA":
return "CAA"
else:
return 0
def stop_test_for_Z(n, cds_lst, gRNA):
new_cds_lst = [cds for cds in cds_lst if gRNA in cds]
score_lst = []
if len(new_cds_lst) > 0:
for haha in new_cds_lst:
m = haha.find(gRNA) + 1
score = (m - 1 + n - 1)
score_lst.append(score)
if len([new_score for new_score in score_lst if score % 3 == 0]) > 0:
return 1
else:
return 0
else:
return 0
def motif_F(gRNA, site):
if gRNA[:20][site: site + 3] == "CCA":
return 1
elif gRNA[:20][site-1: site + 2] == "CCA":
return 2
else:
return 0
def stop_test_for_F(n, cds_lst, gRNA):
gRNA = revcomp(gRNA)
new_cds_lst = [cds for cds in cds_lst if gRNA in cds]
score_lst = []
if len(new_cds_lst) > 0:
for haha in new_cds_lst:
m = haha.find(gRNA) + 1 - 1 + 23
score = (m - n)
score_lst.append(score)
if len([new_score for new_score in score_lst if score % 3 == 0]) > 0:
return 1
else:
return 0
else:
return 0
def detective_Z(gRNA, Gene):
#cds_lst=all_cds[all_cds['Gene']==Gene]['seq'].tolist()
cds_lst=[value for (key,value) in all_cds if key==Gene]
stopSitelst=[]
for m in re.finditer("C", gRNA[:20]):
if m.start() + 1 >= 4 and m.start() + 1 <= 8:
if motif_Z(gRNA, m.start()):
if stop_test_for_Z((m.start() + 1), cds_lst, gRNA):
stopSitelst.append(str(m.start() + 1))
if len(stopSitelst) > 0:
fin="/".join(stopSitelst)
else:
fin="None"
return fin

def detective_F(gRNA, Gene):
#cds_lst=all_cds[all_cds['Gene']==Gene]['seq'].tolist()
cds_lst=[value for (key,value) in all_cds if key==Gene]
stopSitelst=[]
for m in re.finditer("C", gRNA[:20]):
if m.start() + 1 >= 4 and m.start() + 1 <= 8:
if motif_F(gRNA, m.start()):
if stop_test_for_F(((gRNA[:20][m.start() - 1 : m.start() + 3]).find("CCA") + 1 + 2), cds_lst, gRNA):
stopSitelst.append(str(m.start() + 1))
if len(stopSitelst) > 0:
fin="/".join(stopSitelst)
else:
fin="None"
return fin

all_cds = read_cds("AllGeneCDS2.fa")
shunshi=[]
with open("./tmp/{}.part_1.txt".format(name_header),'r') as f:
for line in f:
gene,target=line.split("\t")
target=target.strip("\n")
shunshi.append((gene,target))

new_lst=[(key,value,detective_Z(value,key)) for (key,value) in shunshi]

with open("./tmp/{}.shunshi.out".format(name_header),"a") as f:
for (gene,target,site) in new_lst:
f.write("{}\t{}\t{}\n".format(gene,target,site))

fanshi=[]
with open("./tmp/{}.part_2.txt".format(name_header),'r') as f:
for line in f:
gene,target=line.split("\t")
target=target.strip("\n")
fanshi.append((gene,target))

new_lst2=[(key,value,detective_F(value,key)) for (key,value) in fanshi]

with open("./tmp/{}.fanshi.out".format(name_header),"a") as f:
for (gene,target,site) in new_lst2:
f.write("{}\t{}\t{}\n".format(gene,target,site))

step_3.py

1
2
3
4
5
6
7
8
9
import pandas as pd
import sys
name_header = sys.argv[1]
df1 = pd.read_csv("./tmp/{}.shunshi.out".format(name_header), sep="\t", header=None)
df2 = pd.read_csv("./tmp/{}.fanshi.out".format(name_header), sep="\t", header=None)
fin = pd.concat([df2, df1], ignore_index=True)
fin.columns = ['gene', 'target', 'site']
fin2 = fin[fin['site'] != "None"]
fin2.to_csv("./tmp/{}.second_filter.txt".format(name_header),sep="\t",index=None)

最后整合的shell为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/bash 
name_header="hg38_exon.part_001"
/dellfsqd2/ST_LBI/USER/panxiaoguang/app/miniconda3/envs/Rlibs2/bin/Rscript first_filter.R $name_header
echo "first filter Done! $(date "+%Y-%m-%d %H:%M:%S")"
echo "first step: cut into two groups! $(date "+%Y-%m-%d %H:%M:%S")"
/dellfsqd2/ST_LBI/USER/panxiaoguang/app/miniconda3/envs/pylibs/bin/python3 step_1.py $name_header
echo "step1 done! $(date "+%Y-%m-%d %H:%M:%S")"
echo "second step: filter stop condon site! $(date "+%Y-%m-%d %H:%M:%S")"
/dellfsqd2/ST_LBI/USER/panxiaoguang/app/miniconda3/envs/pypy/bin/pypy3 step_2.py $name_header
echo "step2 done! $(date "+%Y-%m-%d %H:%M:%S")"
echo "second step: merge data! $(date "+%Y-%m-%d %H:%M:%S")"
/dellfsqd2/ST_LBI/USER/panxiaoguang/app/miniconda3/envs/pylibs/bin/python3 step_3.py $name_header
echo "ALL done! $(date "+%Y-%m-%d %H:%M:%S")"

csvtk -t join -f "gene,target" ./tmp/${name_header}.second_filter.txt ./tmp/${name_header}.first_filter.txt --left-join \
|csvtk -t cut -f gene,target,site,strand,orientation,chromosome,genomestart,genomeend,mis0,mis1,mis2,mis3,mis4,gcContent \
|csvtk pretty > ./tmp/${name_header}.finally.txt

/dellfsqd2/ST_LBI/USER/panxiaoguang/app/miniconda3/envs/Rlibs2/bin/Rscript getbed.R $name_header

bedtools getfasta -fi hg38.fa -bed ./tmp/${name_header}.bed -name -tab -s -fo ./tmp/${name_header}.trap.fa

组装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(readr)
library(dplyr)
library(pystr)
library(purrr)
library(openxlsx)
merge_data<-function(x){
name_header<-pystr_format("hg38_exon.part_0{1}",x)
df1<-read_delim(pystr_format("{1}.finally.txt",name_header),delim="\t")
df2<-read_delim(pystr_format("{1}.trap.fa",name_header),delim="\t",col_names=FALSE)
names(df2)<-c("gene","trapSeq")
fin<-bind_cols(df1,df2['trapSeq'])
}
fs_lst<-c("01","02","03","04","05","06","07","08","09","10")
test<-map_dfr(fs_lst,merge_data)
wocao<-test%>%
transmute(Gene=gene,
Target=target,
TrapSeq=trapSeq,
GeneStrand=strand,
gRNAstrand=orientation,
Loc=paste0("chr",chromosome,":",genomestart,"-",genomeend),
`Mismatch(0-1-2-3-4)`=paste0(mis0,"-",mis1,"-",mis2,"-",mis3,"-",mis4),
GCcontent=gcContent,
StopSite=site)%>%
arrange(Gene)
wb <- createWorkbook("test")
addWorksheet(wb, "humanstop")
writeData(wb, sheet = "humanstop", wocao, rowNames = FALSE)
saveWorkbook(wb, "HumanStopCBE.xlsx", overwrite = TRUE)

评论