【Survey】BLINK ＆ PLINK

這是前一陣子（2019-08-28）的筆記了…，那一陣子從「全基因體定序」、「全基因體圖譜繪製」到「全基因體關聯分析（GWAS）」都稍微沾了點，但計畫最終不了了之，只好先擱置一旁…

不過沒留下點紀錄真的很可惜，所以趁著兩個專案間的空檔，把工作時留下的隨手筆記從草稿夾抓出來整理整理。

不過不是立即記錄下的筆記，難免缺三漏四的…

大量複製貼上

由於這篇這我的隨手筆記重新排版而成，大部分的內容可能是直接複製貼上...。當然我最後有附上參考資料...不過，不確定這樣可不可行...。

凌亂注意

此篇就只是隨手筆記，內容非常的不連續。

全基因體關聯分析（GWAS）

全基因體關體分析（Genome Wide Association Study）是指在人類全基因體範圍內找出存在的序列變異，即單核苷酸多態性（SNP），從中篩選出與疾病相關的 SNPs。

白話一點的說法就是，先決定一個想研究的性狀/疾病，然後去找到兩群人，一群有這性狀/疾病，另外一群做對照。考慮可能的混淆因素，並比對這兩群人所有/可能的 SNPs，最後依照基因型頻率分布來找到與該性狀/疾病相關的遺傳位點。之後的應用可以在搭配醫療資料的大數據分析，以進行疾病預防及治療…等。

而全基因體關聯研究（GWAS）效能取決於三個因素：computing speed, memory requirements, and statistical power，這三個因素取決於所使用工具的統計方法與工具本身設計。

在 GWAS 常用演算法有三種常用：BLINK、FarmCPU 和 PLINK，不過這三種演算法偏向統計方法的改進，這篇的目的就是為了 Survey 這些工具。

PLINK

常見格式型別

PLINK軟體輸入檔案的常見格式型別：

一般格式（PLINK text）：PED/MAP
1

$ plink --file toy
會找名為 toy.ped and toy.map 的檔案，如要指定不同的 ped 與 map，則使用
1
2

--ped <filename> --map <filename>
二進位制格式（PLINK 1 binary）：BED/BIM/FAM
1
2
3
4

--bfile [prefix] --bed <filename> --bim <filename> --fam <filename>
PLINK 1 binary is PLINK 1.9’s preferred input format.. In fact, PLINK 1.9 automatically converts most other formats to PLINK 1 binary before the main loading sequence.

所以可以直接用來把其他檔案轉換成二進位制格式
1

$ plink --file text_fileset --out binary_fileset
或是加上 –make-bed 做過濾

如要保留執行過程中的轉換結果，則是用--keep-autoconv
1

$ plink --file text_fileset --freq --keep-autoconv --out results
Variant Call Format
1
2

--vcf <filename> --bcf <filename>
VCF / BCF 的 filename 要打全稱，也就是要加上副檔名。此外提供不同的參數需依照資料格式來進行調整。

VCF reference alleles are set to A2 by the autoconverter even when they appear to be minor. However, to maintain backwards compatibility with PLINK 1.07, PLINK 1.9 normally forces major alleles to A2 during its loading sequence. One workaround is permanently keeping the .bim file generated during initial conversion, for use as –a2-allele input whenever the reference sequence needs to be recovered. (If you use this method, note that, when your initial conversion step invokes –make-bed instead of just –out, you also need –keep-allele-order to avoid losing track of reference alleles before the very first write, because –make-bed triggers the regular loading sequence.)

其他：

transposed text fileset：TPED/TFAM

--tfile [prefix]  
--tped <filename>  
--tfam <filename>

long-format fileset：LGEN

 --lfile [prefix]  
 --lgen <filename>  
 --reference <filename>  
 --allele-count

Irregularly-formatted PLINK text files

輸入過濾

GWAS 粗略流程圖（圖片來源: 知乎）

GWAS基本分析内容（圖片來源: 簡書）

準備

# SNP完整度在95%
$ plink --file wgas1 --make-bed --out wgas1 --mind 0.05

結果：

Logging to wgas1.log.
Options in effect:
  --file wgas1
  --make-bed
  --mind 0.05
  --out wgas1

15920 MB RAM detected; reserving 7960 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (228694 variants, 90 people).
--file: wgas1-temporary.bed + wgas1-temporary.bim + wgas1-temporary.fam
written.
228694 variants loaded from .bim file.
90 people (45 males, 45 females) loaded from .fam.
90 phenotype values loaded from .fam.
1 person removed due to missing genotype data (--mind).
ID written to wgas1.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 89 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.995473.
228694 variants and 89 people pass filters and QC.
Among remaining phenotypes, 48 are cases and 41 are controls.
--make-bed to wgas1.bed + wgas1.bim + wgas1.fam ... done.

過濾

Missingness per individual –mind 0.1 >10% 排除 ###樣本缺失率
Missingness per marker –geno 0.1 > 10% 排除 ###位點缺失率
Allele frequency –maf 0.05 MAF <= 0.05 排除 ###最小等位基因頻率

樣本缺失率 +位點缺失率

 $ plink --bfile wgas1 --missing --out miss_stat

結果

 Logging to miss_stat.log.
 Options in effect:
 --bfile wgas1
 --missing
 --out miss_stat

 15920 MB RAM detected; reserving 7960 MB for main workspace.
 228694 variants loaded from .bim file.
 89 people (44 males, 45 females) loaded from .fam.
 89 phenotype values loaded from .fam.
 Using 1 thread (no multithreaded calculations invoked).
 Before main variant filters, 89 founders and 0 nonfounders present.
 Calculating allele frequencies... done.
 Total genotyping rate is 0.995473.
 --missing: Sample missing data report written to miss_stat.imiss, and
 variant-based missing data report written to miss_stat.lmiss.

.imiss : 樣本缺失率
.lmiss : 位點缺失率

最小等位基因頻率

$ plink --bfile wgas1 --freq --out freq_stat

結果

Logging to freq_stat.log.
Options in effect:
  --bfile wgas1
  --freq
  --out freq_stat

15920 MB RAM detected; reserving 7960 MB for main workspace.
228694 variants loaded from .bim file.
89 people (44 males, 45 females) loaded from .fam.
89 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 89 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.995473.
--freq: Allele frequencies (founders only) written to freq_stat.frq .

區分群組的計算MAF

$ plink --bfile wgas1 --within pop.cov --freq --out freq_stat

Logging to freq_stat.log.
Options in effect:
  --bfile wgas1
  --freq
  --out freq_stat
  --within pop.cov

15920 MB RAM detected; reserving 7960 MB for main workspace.
228694 variants loaded from .bim file.
89 people (44 males, 45 females) loaded from .fam.
89 phenotype values loaded from .fam.
--within: 2 clusters loaded, covering a total of 89 people.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 89 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.995473.
--freq: Cluster-stratified allele frequencies (founders only) written to
freq_stat.frq.strat 

主成分分析

在全基因體關聯性研究(Genome-Wide Association Study, GWAS)的分析中，干擾因子(confounder)的控制是重要的課題，其中最重要的干擾因子為族群分層(population stratification)。性狀(trait)或疾病的分布在不同族群中會不一樣，SNP的頻率在不同族群的分布也會不一樣，因此在探討兩者的關係時，若沒有考慮到族群分層，可能會出現偽陽性或偽陰性的結果，即使分析的研究對象都是來自單一族群，仍可能存在次分群的狀況。所以在完成 QC 後，進行相關性檢定前，需對研究族群進行主成分分析(Principle Component Analysis, PCA)，隨後將PCA結果作為共變數(covariate)，加入迴歸模型中以控制族群分層的影響。

./plink --bfile wgas1 --pca 10 --out wgas1_pca

結果

Logging to wgas1_pca.log.
Options in effect:
--bfile wgas1
--out wgas1_pca
--pca 10

15920 MB RAM detected; reserving 7960 MB for main workspace.
228694 variants loaded from .bim file.
89 people (44 males, 45 females) loaded from .fam.
89 phenotype values loaded from .fam.
Using up to 4 threads (change this with --threads).
Before main variant filters, 89 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.995473.
228694 variants and 89 people pass filters and QC.
Among remaining phenotypes, 48 are cases and 41 are controls.
Relationship matrix calculation complete.
--pca: Results saved to wgas1_pca.eigenval and wgas1_pca.eigenvec .

關聯分析
https://zhuanlan.zhihu.com/p/72490817
http://zzz.bwh.harvard.edu/plink/anal.shtml
https://www.cog-genomics.org/plink/1.9/assoc

 $ plink --bfile wgas1 --assoc --out as1

 Logging to as1.log.
 Options in effect:
 --assoc
 --bfile wgas1
 --out as1

 15920 MB RAM detected; reserving 7960 MB for main workspace.
 228694 variants loaded from .bim file.
 89 people (44 males, 45 females) loaded from .fam.
 89 phenotype values loaded from .fam.
 Using 1 thread (no multithreaded calculations invoked).
 Before main variant filters, 89 founders and 0 nonfounders present.
 Calculating allele frequencies... done.
 Total genotyping rate is 0.995473.
 228694 variants and 89 people pass filters and QC.
 Among remaining phenotypes, 48 are cases and 41 are controls.
 Writing C/C --assoc report to as1.assoc ... done.

小樣本考慮 FDR 校正

 $ plink --bfile wgas1 --assoc --adjust --out as2

 Logging to as2.log.
 Options in effect:
 --adjust
 --assoc
 --bfile wgas1
 --out as2

 15920 MB RAM detected; reserving 7960 MB for main workspace.
 228694 variants loaded from .bim file.
 89 people (44 males, 45 females) loaded from .fam.
 89 phenotype values loaded from .fam.
 Using 1 thread (no multithreaded calculations invoked).
 Before main variant filters, 89 founders and 0 nonfounders present.
 Calculating allele frequencies... done.
 Total genotyping rate is 0.995473.
 228694 variants and 89 people pass filters and QC.
 Among remaining phenotypes, 48 are cases and 41 are controls.
 Writing C/C --assoc report to as2.assoc ... done.
 --adjust: Genomic inflation est. lambda (based on median chisq) = 1.3727.
 --adjust values (188187 variants) written to as2.assoc.adjusted .

關聯分析 - 經過一定篩選的特定位點的細化分析

 ./plink --bfile wgas1 --model --out mod1

 CHR         SNP   A1   A2     TEST            AFF          UNAFF        CHISQ   DF            P
 rs3094315    G    A     GENO        0/14/33         0/7/34           NA   NA           NA
 rs3094315    G    A    TREND          14/80           7/75        1.948    1       0.1628
 rs3094315    G    A  ALLELIC          14/80           7/75        1.684    1       0.1944
 rs3094315    G    A      DOM          14/33           7/34           NA   NA           NA
 rs3094315    G    A      REC           0/47           0/41           NA   NA           NA
 rs6672353    A    G     GENO         0/1/47         0/0/40           NA   NA           NA
 rs6672353    A    G    TREND           1/95           0/80       0.8429    1       0.3586
 rs6672353    A    G  ALLELIC           1/95           0/80       0.8381    1       0.3599
 rs6672353    A    G      DOM           1/47           0/40           NA   NA           NA

分層分析

 ./plink --bfile wgas1 --cluster --mc 2 --ppc 0.05 --out str1

 $ cat str1.cluster1 
 SOL-0	 CH18526_NA18526 CH18637_NA18637
 SOL-1	 CH18524_NA18524 JA18976_NA18976
 SOL-2	 CH18529_NA18529 CH18603_NA18603
 SOL-3	 CH18558_NA18558 CH18623_NA18623
 SOL-4	 CH18532_NA18532 CH18622_NA18622
 SOL-5	 CH18561_NA18561 CH18566_NA18566
 SOL-6	 CH18562_NA18562 CH18550_NA18550
 SOL-7	 CH18537_NA18537 CH18635_NA18635
 SOL-8	 CH18540_NA18540 CH18563_NA18563
 SOL-9	 CH18605_NA18605 CH18594_NA18594

 $ cat str1.cluster2
 CH18526 NA18526	0
 CH18524 NA18524	1
 CH18529 NA18529	2
 CH18558 NA18558	3
 CH18532 NA18532	4
 CH18561 NA18561	5
 CH18562 NA18562	6
 CH18537 NA18537	7
 CH18603 NA18603	2
 CH18540 NA18540	8

 $ cat str1.cluster3
 CH18526 NA18526	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 CH18524 NA18524	1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 CH18529 NA18529	2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 CH18558 NA18558	3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 CH18603 NA18603	8 8 8 8 8 8 8 8 8 8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

關聯分析 - 分組

 ./plink --bfile wgas1 --mh --within str1.cluster2 --adjust --out aac1

 $ cat aac1.cmh.adjusted
 CHR         SNP      UNADJ         GC       BONF       HOLM   SIDAK_SS   SIDAK_SD     FDR_BH     FDR_BY
 rs1323127  5.312e-05  0.0001136          1          1     0.9999     0.9999     0.8349          1 
 rs6692436  0.0001412  0.0002783          1          1          1          1     0.8349          1 
 rs2856028  0.0001478  0.0002902          1          1          1          1     0.8349          1 
 rs2856030  0.0001478  0.0002902          1          1          1          1     0.8349          1 
 rs2869480  0.0002386  0.0004501          1          1          1          1     0.8349          1 
 rs1860020  0.0002386  0.0004501          1          1          1          1     0.8349          1 
rs13344436  0.0002386  0.0004501          1          1          1          1     0.8349          1 
 rs1296754  0.0002457  0.0004624          1          1          1          1     0.8349          1 
 rs7101088  0.0002566  0.0004811          1          1          1          1     0.8349          1 

刪除/選定特定snp或樣本

 ./$ plink --file data --extract mysnps.txt

參考資料

研之有物│中央研究院 (2018/05/18)。全基因組分析能讓我們知道多少事？。檢自 PanSci 泛科學 (2020-03-03)。
知乎日報 (2016-12-20)。聽起來很厲害的「全基因組關聯分析」，能算命嗎？。檢自每日頭條 (2020-03-03)。
協同撰寫。全基因組關聯分析。檢自維基百科 (2020-03-03)。
刘思远 (2019-07-29)。华盛顿州立大学张志武教授来基因组所做作学术报告。檢自中国农业科学院农业基因组研究所 (2020-03-03)。
心心向農隊 (2019-06-14)。關聯分析的特點和優勢。檢自每日頭條 (2020-03-03)。
rapunzel0103 (2018-01-08)。临床生物信息学中的GWAS分析。檢自生信技能树 (2020-03-03)。
刘小磊 (2016)。一种交替运用固定效应和随机效应模型优化全基因组关联分析的算法开发。檢自中國知網 (2020-03-03)。
PLINK 1.9。檢自 PLINK (2020-03-03)。
File format reference - PLINK 1.9。檢自 PLINK (2020-03-03)。
(2018-12-09)。利用PLINK進行GWAS分析。檢自 IT閱讀 (2020-03-03)。
Beccaliii (2019-09-11)。用plink做一套GWAS分析。檢自知乎 (2020-03-03)。
river1023 (2019-03-16)。【GWAS】使用PLINK評估族群分層的現象。檢自走過必留下痕跡痞客邦 (2020-03-03)。
Tommy Huang (2018-04-20)。機器/統計學習:主成分分析(Principal Component Analysis, PCA)。檢自 Tommy Huang - Medium (2020-03-03)。
橙子牛奶糖 (2018-12-17)。GWAS分析基本流程及分析思路。檢自橙子牛奶糖 - 博客园 (2020-03-03)。
rapunzel0103 (2018-02-04)。跑通GWAS，用这些脚本就够了。檢自簡書 (2020-03-03)。
rapunzel0103 (2018-01-20)。GWAS基本分析内容。檢自簡書 (2020-03-03)。
Ochi (2017-01-04)。plink进行GWAS分析)。檢自 Ochi - 新浪博客 (2020-03-03)。
bcftools。檢自 bcftools (2020-03-03)。
PLINK assoc input: can it be two vcf files case/control?。檢自 biostars (2020-03-03)。

【Survey】BLINK ＆ PLINK

全基因體關聯分析（GWAS）

相關閱讀

Fixed and random model Circulating Probability Unification（FarmCPU）

BLINK

PLINK

常見格式型別

輸入過濾

準備

過濾

參考資料

全基因體關聯分析 （GWAS）

相關閱讀

Fixed and random model Circulating Probability Unification（FarmCPU）

BLINK

PLINK

常見格式型別

輸入過濾

準備

過濾

參考資料

全基因體關聯分析（GWAS）