oscar3
高级会员
在网上发现的一个regular expression组合可以去掉html或者xml等大于号>和小于号<风格的标记符号,在EditPlus上试用了一下,觉得效果比较好,和大家分享:
<[^{><}]*>
比如:
<?xml version="1.0" encoding="US-ASCII"?><!DOCTYPE gda>
<gda>
<NP FCTN="HLN"><LST>UI</LST> - 93135830</NP>
<NP><LST>TI</LST> - <NP>A human putative lymphocyte G0/G1 switch gene </NP><ADJP>homologous <PP>to <NP><NP>a rodent gene </NP><VP>encoding <NP>a zinc-binding potential transcription factor</NP></VP></NP></PP></ADJP>.</NP>
<S><LST>AB</LST> - <NP-SBJ>G0S24 </NP-SBJ><VP>is <NP-PRD><NP>a member </NP><PP>of <NP><NP>a set </NP><PP>of <NP><NP><NP>genes </NP><PRN>(<NP>putative G0/G1 switch regulatory genes</NP>) </PRN></NP><SBAR><WHNP id="i87">that </WHNP><S><NP-SBJ id="i88" NULL="T" ref="i87"/><VP>are <VP>expressed <NP NULL="NONE" ref="i88"/><ADVP SEM="TMP">transiently </ADVP><PP SEM="TMP">within <NP><NP><QP>1-2 </QP>hr </NP><PP>of <NP><NP>the addition </NP><PP>of <NP SYN="COOD"><NP>lectin </NP>or <NP>cycloheximide </NP></NP></PP><PP>to <NP>human blood mononuclear cells</NP></PP></NP></PP></NP></PP></VP></VP></S></SBAR></NP></PP></NP></PP></NP-PRD></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>a full-length cDNA sequence </NP></PP><PP>with <NP>the corresponding genomic sequence </NP></PP></NP-SBJ><VP>reveals <NP><NP>an open reading frame </NP><PP>of <NP>326 amino acids</NP></PP>, <VP>distributed <NP NULL="NONE"/><PP>across <NP>two exons</NP></PP></VP></NP></VP>.</S>
<S><NP-SBJ>Potential phosphorylation sites </NP-SBJ><VP>include <NP><NP>the sequence PSPTSPT</NP>, <SBAR><WHNP id="i89">which </WHNP><S><NP-SBJ NULL="T" ref="i89"/><VP>resembles <NP><NP>an RNA polymerase II repeat </NP><VP>reported <S><NP-SBJ NULL="NONE"/><VP>to <VP>be <NP-PRD><NP>a target </NP><PP>of <NP>the cell cycle control kinase cdc2</NP></PP></NP-PRD></VP></VP></S></VP></NP></VP></S></SBAR></NP></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>the derived protein sequence </NP></PP><PP>with <NP><NP>those </NP><PP>of <NP>rodent homologs </NP></PP></NP></PP></NP-SBJ><VP>allows <NP><NP>classification </NP><PP>into <NP>three groups</NP></PP></NP></VP>.</S>
<S><NP-SBJ>Group 1 </NP-SBJ><VP>contains <NP SYN="COOD"><NP>G0S24 </NP>and <NP><NP>the <NP SYN="COOD"><NP>rat </NP>and <NP>mouse </NP></NP>TIS11 genes </NP>(<VP><ADVP>also </ADVP>known <NP NULL="NONE"/><PP>as <NP SYN="COOD"><NP>TTP</NP>, <NP>Nup475</NP>, and <NP>Zfp36</NP></NP></PP></VP>)</NP></NP></VP>.</S>
<S><NP-SBJ><NP>Members </NP><PP>of <NP>this group </NP></PP></NP-SBJ><VP>have <NP>three tetraproline repeats</NP></VP>.</S>
<S><NP-SBJ>Groups <NP SYN="COOD"><NP>1 </NP>and <NP>2 </NP></NP></NP-SBJ><VP>have <NP SYN="COOD"><NP>a serine-rich region </NP>and <NP><NP>an " arginine element " </NP><PRN>(<NP>RRLPIF</NP>) </PRN></NP></NP><PP>at <NP>the carboxyl terminus</NP></PP></VP>.</S>
<S><NP-SBJ>All groups </NP-SBJ><VP>contain <NP SYN="COOD"><NP><ADJP SYN="COOD"><ADJP><NP>cysteine- </NP><ADJP NULL="QSTN"/></ADJP>and <ADJP>histidine-rich </ADJP></ADJP>putative zinc finger domains </NP>and <NP><NP>a serine-phenylalanine " SFS " domain </NP><ADJP>similar <PP>to <NP><NP>part </NP><PP>of <NP><NP>the large subunit </NP><PP>of <NP>eukaryotic RNA polymerase II</NP></PP></NP></PP></NP></PP></ADJP></NP></NP></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>group 1 <UCP SYN="COOD"><ADJP>human </ADJP>and <NP>mouse </NP></UCP>genomic sequences </NP></PP></NP-SBJ><VP>shows <NP><NP>high conservation </NP><PP>in <NP SYN="COOD"><NP>the 5' flank </NP>and <NP>exons</NP></NP></PP></NP></VP>.</S>
<S><NP-SBJ>A CpG island </NP-SBJ><VP>suggests <NP><NP>expression </NP><PP>in <NP>the germ line</NP></PP></NP></VP>.</S>
<S><S><NP-SBJ>G0S24 </NP-SBJ><VP>has <NP><NP>potential sites </NP><PP>for <NP>transcription factors </NP></PP></NP><PP>in <NP SYN="COOD"><NP>the 5' flank </NP>and <NP>intron</NP></NP></PP></VP>;</S> <S><NP-SBJ>these </NP-SBJ><VP>include <NP>a serum response element</NP></VP>.</S></S>
<S><NP-SBJ><UCP SYN="COOD"><NP>Protein </NP>and <ADJP>genomic </ADJP></UCP>sequences </NP-SBJ><VP>show <NP><NP>similarities </NP><PP>with <NP><NP>those </NP><PP>of <NP><NP>a variety </NP><PP>of <NP><NP>proteins </NP><VP>involved <NP NULL="NONE"/><PP>in <NP>transcription</NP></PP></VP></NP></PP></NP></PP></NP></PP></NP>, <S><NP-SBJ NULL="NONE"/><VP>suggesting <SBAR>that <S><NP-SBJ>the G0S24 product </NP-SBJ><VP>has <NP>a similar role</NP></VP></S></SBAR></VP></S></VP>.</S></gda>
以上符码被去掉之后如下:
UI - 93135830
TI - A human putative lymphocyte G0/G1 switch gene homologous to a rodent gene encoding a zinc-binding potential transcription factor.
AB - G0S24 is a member of a set of genes (putative G0/G1 switch regulatory genes) that are expressed transiently within 1-2 hr of the addition of lectin or cycloheximide to human blood mononuclear cells.
Comparison of a full-length cDNA sequence with the corresponding genomic sequence reveals an open reading frame of 326 amino acids, distributed across two exons.
Potential phosphorylation sites include the sequence PSPTSPT, which resembles an RNA polymerase II repeat reported to be a target of the cell cycle control kinase cdc2.
Comparison of the derived protein sequence with those of rodent homologs allows classification into three groups.
Group 1 contains G0S24 and the rat and mouse TIS11 genes (also known as TTP, Nup475, and Zfp36).
Members of this group have three tetraproline repeats.
Groups 1 and 2 have a serine-rich region and an " arginine element " (RRLPIF) at the carboxyl terminus.
All groups contain cysteine- and histidine-rich putative zinc finger domains and a serine-phenylalanine " SFS " domain similar to part of the large subunit of eukaryotic RNA polymerase II.
Comparison of group 1 human and mouse genomic sequences shows high conservation in the 5' flank and exons.
A CpG island suggests expression in the germ line.
G0S24 has potential sites for transcription factors in the 5' flank and intron; these include a serum response element.
Protein and genomic sequences show similarities with those of a variety of proteins involved in transcription, suggesting that the G0S24 product has a similar role.
<[^{><}]*>
比如:
<?xml version="1.0" encoding="US-ASCII"?><!DOCTYPE gda>
<gda>
<NP FCTN="HLN"><LST>UI</LST> - 93135830</NP>
<NP><LST>TI</LST> - <NP>A human putative lymphocyte G0/G1 switch gene </NP><ADJP>homologous <PP>to <NP><NP>a rodent gene </NP><VP>encoding <NP>a zinc-binding potential transcription factor</NP></VP></NP></PP></ADJP>.</NP>
<S><LST>AB</LST> - <NP-SBJ>G0S24 </NP-SBJ><VP>is <NP-PRD><NP>a member </NP><PP>of <NP><NP>a set </NP><PP>of <NP><NP><NP>genes </NP><PRN>(<NP>putative G0/G1 switch regulatory genes</NP>) </PRN></NP><SBAR><WHNP id="i87">that </WHNP><S><NP-SBJ id="i88" NULL="T" ref="i87"/><VP>are <VP>expressed <NP NULL="NONE" ref="i88"/><ADVP SEM="TMP">transiently </ADVP><PP SEM="TMP">within <NP><NP><QP>1-2 </QP>hr </NP><PP>of <NP><NP>the addition </NP><PP>of <NP SYN="COOD"><NP>lectin </NP>or <NP>cycloheximide </NP></NP></PP><PP>to <NP>human blood mononuclear cells</NP></PP></NP></PP></NP></PP></VP></VP></S></SBAR></NP></PP></NP></PP></NP-PRD></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>a full-length cDNA sequence </NP></PP><PP>with <NP>the corresponding genomic sequence </NP></PP></NP-SBJ><VP>reveals <NP><NP>an open reading frame </NP><PP>of <NP>326 amino acids</NP></PP>, <VP>distributed <NP NULL="NONE"/><PP>across <NP>two exons</NP></PP></VP></NP></VP>.</S>
<S><NP-SBJ>Potential phosphorylation sites </NP-SBJ><VP>include <NP><NP>the sequence PSPTSPT</NP>, <SBAR><WHNP id="i89">which </WHNP><S><NP-SBJ NULL="T" ref="i89"/><VP>resembles <NP><NP>an RNA polymerase II repeat </NP><VP>reported <S><NP-SBJ NULL="NONE"/><VP>to <VP>be <NP-PRD><NP>a target </NP><PP>of <NP>the cell cycle control kinase cdc2</NP></PP></NP-PRD></VP></VP></S></VP></NP></VP></S></SBAR></NP></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>the derived protein sequence </NP></PP><PP>with <NP><NP>those </NP><PP>of <NP>rodent homologs </NP></PP></NP></PP></NP-SBJ><VP>allows <NP><NP>classification </NP><PP>into <NP>three groups</NP></PP></NP></VP>.</S>
<S><NP-SBJ>Group 1 </NP-SBJ><VP>contains <NP SYN="COOD"><NP>G0S24 </NP>and <NP><NP>the <NP SYN="COOD"><NP>rat </NP>and <NP>mouse </NP></NP>TIS11 genes </NP>(<VP><ADVP>also </ADVP>known <NP NULL="NONE"/><PP>as <NP SYN="COOD"><NP>TTP</NP>, <NP>Nup475</NP>, and <NP>Zfp36</NP></NP></PP></VP>)</NP></NP></VP>.</S>
<S><NP-SBJ><NP>Members </NP><PP>of <NP>this group </NP></PP></NP-SBJ><VP>have <NP>three tetraproline repeats</NP></VP>.</S>
<S><NP-SBJ>Groups <NP SYN="COOD"><NP>1 </NP>and <NP>2 </NP></NP></NP-SBJ><VP>have <NP SYN="COOD"><NP>a serine-rich region </NP>and <NP><NP>an " arginine element " </NP><PRN>(<NP>RRLPIF</NP>) </PRN></NP></NP><PP>at <NP>the carboxyl terminus</NP></PP></VP>.</S>
<S><NP-SBJ>All groups </NP-SBJ><VP>contain <NP SYN="COOD"><NP><ADJP SYN="COOD"><ADJP><NP>cysteine- </NP><ADJP NULL="QSTN"/></ADJP>and <ADJP>histidine-rich </ADJP></ADJP>putative zinc finger domains </NP>and <NP><NP>a serine-phenylalanine " SFS " domain </NP><ADJP>similar <PP>to <NP><NP>part </NP><PP>of <NP><NP>the large subunit </NP><PP>of <NP>eukaryotic RNA polymerase II</NP></PP></NP></PP></NP></PP></ADJP></NP></NP></VP>.</S>
<S><NP-SBJ><NP>Comparison </NP><PP>of <NP>group 1 <UCP SYN="COOD"><ADJP>human </ADJP>and <NP>mouse </NP></UCP>genomic sequences </NP></PP></NP-SBJ><VP>shows <NP><NP>high conservation </NP><PP>in <NP SYN="COOD"><NP>the 5' flank </NP>and <NP>exons</NP></NP></PP></NP></VP>.</S>
<S><NP-SBJ>A CpG island </NP-SBJ><VP>suggests <NP><NP>expression </NP><PP>in <NP>the germ line</NP></PP></NP></VP>.</S>
<S><S><NP-SBJ>G0S24 </NP-SBJ><VP>has <NP><NP>potential sites </NP><PP>for <NP>transcription factors </NP></PP></NP><PP>in <NP SYN="COOD"><NP>the 5' flank </NP>and <NP>intron</NP></NP></PP></VP>;</S> <S><NP-SBJ>these </NP-SBJ><VP>include <NP>a serum response element</NP></VP>.</S></S>
<S><NP-SBJ><UCP SYN="COOD"><NP>Protein </NP>and <ADJP>genomic </ADJP></UCP>sequences </NP-SBJ><VP>show <NP><NP>similarities </NP><PP>with <NP><NP>those </NP><PP>of <NP><NP>a variety </NP><PP>of <NP><NP>proteins </NP><VP>involved <NP NULL="NONE"/><PP>in <NP>transcription</NP></PP></VP></NP></PP></NP></PP></NP></PP></NP>, <S><NP-SBJ NULL="NONE"/><VP>suggesting <SBAR>that <S><NP-SBJ>the G0S24 product </NP-SBJ><VP>has <NP>a similar role</NP></VP></S></SBAR></VP></S></VP>.</S></gda>
以上符码被去掉之后如下:
UI - 93135830
TI - A human putative lymphocyte G0/G1 switch gene homologous to a rodent gene encoding a zinc-binding potential transcription factor.
AB - G0S24 is a member of a set of genes (putative G0/G1 switch regulatory genes) that are expressed transiently within 1-2 hr of the addition of lectin or cycloheximide to human blood mononuclear cells.
Comparison of a full-length cDNA sequence with the corresponding genomic sequence reveals an open reading frame of 326 amino acids, distributed across two exons.
Potential phosphorylation sites include the sequence PSPTSPT, which resembles an RNA polymerase II repeat reported to be a target of the cell cycle control kinase cdc2.
Comparison of the derived protein sequence with those of rodent homologs allows classification into three groups.
Group 1 contains G0S24 and the rat and mouse TIS11 genes (also known as TTP, Nup475, and Zfp36).
Members of this group have three tetraproline repeats.
Groups 1 and 2 have a serine-rich region and an " arginine element " (RRLPIF) at the carboxyl terminus.
All groups contain cysteine- and histidine-rich putative zinc finger domains and a serine-phenylalanine " SFS " domain similar to part of the large subunit of eukaryotic RNA polymerase II.
Comparison of group 1 human and mouse genomic sequences shows high conservation in the 5' flank and exons.
A CpG island suggests expression in the germ line.
G0S24 has potential sites for transcription factors in the 5' flank and intron; these include a serum response element.
Protein and genomic sequences show similarities with those of a variety of proteins involved in transcription, suggesting that the G0S24 product has a similar role.