Python文本处理的案例有哪些
本篇内容主要讲解“Python文本处理的案例有哪些”,感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷,实用性强。下面就让小编来带大家学习“Python文本处理的案例有哪些”吧!
1提取 PDF 内容#pipinstallPyPDF2安装PyPDF2importPyPDF2fromPyPDF2importPdfFileReader#Creatingapdffileobject.pdf=open("test.pdf","rb")#Creatingpdfreaderobject.pdf_reader=PyPDF2.PdfFileReader(pdf)#Checkingtotalnumberofpagesinapdffile.print("TotalnumberofPages:",pdf_reader.numPages)#Creatingapageobject.page=pdf_reader.getPage(200)#Extractdatafromaspecificpagenumber.print(page.extractText())#Closingtheobject.pdf.close()2提取 Word 内容
#pipinstallpython-docx安装python-docximportdocxdefmain():try:doc=docx.Document('test.docx')#Creatingwordreaderobject.data=""fullText=[]forparaindoc.paragraphs:fullText.append(para.text)data='\n'.join(fullText)print(data)exceptIOError:print('Therewasanerroropeningthefile!')returnif__name__=='__main__':main()3提取 Web 网页内容
#pipinstallbs4安装bs4fromurllib.requestimportRequest,urlopenfrombs4importBeautifulSoupreq=Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1',headers={'User-Agent':'Mozilla/5.0'})webpage=urlopen(req).read()#Parsingsoup=BeautifulSoup(webpage,'html.parser')#Formatingtheparsedhtmlfilestrhtm=soup.prettify()#Printfirst500linesprint(strhtm[:500])#Extractmetatagvalueprint(soup.title.string)print(soup.find('meta',attrs={'property':'og:description'}))#Extractanchortagvalueforxinsoup.find_all('a'):print(x.string)#ExtractParagraphtagvalueforxinsoup.find_all('p'):print(x.text)4读取 Json 数据
importrequestsimportjsonr=requests.get("https://support.oneskyapp.com/hc/en-us/article_attachments/202761727/example_2.json")res=r.json()#Extractspecificnodecontent.print(res['quiz']['sport'])#Dumpdataasstringdata=json.dumps(res)print(data)5读取 CSV 数据
importcsvwithopen('test.csv','r')ascsv_file:reader=csv.reader(csv_file)next(reader)#Skipfirstrowforrowinreader:print(row)6删除字符串中的标点符号
importreimportstringdata="Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\ItpaintstheseneryinyourmindsowellIwouldrecomend\iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\guitarsandsoulfulorchestras.\Itwouldimpressanyonewhocarestolisten!"#Methood1:Regex#Removethespecialcharatersfromthereadstring.no_specials_string=re.sub('[!#?,.:";]','',data)print(no_specials_string)#Methood2:translate()#Raketranslatorobjecttranslator=str.maketrans('','',string.punctuation)data=data.translate(translator)print(data)7使用 NLTK 删除停用词
fromnltk.corpusimportstopwordsdata=['Stuningevenforthenon-gamer:Thissoundtrackwasbeautiful!\ItpaintstheseneryinyourmindsowellIwouldrecomend\iteventopeoplewhohatevid.gamemusic!IhaveplayedthegameChrono\CrossbutoutofallofthegamesIhaveeverplayedithasthebestmusic!\Itbacksawayfromcrudekeyboardingandtakesafresherstepwithgrate\guitarsandsoulfulorchestras.\Itwouldimpressanyonewhocarestolisten!']#Removestopwordsstopwords=set(stopwords.words('english'))output=[]forsentenceindata:temp_list=[]forwordinsentence.split():ifword.lower()notinstopwords:temp_list.append(word)output.append(''.join(temp_list))print(output)8使用 TextBlob 更正拼写
fromtextblobimportTextBlobdata="Naturallanguageisacantralpartofourdaytodaylife,andit'ssoantrestingtoworkonanyproblemrelatedtolangages."output=TextBlob(data).correct()print(output)9使用 NLTK 和 TextBlob 的词标记化
importnltkfromtextblobimportTextBlobdata="Naturallanguageisacentralpartofourdaytodaylife,andit'ssointerestingtoworkonanyproblemrelatedtolanguages."nltk_output=nltk.word_tokenize(data)textblob_output=TextBlob(data).wordsprint(nltk_output)print(textblob_output)
Output:
10使用 NLTK 提取句子单词或短语的词干列表['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', ',', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages', '.']
['Natural', 'language', 'is', 'a', 'central', 'part', 'of', 'our', 'day', 'to', 'day', 'life', 'and', 'it', "'s", 'so', 'interesting', 'to', 'work', 'on', 'any', 'problem', 'related', 'to', 'languages']
fromnltk.stemimportPorterStemmerst=PorterStemmer()text=['Wheredidhelearntodancelikethat?','Hiseyesweredancingwithhumor.','Sheshookherheadanddancedaway','Alexwasanexcellentdancer.']output=[]forsentenceintext:output.append("".join([st.stem(i)foriinsentence.split()]))foriteminoutput:print(item)print("-"*50)print(st.stem('jumping'),st.stem('jumps'),st.stem('jumped'))
Output:
11使用 NLTK 进行句子或短语词形还原where did he learn to danc like that?
hi eye were danc with humor.
she shook her head and danc away
alex wa an excel dancer.
--------------------------------------------------
jump jump jump
fromnltk.stemimportWordNetLemmatizerwnl=WordNetLemmatizer()text=['Shegrippedthearmrestashepassedtwocarsatatime.','Hercarwasinfullview.','Anumberofcarscarriedoutofstatelicenseplates.']output=[]forsentenceintext:output.append("".join([wnl.lemmatize(i)foriinsentence.split()]))foriteminoutput:print(item)print("*"*10)print(wnl.lemmatize('jumps','n'))print(wnl.lemmatize('jumping','v'))print(wnl.lemmatize('jumped','v'))print("*"*10)print(wnl.lemmatize('saddest','a'))print(wnl.lemmatize('happiest','a'))print(wnl.lemmatize('easiest','a'))
Output:
12使用 NLTK 从文本文件中查找每个单词的频率She gripped the armrest a he passed two car at a time.
Her car wa in full view.
A number of car carried out of state license plates.
**********
jump
jump
jump
**********
sad
happy
easy
importnltkfromnltk.corpusimportwebtextfromnltk.probabilityimportFreqDistnltk.download('webtext')wt_words=webtext.words('testing.txt')data_analysis=nltk.FreqDist(wt_words)#Let'stakethespecificwordsonlyiftheirfrequencyisgreaterthan3.filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])forkeyinsorted(filter_words):print("%s:%s"%(key,filter_words[key]))data_analysis=nltk.FreqDist(filter_words)data_analysis.plot(25,cumulative=False)
Output:
13从语料库中创建词云[nltk_data] Downloading package webtext to
[nltk_data] C:\Users\amit\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\webtext.zip.
1989: 1
Accessing: 1
Analysis: 1
Anyone: 1
Chapter: 1
Coding: 1
Data: 1
...
importnltkfromnltk.corpusimportwebtextfromnltk.probabilityimportFreqDistfromwordcloudimportWordCloudimportmatplotlib.pyplotaspltnltk.download('webtext')wt_words=webtext.words('testing.txt')#Sampledatadata_analysis=nltk.FreqDist(wt_words)filter_words=dict([(m,n)form,nindata_analysis.items()iflen(m)>3])wcloud=WordCloud().generate_from_frequencies(filter_words)#Plottingthewordcloudplt.imshow(wcloud,interpolation="bilinear")plt.axis("off")(-0.5,399.5,199.5,-0.5)plt.show()14NLTK 词法散布图
importnltkfromnltk.corpusimportwebtextfromnltk.probabilityimportFreqDistfromwordcloudimportWordCloudimportmatplotlib.pyplotaspltwords=['data','science','dataset']nltk.download('webtext')wt_words=webtext.words('testing.txt')#Sampledatapoints=[(x,y)forxinrange(len(wt_words))foryinrange(len(words))ifwt_words[x]==words[y]]ifpoints:x,y=zip(*points)else:x=y=()plt.plot(x,y,"rx",scalex=.1)plt.yticks(range(len(words)),words,color="b")plt.ylim(-1,len(words))plt.title("LexicalDispersionPlot")plt.xlabel("WordOffset")plt.show()15使用 countvectorizer 将文本转换为数字
importpandasaspdfromsklearn.feature_extraction.textimportCountVectorizer#Sampledataforanalysisdata1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})#Initializevectorizer=CountVectorizer()doc_vec=vectorizer.fit_transform(df1.iloc[0])#CreatedataFramedf2=pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#Changecolumnheadersdf2.columns=df1.columnsprint(df2)
Output:
16使用 TF-IDF 创建文档术语矩阵Go Java Python
and 2 2 2
application 0 1 0
are 1 0 1
bytecode 0 1 0
can 0 1 0
code 0 1 0
comes 1 0 1
compiled 0 1 0
derived 0 1 0
develops 0 1 0
for 0 2 0
from 0 1 0
functional 1 0 1
imperative 1 0 1
...
importpandasaspdfromsklearn.feature_extraction.textimportTfidfVectorizer#Sampledataforanalysisdata1="Javaisalanguageforprogrammingthatdevelopsasoftwareforseveralplatforms.AcompiledcodeorbytecodeonJavaapplicationcanrunonmostoftheoperatingsystemsincludingLinux,Macoperatingsystem,andLinux.MostofthesyntaxofJavaisderivedfromtheC++andClanguages."data2="Pythonsupportsmultipleprogrammingparadigmsandcomesupwithalargestandardlibrary,paradigmsincludedareobject-oriented,imperative,functionalandprocedural."data3="Goistypedstaticallycompiledlanguage.ItwascreatedbyRobertGriesemer,KenThompson,andRobPikein2009.Thislanguageoffersgarbagecollection,concurrencyofCSP-style,memorysafety,andstructuraltyping."df1=pd.DataFrame({'Java':[data1],'Python':[data2],'Go':[data2]})#Initializevectorizer=TfidfVectorizer()doc_vec=vectorizer.fit_transform(df1.iloc[0])#CreatedataFramedf2=pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#Changecolumnheadersdf2.columns=df1.columnsprint(df2)
Output:
17为给定句子生成 N-gramGo Java Python
and 0.323751 0.137553 0.323751
application 0.000000 0.116449 0.000000
are 0.208444 0.000000 0.208444
bytecode 0.000000 0.116449 0.000000
can 0.000000 0.116449 0.000000
code 0.000000 0.116449 0.000000
comes 0.208444 0.000000 0.208444
compiled 0.000000 0.116449 0.000000
derived 0.000000 0.116449 0.000000
develops 0.000000 0.116449 0.000000
for 0.000000 0.232898 0.000000
...
自然语言工具包:NLTK
importnltkfromnltk.utilimportngrams#Functiontogeneraten-gramsfromsentences.defextract_ngrams(data,num):n_grams=ngrams(nltk.word_tokenize(data),num)return[''.join(grams)forgramsinn_grams]data='Aclassisablueprintfortheobject.'print("1-gram:",extract_ngrams(data,1))print("2-gram:",extract_ngrams(data,2))print("3-gram:",extract_ngrams(data,3))print("4-gram:",extract_ngrams(data,4))
文本处理工具:TextBlob
fromtextblobimportTextBlob#Functiontogeneraten-gramsfromsentences.defextract_ngrams(data,num):n_grams=TextBlob(data).ngrams(num)return[''.join(grams)forgramsinn_grams]data='Aclassisablueprintfortheobject.'print("1-gram:",extract_ngrams(data,1))print("2-gram:",extract_ngrams(data,2))print("3-gram:",extract_ngrams(data,3))print("4-gram:",extract_ngrams(data,4))
Output:
18使用带有二元组的 sklearn CountVectorize 词汇规范1-gram: ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram: ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram: ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram: ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
importpandasaspdfromsklearn.feature_extraction.textimportCountVectorizer#Sampledataforanalysisdata1="Machinelanguageisalow-levelprogramminglanguage.Itiseasilyunderstoodbycomputersbutdifficulttoreadbypeople.Thisiswhypeopleusehigherlevelprogramminglanguages.Programswritteninhigh-levellanguagesarealsoeithercompiledand/orinterpretedintomachinelanguagesothatcomputerscanexecutethem."data2="Assemblylanguageisarepresentationofmachinelanguage.Inotherwords,eachassemblylanguageinstructiontranslatestoamachinelanguageinstruction.Thoughassemblylanguagestatementsarereadable,thestatementsarestilllow-level.Adisadvantageofassemblylanguageisthatitisnotportable,becauseeachplatformcomeswithaparticularAssemblyLanguage"df1=pd.DataFrame({'Machine':[data1],'Assembly':[data2]})#Initializevectorizer=CountVectorizer(ngram_range=(2,2))doc_vec=vectorizer.fit_transform(df1.iloc[0])#CreatedataFramedf2=pd.DataFrame(doc_vec.toarray().transpose(),index=vectorizer.get_feature_names())#Changecolumnheadersdf2.columns=df1.columnsprint(df2)
Output:
19使用 TextBlob 提取名词短语Assembly Machine
also either 0 1
and or 0 1
are also 0 1
are readable 1 0
are still 1 0
assembly language 5 0
because each 1 0
but difficult 0 1
by computers 0 1
by people 0 1
can execute 0 1
...
fromtextblobimportTextBlob#Extractnounblob=TextBlob("CanadaisacountryinthenorthernpartofNorthAmerica.")fornounsinblob.noun_phrases:print(nouns)
Output:
20如何计算词-词共现矩阵canada
northern part
america
importnumpyasnpimportnltkfromnltkimportbigramsimportitertoolsimportpandasaspddefgenerate_co_occurrence_matrix(corpus):vocab=set(corpus)vocab=list(vocab)vocab_index={word:ifori,wordinenumerate(vocab)}#Createbigramsfromallwordsincorpusbi_grams=list(bigrams(corpus))#Frequencydistributionofbigrams((word1,word2),num_occurrences)bigram_freq=nltk.FreqDist(bi_grams).most_common(len(bi_grams))#Initialiseco-occurrencematrix#co_occurrence_matrix[current][previous]co_occurrence_matrix=np.zeros((len(vocab),len(vocab)))#Loopthroughthebigramstakingthecurrentandpreviousword,#andthenumberofoccurrencesofthebigram.forbigraminbigram_freq:current=bigram[0][1]previous=bigram[0][0]count=bigram[1]pos_current=vocab_index[current]pos_previous=vocab_index[previous]co_occurrence_matrix[pos_current][pos_previous]=countco_occurrence_matrix=np.matrix(co_occurrence_matrix)#returnthematrixandtheindexreturnco_occurrence_matrix,vocab_indextext_data=[['Where','Python','is','used'],['What','is','Python''used','in'],['Why','Python','is','best'],['What','companies','use','Python']]#Createonelistusingmanylistsdata=list(itertools.chain.from_iterable(text_data))matrix,vocab_index=generate_co_occurrence_matrix(data)data_matrix=pd.DataFrame(matrix,index=vocab_index,columns=vocab_index)print(data_matrix)
Output:
21使用 TextBlob 进行情感分析best use What Where ... in is Python used
best 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
use 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0
What 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Where 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
Pythonused 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
Why 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0
companies 0.0 1.0 0.0 1.0 ... 1.0 0.0 0.0 0.0
in 0.0 0.0 0.0 0.0 ... 0.0 0.0 1.0 0.0
is 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
Python 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
used 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0
[11 rows x 11 columns]
fromtextblobimportTextBlobdefsentiment(polarity):ifblob.sentiment.polarity<0:print("Negative")elifblob.sentiment.polarity>0:print("Positive")else:print("Neutral")blob=TextBlob("Themoviewasexcellent!")print(blob.sentiment)sentiment(blob.sentiment.polarity)blob=TextBlob("Themoviewasnotbad.")print(blob.sentiment)sentiment(blob.sentiment.polarity)blob=TextBlob("Themoviewasridiculous.")print(blob.sentiment)sentiment(blob.sentiment.polarity)
Output:
22使用 Goslate 进行语言翻译Sentiment(polarity=1.0, subjectivity=1.0)
Positive
Sentiment(polarity=0.3499999999999999, subjectivity=0.6666666666666666)
Positive
Sentiment(polarity=-0.3333333333333333, subjectivity=1.0)
Negative
importgoslatetext="Commentvas-tu?"gs=goslate.Goslate()translatedText=gs.translate(text,'en')print(translatedText)translatedText=gs.translate(text,'zh')print(translatedText)translatedText=gs.translate(text,'de')print(translatedText)23使用 TextBlob 进行语言检测和翻译
fromtextblobimportTextBlobblob=TextBlob("Commentvas-tu?")print(blob.detect_language())print(blob.translate(to='es'))print(blob.translate(to='en'))print(blob.translate(to='zh'))
Output:
24使用 TextBlob 获取定义和同义词fr
¿Como estas tu?
How are you?
你好吗?
fromtextblobimportTextBlobfromtextblobimportWordtext_word=Word('safe')print(text_word.definitions)synonyms=set()forsynsetintext_word.synsets:forlemmainsynset.lemmas():synonyms.add(lemma.name())print(synonyms)
Output:
25使用 TextBlob 获取反义词列表['strongbox where valuables can be safely kept', 'a ventilated or refrigerated cupboard for securing provisions from pests', 'contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse', 'free from danger or the risk of harm', '(of an undertaking) secure from risk', 'having reached a base without being put out', 'financially sound']
{'secure', 'rubber', 'good', 'safety', 'safe', 'dependable', 'condom', 'prophylactic'}
fromtextblobimportTextBlobfromtextblobimportWordtext_word=Word('safe')antonyms=set()forsynsetintext_word.synsets:forlemmainsynset.lemmas():iflemma.antonyms():antonyms.add(lemma.antonyms()[0].name())print(antonyms)
Output:
{'dangerous', 'out'}
到此,相信大家对“Python文本处理的案例有哪些”有了更深的了解,不妨来实际操作一番吧!这里是亿速云网站,更多相关内容可以进入相关频道进行查询,关注我们,继续学习!
声明:本站所有文章资源内容,如无特殊说明或标注,均为采集网络资源。如若本站内容侵犯了原著者的合法权益,可联系本站删除。