阅读量:4
文章目录
数据简介
各找一篇中文,日文,韩文,英文,俄文较长的学术论文。将论文转化为JPG格式。拆分每张JPG生成更多小的JPG。最终获得很多5个不同语言的JPG并且自带标签。数据链接:提取码8848。
将PDF转化为JPG。
import aspose.words as aw for i in range(1,6): doc=aw.Document(f"data/{i}/{i}.pdf") for page in range(0,doc.page_count): extractedPage=doc.extract_pages(page,1) extractedPage.save(f"dataset/{i}/{page+1}.jpg")
确认所有JPG大小是否一样。结果为假。
from PIL import Image import os sizes=[] for i in range(1,6): for filename in os.listdir(f"dataset/{i}"): if filename.endswith(".jpg"): with Image.open(os.path.join(f"dataset/{i}",filename)) as img: sizes.append(img.size) flag=True for i in sizes: if i!=sizes[0]: flag=False;break print(flag)
初步裁切JPG取正中间的400*800个像素点(因为所有JPG的大小都大于400*800)。
from PIL import Image import os sizes=[] for i in range(1,6): for filename in os.listdir(f"dataset/{i}"): if filename.endswith(".jpg"): with Image.open(os.path.join(f"dataset/{i}",filename)) as img: width,height=img.size left=(width-400)/2 top=(height-800)/2 right=(width+400)/2 bottom=(height+800)/2 copped_img=img.crop((left,top,right,bottom)) copped_img.save(f"dataset_new/{i}/{filename}")
拆分大小为400*800的JPG为32张100*100的JPG。
from PIL import Image import os sizes=[] for i in range(1,6): for filename in os.listdir(f"dataset_new/{i}"): if filename.endswith(".jpg"): with Image.open(os.path.join(f"dataset_new/{i}",filename)) as img: for x in range(0,400,100): for y in range(0,800,100): box=(x,y,x+100,y+100) tile=img.crop(box) tile.save(f"dataset_last_temp/{i}/{filename[:-4]}"+f"_{x//100}{y//100}"+".jpg")
人为地手动删除一些没有文字地的JPG,保存在dataset_last中。
展示其中一些数据:从上往下依次是中、日、韩、英、俄。
开始实验
小波分解
为了方便展示结果,对LL2,LH2,HL2,HH2,LH1,HL1,HH1进行了裁剪。实际实验中没有进行裁剪。
from PIL import Image import os import numpy as np import pywt import matplotlib.pyplot as plt def fc(LL,LH,HL,HH,x): LL=LL[:x,:x] LH=LH[:x,:x] HL=HL[:x,:x] HH=HH[:x,:x] image=np.zeros((LL.shape[0]+LH.shape[0],LL.shape[1]+HL.shape[1])) image[:LL.shape[0],:LL.shape[1]]=LL image[LL.shape[0]:,:LL.shape[1]]=LH image[:LL.shape[0],LL.shape[1]:]=HL image[LL.shape[0]:,LL.shape[1]:]=HH return image for i in range(1,6): for filename in os.listdir(f"dataset_last/{i}"): if filename.endswith(".jpg"): with Image.open(os.path.join(f"dataset_last/{i}",filename)) as img: img=img.convert('L') coeffs1=pywt.dwt2(img,'db4') LL1,(LH1,HL1,HH1)=coeffs1 coeffs2=pywt.dwt2(LL1,'db4') LL2,(LH2,HL2,HH2)=coeffs2 image=fc(fc(LL2,LH2,HL2,HH2,25),LH1,HL1,HH1,50) image=Image.fromarray(image.astype('uint8')) image.save(f"temp/{i}/{filename}")
展示其中一些结果:从上往下依次是中、日、韩、英、俄。
得出结果
标准流程。
from PIL import Image import os import numpy as np import pywt def fc(matrix): count=0 for i in matrix: for j in i: count+=j**2 return count/(matrix.shape[0]*matrix.shape[1]) def metric1(LH,HL,HH): return [fc(LH),fc(HL),fc(HH)] def metric2(LH,HL,HH): x=metric1(LH,HL,HH) a,b,c=x[0],x[1],x[2] d=a+b+c return [a/d,b/d,c/d] lt1=[[] for _ in range(5)] lt2=[[] for _ in range(5)] for i in range(1,6): for filename in os.listdir(f"dataset_last/{i}"): if filename.endswith(".jpg"): with Image.open(os.path.join(f"dataset_last/{i}",filename)) as img: img=img.convert('L') coeffs1=pywt.dwt2(img,'db4') LL1,(LH1,HL1,HH1)=coeffs1 coeffs2=pywt.dwt2(LL1,'db4') LL2,(LH2,HL2,HH2)=coeffs2 lt1[i-1].append([LH1,HL1,HH1]) lt2[i-1].append([LH2,HL2,HH2]) metrics11=[[metric1(_[0],_[1],_[2]) for _ in lt1[i]] for i in range(5)] metrics12=[[metric2(_[0],_[1],_[2]) for _ in lt1[i]] for i in range(5)] mean11=[np.mean(metrics11[i],axis=0) for i in range(5)] mean12=[np.mean(metrics12[i],axis=0) for i in range(5)] var11=[np.var(metrics11[i],axis=0) for i in range(5)] var12=[np.var(metrics12[i],axis=0) for i in range(5)] metrics21=[[metric1(_[0],_[1],_[2]) for _ in lt2[i]] for i in range(5)] metrics22=[[metric2(_[0],_[1],_[2]) for _ in lt2[i]] for i in range(5)] mean21=[np.mean(metrics21[i],axis=0) for i in range(5)] mean22=[np.mean(metrics22[i],axis=0) for i in range(5)] var21=[np.var(metrics21[i],axis=0) for i in range(5)] var22=[np.var(metrics22[i],axis=0) for i in range(5)] zd={1:"中文",2:"日文",3:"韩文",4:"英文",5:"俄文"} print(f"{'1次分解-DEMW:':<14}",end=" ") for i in range(5): count=0 for j in metrics11[i]: d=[sum((np.array(j)-_)**2) for _ in mean11] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics11[i])*10000)/100),end=" ") print() print(f"{'1次分解-DPMW:':<14}",end=" ") for i in range(5): count=0 for j in metrics12[i]: d=[sum((np.array(j)-_)**2) for _ in mean12] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics12[i])*10000)/100),end=" ") print() print(f"{'1次分解-DEMWV:':<14}",end=" ") for i in range(5): count=0 for j in metrics11[i]: d=[sum(((np.array(j)-mean11[k])**2)/(var11[k]**2)) for k in range(5)] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics11[i])*10000)/100),end=" ") print() print(f"{'1次分解-DPMWV:':<14}",end=" ") for i in range(5): count=0 for j in metrics12[i]: d=[sum(((np.array(j)-mean12[k])**2)/(var12[k]**2)) for k in range(5)] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics12[i])*10000)/100),end=" ") print() print(f"{'2次分解-DEMW:':<14}",end=" ") for i in range(5): count=0 for j in metrics21[i]: d=[sum((np.array(j)-_)**2) for _ in mean21] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics21[i])*10000)/100),end=" ") print() print(f"{'2次分解-DPMW:':<14}",end=" ") for i in range(5): count=0 for j in metrics22[i]: d=[sum((np.array(j)-_)**2) for _ in mean22] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics22[i])*10000)/100),end=" ") print() print(f"{'2次分解-DEMWV:':<14}",end=" ") for i in range(5): count=0 for j in metrics21[i]: d=[sum(((np.array(j)-mean21[k])**2)/(var21[k]**2)) for k in range(5)] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics21[i])*10000)/100),end=" ") print() print(f"{'2次分解-DPMWV:':<14}",end=" ") for i in range(5): count=0 for j in metrics22[i]: d=[sum(((np.array(j)-mean22[k])**2)/(var22[k]**2)) for k in range(5)] if np.argmin(d)==i: count+=1 print(zd[i+1],end="") print(" :{:06.2f}%".format(int(count/len(metrics22[i])*10000)/100),end=" ") print()
结果分析
这是一个5分类任务,乱猜猜中的概率为20%。根据上述实验结果,我们能够保证至少有一种判断方法判断一种语言正确的概率大于80%(除了英语)。大胆猜测英语判断效果不好的原因是我找的不同语言的论文中或多或少都包括了英文,毕竟英语是国际通用语言。
误差分析
由于每种语言我只找了一篇论文来做实验,显然实验数据并不够多。并不满足格列文科定理,结果不准也很正常。还有就是图片质量本身也不够好,例如:各种各样的水印,奇奇怪怪的与文字没有关系的论文插图。