从文本文件中提取特定模式并导入dataframe

问题

我有几个具有以下格式的文本的文本文件。以“轮廓”开头的行数对于每个“对象”块不同。我上传了一个样本文件到我的github页面(链接)。对于以“对象”开头的每个部分,我需要从“contour”开始从“contour”开始的每一行,并将其导入Pandas DataFrame。处理后,我的目标是使用具有三列的Dataframe,其中具有对象,点和长度作为标题。

表达代码:

OBJECT 1
NAME:  MT1(SP1)
       3 contours
       object uses open contours.
       color (red, green, blue) = (0, 1, 0)

    CONTOUR #1,1,0  5 points    length = 3.07e+006 pm
    CONTOUR #2,1,0  6 points    length = 3.51e+006 pm
    CONTOUR #3,1,0  5 points    length = 3.50e+006 pm

OBJECT 2
NAME:  MT2(SP3)
       4 contours
       object uses open contours.
       color (red, green, blue) = (0, 1, 1)

    CONTOUR #1,2,0  4 points    length = 1.86e+006 pm
    CONTOUR #2,2,0  4 points    length = 2.29e+006 pm
    CONTOUR #3,2,0  5 points    length = 2.47e+006 pm
    CONTOUR #3,2,0  5 points    length = 2.47e+006 pm

OBJECT 3
NAME:  MT3(SP2)
       1 contours
       object uses open contours.
       color (red, green, blue) = (1, 0, 1)

    CONTOUR #1,3,0  6 points    length = 2.74e+006 pm

结果:

Object | Points | Length
1 | 5 | 3.07e+006   
1 | 6 | 3.51e+006
1 | 5 | 3.50e+006
2 | 4 | 1.86e+006
2 | 4 | 2.29e+006
2 | 5 | 2.47e+006
2 | 5 | 2.47e+006
3 | 6 | 2.74e+006

我试过了什么

Through Rough()我设法在“对象1”和“对象2”之间提取一个块。我现在尝试使用某种计数器将其推断到以下块(2-3,3-4等),但我不确定如何继续。这种方法还携带问题,即将读取最后一个块(因为它缺少最终的“对象”)。

with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
  for line in input:
        if line.strip() == "OBJECT 1":
            copy = True
            continue
        elif line.strip() == "OBJECT 2":
            copy = False
            continue
        elif copy:
            output.write(line)
            
input.close()
output.close()

我也在使用正则表达式播放,但是如何单独识别和导入每个块的同一问题。

pattern = 'OBJECT\s\d.*OBJECT\s\d'
match = re.findall(pattern, text, re.DOTALL)

非常感谢如何继续使用此问题的帮助或指示!如果需要,请询问澄清。

回答 2

  1. 赞同 1

    您可以使用此代码:

    with open('test_sample.txt') as datfile:
        data = []
        for line in datfile:
            line = line.strip()
    
            if line.startswith('OBJECT'):
                obj_id = int(line.split()[1])
    
            elif line.startswith('CONTOUR'):
                args = line.split()
                points = int(args[2])
                length = float(args[6])
                data.append({'Object': obj_id,
                             'Points': points,
                             'Length': length})
    
    df = pd.DataFrame(data)
    

    Output(用您的GitHub样本测试):

    >>> df
         Object  Points     Length
    0         1       5  3074020.0
    1         1       6  3511060.0
    2         1       5  3509020.0
    3         1       5  3505450.0
    4         1       5  3423030.0
    ..      ...     ...        ...
    166      16       4  1461300.0
    167      16       2  1372990.0
    168      16       3  1471150.0
    169      16       3  1392340.0
    170      16       4  1381150.0
    
    [171 rows x 3 columns]
    

    Corralien
  2. 赞同 0

    TRY:

    df = pd.DataFrame(columns=["Object", "Points", "Length"])
    with open('textfile.txt', 'r') as input, open('new_textfile.txt', 'w') as output:
        for line in input:
            if "OBJECT" in line.strip():
                object_num = int(line.strip()[7:])
    
            if "CONTOUR" in line.strip():
                index_points = line.split(" ").index("points")-1
                index_length = line.split(" ").index("length")+2
                df.loc[len(df)] = {"Object": object_num, "Points": int(line.split(" ")[index_points]), 
                                "Length":"{:.2E}".format(float(line.split(" ")[index_length]))}
                
    input.close()
    output.close()
    
    print(df)
    

    Output:

      Object Points    Length
    0      1      5  3.07E+06
    1      1      6  3.51E+06
    2      1      5  3.50E+06
    3      2      4  1.86E+06
    4      2      4  2.29E+06
    5      2      5  2.47E+06
    6      2      5  2.47E+06
    7      3      6  2.74E+06
    

    ESI