[解决] Python lxml wrapping elements

与这个问题有点类似: Stack Overflow: Python lxml wrapping elements

有的网页中正文没有使用<div>进行包装, 结果提取正文时只提取了某一段, 而不是作为整体的正文, 一开始直接将<body>元素变成<div>, 发现会有其它副作用, 因为其它代码中有通过标签是否是body作为判断的代码. 因此考虑默认为<body>的子元素加一层<div>包装.

假设有如下HTML片段:

<html>
<head>
</head>
<body>
    Text
    <h1>Title</h1>
    Tail1
    <div>p1<p>inner text</p></div>
    <p>p2</p>
    Tail2
</body>

我想转换成如下片段:

<html>
<head>
</head>
<body>
    <div>
        Text
        <h1>Title</h1>
        Tail1
        <div>p1<p>inner text</p></div>
        <p>p2</p>
        Tail2
    </div>
</body>

使用如下代码:

from lxml.html import html5parser
from lxml.html import fragment_fromstring
from lxml.html import tostring
html = """
<html>
<head>
</head>
<body>
    Text
    <h1>Title</h1>
    Tail1
    <div>p1<p>inner text</p></div>
    <p>p2</p>
    Tail2
</body>
"""
html5doc = html5parser.document_fromstring(html, guess_charset=False)
root = fragment_fromstring(tostring(html5doc))
for elem in root.findall(".//body"):
    # 1. 为Text包装<p>, 即将Text变成<p>Text</p>
    if elem.text and elem.text.strip():
        p = fragment_fromstring('<p/>')
        p.text = elem.text
        elem.text = None
        elem.insert(0, p)
    # 2. 为body的子元素包装div
    div = fragment_fromstring("<div/>")
    for e in elem.iterchildren():
        print(e, e.text)
        div.append(e)
        print(tostring(div))
    elem.insert(0, div)

print(tostring(root))

输出:

<Element p at 0x37b3990> Text
b'<div><p>Text</p></div>'
<Element h1 at 0x37b39e8> Title
b'<div><p>Text</p><h1>Title</h1>Tail1</div>'
<Element div at 0x37b3830> p1
b'<div><p>Text</p><h1>Title</h1>Tail1<div>p1<p>inner text</p></div></div>'
<Element p at 0x37b3938> p2
b'<div><p>Text</p><h1>Title</h1>Tail1<div>p1<p>inner text</p></div><p>p2</p>Tail2</div>'
b'<html xmlns:html="http://www.w3.org/1999/xhtml"><head></head><body><div><p>Text</p><h1>Title</h1>Tail1<div>p1<p>inner text</p></div><p>p2</p>Tail2</div></body></html>'

HTML:

<html>
<head>
</head>
<body>
    <div>
        <p>Text</p>
        <h1>Title</h1>
        Tail1
        <div>p1<p>inner text</p></div>
        <p>p2</p>
        Tail2
    </div>
</body>

步骤1是预处理, 如果不这么处理会输出:

<Element h1 at 0x38e7990> Title
b'<div><h1>Title</h1>Tail1</div>'
<Element div at 0x38e7938> p1
b'<div><h1>Title</h1>Tail1<div>p1<p>inner text</p></div></div>'
<Element p at 0x38e7830> p2
b'<div><h1>Title</h1>Tail1<div>p1<p>inner text</p></div><p>p2</p>Tail2</div>'
b'<html xmlns:html="http://www.w3.org/1999/xhtml"><head></head><body>Text<div><h1>Title</h1>Tail1<div>p1<p>inner text</p></div><p>p2</p>Tail2</div></body></html>'

因为Text不包含在elem.iterchildren()里.


[解决] 使用lxml过滤HTML中class或id符合特定正则的元素

使用lxml过滤HTML中class或id符合特定正则的元素


[待学习] matplotlib相关

matplotlib相关


[学习中] 使用lxml解析HTML

使用lxml解析HTML