Notice
Recent Posts
Recent Comments
관리 메뉴

안까먹을라고 쓰는 블로그

[Scrapy] Selector 본문

Language/Python

[Scrapy] Selector

YawnsDuzin 2020. 3. 13. 19:28

 

반응형

 

웹 페이지를 가지고 올때, 일반적으로 수행해야 할 작업은 HTML소스에서 데이터를 추출하는 것입니다. 이를 위해 Scrapy에는 데이터 추출을 위한 자체 메커니즘이 있습니다. XPath 또는 CSS 표현식으로 지정 된 HTML 문서의 특정 부분을 선택하기 때문에 Selector 라고 합니다.

 

 예제 사이트

https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

 

Scrapy Shell 여는 명령어
scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

 

 Selector 명령
기본 명령 
>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'
>>> response.css('title::text').get()
'Example website'
>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '
>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True
>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'
>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
>>> response.css('img').attrib['src']
'image1_thumb.jpg'

 

>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.xpath('//base/@href').get()
'http://example.com/'
>>> response.css('base::attr(href)').get()
'http://example.com/'
>>> response.css('base').attrib['href']
'http://example.com/'

 

>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']
>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']
>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']
>>> response.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

 

CSS 선택기의 확장 
>>> response.css('title::text').get()
'Example website'

title::text - <title>요소의 하위 텍스트 노드를 선택합니다.

 

>>> response.css('#images *::text').getall()
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

*::text - 현재 선택기 컨텍스트의 모든 하위 텍스트 노드를 선택합니다.

 

>>> response.css('img::text').getall()
[]

 

>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''

default='' - 항상 문자열을 원하면 사용

 

>>> response.css('a::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

a::attr(href) - 하위 링크의 href 속성값을 선택합니다

 

 중첩 선택기
>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']
>>> for index, link in enumerate(links):
...     args = (index, link.xpath('@href').get(), link.xpath('img/@src').get())
...     print('Link number %d points to url %r and image %r' % args)
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

 

 요소 속성 선택하기
>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

 

>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'
>>> response.css('foo').attrib
{}

 

 정규표현식 에 Selector 사용하기

 - Selector 의 .re() 정규식을 사용하여 데이터를 추출하는 방법도 있습니다.

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

 

>>> response.css('a::attr(href)').get()
'image1.html'
>>> response.css('a::attr(href)').extract_first()
'image1.html'

 SelectorList.get() = SelectorList.extract_first() 는 동일하다

 

>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

SelectorList.getall() = SelectorList.extract() 는 동일하다.

 

>>> response.css('a::attr(href)')[0].get()
'image1.html'
>>> response.css('a::attr(href)')[0].extract()
'image1.html'
>>> response.css('a::attr(href)')[0].getall()
['image1.html']

Selector.get() = Selector.extract() 는 동일하다.

 

 Xpath로 작업하기

 

 

 


출처 : https://docs.scrapy.org/en/latest/topics/selectors.html

 

Selectors — Scrapy 2.0.0 documentation

Here are some tips which may help you to use XPath with Scrapy selectors effectively. If you are not much familiar with XPath yet, you may want to take a look first at this XPath tutorial. Using text nodes in a condition When you need to use the text conte

docs.scrapy.org

 

반응형
Comments